Feb 15 2010

Mixing bots to gain new services

Category: Php,Spiders & web botsGiulio Pons @ 12:39 pm

Spiders and bots let you take services from other web sites, this could be very cool, but also this could become a problem (you are using stuff made from other people, is it correct? they know what you’re doing, are there any bandwidth problems you can cause? are your bots ok with copyright?).

Well let’s go over all this problems and try to make spider’s work even more cool: you can mix two or more spiders to create something new, in this example I’ve mixed geographic ip reference bot and meteo bot to get a meteo service localized for the user who connect at your site.
This is a geographic meteo as the ones you can find on smartphones.

Have you any ideas about other mix you can do? You can grab restourants and show localized restourants, shops… There are many applications that do this on iPhone… the problem is, how much good is the result, and this depends on how good are the sources. But this geo mixes already sounds old, we have to find new mixex.

Share

Tags: , , , , ,


Jan 12 2010

Bot that retrieves url meta data and other infos

Category: Php,Spiders & web botsGiulio Pons @ 2:55 pm

From a given url this function retrieves page title, meta description, keywords, favicon, and an array of 5 images to use for links. It call file_get_contents and then make some regular expression job.

This function is included in the Mini Bots Class.

print_r(getLinksInfo("http://www.rockit.it/articolo/825/nada-studio-report-quando-nasce-una-canzone"));

function getLinksInfo($url) {
	$web_page = file_get_contents($url);

	$data['keywords']="";
	$data['description']="";
	$data['title']="";
	$data['favicon']="";
	$data['images']=array();

	preg_match_all('#<title([^>]*)?>(.*)</title>#Uis', $web_page, $title_array);
	$data['title'] = $title_array[2][0];
	preg_match_all('#<meta([^>]*)(.*)>#Uis', $web_page, $meta_array);
	for($i=0;$i<count($meta_array[0]);$i++) {
		if (strtolower(attr($meta_array[0][$i],"name"))=='description') $data['description'] = attr($meta_array[0][$i],"content");
		if (strtolower(attr($meta_array[0][$i],"name"))=='keywords') $data['keywords'] = attr($meta_array[0][$i],"content");
	}
	preg_match_all('#<link([^>]*)(.*)>#Uis', $web_page, $link_array);
	for($i=0;$i<count($link_array[0]);$i++) {
		if (strtolower(attr($link_array[0][$i],"rel"))=='shortcut icon') $data['favicon'] = makeabsolute($url,attr($link_array[0][$i],"href"));
	}
	preg_match_all('#<img([^>]*)(.*)/?>#Uis', $web_page, $imgs_array);
	$imgs = array();
	for($i=0;$i<count($imgs_array[0]);$i++) {
		if ($src = attr($imgs_array[0][$i],"src")) {
			$src = makeabsolute($url,$src);
			if (getRemoteFileSize($src)>15000) array_push($imgs,$src);
		}
		if (count($imgs)>5) break;
	}
	$data['images']=$imgs;

	return $data;
}

Here is the output:

Array
(
    [keywords] => Nada
    [description] => (Nada e il compagno Gerri Manzoli, foto d archivio) Nada &egrave; al Naural HeadQuarter di Ferrara per la registrazione del suo ultimo album in studio, il ventitreesimo, un nuovo capitolo che segna un ulteriore punto nella sua carriera da musicista, iniziata da giovanissima alla fine dei 60. Il titolo non &egrave; stato ancora scelto, cos&igrave; come la data d uscita. Ma possiamo anticiparvi...
    [title] => Nada Studio report - Quando nasce una canzone
    [favicon] => http://www.rockit.it/favicon.ico
    [images] => Array
        (
            [0] => http://ww2.rockit.it/rockit/immagini/Nadain2.jpg
            [1] => http://ww2.rockit.it/rockit/immagini/NadaIn3.jpg
        )

)

And here there are the used functions:

function attr($s,$attrname) {
		//retrn html attribute
		preg_match_all('#\s*('.$attrname.')\s*=\s*["|\']([^"\']*)["|\']\s*#i', $s, $x);
		if (count($x)>=3) return $x[2][0];
		return "";
	}

function makeabsolute($url,$link) {
	if (strpos( $link,"http://")===0 ) return $link;
	$p = parse_url($url);
	if (strpos( $link, "/")===0) return "http://".$p['host'].$link;
	return str_replace(substr(strrchr($url, "/"), 1),"",$url).$link;
}

function getRemoteFileSize($url) {
	if (substr($url,0,4)=='http') {
		$x = array_change_key_case(get_headers($url, 1),CASE_LOWER);
		if ( strcasecmp($x[0], 'HTTP/1.1 200 OK') != 0 ) { $x = $x['content-length'][1]; }
		else { $x = $x['content-length']; }
	}
	else { $x = @filesize($url); }
	return $x;
}
Share

Tags: , , , ,