Jan 12 2010

Bot that retrieves url meta data and other infos

Category: Php, Spiders & webbotsadmin @ 2:55 pm

From a given url this function retrieves page title, meta description, keywords, favicon, and an array of 5 images to use for links. It call file_get_contents and then make some regular expression job.

This function is included in the Mini Bots Class.

print_r(getLinksInfo("http://www.rockit.it/articolo/825/nada-studio-report-quando-nasce-una-canzone"));

function getLinksInfo($url) {
	$web_page = file_get_contents($url);

	$data['keywords']="";
	$data['description']="";
	$data['title']="";
	$data['favicon']="";
	$data['images']=array();

	preg_match_all('#<title([^>]*)?>(.*)</title>#Uis', $web_page, $title_array);
	$data['title'] = $title_array[2][0];
	preg_match_all('#<meta([^>]*)(.*)>#Uis', $web_page, $meta_array);
	for($i=0;$i<count($meta_array[0]);$i++) {
		if (strtolower(attr($meta_array[0][$i],"name"))=='description') $data['description'] = attr($meta_array[0][$i],"content");
		if (strtolower(attr($meta_array[0][$i],"name"))=='keywords') $data['keywords'] = attr($meta_array[0][$i],"content");
	}
	preg_match_all('#<link([^>]*)(.*)>#Uis', $web_page, $link_array);
	for($i=0;$i<count($link_array[0]);$i++) {
		if (strtolower(attr($link_array[0][$i],"rel"))=='shortcut icon') $data['favicon'] = makeabsolute($url,attr($link_array[0][$i],"href"));
	}
	preg_match_all('#<img([^>]*)(.*)/?>#Uis', $web_page, $imgs_array);
	$imgs = array();
	for($i=0;$i<count($imgs_array[0]);$i++) {
		if ($src = attr($imgs_array[0][$i],"src")) {
			$src = makeabsolute($url,$src);
			if (getRemoteFileSize($src)>15000) array_push($imgs,$src);
		}
		if (count($imgs)>5) break;
	}
	$data['images']=$imgs;

	return $data;
}

Here is the output:

Array
(
    [keywords] => Nada
    [description] => (Nada e il compagno Gerri Manzoli, foto d archivio) Nada &egrave; al Naural HeadQuarter di Ferrara per la registrazione del suo ultimo album in studio, il ventitreesimo, un nuovo capitolo che segna un ulteriore punto nella sua carriera da musicista, iniziata da giovanissima alla fine dei 60. Il titolo non &egrave; stato ancora scelto, cos&igrave; come la data d uscita. Ma possiamo anticiparvi...
    [title] => Nada Studio report - Quando nasce una canzone
    [favicon] => http://www.rockit.it/favicon.ico
    [images] => Array
        (
            [0] => http://ww2.rockit.it/rockit/immagini/Nadain2.jpg
            [1] => http://ww2.rockit.it/rockit/immagini/NadaIn3.jpg
        )

)

And here there are the used functions:

function attr($s,$attrname) {
		//retrn html attribute
		preg_match_all('#\s*('.$attrname.')\s*=\s*["|\']([^"\']*)["|\']\s*#i', $s, $x);
		if (count($x)>=3) return $x[2][0];
		return "";
	}

function makeabsolute($url,$link) {
	if (strpos( $link,"http://")===0 ) return $link;
	$p = parse_url($url);
	if (strpos( $link, "/")===0) return "http://".$p['host'].$link;
	return str_replace(substr(strrchr($url, "/"), 1),"",$url).$link;
}

function getRemoteFileSize($url) {
	if (substr($url,0,4)=='http') {
		$x = array_change_key_case(get_headers($url, 1),CASE_LOWER);
		if ( strcasecmp($x[0], 'HTTP/1.1 200 OK') != 0 ) { $x = $x['content-length'][1]; }
		else { $x = $x['content-length']; }
	}
	else { $x = @filesize($url); }
	return $x;
}
  • Share/Bookmark

Tags: , , , ,


Dec 24 2009

PHP bot to grab meteo information from Google

Category: Php, Spiders & webbotsadmin @ 3:37 pm

Google has many usefull functions that give you data fast, such as cinema infos, or for meteo forecasts. I think that Google grabs those informations from the many sites indexed with his bots.

As I did on a previous post for words spelling you can retrieve those informations with some mini bots. The mini bot I’ve made for meteo retrieves informations from italian Google servers about weather forecast of a specified city (not only italian cities).

Since google gives only 4 days meteo if you ask for a date too much in the future it will return an empty string.

Below you can find the PHP source code, and here is a working demo!

This function is included in the Mini Bots Class.

function dayadd($days,$date=null , $format="d/m/Y"){
	//add days to a date function
	return date($format,strtotime($days." days",strtotime( $date ? $date : date($format) )));
}

function attr($s,$attrname) {
	//get the attribute value of an html tag
	preg_match_all('#\s*('.$attrname.')\s*=\s*["|\']([^"\']*)["|\']\s*#i', $s, $x);
	if (count($x)>=3) return $x[2][0];
	return "";
}

function doGoogleMeteo($q,$date) {
	if ($date>dayadd(3,date("Y-m-d"),"Y-m-d"))return "";

	// grab google page with meteo query
	$web_page = file_get_contents( "http://www.google.it/search?q=meteo+" . urlencode($q) );

	//parse html to find data, and store them in an array
	preg_match_all('#<div class=e>(.*)</table>#Us', $web_page, $m);
	if (count($m)>0) {
		$p = array();
		preg_match_all('#<img([^>]*)?>#Us', $m[0][0], $img);
		for ($i=0;$i<count($img[0]);$i++) {
			$tag = str_replace("src=\"/","src=\"http://www.google.it/",$img[0][$i]);
			$p[dayadd($i,date("Y-m-d"),"Y-m-d")]["title"] = attr($tag,"title");
			$p[dayadd($i,date("Y-m-d"),"Y-m-d")]["img"] = attr($tag,"src");
		}
		preg_match_all('#<nobr>(.*)</nobr>#Uis', $m[0][0], $nobr);
		for ($i=0;$i<count($nobr[1]);$i++) {
			$temp= explode("|",$nobr[1][$i]);
			$p[dayadd($i,date("Y-m-d"),"Y-m-d")]["min"] = trim($temp[1]);
			$p[dayadd($i,date("Y-m-d"),"Y-m-d")]["max"] = trim($temp[0]);
		}
		return $p[$date];
	}

	return "nada.";
}

print_r ( doGoogleMeteo("milano","2009-12-25") );
//Array (
// [title] => Rovesci
// [img] => http://www.google.it/images/weather/rain.gif
// [min] => -4°C
// [max] => 7°C
//)
  • Share/Bookmark

Tags: , , , , , , , , ,


Nov 10 2009

Do spelling using google spell checker

Category: Php, Spiders & webbotsadmin @ 11:31 am

If you have a user input that may contains some error you can try to check the spelling using Google Spelling Suggestion service (there is an api and you have to register to have an api key to use their web services).

But you can obtain the same result searching the Google search engine and parsing the html code to find the link after the phrase: “Did you mean“. You can think at this code as a mini web bot spell checker.

This code works in any language, it finds the anchor tag that has the classname set to “spell”:

echo doGoogleSpelling("wokipedia");  //returns "wikipedia"

function doGoogleSpelling($q) {

	// grab google page with search
	$web_page = file_get_contents( "http://www.google.it/search?q=" . urlencode($q) );

	// put anchors tag in an array
	preg_match_all('#<a([^>]*)?>(.*)</a>#Us', $web_page, $a_array);
	for($j=0;$j<count($a_array[0]);$j++) {

		// find link with spell suggestion and return it
		if(stristr($a_array[0][$j],"class=spell")) return strip_tags($a_array[0][$j]);

	}

	return "";
}
  • Share/Bookmark

Tags: , , , , , , , , , ,