Jan 12 2010

Bot that retrieves url meta data and other infos

Category: Php,Spiders & web botsGiulio Pons @ 2:55 pm

From a given url this function retrieves page title, meta description, keywords, favicon, and an array of 5 images to use for links. It call file_get_contents and then make some regular expression job.

This function is included in the Mini Bots Class.

print_r(getLinksInfo("http://www.rockit.it/articolo/825/nada-studio-report-quando-nasce-una-canzone"));

function getLinksInfo($url) {
	$web_page = file_get_contents($url);

	$data['keywords']="";
	$data['description']="";
	$data['title']="";
	$data['favicon']="";
	$data['images']=array();

	preg_match_all('#<title([^>]*)?>(.*)</title>#Uis', $web_page, $title_array);
	$data['title'] = $title_array[2][0];
	preg_match_all('#<meta([^>]*)(.*)>#Uis', $web_page, $meta_array);
	for($i=0;$i<count($meta_array[0]);$i++) {
		if (strtolower(attr($meta_array[0][$i],"name"))=='description') $data['description'] = attr($meta_array[0][$i],"content");
		if (strtolower(attr($meta_array[0][$i],"name"))=='keywords') $data['keywords'] = attr($meta_array[0][$i],"content");
	}
	preg_match_all('#<link([^>]*)(.*)>#Uis', $web_page, $link_array);
	for($i=0;$i<count($link_array[0]);$i++) {
		if (strtolower(attr($link_array[0][$i],"rel"))=='shortcut icon') $data['favicon'] = makeabsolute($url,attr($link_array[0][$i],"href"));
	}
	preg_match_all('#<img([^>]*)(.*)/?>#Uis', $web_page, $imgs_array);
	$imgs = array();
	for($i=0;$i<count($imgs_array[0]);$i++) {
		if ($src = attr($imgs_array[0][$i],"src")) {
			$src = makeabsolute($url,$src);
			if (getRemoteFileSize($src)>15000) array_push($imgs,$src);
		}
		if (count($imgs)>5) break;
	}
	$data['images']=$imgs;

	return $data;
}

Here is the output:

Array
(
    [keywords] => Nada
    [description] => (Nada e il compagno Gerri Manzoli, foto d archivio) Nada &egrave; al Naural HeadQuarter di Ferrara per la registrazione del suo ultimo album in studio, il ventitreesimo, un nuovo capitolo che segna un ulteriore punto nella sua carriera da musicista, iniziata da giovanissima alla fine dei 60. Il titolo non &egrave; stato ancora scelto, cos&igrave; come la data d uscita. Ma possiamo anticiparvi...
    [title] => Nada Studio report - Quando nasce una canzone
    [favicon] => http://www.rockit.it/favicon.ico
    [images] => Array
        (
            [0] => http://ww2.rockit.it/rockit/immagini/Nadain2.jpg
            [1] => http://ww2.rockit.it/rockit/immagini/NadaIn3.jpg
        )

)

And here there are the used functions:

function attr($s,$attrname) {
		//retrn html attribute
		preg_match_all('#\s*('.$attrname.')\s*=\s*["|\']([^"\']*)["|\']\s*#i', $s, $x);
		if (count($x)>=3) return $x[2][0];
		return "";
	}

function makeabsolute($url,$link) {
	if (strpos( $link,"http://")===0 ) return $link;
	$p = parse_url($url);
	if (strpos( $link, "/")===0) return "http://".$p['host'].$link;
	return str_replace(substr(strrchr($url, "/"), 1),"",$url).$link;
}

function getRemoteFileSize($url) {
	if (substr($url,0,4)=='http') {
		$x = array_change_key_case(get_headers($url, 1),CASE_LOWER);
		if ( strcasecmp($x[0], 'HTTP/1.1 200 OK') != 0 ) { $x = $x['content-length'][1]; }
		else { $x = $x['content-length']; }
	}
	else { $x = @filesize($url); }
	return $x;
}
Share

Tags: , , , ,


Jan 06 2010

Test if a remote url exists with PHP and CURL

Category: Php,Spiders & web botsGiulio Pons @ 10:13 am

If you have to test if a local file exists you will probably use the php file_exists function, but if you have to test a remote file, that is to say a remote url, than you can use CURL and get the headers returned by the http request. If you receive a 200 code, than it’s ok, else the url is not correct.

This function is included in the Mini Bots Class.

function url_exists($url) {
	$ch = @curl_init($url);
	@curl_setopt($ch, CURLOPT_HEADER, TRUE);
	@curl_setopt($ch, CURLOPT_NOBODY, TRUE);
	@curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
	@curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
	$status = array();
	preg_match('/HTTP\/.* ([0-9]+) .*/', @curl_exec($ch) , $status);
	return ($status[1] == 200);
}

If you you don’t have CURL lib istalled you can use the php get_headers function, it returns an array with the headers:

$url = 'http://www.example.com';
print_r(get_headers($url));
print_r(get_headers($url, 1));

If you apply the preg_match function to the first element of the array you will reach the same result:

function url_exists($url) {
	$h = get_headers($url);
	$status = array();
	preg_match('/HTTP\/.* ([0-9]+) .*/', $h[0] , $status);
	return ($status[1] == 200);
}
Share

Tags: , , , , , ,


Nov 10 2009

ASP equivalent to PHP ereg_replace function

Category: AspGiulio Pons @ 11:05 pm

I’ve used so many time the php function ereg_replace that when I have to use ASP (’cause sometimens you have to use that old terrible language) I have to use it also in ASP.
I’ve also read on PHP site that this function will soon became depracated. I’m sad about that.
But in Microsoft ASP language ereg_replace doesn’t exists, so, here it is ASP equivalent to PHP ereg_replace:

function ereg_replace(pattern,change,str)
	Dim ObjRegexp
	Set ObjRegexp = New RegExp
	ObjRegexp.Global = True
	ObjRegexp.IgnoreCase = True
	ObjRegexp.Pattern = pattern
	str = ObjRegexp.Replace(str,change)
	Set ObjRegexp = Nothing
	ereg_replace = str
end Function
Share

Tags: , , ,


« Previous PageNext Page »