Jan 20 2010

New version of Mini Bots PHP Class (v.1.4)

Category: Php,Spiders & webbotsGiulio Pons @ 11:53 pm

I’ve added three more bots to the Mini Bots Php Class, now the version number is 1.4 and it has these new features.

Follow the link, watch demos and download the class from the Mini Bots Php Class page:

  1. addeded twitterInfo method to retrieve the numbers of following, followers, and lists.
  2. addeded url2pdf method to convert and save locally a page to a pdf (thanks to pdfmyurl.com service).
  3. addeded webpage2txt method to extract all the text from a url, stripping html but also scripts and styles.

What else do you need? Suggest some usefull small bots. Comment to suggest.

  • Share/Bookmark

Tags: , , , , , ,


Jan 16 2010

PHP Web page to text function

Category: Php,Spiders & webbotsGiulio Pons @ 2:55 pm

I’ve found this nice small bot on the www.php.net site, thanks to the author of the script on the preg_replace page.
This bot returns the text content of a url and it could be used to take text from a site and find relevant words to search.

It’s nice because it uses CURL and let us see some nice stuff that CURL does:

  1. hide himself presenting with a specific user agent, to seem a browser and not a spider
  2. follows redirect (this means that you can call this function with a tiny url to retrieve the text in the real page!)
  3. use very powerful regular expression to remove html tags, but also javascript and styles (they remain if you simply do a strip_tags)

This script will be included in the next version of the Mini Bots PHP Class.

function webpage2txt($url) {
	$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";

	$ch = curl_init();    // initialize curl handle
	curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
	curl_setopt($ch, CURLOPT_FAILONERROR, 1);              // Fail on errors
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);    // allow redirects
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
	curl_setopt($ch, CURLOPT_PORT, 80);            //Set the port number
	curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

	$document = curl_exec($ch);

	$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
		'@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
		'@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
		'@<![\s\S]*?–[ \t\n\r]*>@',         // Strip multi-line comments including CDATA
		'/\s{2,}/',
	);

	$text = preg_replace($search, "\n", html_entity_decode($document));

	$pat[0] = "/^\s+/";
	$pat[2] = "/\s+\$/";
	$rep[0] = "";
	$rep[2] = " ";

	$text = preg_replace($pat, $rep, trim($text));

	return $text;
}

echo webpage2txt("http://www.rockit.it");
  • Share/Bookmark

Tags: , , , ,


Jan 12 2010

Bot that retrieves url meta data and other infos

Category: Php,Spiders & webbotsGiulio Pons @ 2:55 pm

From a given url this function retrieves page title, meta description, keywords, favicon, and an array of 5 images to use for links. It call file_get_contents and then make some regular expression job.

This function is included in the Mini Bots Class.

print_r(getLinksInfo("http://www.rockit.it/articolo/825/nada-studio-report-quando-nasce-una-canzone"));

function getLinksInfo($url) {
	$web_page = file_get_contents($url);

	$data['keywords']="";
	$data['description']="";
	$data['title']="";
	$data['favicon']="";
	$data['images']=array();

	preg_match_all('#<title([^>]*)?>(.*)</title>#Uis', $web_page, $title_array);
	$data['title'] = $title_array[2][0];
	preg_match_all('#<meta([^>]*)(.*)>#Uis', $web_page, $meta_array);
	for($i=0;$i<count($meta_array[0]);$i++) {
		if (strtolower(attr($meta_array[0][$i],"name"))=='description') $data['description'] = attr($meta_array[0][$i],"content");
		if (strtolower(attr($meta_array[0][$i],"name"))=='keywords') $data['keywords'] = attr($meta_array[0][$i],"content");
	}
	preg_match_all('#<link([^>]*)(.*)>#Uis', $web_page, $link_array);
	for($i=0;$i<count($link_array[0]);$i++) {
		if (strtolower(attr($link_array[0][$i],"rel"))=='shortcut icon') $data['favicon'] = makeabsolute($url,attr($link_array[0][$i],"href"));
	}
	preg_match_all('#<img([^>]*)(.*)/?>#Uis', $web_page, $imgs_array);
	$imgs = array();
	for($i=0;$i<count($imgs_array[0]);$i++) {
		if ($src = attr($imgs_array[0][$i],"src")) {
			$src = makeabsolute($url,$src);
			if (getRemoteFileSize($src)>15000) array_push($imgs,$src);
		}
		if (count($imgs)>5) break;
	}
	$data['images']=$imgs;

	return $data;
}

Here is the output:

Array
(
    [keywords] => Nada
    [description] => (Nada e il compagno Gerri Manzoli, foto d archivio) Nada &egrave; al Naural HeadQuarter di Ferrara per la registrazione del suo ultimo album in studio, il ventitreesimo, un nuovo capitolo che segna un ulteriore punto nella sua carriera da musicista, iniziata da giovanissima alla fine dei 60. Il titolo non &egrave; stato ancora scelto, cos&igrave; come la data d uscita. Ma possiamo anticiparvi...
    [title] => Nada Studio report - Quando nasce una canzone
    [favicon] => http://www.rockit.it/favicon.ico
    [images] => Array
        (
            [0] => http://ww2.rockit.it/rockit/immagini/Nadain2.jpg
            [1] => http://ww2.rockit.it/rockit/immagini/NadaIn3.jpg
        )

)

And here there are the used functions:

function attr($s,$attrname) {
		//retrn html attribute
		preg_match_all('#\s*('.$attrname.')\s*=\s*["|\']([^"\']*)["|\']\s*#i', $s, $x);
		if (count($x)>=3) return $x[2][0];
		return "";
	}

function makeabsolute($url,$link) {
	if (strpos( $link,"http://")===0 ) return $link;
	$p = parse_url($url);
	if (strpos( $link, "/")===0) return "http://".$p['host'].$link;
	return str_replace(substr(strrchr($url, "/"), 1),"",$url).$link;
}

function getRemoteFileSize($url) {
	if (substr($url,0,4)=='http') {
		$x = array_change_key_case(get_headers($url, 1),CASE_LOWER);
		if ( strcasecmp($x[0], 'HTTP/1.1 200 OK') != 0 ) { $x = $x['content-length'][1]; }
		else { $x = $x['content-length']; }
	}
	else { $x = @filesize($url); }
	return $x;
}
  • Share/Bookmark

Tags: , , , ,


Next Page »