How to build a spider… uh, well an email scraper

What is a spider? A spider is a program that automatically navigates web pages to find informations. These informations can…

Novembre 26, 2013

What is a spider?

A spider is a program that automatically navigates web pages to find informations. These informations can be of any kind, Google, for example has a spider named Googlebot, which is the most famous and ubiquitous spider ever existed, it scans pages and copy contents to Google servers to let your pages be found on Google search engine. After copying pages, other programs build indexes, analyze your contents and determine page rank for each keyword of your page. Search engines, like Google, Bing, Yahoo! can exist only because there are spiders.
Spider are good. Bad can be evil… and Googlebot reminds me to Agent Smith, do you remember Agent Smith in Matrix? :)

So, in spiders developing there are often two steps:
1) collect data from urls
2) analyze and use data

The second step could go outside the scope of a spider if it is too complex.

There are different names for spiders. They are called also web crawlers, scrapers, bots, harvesters. Programs that go on the web are called web bots (or simply bots) when perform small tasks, whereas when they collect a lot of data and parse thousands of pages and make complex things they are called spiders.

Building a spider or a web bot

Building a spider/bot means thinking at web in a different way, it’s not only a place where you can read articles, log in to your starred social network or buy something on a e-shop. It’s a place where you can use programs to make things.
For example, you can set up a spider to check if the page of your preferred artist changes and send you a push notification on your device if it happens. You can use spiders to track views of your competitors and build a rank to monitor them.
You can build a spider that logs in on a site and makes something as a normal user. There are a lot of possibilities.
A spider can also use APIs to find data and perform tasks inside third parties sites.

How to build an email scraper

A very simple spider to build is an email scraper. Which, if used to send spam, becomes a Spambot, wich is a BAD THING. As the name say an email scraper is a program that crawl the links of an entire site, and find emails.
The emails are than stored in a database and can after be used to contact users with (maybe unsolicited) emails.

I really don’t like spam and spammers, but building a part of a Spambot (the part that find emails) for didactic purposes can show you how simple is to take emails out from a site, and you can easily understand why there is so much spam in our inboxes today.

So, I will show the code of an email scraper which collects emails from a web site.

You will also understand, as a developer why it is so important to never show emails clearly in web pages.
In the next paragraphs I will explain the code of a simple mail scraper that is made using my Mini Bots PHP Class.

Be polite and keep the control of your spider

Since every spider needs time to run, it’s really important to have control on it. This means that you know what you’re doing and that you can cause problems to servers and developers behind those servers. So you have to control your code, and be able to stop your spider and limit it.

This because if you send a lot of http calls to a web site, you can overload it and “kill” it.

The simplest way I’ve found to run a spider in PHP completely under my control is to let it work by chunks inside a web browser. Step by step means that it makes a job that lasts a small number of seconds, than stops and waits an amount of time, than it restarts again and makes another little chunk of work. And so on…
This way respects also the rule of politeness, since you will make a small number of calls to the target server, not to close. This rule is very important and is a policy that developers must follow.


The crawler starts from a “seed“, that is to say a beginning URL, and finds all the links in that page and store them in a database. Then it searches for emails in that page and, if there are any, proceed to store them in the database. It’s very easy to find emails since you need just one regular expression to extract them from html. At the end of this process the crawler marks the analyzed page as read and the spider stops.

Since this mail scraper was thought to run in a browser, at the end of the process it waits for a random interval of time and then reload throught javascript.
When it restarts it choose another seed from the previous saved links and not yet marked a read.
This process lasts until all the urls in the database are read.
With a script like this you can scrape an entire site and find emails.

Database

Here is a very simple database for our email scraper:

CREATE TABLE IF NOT EXISTS `spider` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `saved` datetime NOT NULL,
  `url` varchar(255) NOT NULL,
  `visited` tinyint(3) unsigned NOT NULL default '0' COMMENT '0=not visited,1=visited',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `url` (`url`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

CREATE TABLE IF NOT EXISTS `emails` (
  `cd_spider` int(10) unsigned NOT NULL,
  `email` varchar(150) NOT NULL,
  UNIQUE KEY `email` (`email`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

As you can see there are only two tables, one to store the links found and mark them visited or not.
And then the emails table which collects the found data.
Here is the code, remember that to let it work you need my Mini Bots PHP Class, available for just 7$ on CodeCanyon.

PHP source code

This code uses these methods of the Mini Bots PHP Class:

getPage
To grab a page from a url, masking your bot as a browser

findLinks
To grab the list of links in a page, fix absolute paths and remove duplicates and filter internal links.

findEmails
To extract all the emails in a web page

So, build the tables in your database, copy the code and configure it with your database and your $SEED and go.
You can close the browser and reopen it, or stop the automatic refresh of the scraper by pressing ESC.
To restart it just reload the page.

<?
header('Content-type: text/html; charset=utf-8');
ini_set('default_charset', 'UTF-8');
setlocale(LC_CTYPE, 'it_IT.UTF-8');

define("WEBDOMAIN", "localhost");
define("DEFDBNAME", "dbname");
define("DEFUSERNAME", "username");
define("DEFDBPWD",  "password");
if(!Connect()) {die("err db");} else {mysql_query("SET NAMES 'utf8';");}

//
// this is the seed, the starting
// page.
$SEED = "http://www.your-very-intersting-target-site.com";

//
// includes the Mini Bots PHP Class
include("minibots/minibots.class.php");
$mb = new Minibots();

//
// this is the main function
// that does everyhting
function goSpiderGo($target="") {
	global $mb;
	$f = (integer)execute_scalar($sql = "select count(*) from spider where url='".addslashes($target)."' and visited=1");
	if($f>0) {
		$target = execute_scalar("select url from spider where visited=0 order by saved limit 0,1");
	}
	if(!$target) {
		return 0;
	}

	echo "target: <code>".$target."</code><br>";
	$web_page = $mb->getPage($target);
	echo "length: <code>".strlen($web_page)." bytes</code><br>";
	$links = $mb->findLinks($target, $web_page, false, "pdf,zip,jpg,gif,png");

	$countUrls0 = execute_scalar("select count(*) from spider");
	foreach($links as $link){saveUrl($link,0);}
	$countUrls1 = execute_scalar("select count(*) from spider");
	$countUrlsV1 = execute_scalar("select count(*) from spider where visited=1");

	echo "Urls added: <code>".($countUrls1-$countUrls0)."</code><br>";
	echo "Urls Totali: <code>".($countUrls1)."</code><br>";
	if($countUrls1>0) echo "Urls scraped: <code>".($countUrlsV1)." (".round($countUrlsV1/$countUrls1*100)."%)</code><br>";

	$spider = saveUrl($target,1);
	echo "record scraped: <code>".$spider."</code><br>";

	$countEmails0 = execute_scalar("select count(*) from emails");
	$emails = $mb->findEmails($web_page);

	foreach($emails as $email){
		mysql_query($sql = "insert ignore into emails (cd_spider,email) values (".(integer)$spider.",'".addslashes($email)."')") or die(mysql_error().$sql);
	}
	$countEmails1 = execute_scalar("select count(*) from emails");
	echo "Emails added: <code>".($countEmails1-$countEmails0)."</code><br>";
	echo "Emails totali: <code>".($countEmails1)."</code><br>";

	return 1;
}






//
// database related functions
// -------------------------------------------------
function Connect() { if (@mysql_connect( WEBDOMAIN, DEFUSERNAME, DEFDBPWD ) && @mysql_select_db( DEFDBNAME)) return 1; else return 0; }
function saveUrl($url,$visited=0) { 
	$f = execute_scalar("select id from spider where url='".addslashes($url)."'");
	if(!$f) {
		mysql_query("insert ignore into spider (saved,url,visited) values (NOW(),'".addslashes($url)."',".(integer)$visited.")") or die(mysql_error());
		$f = mysql_insert_id();
	} else {
		if($visited==1) mysql_query("update spider set visited=".(integer)$visited." where url='".addslashes($url)."'") or die(mysql_error());
	}
	return $f;
}
function execute_scalar($sql,$def="") {
	$rs = mysql_query($sql) or mysql_error().$sql;
	if (mysql_num_rows($rs)) {$r = mysql_fetch_row($rs);mysql_free_result($rs);return $r[0];}
	return $def;
}
function execute_row($sql) {
	$rs = mysql_query($sql) or mysql_error().$sql;
	if (mysql_num_rows($rs)) {$r = mysql_fetch_array($rs);mysql_free_result($rs);return $r;}
	mysql_free_result($rs);
	return "";
}
// -------------------------------------------------

?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="content-language" content="it" />
<title>Email Scraper</title>
<style>
	body{font-size:12px;font-family:sans-serif,arial;line-height:17px;}
	code {color:#777;background-color:#f0f0f0;padding:0 3px;}
</style>
</head>
<body>
<?
echo date("Y-m-d H:i:s");
echo "<br>";

$refresh = goSpiderGo($SEED);

if($refresh) {
	//
	// stops and wait a small
	// amount of time to be polite with
	// the target server.
	$timer = rand(1,3)*1231;
	echo "reloading in... <code id='timer'>".$timer."</code> milliseconds";
	echo "<script>
	setTimeout(function(){
		document.location.href = document.location.href;
	},".($timer).");
	setInterval(function(){
		q=parseInt(document.getElementById('timer').innerHTML);
		if(q-10>0) q = q-10; else q=0;
		document.getElementById('timer').innerHTML=q;
	},10);
	</script>";
}
?>
</body>
</html>

Author

PHP expert. Wordpress plugin and theme developer. Father, Maker, Arduino and ESP8266 enthusiast.

Recommended

Find values recursively inside complex json objects in PHP

A PHP function to to quickly search complex, nested php structures for specific values.

Dicembre 18, 2022

Orari trenord, corri solo quando ce n’è bisogno

Hai presente quando corri in stazione per prendere il treno al volo e, quando arrivi, ti accorgi che il treno ha 12 minuti di ritardo?

Marzo 30, 2017

Twitter counter no longer works

Since 20 of November 2015 the twitter button has changed. This is the new Twitter sharing button, as you can…

Novembre 23, 2015

Scraping content with PHP as if it was jQuery

Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match or preg_match_all…

Dicembre 8, 2013

Get instagram data without official api in PHP

Instagram has an official API to interact with its database of images and users. If you have enough time to…

Dicembre 3, 2013

Make a cron job with IFTTT

Cron is a software utility, a time-based job scheduler in Unix-like computer operating systems. People who set up and maintain…

Novembre 12, 2013