Scraping content with PHP as if it was jQuery

Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match or preg_match_all…

December 8, 2013

Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match or preg_match_all to selectively find tags and extract informations from the html source. Sometimes, while I was watching the html code of a page, I’ve been thinking “If I could only use jQuery to get it! It would be so easy!“.

This happens because sometimes there are nested items which are not easily recognizable with regular expressions, you don’t have a clear and stable point to use to detect the informations… So you need to use the DOM, that is to say the document structure (Document Object Model).
And you have load the target page with function like simplexml_load_string and then you have to navigate throught the objects to find the information needed. This is not easy. If you’re also a front-end developer you probably know what is jQuery, basically it’s a framework that helps programmers to handle differences between browsers and lets you develop anything in javascript incredibly faster.

The first thing you learn with jQuery is its simple system to find and get html tags, selecting them using classes or other attributes.
So, you need something like a jQuery Php or a Php jQuery lib!

I’m not going to teach you jQuery, but I’m going to talk you about Simple HTML DOM Php class, that you can find here on sourceforge. It is the PHP jQuery lib I was searching for. It’s a class that lets you build scrapers using methods to navigate the DOM like the ones used in jQuery. Using this class I was able to build in just ten minutes the mini-widget in the right sidebar that embeds an animated GIF and it’s description from the very funny tumblr the_coding_love which every day publishes funny animated GIFs about coding. The code is only this and could be done better:

$html = file_get_html('');
$src = $link = $text = "";
foreach($html->find(' div.centre h3') as $e) {
	foreach($e->find("a") as $a) $text = $a->innertext;
	foreach($e->find("a") as $a) $link = $a->href;
foreach($html->find(' div.bodytype') as $e) {
	foreach($e->find("img") as $a) $src = $a->src;
echo $src."\n";
echo $link."\n";
echo $text."\n";

I know the above code is not so well written and could be better, but I wrote it at 1:00 am.

You can see the result (with some css) in the right sidebar (or below if you are on mobile).

Here is a complete list of the methods/functions and properties in the last 1.5 version:

Helper functions

Name Description
objectstr_get_html ( string $content ) Creates a DOM object from a string.
objectfile_get_html ( string $filename ) Creates a DOM object from a file or a URL.

DOM methods & properties

Name Description
void__construct ( [string $filename] ) Constructor, set the filename parameter will automatically load the contents, either text or file/url.
stringplaintext Returns the contents extracted from HTML.
voidclear () Clean up memory.
voidload ( string $content ) Load contents from a string.
stringsave ( [string $filename] ) Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
voidload_file ( string $filename ) Load contents from a from a file or a URL.
voidset_callback ( string $function_name ) Set a callback function.
mixedfind ( string $selector [, int $index] ) Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Element methods & properties

Name Description
string[attribute] Read or write element’s attribure value.
stringtag Read or write the tag name of element.
stringoutertext Read or write the outer HTML text of element.
stringinnertext Read or write the inner HTML text of element.
stringplaintext Read or write the plain text of element.
mixedfind ( string $selector [, int $index] ) Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.

DOM  traversing

Name Description
mixed$e->children ( [int $index] ) Returns the Nth child object if index is set, otherwise return an array of children.
element$e->parent () Returns the parent of element.
element$e->first_child () Returns the first child of element, or null if not found.
element$e->last_child () Returns the last child of element, or null if not found.
element$e->next_sibling () Returns the next sibling of element, or null if not found.
element$e->prev_sibling () Returns the previous sibling of element, or null if not found.

Camel naming convertions

You can also call methods with W3C STANDARD camel naming convertions.

Method Mapping
array$e->getAllAttributes () array$e->attr
string$e->getAttribute ( $name ) string$e->attribute
void$e->setAttribute ( $name, $value ) void$value = $e->attribute
bool$e->hasAttribute ( $name ) boolisset($e->attribute)
void$e->removeAttribute ( $name ) void$e->attribute = null
element$e->getElementById ( $id ) mixed$e->find ( “#$id”, 0 )
mixed$e->getElementsById ( $id [,$index] ) mixed$e->find ( “#$id” [, int $index] )
element$e->getElementByTagName ($name ) mixed$e->find ( $name, 0 )
mixed$e->getElementsByTagName ( $name [, $index] ) mixed$e->find ( $name [, int $index] )
element$e->parentNode () element$e->parent ()
mixed$e->childNodes ( [$index] ) mixed$e->children ( [int $index] )
element$e->firstChild () element$e->first_child ()
element$e->lastChild () element$e->last_child ()
element$e->nextSibling () element$e->next_sibling ()
element$e->previousSibling () element$e->prev_sibling ()


I'm a software engineer, an everyday web developer and a maker. I usually build sites with PHP, within or without WordPress. I build Internet of Things with Arduino and ESP8266. I'm the founder of and and I'm actually the Chief Technical Officer of Better Days web agency.


Get instagram data without official api in PHP

Instagram has an official API to interact with its database of images and users. If you have enough time to…

December 3, 2013

Make a cron job with IFTTT

Cron is a software utility, a time-based job scheduler in Unix-like computer operating systems. People who set up and maintain…

November 12, 2013

How to read facebook likes count from PHP

When you add facebook like button to your site, probably, you also want to save the number of likes of…

October 8, 2012

PHP code to check if remote mp3 exists

Hi, I’ve a big table with thousands of mp3 links. Sice these links come from an old database, many of…

November 1, 2011

How to use photos on your site

UPDATE: 2013-12-04 I’ve made a method in the Mini Bots PHP Class that lets you retrieve images from instagram without…

August 18, 2011

get MySpace events with a PHP function

Here is a function to read the concerts for a myspace band page. This code retrieves the “shows page” for…

February 21, 2011