Scraping content with PHP as if it was jQuery

Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match or preg_match_all to selectively find tags and extract informations from the html source. Sometimes, while I was watching the html code of a page, I’ve been thinking “If I could only use jQuery to get it! It would be so easy!“.

This happens because sometimes there are nested items which are not easily recognizable with regular expressions, you don’t have a clear and stable point to use to detect the informations… So you need to use the DOM, that is to say the document structure (Document Object Model).
And you have load the target page with function like simplexml_load_string and then you have to navigate throught the objects to find the information needed. This is not easy. If you’re also a front-end developer you probably know what is jQuery, basically it’s a framework that helps programmers to handle differences between browsers and lets you develop anything in javascript incredibly faster.

The first thing you learn with jQuery is its simple system to find and get html tags, selecting them using classes or other attributes.
So, you need something like a jQuery Php or a Php jQuery lib!

I’m not going to teach you jQuery, but I’m going to talk you about Simple HTML DOM Php class, that you can find here on sourceforge. It is the PHP jQuery lib I was searching for. It’s a class that lets you build scrapers using methods to navigate the DOM like the ones used in jQuery. Using this class I was able to build in just ten minutes the mini-widget in the right sidebar that embeds an animated GIF and it’s description from the very funny tumblr the_coding_love which every day publishes funny animated GIFs about coding. The code is only this and could be done better:

$html = file_get_html('');
$src = $link = $text = "";
foreach($html->find(' div.centre h3') as $e) {
	foreach($e->find("a") as $a) $text = $a->innertext;
	foreach($e->find("a") as $a) $link = $a->href;
foreach($html->find(' div.bodytype') as $e) {
	foreach($e->find("img") as $a) $src = $a->src;
echo $src."\n";
echo $link."\n";
echo $text."\n";

I know the above code is not so well written and could be better, but I wrote it at 1:00 am.

You can see the result (with some css) in the right sidebar (or below if you are on mobile).

Here is a complete list of the methods/functions and properties in the last 1.5 version:

Helper functions

Name Description
objectstr_get_html ( string $content ) Creates a DOM object from a string.
objectfile_get_html ( string $filename ) Creates a DOM object from a file or a URL.

DOM methods & properties

Name Description
void__construct ( [string $filename] ) Constructor, set the filename parameter will automatically load the contents, either text or file/url.
stringplaintext Returns the contents extracted from HTML.
voidclear () Clean up memory.
voidload ( string $content ) Load contents from a string.
stringsave ( [string $filename] ) Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
voidload_file ( string $filename ) Load contents from a from a file or a URL.
voidset_callback ( string $function_name ) Set a callback function.
mixedfind ( string $selector [, int $index] ) Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Element methods & properties

Name Description
string[attribute] Read or write element’s attribure value.
stringtag Read or write the tag name of element.
stringoutertext Read or write the outer HTML text of element.
stringinnertext Read or write the inner HTML text of element.
stringplaintext Read or write the plain text of element.
mixedfind ( string $selector [, int $index] ) Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.

DOM  traversing

Name Description
mixed$e->children ( [int $index] ) Returns the Nth child object if index is set, otherwise return an array of children.
element$e->parent () Returns the parent of element.
element$e->first_child () Returns the first child of element, or null if not found.
element$e->last_child () Returns the last child of element, or null if not found.
element$e->next_sibling () Returns the next sibling of element, or null if not found.
element$e->prev_sibling () Returns the previous sibling of element, or null if not found.

Camel naming convertions

You can also call methods with W3C STANDARD camel naming convertions.

Method Mapping
array$e->getAllAttributes () array$e->attr
string$e->getAttribute ( $name ) string$e->attribute
void$e->setAttribute ( $name, $value ) void$value = $e->attribute
bool$e->hasAttribute ( $name ) boolisset($e->attribute)
void$e->removeAttribute ( $name ) void$e->attribute = null
element$e->getElementById ( $id ) mixed$e->find ( “#$id”, 0 )
mixed$e->getElementsById ( $id [,$index] ) mixed$e->find ( “#$id” [, int $index] )
element$e->getElementByTagName ($name ) mixed$e->find ( $name, 0 )
mixed$e->getElementsByTagName ( $name [, $index] ) mixed$e->find ( $name [, int $index] )
element$e->parentNode () element$e->parent ()
mixed$e->childNodes ( [$index] ) mixed$e->children ( [int $index] )
element$e->firstChild () element$e->first_child ()
element$e->lastChild () element$e->last_child ()
element$e->nextSibling () element$e->next_sibling ()
element$e->previousSibling () element$e->prev_sibling ()

Anything to say?

Comments with Facebook or below with WordPress

Sorry, WordPress comments are closed on old posts.