Learning to crawl before you can run
Wednesday, July 8th, 2009Crawling websites for data using php running in a browser
I’ve had an idea for a website for almost a year now (won’t spill the beans just yet though) and today I finally started work on it. To lift the veil of secrecy a little, I’m putting information about certain places onto a Google Map, because somehow nobody has thought of doing it yet.
All that information about the places is already available on the internet, just not embed in a map, so my first step was to crawl some websites to get hold of all that information. A slight problem I had is that this required running a script to automatically trawl through all the pages. I only know PHP, which as far as I was aware, could only be run in a browser, and browsers time out after a while, so it would be impossible to just leave it running.
A little more research revealed that it is possible to run PHP scripts as stand-alone entities outside the browser, but only using something called CGI. I had no idea what CGI was, and my hosting company don’t allow you to use CGI anyway. But I did manage to find another solution.
Although a php script crawling lots of pages for data would cause a browser timeout, a script crawling just one page almost certainly wouldn’t. So what I had to do was:
- Write a script that crawls the data of one page but…
- … checks what page was last crawled and moves on to the next one when it starts and…
- … tells the script to execute again once it’s finished.
1. will differ depending on your needs, but my solutions to 2. and 3., I believe, provide a good technique to crawl web pages if you only know PHP.
Solution to 2.
Each iteration will presumably write the information to a database. Provided you’re iterating over an integer (e.g the webpages are of the form http://www.thesite.com/thepage?id=theinteger) then you’ll probably be storing that integer in your database. Then the following code at the beginning of your script will advance you to the next web page to crawl.
$last_entry_query = @mysql_query("SELECT theinteger FROM thetable ORDER BY theinteger DESC LIMIT 1");
if(($last_entry_query) &&mysql_num_rows($last_entry_query)) {
$last_entry_row = mysql_fetch_array($last_entry_query, MYSQL_ASSOC);
$last_entry = $last_entry_row['int_ukcs_id'];
} else {
$last_entry = one less than the first entry (starts the script off);
}
$current_page = file('http://www.thesite.com/thepage?id='.($last_entry+1));
You can stop and start the crawl whenever you like as the database will always tell you which page to crawl next.
Solution to 3.
Really simple this one. At the end of the script you need to run it again. What better way than just redirecting the browser to the same page again.
if(mysql_affected_rows() == 1){
header('Location: http://localhost/campcrawl/index.php?id='.$last_entry);
}
Still one little tweak though; browsers will tend to limit the number of times a page can be redirected to in order to avoid infinite loops. In Firefox, to over-ride this go to about:config and change redirection-limit to something really big. If there is a danger of an infinite loop you can limit the number of iterations in your script with a counter or a timeout, but for me it wasn’t a problem as just closing the browser tab stopped my script.
The reason this technique is workable is that even though it requires a browser to run the script, with today’s multi-tab browsers, and the fact that all the calculation is done on the server, mean that it doesn’t infringe on whatever else you’re using the browser for (aside from maybe occasionally having to refresh the tab running the script as sometimes it stops for seemingly no reason, but that might just be bad programming from me).
So that’s stage one nearly completed (2,000 out of about 9,000 pages trawled so far, but at this rate should be fininshed within the hour). Now on to actually doing something with the data.

