Posts Tagged ‘php’

Delicious, though not so easy to swallow

Thursday, February 11th, 2010

For a long time I’ve wanted to work with the Delicious API. Initially it was because the Delicious website not only had the difficult to remember del.icio.us url, but was also very badly designed. If you compared its progress – addition of new features, cleaning up of design, making use of new techniques suchas AJAX – with its web2.0 compatriots (Flickr, Digg, boris-johnson.com) it lagged way behind.

So I initially planned to build a new front-end for it, making it easier to work with your bookmarks, but before I could progress far enough in my coding abilities they completely redesigned the site; a vast improvement.

Though still not perfect. For a while I’ve found it frustrating that there is no easy way to simultaneously see the content of a bookmarked page and delete the bookmark if you deem it no longer useful, so my delicious account gradually got more and more cluttered. Well, this afternoon I decided to do something about it (and not just because I’m avoiding doing more important stuff).

But I was foiled for a long time by the laziness of the Delicious developers. My initial plan was to use javascript to get a JSON of all my bookmarks (or alternatively request one at a time) and go through them one by one, displaying the webpage in an iframe, and offering the option to discard or keep the bookmark. However, delicious only publish this data as XML which means, due to cross-domain restrictions on AJAX, you can’t just use javascript. I may be a bit hasty in pinning this on developer laziness, but I imagine creating alternate templates (because that’s all the difference between JSON and XML really) wouldn’t be too time consuming, and would greatly enhance the versatility of the API.

Anyway, I realised I would have to use a bit of PHP to get the XML and create pages from which my javascript would be able to access the data. Luckily, before I dived straight in I came across phpdelicious (which, appropriately, I have now bookmarked in Delicious) , a very easy to use php class for wrapping the Delicious API, which is very handy indeed. Less than an hour later I had built exactly what I wanted.

I reckon a few more hours development and I can make it a publicly available service.  All I need to do is include a form for other users to be able to login, and (ideally) preload websites in the iframe to speed things up (though this is problematic as some sites force the whole web page to be redirect if you try and put them in an iframe).

Learning to crawl before you can run

Wednesday, July 8th, 2009

Crawling websites for data using php running in a browser

I’ve had an idea for a website for almost a year now (won’t spill the beans just yet though) and today I finally started work on it. To lift the veil of secrecy a little, I’m putting information about certain places onto a Google Map, because somehow nobody has thought of doing it yet.

All that information about the places is already available on the internet, just not embed in a map, so my first step was to crawl some websites to get hold of all that information. A slight problem I had is that this required running a script to automatically trawl through all the pages. I only know PHP, which as far as I was aware, could only be run in a browser, and browsers time out after a while, so it would be impossible to just leave it running.

A little more research revealed that it is possible to run PHP scripts as stand-alone entities outside the browser, but only using something called CGI. I had no idea what CGI was, and my hosting company don’t allow you to use CGI anyway. But I did manage to find another solution.

Although a php script crawling lots of pages for data would cause a browser timeout, a script crawling just one page almost certainly wouldn’t. So what I had to do was:

  1. Write a script that crawls the data of one page but…
  2. … checks what page was last crawled and moves on to the next one when it starts and…
  3. … tells the script to execute again once it’s finished.

1. will differ depending on your needs, but my solutions to 2. and 3., I believe, provide a good technique to crawl web pages if you only know PHP.

Solution to 2.

Each iteration will presumably write the information to a database. Provided you’re iterating over an integer (e.g the webpages are of the form http://www.thesite.com/thepage?id=theinteger) then you’ll probably be storing that integer in your database. Then the following code at the beginning of your script will advance you to the next web page to crawl.

 $last_entry_query = @mysql_query("SELECT theinteger FROM thetable ORDER BY theinteger DESC LIMIT 1");
 if(($last_entry_query) &&mysql_num_rows($last_entry_query)) {
    $last_entry_row = mysql_fetch_array($last_entry_query, MYSQL_ASSOC);
    $last_entry = $last_entry_row['int_ukcs_id'];
 } else {
    $last_entry = one less than the first entry (starts the script off);
 }
$current_page = file('http://www.thesite.com/thepage?id='.($last_entry+1));

You can stop and start the crawl whenever you like as the database will always tell you which page to crawl next.

Solution to 3.

Really simple this one. At the end of the script you need to run it again. What better way than just redirecting the browser to the same page again.

if(mysql_affected_rows() == 1){
    header('Location: http://localhost/campcrawl/index.php?id='.$last_entry);
 }

Still one little tweak though; browsers will tend to limit the number of times a page can be redirected to in order to avoid infinite loops. In Firefox, to over-ride this go to about:config  and change redirection-limit to something really big. If there is a danger of an infinite loop you can limit the number of iterations in your script with a counter or a timeout, but for me it wasn’t a  problem as just closing the browser tab stopped my script.

The reason this technique is workable is that even though it requires a browser to run the script, with today’s multi-tab browsers, and the fact that all the calculation is done on the server, mean that it doesn’t infringe on whatever else you’re using the browser for (aside from maybe occasionally having to refresh the tab running the script as sometimes it stops for seemingly no reason, but that might just be bad programming from me).

So that’s stage one nearly completed (2,000 out of about 9,000 pages trawled so far, but at this rate should be fininshed within the hour). Now on to actually doing something with the data.