Please visit my new campsite listing site ukcampingmap.co.uk


Archive for July, 2009

Candy floss UK

Tuesday, July 14th, 2009

No, this isn’t an artist’s impression of the UK trapped in the midst of a swine flu epidemic, but is in fact what all my geocoded points look like plotted on a map of the UK*. All 4,528 of them. Now I just have to figure out how to make this look presentable, i.e. feed the points to the map gradualy as the user zooms in. I bet this is where things will really start to get difficult.

UK sites

*with apologies to Cornwall and northern Scotland

Anarchy in the UK

Monday, July 13th, 2009

This damn economic crisis/swine flu outbreak isn’t quite that bad yet, but nevertheless there is a very limited sense where the UK is quite anarchic: geocoding addresses using Google Maps.

Having completed my download of addresses for my new Google Maps website the next stage was to geocode them so that I can plot them on the map. I had no idea how tricky it would be when I started out.

The most irritating and fundamental difficulty is that geocodes for UK postcodes are not available for free. The data is owned by the Royal Mail, and there is at least one website where you can buy access to this information (it has a free trial, but I discovered that this is just for about 10 or so geocodes). You can search by postcode on google maps, but if you put a postcode e.g. LL13 7YH into the geocoder API you’re given the geocode for LL13 7 – not accurate enough to be of any real use.

So you have to go for geocoding full addresses instead. The geocodes for these data isn’t owned by Royal Mail, but by the Ordnance Survey, and for some reason they are less restrictive about sharing the information. But there’s still a long hard slog before you can get the geocodes out of this.

Google offer a really useful turorial on geocoding addresses, and this, combined with my approach to iterating over a large number of records meant I was collecting the geocodes in no time. However, it wasn’t as peachy as it seemed.

For example, the address Llantysilio, Denbighshire, UK brings up a pretty accurate geocode for the village of Llantysilio in North Wales. However, the full address, including the postal town is Llantysilio, Llangollen, Denbighshire, UK, and this unexpectedly brings up the geocode for an address on Castle Street, right in the middle of Llangollen. So a more complete address leads to a far less accurate geocode. This is immensely problematic.

In general I was feeding in the longest possible address made up out of the data I had, so in my php script I had something like the following:

while(count($arr_address > 1) && !$str_lat)
 {
$str_address = implode(', ', $arr_address);
 attempt_geocode($url.$str_name.', '.$str_address.', '.$str_county.', UK');
 attempt_geocode($url.$str_address.', '.$str_county.', UK');
 attempt_geocode($url.$str_address.', UK');
 array_pop($arr_address);
 }

This starts with the longest, most detailed address string, and then gradually cuts the string down (possibly sacrificing accuracy in order to get a passable geocode), with attempt_geocode() exiting the loop on success.

But the fact that longer addresses can lead to incorrect geocodes meant I had to work in a way to start off with shorter addresses, and if that doesn’t get a geocode then gradually shorten them and keep trying to geocode. So I’ now have:

 attempt_geocode($url.$arr_address[0].', '.$str_county.', UK');
 attempt_geocode($url.$arr_address[0].', '.end($arr_address).', '.$str_county.', UK');
while(count($arr_address > 1) && !$str_lat)
{
 $str_address = implode(', ', $arr_address);
 attempt_geocode($url.$str_address.', '.$str_county.', UK');
 attempt_geocode($url.$str_address.', UK');
 array_pop($arr_address);
}

A long process in order to get a geocode that could still quite likely be wrong, and even if it’s basically correct might not be as accurate as a postcode; but nevertheless an improvement on what I had before.

A glimmer of hope though is that google maps itself doesn’t suffer from this issue – both address versions return the same accurate point on the map, and as someone pointed out to me on stackoverflow, google maps is in beta, so maybe teh geocoder API just hasn’t been updated to the newer, better address parser, and maybe one day reliable geocoding for free in the UK will be a reality. Also, somebody has found a way to geocode in the UK using postcodes, by hacking together the google maps and search APIs, and I may well try it, as this address geocoding malarchy leaves a lot to be desired. (*edit – turns out it’s heavily reliant on javascript so can’t be used for geocoding masses of pages without slowing down your browser.)

Finally, if this article wasn’t any help, there’s loads of geocoding links here.

jQuery.each() for single objects

Thursday, July 9th, 2009

While refining  jQuery.crossselect.js recently I was briefly faced with a problem which often rears its ugly head, though this time I found a solution.

Consider a function alters_item(), which can be applied to certain DOM elements. Further, consider that it can be triggered in two distinct ways:

  1. By a click (or other event) on the item to be altered, so the DOM element is the context and can be accessed via the pseudo-variable this
  2. Just applied like a normal javascript function, which means the DOM element needs to be passed in as a parameter, i.e. you need to call the function using alters_item(element)

So to have a function usable in both circumstances I would write some conditionals at the start which check if an argument has been passed, if it hasn’t then set var element = this, etc…

But there is another way.

For the second case instead of

alters_item(element)

we can write

$(element).each(alters_item)

because jQuery.each works even on jQueries that return only one object.

Doing this is a bit of a trade off – the second line of code I bet takes measurably longer to execute, but it does mean my functions get to be simpler, so it’s my weapon of choice at the moment.

But it does make me think that jQuery should have a call() method, that runs a function on an object, but also setting the object as the context.

Incidentally, if anyone knows of a better way of dealng with this isue than the one I’ve found, please let me know.

Learning to crawl before you can run

Wednesday, July 8th, 2009

Crawling websites for data using php running in a browser

I’ve had an idea for a website for almost a year now (won’t spill the beans just yet though) and today I finally started work on it. To lift the veil of secrecy a little, I’m putting information about certain places onto a Google Map, because somehow nobody has thought of doing it yet.

All that information about the places is already available on the internet, just not embed in a map, so my first step was to crawl some websites to get hold of all that information. A slight problem I had is that this required running a script to automatically trawl through all the pages. I only know PHP, which as far as I was aware, could only be run in a browser, and browsers time out after a while, so it would be impossible to just leave it running.

A little more research revealed that it is possible to run PHP scripts as stand-alone entities outside the browser, but only using something called CGI. I had no idea what CGI was, and my hosting company don’t allow you to use CGI anyway. But I did manage to find another solution.

Although a php script crawling lots of pages for data would cause a browser timeout, a script crawling just one page almost certainly wouldn’t. So what I had to do was:

  1. Write a script that crawls the data of one page but…
  2. … checks what page was last crawled and moves on to the next one when it starts and…
  3. … tells the script to execute again once it’s finished.

1. will differ depending on your needs, but my solutions to 2. and 3., I believe, provide a good technique to crawl web pages if you only know PHP.

Solution to 2.

Each iteration will presumably write the information to a database. Provided you’re iterating over an integer (e.g the webpages are of the form http://www.thesite.com/thepage?id=theinteger) then you’ll probably be storing that integer in your database. Then the following code at the beginning of your script will advance you to the next web page to crawl.

 $last_entry_query = @mysql_query("SELECT theinteger FROM thetable ORDER BY theinteger DESC LIMIT 1");
 if(($last_entry_query) &&mysql_num_rows($last_entry_query)) {
    $last_entry_row = mysql_fetch_array($last_entry_query, MYSQL_ASSOC);
    $last_entry = $last_entry_row['int_ukcs_id'];
 } else {
    $last_entry = one less than the first entry (starts the script off);
 }
$current_page = file('http://www.thesite.com/thepage?id='.($last_entry+1));

You can stop and start the crawl whenever you like as the database will always tell you which page to crawl next.

Solution to 3.

Really simple this one. At the end of the script you need to run it again. What better way than just redirecting the browser to the same page again.

if(mysql_affected_rows() == 1){
    header('Location: http://localhost/campcrawl/index.php?id='.$last_entry);
 }

Still one little tweak though; browsers will tend to limit the number of times a page can be redirected to in order to avoid infinite loops. In Firefox, to over-ride this go to about:config  and change redirection-limit to something really big. If there is a danger of an infinite loop you can limit the number of iterations in your script with a counter or a timeout, but for me it wasn’t a  problem as just closing the browser tab stopped my script.

The reason this technique is workable is that even though it requires a browser to run the script, with today’s multi-tab browsers, and the fact that all the calculation is done on the server, mean that it doesn’t infringe on whatever else you’re using the browser for (aside from maybe occasionally having to refresh the tab running the script as sometimes it stops for seemingly no reason, but that might just be bad programming from me).

So that’s stage one nearly completed (2,000 out of about 9,000 pages trawled so far, but at this rate should be fininshed within the hour). Now on to actually doing something with the data.

What’s up with wordpress?

Sunday, July 5th, 2009

My old blog was a blogspot blog, and while using blogger to write posts I had very few complaints. But when it came to setting up this blog I plumped for WordPress, mainly on reputation; wherever I turned I would read phrases along the lines of “while not a full-featured publishing platform, Blogger does offer an impresive, easy-to-use interface for the novice blogger,” and  damn me if I was going to be labelled a novice!

I have, by and large, though, been very impressed with WordPress. As a front-end developer, I particularly like the fact that it very rarely gets confused about what to do when you do something (eg delete some html which leaves some tags unexpectedly unclosed) that woudl drive Blogger potty. And the new version of the admin interface (from about version 2.6 onwards) is a staggering improvement on the old version and on Blogger.

However it has one irritating feature: it constantly breaks.

Since upgrading I’ve had the wysiwyg text editor break several times (fixed by deleting and then recopying in the wp-includes folder), saving posts broken once (fixed by overwriting thw wp-admin folder), and for a long period of time couldn’t log out of admin (no idea how this is fixed – it just seemed to go away of its own accord). It’s immensely frustrating (though not frustrating enough to cause me to abandon wordpress).

It’s probably a flaw in the wordpress automatic upgrade plugin, but I’m loathe to abandon that too as it takes the pain out of making tea, so to speak. Though I can’t help thinking it may be a false economy as in the end I have to manually overwrite files anyway.

This phoku is throwing the baby out with the bathwater

Saturday, July 4th, 2009

Blue tits once nested
Now planning permission saught
For more tidied flats

Canal side dissheveled building

Optimize how?

Thursday, July 2nd, 2009

Yesterday and today I’ve been rewriting my jQuery crossSelect plugin (probably over half of the code has changed) to; a) Fix the serious bugs brought about by trying to bring my plugin closer in line with how it’s supposed to be done, without fully understanding the implications in advance; b) make the code more efficient, in part applying the ideas in this excellent article; and c) prepare the code for bringing in more functionality in later releases.

With regard to c), the main thing I needed to do was rewrite all my selection and removal functions so that moving many items into the selected column at once could just be the move one item function iterated a number of times. I’ve now ( I think) found a pretty efficient solution (each move many function is only 3 lines long), but along the way I came across an interesting dilemma.

My selectOne() function essentially moved a list item and then checks how many items are in each list before adjusting the buttons appropriately. Now, to do a selectAll() or a selectMany() the obvious thing to do is just to iterate that selectOne() function over all list items – just a handful of lines of code – … but this unfortunately leads to a less efficient (and probably slower) function. Writing a selectAll()/selectMany() function from scratch would enable me to only adjust the buttons once and, in the selectAll() case, not have to care about tracking which list item I’m dealing with as they all get moved over in the end… but this way would not only be less elegant, I feel, but also lead to more lines of code.

I’d always assumed that optimising code meant two things – faster and smaller – and I’d always thought that one more or less implies the other. Turns out I was wrong.

In the end, the escape from this trade off involved removing the button adjustment from selectOne() and putting it in selectNow(), a new function triggered by a click. But this required a feature of jQuery which I’ll talk about in some other post.