Thursday, October 18, 2012

New ISTI dataset - duplicates

This is my third post on the new beta release of the ISTI temperature database. In the first post, a Google Maps display, I noticed a number of stations which appeared to be duplicates. So I thought I'd check more comprehensively.

I first ordered the inventory alphabetically by name. A complication here is that 430 have no name. Some still showed up as duplicates.

The next step was to collect pairs of adjacent stations whose data began in the same or adjacent year. Then I did a rough distance check and retained pairs for which the sum of lat and longitude differences (absolute value) was less than 1°. That's within about 70 km at most near the equator, requiring greater closeness near the poles. In fact most pairs at this stage have near identical coordinates.

That left 1077 pairs. I've made a list as a zipped CSV file here.

There will be some missing. I suspect Vienna/Wien are duplicates, but are missed alphabetically. The two Trondheims I noticed are assigned coords too far apart. And of course, my test doesn't prove duplication - just flags for checking.


  1. Hi Nick,

    its great to see someone is looking at the data and turning over the rocks. The station merge code does include station name in the algorithm (through a Jaccard Index - used by postal services to work out when someone goofed on an address!) along with lat / lon and (where available) elevation. Then data matching. Further details are in the readme file and being written up for a journal article. It, of course, depends upon the reliability of the geolocation metadata in the source decks. One thing we are doing is developing a blacklist facility where known issues can be looked up. We'll take a look at your spreadsheet.

    As a terminally busy stretched team it would be great if you could alert us to posts on the databank issues you (or others) find either at the surfacetemperatures blog or by email. Otherwise its kind of down to happenstance on our part whether we find posts by third parties I'm afraid and yet this stuff and the perspectives those less in the weeds than we are bring are hugely valuable in developing this resource for the benefit of all.

    This is the precise reason behind a beta release for three months so that we can get constructive feedback to improve the first version release before it is locked down.


    1. Peter,
      Yes, I'd planned to do some more analysis and send you a note, but a stomach bug kept me quiet for a while. I'll be prompter with alerts in the future.

      I think your approach is just fine - get a lot of data together (the basic resource) and then see what people find in it.

  2. Just to note that we have now incorporated the station start date as a piece of information going into the merge decision-making algorithm. This should have reduced the propensity to bleed long-term stations present in multiple source decks into multiple neighbours which is what, after some digging, we concluded to be happening in the cases you highlighted for us. This serves to reduce an artificial inflation in the station count that gets worse furtehr back in time. There are other changes too with respect to sources and some thresholds. There is more at - the second beta version release. The net impact is relatively small all in all compared to the first release in terms of 'headlines' folks care about (station count, timeseries behaviour at the global mean etc.) but I think its technically preferable. There is still a further month left to turn over rocks before we freeze a first version.

    1. Thanks Peter. Good to hear there's a new, more settled version - I'll try some analysis in preparation for the frozen version..