Wednesday, July 21, 2010

Spatial coverage of the GHCN and GSOD station sets

The Greenland study with the NOAA GSOD surface temperature dataset looked very promising. There are many more GSOD stations than GHCN, especially in modern times. But it turned out that they weren't very well distributed, being mainly along the west coast. In that way GHCN did better, and its fewer stations may have performed at least as well as GSOD.

So I tried to compare more closely for other parts of the world. It's a big and complex picture, but this may be a start.

The coverage of both GHCN and GSOD has varied a lot over the years. GHCN had many more stations in the decades before 1992; GSOD has not declined in numbers recently, but numbers varied between 1940 and 1972, and there is very little before 1940. I decided to focus on the big contrast in numbers of  stations recently reporting, where recent means since start 2008.

Here are the side by side plots. I tried superimposing, but there is too much overlap. I tried combining images using color addition, but found it too hard to get the pixels aligned, so the result was blurry. Side-by-side worked out best.

Many people have noted the Africa gaps in GHCN. GSOD fills some but not all - esp Zaire.

GSOD certainly has more Antarctic data. It will be interesting to see if it has stations that GISS does not cover.

GSOD has much denser coverage of both China and India, although GHCN may well be adequate. Tibet is a gap in both.

GSOD does better in the Eastern arid regions, but the Western desert is still sparse.

GSOD fills some gaps, and there's overkill as well.N Scandinavia is shown on the Arctic map.
GSOD is generally better. Curious concentrations in Alberta/Sask and in Cuba.

Generally better coverage, even in NE Siberia
Well, we have Bolivia covered. GSOD is still a bit sparse in the Amazon.

A general observation is that GHCN does seem to have a well chosen distribution. One would generally think that the tendency of GSOD to cluster should do no harm, but in fact it does put more pressure on the weighting system to make sure that the very dense regions don't disproportionately affect the result. Gridding limits the damage that could be done.


  1. Lots of overkill, but some areas are helped out. Angola, Namibia, DR Congo (former Zaire) and Madagascar are looking better.

    The US GHCN map is a bit deceptive, since its a lot denser with the USHCN tacked on.

    This is a good beginning. Some possible ideas for focusing in:
    To judge how well this fills in the 1990 station dropout, you could show GHCN and GSOD maps for only those stations which have data for 80% of the months between 1991 and 2008, or something like that. Data in a single year is nice, but that doesn't tell you if there's enough there to be usable.

    Or, take the list of stations that dropped out of GHCN in 1990, and see how many of the have 80% or so data after 1990 in GSOD. Ron may have already done this?

    One word of caution is that where GHCN and GSOD give data for the same station at the same time, I would not expect a perfect match. GSOD would have less or maybe zero QC. There's also the question of whether the source countries calculated the monthly means exactly the same way Ron did. They usually would, but not always.

  2. oh, and the well chosen distribution of GHCN is to some extent by design. I think it was the NCDC guys who picked out a network of stations that they'd really want to get CLIMATs from, with this in mind. If all those stations actually reported on time, there wouldn't be any holes in coverage.

  3. The beatings will continue until morale improves.

    Recall, carrot, that the GSOD is a summary and does not include hourly data. You have only daily min, max, and mean to chose from in GSOD. I think the choice to take the monthly mean as the mean of the daily means ( mon_mean = mean(daily_mean,na.rm=T) ) is the only sensible thing to do.

    I do have evidence that when hourly data is available, GHCN will sometimes use the mean of the daily data rather than Tmean = (Tmax + Tmin)/2.

  4. BTW, I really like this post, Nick. Nice job on the regions.

    Is there a way to quantatize the spatial distribution? Mean/Max/Min distance between stations? Maximum area of unfilled polygon? There's got to be some kind of network analysis that provides that info ...

  5. Ron,

    I didn't mean you were making a different choice. For as you point out, you have no choice to make.

    Rather, my point was that the individual countries may take the mean of 4 daily measurements, or something like that, instead of the max/min average. In which case they wouldn't match what you got, through no fault of yours.

    Which is basically repeating everything you just said.

  6. Thanks, Ron. I don't know much about the network analysis possibilities, but R has a huge number of packages which may offer something. I could do a fine grid density estimate (say 1 deg), smooth it a bit and plot as a continuum variable with colors.

    But it's a question of what kind of thing we want to know. I think you'd want to focus on quantifying sparsity rather than density, which I guess is where your suggestions are directed.

    Anyway, I'll check what R has - there may even be something in Steven's raster package.

  7. Strike my comment above about a case where GHCN chose the mean of hourly over Tmean = (Tmax - Tmin)/2. I switched the data sets in my head driving home. In the case of the Kuska, GHCN chose the max/min method over the mean of hourlies. Sorry for the mix-up

  8. Was the GHCN choosing anything in that case? I thought they just took whatever they found.

  9. mmmm... both NDP040 (min/max) and NDP048 (6- and 3- hourly) seem to be developed at roughly the same time from roughly the same source. GHCN appears to have chosen NDP040. So I *think*, yes, that they had a choice.

  10. Hmm. NDP048 as such came out too late for GHCN v1, but in time for GHCN v2. Yet v1 seems to list the original source, as "243 station temperature database for the " USSR, Research Institute for Hydro-meteorological Information, Obninsk Russia.

    The documentation for GHCN v2 doesn't mention what they do with sources that give 4x daily measurements, instead of pre-calculated monthly means. Maybe they just didn't use such sources, if monthly means were available. Maybe the GHCN v1 docs would help on that question. They do avoid synoptic-sourced data that hasn't been QCed.