Friday, March 18, 2011

Area weighting and a 60 stations global temperature.

I've been experimenting more with area-based weighting. I found strict Voronoi tessellation was slow for global coverage. The reason is that one really has to loop over triangles, and in R that is slow. So I investigated a variant in which a triangular mesh is used, but to get the area associated with the node, the mid-point of each side is connected to the median instead of the circumcentre. That is equivalent to apportioning the triangle area equally to its three nodes. It is also equivalent to using linear finite element basis functions.

It's much quicker, because most calculation can be done on whole vectors of nodes. I found that a 9800 node global calc took nearly a minute per month to make Voronoi meshing, compared with about 2 min for a complete lat/lon grid weighting calc. That slowness isn't prohibitive, but the modified method brings it back to a few seconds per month.

So as a test I decided to revisit the just 60 stations calculation. Looking back, I was actually surprised that that calc worked as well as it did, because the stations were not evenly distributed, and there was no effectively discriminating weighting for area. The usual grid-based weighting was used, but because the sparse stations were generally one to a cell, the only differentiation was based on grid cell area, which there was just a function of latitude.

In that earlier calc I wanted to stick to a simple objective criterion for station selection. This was that the stations be rural, have at least 90 years of data, and have been still reporting in 2009. That cut the number of stations down to 61.

Now I can use the weighting in the selection process too. It's still objective - the actual temperatures are not consulted. What I do is to take a larger set (I relaxed the record length to 50 years) and mesh that. There is a big range of weights, so I pick the top two thirds (approx), the stations covering the largest areas. Then I remesh, and select again, and then repeat (stages 170, 120, 90, 60). This gets down to 60 stations, but more evenly distributed. And they are now properly weighted by area, with no missing areas.



Here is the plot of the initial selection and mesh, from 4 perspectives. The circles show (by area) the weights attached:

If it looks ragged around the rim, that's because I plotted only triangles where all nodes could be seen - otherwise there are end-of-world effects. The faint lines are the underlying mesh, and the dark lines show the area assigned to each station in the weighting.

Here are the stages of concentration, seen from N America.

And here is the final stage, from other perspectives:

and a lat/lon plot:



For comparison, here is the original 61-station selection:


Finally, here is the time series of numbers of these stations reporting in each year:


Results

Firstly. for comparison, here were the results of the original study compared with CRUTEM3 (land only. The observation was that the same general trend was followed, with of course more noise in the 61-station sample.

Now here is the new plot compared with CRUTEM3 and with GISS Land/Ocean (all to base period 1961-90): (Update - corrected following Anon comment below - the graph I originally showed is here)



It's a lot less noisy. It is also rather closer, in both trend and yearly, to GISS Land/Ocean rather than to CRUTEM3. This is interesting, because although it has 60 land stations only, they are weighted for total global coverage, and with islands heavily represented. This suggests that the difference between the two data sets may reflect weighting rather than the kind of measurement.

Next

I think the smoothing achieved with refinement by weighting worked rather well. I'll look at rationalising larger subsets that way.

I also want to look at schemes for avoiding monthly meshing. Creating these rather larger cells around a few stations means that the stations could be weighted not by individual areas but by their share in larger areas. Since several stations would inhabit such cells, frequent remeshing would be unnecessary - only required when there were no stations at all in a cell.




11 comments:

  1. The last figure stops in 1999. I would like to see it until 2009.

    ReplyDelete
  2. Sorry, Anon, I linked the wrong image. It's fixed above. It actually makes some difference to the trends.

    ReplyDelete
  3. I had an idea about this. Instead of eliminating the stations by area, you could weight the area by the ratio of the radius of gyration of the polygon about the station coords.

    (Possibly raise the area to some power first if it needs a higher weight.)

    That will tend to disfavour stations which are near-neighbours at the edge of their polygons, for example pairs on the W African coast, Baja California and Venezuela.

    Kevin C

    ReplyDelete
  4. Kevin,
    Yes, that might help. generally I haven't worried much about details of area shape - the idea is that there should be enough correlation so that there isn't really a bias when you change weighting locally. I'm trying to avoid bias on a larger scale. So locally, once there is an area weighting, the focus shifts trying to spread the weight (trading areas) to keep the variance down.

    But it's true that the 60-station exercise is pushing to the limit of correlation, so the effect you're describing is more important then. It's still essential to have area consistency - basically so the sums behave like area integrals. But radius of gyration could be used as a smoothing principle while still maintaining area consistency.

    ReplyDelete
  5. OK, I got a 60 station list I'm happy with. After a few attempts I settled on this:

    1. Calculate area for each station. (I used a cruder method by looping over a 2deg grid allocating the area of each grid cell to the nearest station).

    2. Calculate the nearest neighbouring station for each station in the current list.

    3. Sort the list of neighbour distances, lowest first. For each neighbour pair, eliminate the station with the lower area until the desired number of stations has been eliminated in this cycle.

    Area and distance now both play roles. My area result is different from yours owing to a different initial set of 'long running' stations, but shows the same problem of close pairs. The distance version looks better. However the standard deviation of the resulting areas is about 50% higher.

    Here's a pic:
    http://img30.imageshack.us/img30/8835/mappn.png

    Kevin C

    ReplyDelete
  6. Thanks, Kevin,
    I like the principle that you are using. And the result looks promising.

    I could do a reconctruction with it if you like, for comparison.

    ReplyDelete
  7. Don't worry unless its useful to you. Using my crude code the reconstruction looks pretty much like yours.

    My next step is to turn the 60 stations into a spreadsheet. Given the list only contains stations which are 90% complete on 1940-2000, it may be possible to calculate a temperature record without worrying about missing data or anomalies, and the uniform sampling makes area weighting rather less important. So it may be possible to make the calculation accessible to anyone who can use a spreadsheet.

    Kevin C

    ReplyDelete
  8. It didn't work (even with a 60 month moving average). You can see a few features of the ITR, but the appearance and disappearance of stations adds a load of spurious features too. It looks like the anomaly calculation is needed even when working with a set of fairly complete stations.

    ReplyDelete
  9. That is a really interesting study. 60 stations are not much, that would allow one to actually study the station histories of these 60 stations and especially the state they are in now to confirm that they are really rural. May be publishable, if this has not been done before.

    ReplyDelete
    Replies
    1. Victor,
      Thanks, for the remarks. I don't think this kind of mesh usage has been done before - I haven't come across it. I'll look at it again - I'm handling meshes better now. I think they do make the best use of the data.

      Delete
    2. Nick, to me the most interesting is that the low number of stations would allow to study their quality and especially to ascertain that they are really rural. Personally, I would not put the method at the foreground here even if I am normally a fan of methodological studies.

      Two related papers on the number of stations needed are:

      Jones, P.D., Osborn, T.J. and Briffa, K.R. 1997: Estimating sampling errors in large-scale temperature averages. J. Climate, 10, 2548–2568.

      Hawkins, E. and Jones, P.D. 2013: On increasing global temperatures: 75 years after Callendar. Q. J. Royal Meteorol. Soc., doi:10.1002/qj.2178 (early view).

      Delete