Saturday, June 19, 2010

A success for open coding!



Steven Mosher was looking through the code for TempLS v1.4 when he saw something puzzling and asked me about it. And sure enough, it was an error. One of those things a fresh pair of eyes can notice.



It was in the section that takes the station list and assigns them to a cell number, depending on what 5x5 lat/lon block they fall into. It doesn't matter what the number is, as long as the stations in the block have the same number and others have different numbers. The stations with that number are counted, and the sum is used in the least squares weighting.

Here are the relevant lines of code:

d = floor(tv[,5:6]/5);    # station lat and long;

cellnums = unlist(d[1]*36+d[2]+1333);  # 5 deg x 5 deg cells numbered 1 to 72*36=2592;


The first line divides the lat/lon by 5 and rounds to an integer. So each station has an integer pair identifying its cell.

They range from (-18,-36) to (17,35) - there are 2592 cells.

The next line tries to number these from 1 to 2592. Numbering a rectangle array is normally done either by row or by column. The 1333 is an offset to bring them to a positive range. I'm numbering in rows, so 36 should be the number of columns. But alas, there are 36 rows and 72 columns. I made the mistake because lat/lon, on the map, is actually in (y,x) order rather than the conventional (x,y).

I'll post the revised code at Ver 1.41. Ver 2.0 should be out soon.

What is the effect?

Basically, two cells get assigned to every number. It turns out that they are opposite in longitude. This affects the weighting, which is meant to correct for station concentration. When an area is well covered by stations, the least squares process downweights each individual station, so the area does not get undue attention. But by combining with a cell on the other side of the world, this downweight could be out by a factor of two.

I should hastily say that it doesn't mean that temperatures from the other side of the world are modifying the readings. They don't - they only modify the weight given to the readings.

Fortunately, it won't matter for any regional studies. The reason is that there are then no stations included from the other side of the world. The weights are unaffected.

A digression on weighting


Another way of seeing the use of weighting by inverse density is that it does what you would do if you were trying to use the stations in a numerical integral expression. As such. it has a fault when grid cells can be empty. The natural thing to do with an empty cell is just to leave it out, as GISS does, and I suspect the other indices. So did I. But leaving it out, when compiling a global average, is effectively the same as including it with a value artificially assigned to it, which is equal to the global average. And when you look at spatial patterns of anomalies etc, it's clear this often isn't optimal. You might have regions like the Arctic where the cells that do have stations have high (or low) values, but many are missing. Implicitly assigning these to have average value biases the result down (or up).


GIStemp get criticised for extrapolating over wide regions in the Arctic. But it's the best thing to do (in a bad situation). leaving the regions out seems more conservative, but it isn't.


Anyway I've been looking at ways of using irregular triangular subdivisions instead of a regular grid. The idea is that when stations get sparse, you just make the subdivisions bigger. No part of the Earth lacks a representative cover. As I mentioned, sparseness is a problem - here the triangles get big, so the coverage degrades. But it's better than none.

Back to the error

I recalculated some of the main global plots of recent posts. Here old and new are contrasted. As you'll see, the discrepancies are noticeable but not major. Mainly the error made peaks and valleys more extreme. I don't believe the revised data would show out on Zeke's combined blogger plots. At the bottom of each plot the modified trend is shown. The changes are small.


There are sometimes some biggish changes in late 1930's. I think this reflects the fact that these years were hot in ConUS, which were then overweighted in the older version.




Global Land and Sea

Land Only

Global Sea (SST)

Rural stations

61 station subset


5 comments:

  1. Thanks Nick,

    I've been puzzling over the weighting issue myself, i was going to look at delaunay triangulation or Voronoi (just to be different)

    Here is the puzzle. Take a grid with 4 cells. 3 cells have 1 station and the
    4 cell has 10 stations. so at that gridding we would average the 10 to
    get one value, and then average all 4.

    But, at some distance R, we can draw a grid where each station will be in its own grid square. and then the average will be the same as the simple average.. with a bunch of missing grids...



    so it was in puzzling over this I wanted to parameterize the gridding
    and ask "what does grid size do to the average"

    Another way to think about this..You have 5*5 grid with 10 stations
    and another grid with 1 station. which has better information.. which lead me to wonder.. what do the trends in sparse grids look like as opposed to highly sampled grids?

    ReplyDelete
  2. Steve,
    The trouble with Delaunay is that the stations reporting changes every month. If there's an empty cell, you need to re-grid. And Delaunay is striving for equilateral triangles, which you don't really need. The requirements for just dividing the area are much less strict than Delaunay, and the algorithm has to take advantage of that.

    The algorithm I'm currently looking at goes like this. You start with just a few triangles covering the Earth. You then start bisecting, with a requirement that all new triangles have a minimum number of stations (maybe 1). You have a priority based on size of triangle and number of stations within. Form a binary tree of the bisection process. Stop when you can't make any new qualifying triangles.

    Then as you proceed month-to-month, when a triangle gets empty, consult the tree and move up a level, reversing the bisection. That's quick to do. At any time you have a tesselation with all triangles qualifying.

    For a rectangular mesh, this is the idea of the quadtree. But triangles are more flexible.

    ReplyDelete
  3. hmm. ya regriding the whole world every month or even year would be problematic. like your approach better. for now I'll stick with a grid, also thought about this:

    http://www.bostongis.com/images/snippets/bostonschools_voronoi.png

    http://www.sgsi.com/MIUserGroup/Tech_FindNearest_SGSI.htm

    http://www.voronoi.com/wiki/index.php?title=Main_Page

    ReplyDelete
  4. Voronoi has the same issue as Delaunay - you'd have to redo when cells become empty. In fact, Voronoi tessellation is just a dual of Delaunay triangualtion, and has basically the same algorithm. You'll notice that many pairs of points can be connected by drawing a line between them that crosses just one edge of the Voronoi tess. And when that happens, it crosses at right angles. If you make all such connections that you can, you have a triangulation.

    ReplyDelete
  5. Thanks, I had read the 'dual" explanation without fully comprehending what it meant. There appear to be some quick ways to regrid if a single point is dropped. i'm wondering if methods that use CAM could regrid on a annual or decadal basis..

    ReplyDelete