Friday, March 4, 2011

Spatial weighting and Voronoi tessellation.

I've been spending far too much time lately on triangular meshes and Voronoi tessellation. The idea for combining temperature measurements is to weight stations according to the area they could be said to "represent". This is important in Antarctica, for example, where the distribution is very uneven, and the Peninsula is likely to dominate just because of the number of stations.

For the global indices, this is handled indirectly through gridding. A cell average is calculated, and cells combined, usually weighted by their easily calculated area.. There's a problem, usually unresolved, with empty cells. And this is one reason why I'm attracted to Voronoi tessellation. Every cell has a reading, provided you are prepared to create a new tessellation every month.

I experimented with a simpler variable cell scheme last year. But there were still empty cells.

A Voronoi tessellation just divides the space up according to the nearest nodes. The patch around node A consists of points nearer to A than any other node. There are still some practical problems in getting it into TempLS, because the tessellation would have to be done for every month. So it has to be reasonably efficient, but more importantly, done without any intervention to fix things. In this post I'll describe my experience so far.

Voronoi tessellation follows from triangular gridding. If you can form a triangular mesh connecting the nodes, then the bisector of each line is likely to be a cell boundary. It's not certain, because one of the nodes in the adjoining cells could be closer.

I found mesh and Voronoi routines in the R packages geometry, alphahull and tripack. The latter seemed the most used, and is called by alphahull. The key routine is called delvor, and does both. However, I found it advisable to run the mesher tri.mesh() first, because it had a flag for removing duplicate nodes.

Here is a Delaunay mesh of Antarctic stations. I've marked a quadrilateral to explain what Delaunay means. It is usually expressed as a requirement that the circumcircles of each triangle include no other nodes. I find that hard to visualize - a simpler version is that in each quad like that marked, the angles opposite the central line add up to less than 180. That gives the simple algorithm for converting any mesh to Delaunay. Since the sum of angles in the quad is 360, if the angles added up to more than 180, you could flip that central line to cross the other way, and the sum would then be less than 180. Of course this could disturb the property for some other quad, so the process can go on for a while. There are more efficient algorithms, but for small meshes like this, flipping is OK. In fact, you can make it quite efficient by doing an occasional sort.

The mesh includes quite a lot of ocean. That's because tri.mesh, and everything in R that I found, insists on meshing the convex hull of the region. The convex hull is what you would get by running a taut string around the points. I did my own experimenting  to try to generate a mesh where I can chip away to make the mesh better follow the land. I'll probably be able to do this with the tri.mesh structure. But there's the problem of doing it mechanically so it can be done every month as stations reporting change.

As I said above, the Voronoi tessellation is formed by drawing the perpendicular bisectors of the lines, and surrounding each node by the inner envelope. The Delaunay property means that you can be fairly sure that this means just connecting up the circumcentres. But there is a remaining problem, as you'll see from this result:

Although the circumcircles can't include other nodes, near the boundary, this isn't much of a restriction, and the Voronoi lines go a long way. I want to use the areas for weights, so something has to be done. Oddly, I couldn't find anything in R that confines the cells to the meshed region. So I added my own code to cut off at the boundary, which gave this:

So here, finally, is a map showing the weightings produced (by area of circle). A problem is that it is rather uneven. On the peninsula, say, just one or two stations get connected to a large area. I think something will need to be done to fix this by averaging - not to upweight the region, but to avoid suppressing some stations completely.

This effect is somewhat exaggerated because I've included all stations that have ever reported; in any given month it will be sparser.

I have posted the code here ( on the doc depository. Antvor.r is the executable; idx.r is the set of station data which comes from Ryan's code.


  1. Sorry to report, I can't access the code.
    But, then, some weird seems to happen when I try to log in and post as well.

  2. Sorry, Ron, I tried to give a direct like to the file, but maybe that doesn't work if you aren't the owner. I've replaced it with a link to the docs store (same as in the resources list, top right).

    I see I've also messed up the figs - the last two are the same. I'll fix that when I get back home.

  3. Looks interesting. But haven't you gone from one extreme to another??

    eg. the coastal cells have gone from radiating out to infinity, to being restricted to the coast. Don't you need an algorithm to make sure the points/positions on the perimeter are in the centre of their cell?

    (BTW I know nothing about the methods you are using)

  4. Waemcast,
    It's just for weighting. It doesn't in the end matter too much whether a node is a good representative of the exact area that's nominated. It's just accounting to ensure that regions aren't overrepresented. In that sense, if one node gets just a bit of coast, and another nearby gets a bigger chunk of interior, that's "fair". The region is weighted about right.

    However, there is some loss of information. If that situation does arise (as it does) then there should be some local averaging. My thinking is that this analysis should be done to allocate the area, and then some spatial smoothing of the weightsso that the information in "unlucky" nodes can still count.

    Quantitatively it could go like this. One could say that there is good correlation up to 250 km (say), but not so good beyond 500. So you want to have the 500 km scale properly allocated, and we've done that. The price is loss of effective dof, or increased variance, which goes with the variability of the weighting. If we can smooth the weighting locally, without spreading it beyond the 250 scale, then we ameliorate the variance issue without much loss.

  5. So the bigger the area, the bigger the weighting??

    BTW, if you have already explained this, just point me towards the relevant post(s).

    I was drawn to this page because the graphics looked intriguing!


  6. I see some stations, surrounded by others, that have near zero weighting.

    We know from previous studies that stations a few km from each other will have high correlation and the correlation decays with increasing distance.

    I propose using that correlation vs distance function, combined with the distance to the grid cell being considered as the method of weighting stations.

    That avoids the underweighting of stations just because there is a nearby station located between a station and a nearby grid cell.

    In one sense, the traditional gridding of data is doing this same algorithm, but with only the discrete weighting coefficients of 0 or 1.

    I really just suggesting traditional gridding, but using a fuzzy correlation function rather than a discrete yes/no, or station in or not in the cell.

    Does this make any sense?

    My other suggestions were roughly the same, but were described using each station as a reference rather than looking at it from the point of view of the grid cell.

  7. Charlie,
    The problem is that we mostly need a geometric criterion, because we have to grid monthly with the data varying. Now it's true, I expect, that the correlation function does not have to use the latest data etc.

    I'm keen to develop my diffusion idea. It's a bit like multigrid. We've got the weighting approx right on a large scale, but as you say, unsatisfactory locally. So we can then smooth locally weights without ddisturbing the wider balance.

    Specifically, I'm thinking this method. Usingthe mesh, we do a few diffusion steps (relaxation) in which, along lines of <100 km say, weight is exchanged. Each node might send out up to a third of what it has, and of course will receive back. The amount exchanged could taper with distance. Ten or so steps of that should even things up, and little weight would go beyond, say, 200 km.

  8. The smoothly weighted grid method I propose has no trouble with missing months.

    I'm assuming that the weighting vs. distance function would stay constant over the run. I only mention the station to station correlation as a way of justifying/selecting a particular weighting function. One could also use some arbitrary weighting such as inverse distance squared.

    The handling of missing data would be no different than for traditional grid methods. In those, you just count the number of stations in the grid with data for that month and average them together.

    In terms of my proposed solution I would describe that averaging as combining all stations inside the grid cell with weighting coefficient of 1. Then of course, you divide the summed station values by the number of stations to get the average.

    In the modified method, all that changes is the weighting would not be limited to either 1 or 0. The result for each grid is simply the summation of each station value times its weight, divided by the total of all weights.

    The more I look at that method, the more I'm convinced that it is not merely a quick and dirty method, but that it is also perhaps the most accurate way of estimating values for each grid cells.

    The handling of a variety of combinations does what I would want an algorithm to do.

    For nearly co-located stations, it removes the influence of the direction between the stations. For what would be empty grid cells in a normal gridding method, the algorithm "reaches out" to include far away stations. In grids with cells in it and nearby, the far away stations are included in the calculations, but have small contributions compared to the nearby stations.

    The only other tweaking the algorithm might need is a way to truncate and ignore far away stations once a sufficient number of closer stations have been included.


    The approach I'm suggesting is so obvious that undoubtably somebody has already used it for spatial interpolation to infill missing data or something similar.

    I recognize that all this is very much a divergence away from the direction you are going and will stop cluttering up the thread at this point.

  9. Charlie,
    I think you are on to something there. It's more than just spatial weighting though - it's an inadequacy in least squares. The weighting function really should incorporatee a correlation matrix. Here is Joe Triscari making that point.

    My defence is that much work is still done with OLS and the matrix kernel complicates the numerical algebra a lot. But yes, it should be done, one day.

    It's related to a controversy about S09 originally not allowing for first order correlation. That led to a corrigendum in which a Quenouille correction was used. I haven't found where O10 did that, but I presume they did. But this correction is a small step, and doesn't actually change the expected values - just the CI's.

  10. Nick,
    "That led to a corrigendum in which a Quenouille correction was used. I haven't found where O10 did that, but I presume they did."

    Yes, indeed, I think we stated as much in the paper. And, unlike S09, Ryan posted full code so people can check that we actually did what we said we did. It is in lmFn, which is called by getRecon to do the trend and trend CI calculations:

    ### Apply DoF correction
    Q = sqrt((1 - r) / (1 + r))
    SE = SE / Q

    I think you will find that the RLS method we used in effect incorporated full spatial correlation information, as derived from the AVHRR satellite data. Although this data suffers form temporal drift and inhomogeneities, that doesn't degrade the spatial correlation information nearly as much as it affects the accuracy of its temporal data.

  11. Thanks, Nic
    Yes, it wasn't hard to see.