Wednesday, March 31, 2010

GHCN processing algorithm

As Zeke says, it seems to be a custom now for bloggers to post codes to grid GHCN data and calculate trends. Here's my offering. It's influenced by Jeff Id's program, and particularly by Romanm's Least Squares. In fact, it's an extension of Roman's offset method to the whole global set. It's fairly different to the conventional approach. It doesn't actually do gridding, or (per Roman) calculate anomalies. It's probably the shortest of the codes, and runs quite fast.

A summary of this series of posts is here.

The linear model

I'll attach a pdf file explaining the mathematics in more detail (soon). But here's a sketch. The monthly data, from v2.mean after averaging over duplicates, is in a big array x(S,m,y). The indices are station, month and year.

The model is that x can be represented in terms of a global temperature G(y) depending on year alone (not month or station) plus a local offset L(S,m) for each station S, with a seasonal component dependent on month m (but not year), plus error:
x(S,m,y) = G(y) + L(S,m) + eps(S,m,y)
I do a least squares weighted fit.

The weight function w(S,m,y) is inversely proportional to area density of stations, and of course, zero for (S,m,y) combinations where there is no data. So we're minimising (Latex-like notation) Sum_{S,m,y} w(S,m,y)*(x(S,m,y) - G(y) - L(S,m))^2


The weight function uses a grid. But it just counts stations in each cell - no explicit grid averaging is done. The count is for each month (and year). The cell area is divded by this count to give wt(S,m,y). A conventional 5 deg x 5 lat/lon cell is used.

The design matrix system

Minimising LS gives the system
L(S,m)*Sum_{y} w(S,m,y) + Sum_{y} {w(S,m,y)*G(y)}
            = Sum_{y} {w(S,m,y)*x(S,m,y)}

Sum_{S,m} {w(S,m,y)*L(S,m)} + G(y)*Sum_{S,m} w(S,m,y)
           = Sum_{S,m} {w(S,m,y)*x(S,m,y)}

Regarding (S,m) as a single index, this is actually a symmetric system:
A1*L +B*G = R1
t(B)*L +A2*G = R2

A1 is the very big block, but is a diagonal matrix. t(B) is the transpose of B. A2 is diagonal too.


This can no doubt be solved (for local L and global temp G) using R's lm() type functions. Numerically, it is essential to take advantage of A1 being diagonal. I don't know how R can do that, and would appreciate advice.

I constructed a home-made block solver which was quite fast and apparently effective (although I have an idea for a faster version). A wrinkle is that the matrix is singular, because you coould add any constant to the global temp G and subtract it from L. So I added an equation to fix the initial G to zero.


To test the code, I created a synthetic data file with the v2,mean format, but made up of just a trend with a sinusoidal temperature (same for all stations), with a random scattering of missing data. It should give an exact answer, and it did. The test code is included here.


I haven't tried hemispheres, yet, just global. I'll show in a following post comparisons with HADCRUT etc. For the moment, I just looked at global temps and trends from 1900 to present, and 1978 to present. I was particularly interested to compare with Jeff's value of 0.2476 C/dec. Here's my fig, with the result 0.2542 C.dec. (The AR1 correction is coming).

For 1900-2009, my slope is 0.0747 C/dec - well within Zeke's range.

Code and next steps

The code, and the test code, is in the repository. In a future post, I'll show results for the local seasonal temperatures from the LS fit - they are pretty interesting too.


  1. Hi Nick,

    Looks very interesting. Could you put together a tar bar of your source files?

    And maybe version number (and name) for the project.

    Great work!

  2. Thanks, Carrick and Steve

    If anyone tried to download the code, I had left in a if(F){ ..} ##F which actually blocked out all the calcs - I was scrubbing up the graphics. It's OK now.

    I've created a doc repository (which is what Zeke uses). Currently there's only the numbers for the G output, but I'll put up a zipfile of code and my math description of the algorithm.

  3. So the gridding is basically implicitly embedded within the calculation. Which is fine if all you want is some sort of 'global' record, but what if you want hemispheric records or regional records? Seems to me that you'd have to filter the source data and run the script again.

    I put 'global' in scare quotes as there will be empty or undersampled grid boxes. It's important to remember that the G(y) has this limitation.

    I haven't thought through the degrees of freedom carefully yet, but wouldn't there be multiple solutions here? Meaning, G(y) could be shifted up or down, and the offsets L would merely shift with them. Seems to be that you'd have to specify the offsets L at one station somewhere, as GISS and Tamino do. What am I missing?

  4. CE
    Gridding is now a minor part of the process - just determines weights. The weights are valid for subsets too.

    I'll write a hemisphere restriction into the next version. There are many ways it could be done. It's no big deal to rerun the calc, because most time is spent in solving the design matrix system. More soecifically, in computing the explicit Schur complement (which I am hoping to avoid in a speedup).

    Yes, as I mentioned under solving, the matrix has a rank deficiency of 1, which I solve by setting G(1) to zero.

  5. I wouldn't say gridding is now minor; it's just being done all at once through the weights, instead of going box-to-box and then combining boxes. But it's still being done. If rerunning the calc isn't cumbersome, I guess having to repeat it for hemisphere/regional calcs isn't so bad.

    Ah, I didn't see you that you addressed my exact concern under 'solving'. Sorry.

  6. Carrot Eater, I would say gridding is a form of area weighting, not the other way around.

  7. I use the terms interchangeably.

    Gridding accomplishes area weighting; area weighting is hard to do without drawing a grid of some sort, at some point.

  8. Nick, I have been thinking about your model. It is basically the one that I have used except that you have chosen a single "global temperature" for each year rather than each month as I have done. I am not sure I understand why.

    It will certainly not simplify the complexity of the calculations. The number of values you are estimating at one time is enormous.

    If there are S stations and Y years, then you have 12S + Y parameters. With 3000 stations and 100 years, this comes out as 36100 parameters with matrix operations on a square matrix with that many rows and the same number of columns.

    On the other hand, if the "global" value is monthly, the sum of squares splits apart into twelve separate SS's, each of which is minimized separately with only S + Y parameters for each month (3100 in the example) but repeated 12 times - a process considerably less complicated and much quicker than that above.

    I haven't done the math yet, but I suspect that your yearly value would also be a weighted average of the twelve monthly values from my procedure.

    With numbers of parameters of the order involved here, using the R lm procedure would not be efficient and would require a great deal of computer memory. However, it is possible to write reasonably simple scripts for this purpose (as you seem to have done here).

    Good on ya for working on the problem with the rest of us sloggers!

  9. Roman, thanks for your comment there. I saw the restriction of G(y) to annual as reducing the number of equations, but I can now see that if I don't, the larger set could be partitioned into 12 subsets, which would likely be easier overall. I'll try it.