Saturday, July 23, 2011

Using RghcnV3 - a very simple TempLS

I've been experimenting with Steven Mosher's R package RghcnV3. It has many useful features, and I think the R package format serves it well. I'm not as enthusiastic as Steven about the zoo and raster structures, at least for this purpose, but still, I may yet be convinced.

Anyway, it certainly gets some of the messy initial data structuring out of the way. So since I am currently working on Ver 2.2, I thought I would put together a very simple version using some of the more recent ideas, in the RghcnV3 style.
Update - I've added a version of the code with detailed comments at the end.

TempLS basics

Firstly, a reminder of how TempLS works. After data rearrangement, we have an array of temperature readings, over some number S of stations, and over M months (usually 12, but can do seasons etc) and Y years. This is a SxMxY array with many missing values.

I'll explain the math in a similar style to this account of V2. I'll use a notation, described there, in which arrays (matrices etc) are represented with suffixes, and equations including them are true for the natural range of values. To save summation signs, I use a summation convention in which a repeated index is assumed to be summed over the range. But to emphasise that, I use colors. Summed indices are blue, independent ones are red. There are some I need to exempt from the summation rule, and these are green.

So we assume that the observed temps x can be modelled with
  • a local temperature L, which varies over station and months, but not over years. Think of it as a monthly average, and
  • a global temperature G uniform over months and stations, but varying with years. This corresponds to the global anomaly that we are after.
So we write a model equation

xsmy  ~  Lsm  +  Gy
I'll repeat here the least squares part from that earlier post. It's useful to maintain a convention that every term has to include every red index. So I'll introduce an identity Iy, for which every component is 1, and rewrite the model as
xsmy  ~  IyLsm  +  IsmGy

Weighted least squares.

The L and G are parameters to be fitted. We minimise a weighted least squares expression for the residuals. The weight function w has two important roles.
  • It is zero for missing values
  • It needs to ensure that stations make "fair" contributions to the sum, so that regions with many stations aren't over-represented. This means that the sum with weights should be like a numericam integral over the surface. w should be inversely proportional to density.
So we want to minimise w(smy)(xsmy - IyLsm - IsmGy)2


This is done by differentiating this expression wrt each parameter component. In the summation convention, when you differentiate a paired index, it disappears. The remaining index goes from blue to red. The original scalar SS becomes a system of equations, as the red indices indicate:
Minimise:w(smy)(xsmy - IyLsm - IsmGy)2
Differentiate wrt L:   w(smy)Iy(xsmy - IyLsm - IsmGy)
Differentiate wrt G:w(smy)Ism(xsmy - IyLsm - IsmGy)
Notice what we have here - linear equations in L and G, and all we need is the data x and the weights w.

Equations to be solved

These are the equations to be solved for L and G. It can be seen as a block system:
A*L + B*G  = X1
BT*L + C*G  = X2
combining s and m as one index, and where T signifies transpose. Examining the various terms, one sees that A and C are diagonal matrices, so the overall system is symmetric.

The first equation can be used to eliminate L, so
(C -  BTA-1B)*G = X2  -  BTA-1X1
This is not a large system - G has of order 100 components (years).

Implementation with RghcnV3

There are just three steps. We have to get the array x. RghcnV3 can get us (as explained in the manual as far as the output of readV3Data(). This is a large array ordered first by months (in rows), then by year, then by station. There is then a function v3ToZoo() in V1.3) which should get it into a structured array. Unfortunately, I ran out of memory at that stage. But the further rearrangement was a basic step in TempLS, so I used the sortdata() routine to do that. The arguments are the data array, and a vector with the start and end year of the period you want to look at.The resulting x array is ordered as suggested above, by station, month and year.


Then we need a weight function w. A basic method is to divide the surface into 5x5 lat/lon cells, and get a station density by dividing the area of each cell by the number its stations reporting in each month. This is done in the routine makeweights, which returns the array w.


So that's it. We have the needed w and x; the rest is just some summing over rows etc to get the block matrices, and then a block inversion. This is done in the routine solveTemperature, which returns the temperature anomalies about a mean zero. I've added a plot statement.

The code

As promised, just three routines. Start with dat, the array from readV3Data(), and your choice of years yrs[1:2]. There is another version with detailed comments below:

sortdata = function(v,yrs=c(1900,2010)) # Sorts data from v2.mean etc into matrix array x
for(i in 1:jq){ # loop over stations
oy=(iy[,2] > y0) & (iy[,2]<=yrs[2]);
iz= iy[oy,];
x[i,,iz[,2]-y0] = t(iz[,3:14]);

Then the routine to make the weights - needs x and lat/lon:
makeweights = function( x, inv)
d = dim(x); wt = array(NA,d);
ns = floor(inv$Lat/5)*72 + floor(inv$Lon/5) + 1333;
wtt = sin((1:36-0.5)*pi/36);
w_area = rep(wtt,each=72);
for(i in 1:d[2]){
for(j in 1:d[3]){
ow = ![,i,j]);
ok=kc>0; wc=w_area;
wt[ow,i,j]= wc[n];

Finally, the solve routine:

x=x*wt; d=dim(x); d=c(d[1]*d[2],d[3]);
b = drop(R2-R1 %*% aw);
S=diag(C) - t(wt) %*% aw;
S[1,1]=S[1,1]+1; # there is a dof to be fixed

And the program to run it all:


And a (no-frills) result:

Here is a version of the code above with line by line comments.

#You'll want your own version of yrs
sortdata = function(v,yrs=c(1900,2010)) # Sorts data from v2.mean etc into matrix array x.
#v is just the v3.mean file read in - one year per row, one station block after another. Within a block, the years are in order.

   nd=dim(v)[1]; # Number of stations in inventory
   y0=yrs[1]-1; # Year zero
   iq=c(0,which(diff(v[,1])>0),nd); # Endpoints of each station block of data (where the stat num ber increments)
   jq=length(iq)-1; # Number of stations in data set
   x=array(NA,c(jq,12,yrs[2]-y0)); # setting up the big temp matrix
   for(i in 1:jq){ # loop over stations
       kq=c(iq[i]+1,iq[i+1]); # marks the block of station i
       iy=as.matrix(v[kq[1]:kq[2],]); # copies that block of info
       oy=(iy[,2] > y0) & (iy[,2]<=yrs[2]); # selects the years within range
       if(sum(oy)<2)next; # reject stations with less than a year of data (can't get an anomaly with 1 dof)
       iz= iy[oy,]; # the selected block
       x[i,,iz[,2]-y0] = t(iz[,3:14]); # copy into x
x; # return x
makeweights = function( x, inv)
# To return wt, a matrix with the same structure as x
#ie a weight for every station/month/year, but zero or NA if missing
  d = dim(x); wt = array(NA,d); # set up the big blank
  ns = floor(inv$Lat/5)*72 + floor(inv$Lon/5) + 1333;
# ns lists a (5x5 degree) cell number for every station in the inventory
  wtt = sin((1:36-0.5)*pi/36); w_area = rep(wtt,each=72);
# w_area is an area weight for each cell
  for(i in 1:d[2]){ # loop over months
    for(j in 1:d[3]){ # loop over years
       ow = ![,i,j]); # which stations reported in i/j
       n=ns[ow]; # and what were their cell numbers
       kc=tabulate(c(n,72*36)); # frequency of cell numbers
       ok=kc>0; # logical for empty cells (that month)
       wc=w_area; #Initialize the weights to the areas
       wc[ok]=w_area[ok]/kc[ok]; # now divide the weights by number of stats, to get inverse density
       wt[ow,i,j]= wc[n]; # set the weights for that i,j
 wt; # return the big weight matrix
#Now we have the weights wt and temp matrix x, can solve linear systen for temp anomaly
# to avoid NA fuss, missing readings have zero weight, so temp doesn't matter
   # redimension to a matrix with s&m vs year
   x=x*wt; d=dim(x); d=c(d[1]*d[2],d[3]);
   A=rowSums(wt)+1.0e-9; # Top left diag block matrix
# There can be missing lines with all zero. Instead of eliminating them, just add something to the diag to avoid a 0/0 error. No effect on result.
   R1=rowSums(x); # Corresponding RHS
   C=colSums(wt); #Bottom right diag matrix
   R2=colSums(x); # Corresponding RHS
   aw=array(rep(1/A,d[2])*wt,d); # LHS term after dividing by A to eliminate L
   b = drop(R2-R1 %*% aw); # Elimination op for RHS
   S=diag(C) - t(wt) %*% aw; # Local temp has now been eliminated, along with variables s and m.
#Now the equation is just in yearly T anomalies over years. Symmetric.
# S is singular, because an arbitrary constant could be added to G and subtracted from L.
# So arbitrarily set anomaly 1 to zero
# Now solve - S could be up to 150x150 say (yr[2]-yr[1])
# Now reset to mean zero (fixing that arbitrariness)
   y-mean(y); # zero mean anomaly returned.
# End functions.
# Execution You start with v from data file (readV3Data()) and inv from inventory (readInventory())
# All you need from inventory is lat/lon
# Many other makeweights() are possible
# A simple plot on a png file; comment out to see on screen (also


  1. Perfect, worked for me like a charm.

    With this description I'll see what I can do to integrate it into raster. The main benefit would be to make the calculation of the weights parameterized based on the size of the grid cell. I tested with the texas data. slick and fast.

  2. Hi Steven,
    the current weights are based on cell area too. I'll add a commented version of the code above.

    I was thinking that it wouldn't be hard to incorporate the full TempLS. Currently it's organised so there is a "user" file which sets a list of variables that the user might want to define. Then it reads a file written by the user that redefines whatever he wants. Then it runs.

    That's just the structure of an R function. The user list u would be the arguments, with defaults, and then in actually calling, the user would provide just those arguments he would have set in the comtrol file. Just a wrapper around the whole thing.

  3. Nick,
    Thanks for doing this. I sense a very high value in being able to construct different expansions of the temp dataset localization effects and geometry - if that is a good characterization of what's afoot here. Likely the results of this activity will not vary much from what's been published and knowing that the issues of post-processing have really been wrung out and not shown much difference is comforting.

    I continue not to see how systemic bias in the datasets themselves can be accounted for by looking at the sets themselves. It seems to me that we need more physical analysis of siting and the mechanics and protocols for recording the temperatures, but then I'm pretty ignorant.

    Thank you as well for your commenting in all the usual places. I worry much more about adopting ill-conceived observations of the people I think I agree with. And again,there's a very high value to showing that conclusions are very often unsupported by the reference or of apparently supported, not the only conclusion which can be inferred from the "facts."

    Thank you, john ferguson

  4. Thanks, John,
    Yes, it is about looking at different characterisations. The LS method is a bit different from what is traditionally used, and it's good for pucking out subsets.

    And you're right - it can't improve faulty data - though it can make it easier to spot.