Monday, April 15, 2019

The math basis for the TempLS V4 global temperature program.

This is, I hope, the last of the preparatory posts before I post the code and description of TempLS V4. Earlier posts were on ideas about spatial integration, an icosahedral grid with equal area mapping for a new method, the documentation system that I'll be using, details of new methods, tests, and some math methods that I'll be using and referring to.

The math basis is similar to what I described back in 2010 for V2, and in the guide to V3, still the most complete reference. The statistical model for temperature for a given month (Jan-Dec) is:

T = L + G, where T is measured temperature, L is a time-invariant offset for stations, and G a space-invariant global average temperature. In more detail, using my index notation, with s for station, y for year, and m for month, it is

Tsmy = IyLsm + IsGmy

The I's are added for consistency of indexing; they denote arrays of 1's.

A change in V4 is that the analysis is done separately over months, to avoid over large arrays in RAM. That simplifies the analysis; the subscript m can be dropped. Another change is more subtle. The model fitting should be by area and time, ie by minimising the integral of squares

In earlier versions I discretised the variables jointly, giving a weighted sum

w(sy)(Tsy-IyLs + IsGy)2

In fitting, this gave a matrix which was symmetric positive definite, which seemed natural and good. But it means that if you choose the weights to be right for the spatial integration, they can't be controlled for the time integration, so locally, they vary over years, which isn't really wanted. For the methods I used in v3, the weights were all positive, and of reasonably uniform size, so it didn't really matter. But in V4 I use methods which are higher order, in taking account of derivatives. Sometimes, for good reason, the weights can be negative. Now for spatial integration, they add to the area, so that is still OK. But for time integration, there might be significant cancellation, with a sum that is near zero, or even negative. It doesn't often happen, but since you have to divide by the sum of weights to get a time average, it is bad.

Revised weighting - Gauss Seidel iteration

The previous equations, derived from the sum squares, were
w(s)y(Tsy-IyLs - IsGy)=0(1a)
ws(y)(Tsy-IyLs - IsGy)=0(2a)

To get even time weighting in (1a), I use a weighting J(s)y, which is 1 where there is a reading, and 0 where there isn't (where w also had zeroes).
J(s)y(Tsy-IyLs - IsGy)=0(1b)
ws(y)(Tsy-IyLs - IsGy)=0(2b)
This can be written in part-solved form as
J(s)yIyLs = J(s)y(Tsy - IsGy)=0(1)
ws(y)IsGy = ws(y)(Tsy-IyLs)=0(2)

The first just says that L is the time average of T-G where observed, since its multiplier is just the set of row sums of J, which is the number of years of data for each station. The second then is just the area average of T corrected for L (the anomaly, whne L has converged). Suitably normalised, the multiplier of G is 1. Starting with G=0, this gives "naive averaging", with L just the overall time mean at station s. As I wrote here, this gives a bad result, which is why the main methods insist on averaging over a fixed time period (eg 1961-90). But that then leaves the problem of stations that do not have data there, which this method avoids.

So the first equation is solved, and the L used to estimate G, which is then used to correct, L, and so on iterating. The reason that iteration works is that the equations are weakly coupled. Eq (1) almost fixes L, with variations in G having only a small effect. Conversely Eq (2) almost fixes G, not very sensitive to local variations in L. There is an exception - if you add a constant value to G, it will cause the L's to drop by a similar constant amount.

So that is the iteration sequence, which can be characterised as block Gauss-Seidel. Start with a guessed value for G. Solve (1) for L, then solve (2) for an updated value of G. For GHCNV4, as used here, this converged gaining at least one significant figure per iteration, so four or five steps are sufficient. In practice, I now start with a previous estimate of G, which converges even faster. But in any case, the step takes only a few seconds. At each step, G is normalised to have mean zero between 1961 and 1990 (for each month), to deal with the ambiguity about exchanging a constant value with L.

Program structure

As before, TempLS has four blocks of code, now expressed as functions:
  • SortLandData
  • SortSSTData
  • DeriveWeights
  • FitGlobal
The first two are just housekeeping, ending with xsy, the array of monthly temperatures. The third derives the corresponding spatial integration weight matrix wsy by one of five methods described here. The fourth, FitGlobal(), performs the iteration described above. The result are the parameters G and L, of which G is the desired global temperature anomaly, which I publish every month.

For the more accurate methods, DeriveWeights is the most time consuming step; a full meshing can take an hour, and LOESS takes ten minutes or so to do the full record since 1900. But the weights depend only on the stations reporting, not what they report, and for most of those years this doesn't change. So I store weights and reuse them unless there is evidence of change in the list of stations that reported that year. This brings the compute time back to a few seconds.

In V3, I had a separate method based on Spherical Harmonics. As described here, I now treat this as an enhancement of any method of integration. In V3, it was in effect an enhancement of the very crude method of unweighted averaging over space. It actually worked well. In V4 it is implemented, optionally, at the start of FixGlobal(). The weights from part 3 are modified in a quite general way to implement the enhancement, with results described in the post on tests. I think it is of interest that as a separate integration principle, it yields results very similar to the more accurate (higher order) integration methods. But I don't think it will have a place in routine usage. It takes time, although I now have a much faster method, and it does not give much benefit if the more accurate methods are used. So why not just use them?







0 comments:

Post a Comment