Thursday, April 14, 2011

Quiet time

Improving averaging of global temperature anomaly with fitted functions - seen as an aspect of a general method.

I have written a lot about two topics - anomalies in general, and spatial averaging of temperature anomalies. Both are headings in the Moyhu interactive index. My main point on anomalies as used in temperature statistics is that they play an essential role in reducing inhomogeneity, thus making the inevitable sample use more reliable.

But I think they have a wider use. The general pattern is that a field, temperature t, say, is partitioned:
T=E+A
where T is some kind of expected value, and A is an anomaly. That sounds vague, and there are many ways you could make such a partition. But it becomes clear when associated with the use made of it. The purpose is often to calculate some linear function of T, say L(T). Here L will be some kind of average. But suppose L is an ideal, and we actually have to rely on an approximate version L1 (eg coarse sampling).

So L(T) is approximated by L1(T)=L1(E)+L1(A). Now the objective is to choose E in such a way that
  1. L1(E) can be replaced by L(E), which is more accurate for that part
  2. L1(A) loses some of the error that marred L1(T)

I will explain with three examples here, of which the third will be developed in detail, because it bears on my long-term project of more accurately calculating surface temperatures:
  1. Temperature anomalies for global average (GISS). Here the station temperatures have some kind of mean of past values subtracted. It is usually the mean of some fixed period, eg 1961-90. Here is a typical post in which I explain why absolute temperatures are too inhomogeneous to average with the sample density we have, which the anomalies thus constructed are homogeneous enough, if only because they have approximately the same expected value (0). But in terms of this structure, they don't quite fit. The component E (climatology) that is removed is not calculated with a more accurate estimator, but is discarded, with only the anomaly average reported. But the mode of partition is still important, to be sure that the discarded part does not carry with it variation that should have been attributed to A. Using a consistent base period is one way of ensuring that.
  2. Another example is one I developed in a post at WUWT. This was about the time samples, for a single station, that go into forming a monthly average. An objection sometimes raised that characterising that by two values per day (max and min) "violates Nyquist". Nyquist doesn't really tell us about this kind of sampling, but I looked at the error introduced in sampling regularly a few times a day (eg every 12 hours). This does create aliasing of some harmonics of the diurnal frequency to zero, which adds spurious amounts to the monthly average. I showed that this is a limited error, but which can be reduced with a T=E+A partition. Now E is an estimate of the regular diurnal cycle, based on fine (eg hourly) sampling of a few reference years. That takes most of the diurnal variation out of A, and so sparse sampling (L1) aliases very little to zero, and causes little error. The fine operator (L) can be applied to the E component.
  3. The third example, to be developed here, is for the monthly spatial integration of anomalies. I'll describe it first rather generally. Suppose T is a variable to be integrated over any domain, and L1 is an approximate integration operator. Then suppose E is a Fourier fit with the first n orthogonal functions, where the inner product is integration with L1.

    Then E contains the low-frequency terms which contribute most to the integral, leaving A with just high frequency fluctuations. These may be of lower amplitude that the whole, but more importantly, they will make little contribution to the low pass integration operator. This is still true for the approximate L1, since the basic reason for the low-pass is cancellation.

    In fact, by the Fourier process, L1(A) is exactly 0, since A is orthogonal to the fitting functions, of which the lowest is a constant. So if the improved integration operator L is just exact integration, the varying fitted terms will make zero contribution to the integral, except for the constant term. The improved integral is just the integral of that constant. The improvement comes because the approximate integral of those low frequency components is replaced by the exact.

Integration methods for the sphere

I have developed four main methods for averaging temperature anomalies via numerical integration on the sphere, which I have incorporated into my monthly program TempLS, with results regularly reported here (click TempLS button). They are described in greatest detail here, with some further ideas. The methods described are
  • Simple gridding - just assign to each cell of a lat/lon grid the average of the anomalies in the cell, and then us ethe area-weighted average of the cells with data. This is basically what is done in HADCRUT
  • Infilled gridding - where cells without data are assigned a value derived from neighbouring cells in some way, and then the area-weighted average of all cells can be used. With the assignment done by kriging, this is the Cowtan and Way improvement of HADCRUT.
  • Finite element integration using an irregular triangular mesh. This is my preferred method, as it assigns to each point on Earth the interpolated value from the three nearest data points for each month.
  • Spherical Harmonics. I'll now talk more about that. In its original form, I did a least squares regression of the temperature with a set of spherical harmonics, and took the lowest coefficient as the integral.


The final method, using Spherical Harmonics (SHs), is not like the others, in that it doesn't introduce a subdivision of the area. But I now think it is best seen as not an independent method, but as an improvement applicable to any method. To just write out the regression maths, if L1(xy) is the scalar product, and H is the vector of fitting functions (SH), then the regression coefficients of the fit are

b = L1(H⊗H)-1L1(H⊗T), where ⊗ is the outer product from the weighted integration

and so b0L1(H0) is the average of T

I generally denote L1(H⊗H) as HwH, since L1 is just a weighted sum with weights w (this is the architecture of TempLS). The functions H are orthogonal wrt exact integration L, so L(H⊗H) would be a unit matrix. The approximate version deviates from this, and a figure of merit is the condition number of HwH. This is kept down if the integration method L1 is good.

In addition to the averaging methods listed above, I should add a very primitive one - the simple, unweighted average. This is generally regarded as too unreliable for use, but my original SH method, which actually performed very well, was basically the improvement of this method. You can of course expect better outcomes from improving better methods, and as I shall show, this happens. However, the relative improvement is much less. The reason is that the improvement comes from the improved integration of the SH components, and with, say, the mesh method, the finite element integration was already very good.

More on Spherical Harmonics

I described the functions here, with visualisation here, and a movie here. There are two parameters L and M; it is natural to think of M as determining the number of oscillations, with L varying from 0 to M (for fixed M) going from purely latitude oscillation to purely longitudinal. In the latter case (L=M), there are M periods of the sinusoidal variation around the equator. For SHs of order up to M, there are (M+1)2 functions, and these are the groupings that I will test below.

Tests

I looked at various means of anomalies for each of the months of 2018. I used the anomalies calculated by the mesh method; other integration schemes give slightly different values, but I don't think that changes the issues of integration. For each month and order M of SH, I calculate the mean for each of the four methods
  • No weighting (simple average)
  • Simple grid with no infill
  • Infilled grids
  • Integrating on irregular triangular mesh.
The value for M=0 is the unimproved value. Here is a plot of results. You can see the different months by using the buttons at the bottom:



You can see that the better methods, mesh and infilled grid, start out usually close together, with often the simple average furthest away. But then, as the order M incrases, they all converge for a while. Eventually this gets ragged as the higher order SH's are less well integrated by the integration scheme. The direct cause is ill-conditioning of the HwH matrix. I'll show here a similar plot of log10 of this condition number for each case.

There is an earlier discussion of SH improvements here. It shows (for Jan 2017) the sum of squares improvement, but more relevantly, a detailed map of the residuals. It shows how, after removing up to M=10, there are still spatial patterns in the residuals, some of which look like higher order harmonics.

Discussion

As you look through the months, the blue mesh line is usually fairly stable, at least up to about M=10. This indicates that the removal of SH components has little effect, because the mesh integration was already good, and replacing with exact integration makes little difference. The other lines start out separated; the black unweighted average often a long way away. But they converge toward the blue line, which says that the SH separation is giving better results.

The condition numbers also give a good quantification of the merit of the integration schemes. mesh is markedly better than infilled grid, which is in turn better than simple grid weighting, which in turn is better than no weighting at all. Better conditioning has the practical effect that mesh integration can be used to approximate temperature anomalies with a SH fit (as done monthly here) to higher order, and better resolution, than other methods.

An anomaly in approximation behaviour is February, in which there was convergence toward the blue, but that showed a gradual increase rather than stability after about M=5. I have looked into that; I don't really know what is different about the February mesh that would cause that. The meshes can be seen here. It is true that the February mesh does do a less accurate job with integrating the harmonics, as can be seen from the top row of HwH. I can't track that down to a visible difference in the mesh. The condition number for February rises earlier than it does for March, although not so different from January.

I had thought that, although subtracting fitted SHs did not improve mesh integration here, it might do so if stations were fewer. I tried that in conjunction with my culling process. However, no improvement was found there either. The reason seems to be that when stations are too sparse for SHs to be well integrated, so there is room for improvement, they are also too sparse to allow the fit to be identified. There just aren't enough degrees of freedom.

Conclusion

The idea of separating out an "expected" component prior to integration, which can then be integrated with a more accurate operator, has general application. In the case of spatial integration of monthly anomalies on the sphere, it gives a radical improvement of inferior methods. It does not do much for the better mesh integration, but the different behaviour quantifies the merits order of the various schemes.

Although the mesh method seems to be at least as good as SH upgrades of grid methods, the latter is faster if meshes would have to be calculated. Simple gridding is fast, as is the SH fitting. However, on my PC even full meshing takes only a few minutes. So I have upgraded the "SH method" in TempLS to be the enhancement of the mesh method, since I can generally re-use meshes anyway.



Saturday, April 9, 2011

TempLS Ver 2.1 release

This is the first release for a while. The structure is still much like ver 2, and the math basis is the same. The main differences are:
  • The code has been internally reorganized. Most variables have been gathered into R-style lists. And more code has been sequestered into functions, to clarify the underlying structure of the program.
  • The new work on area-based weighting has been incorporated.
  • Other work has been done with weighting. In particular, subsets of stations can be weighted differently relative to each other.
  • Hooks have been added so that the adaptions to the Antarctic studies can be included.


The preprocessor which merges and converts data files available on the web to a standard TempLS format is unchanged. Data and inventory files from Ver 2 will still work.

A reminder of how TempLS is used. In an R environment, a jobname is defined, and a .r file with that name is created which defines some variables relevant to the wanted problem. This may have several consecutive runs defined. The variables are variants of supplied defaults.

Then source("TempLSv2.1.r") solves and writes data and jpg files to a directory jobname.

The files you need are on a zip file TempLSv2.1.zip. This 16Mb file contains preprocessed data sets, so you won't need to use the preprocessor. I've had to separate GSOD data; GSOD.zip is another 7 Mb.

If you just want to look at the code and information (everything readable), there is a 170 Kb TempLS2.1_small.zip. A ReadMe.txt file is included. There is also a partly updated manual from V2.0, most of which is still applicable. More details below, and past files on TempLS can be found on the Moyhu index. I'll describe below the things that are new.



TempLS internal lists

Being R code, you can of course modify it as you wish. But it's better generally to work through the jobname.r file. To understand the code, you'll want to know about these global lists:
  • u - the one most encountered. It is defined in TempLSuserv2.1.r, and includes the variables that I think you're most likely to want to change. Some important ones:
    • u$inputstem - for selecting data. eg if ="GHCN", uses files GHCN.data, GHCN.inv
    • u$name - a string that gets attached to output and filenames - can be varied between runs
    • u$num_runs - your file can have several expressions {} and a run could be done with each - num_runs says how many are used
    • u$station_mask - select from the inventory which stations to include in the analysis. A logical expression based on the names of columns
    • u$model - what kind of analysis. 1 for time series, 2 for spatial trend, 3 for spatial period (currently year
    • u$yr0, u$yr1 analysis endpoints
    • u$postp() - an output function where you can add requests
  • v - list from the data read - loaddata()
    • v$d - the full data set from the .data file, v$nd = number rows
    • v$i - the inventory table from the .inv file, v$ni = number rows
  • s - things created following station selection by u$station_mask
    • s$o - logical vec (length v$ni) of stations selection
    • s$ns, s$nm, s$ny - numbers of stations selected, months and years
    • s$i - index vec (length s$ns) of stations selection
    • s$ib - reverse index vec (length v$ni) of stations selection
    • s$dm - various dimensions that can be assigned to the s$ns*s$nm data array
  • out - list for accumulating runs for compiling run report

    TempLS V2.1 output

    You'll generally get a map of the region with stations marked, and a plot of stations reporting in each year. The map will be a projection of a sphere on the surface. You can specify the lat/lon of the centre point (u$mapfocus). If the region is global (u$isglobe=T) you get a Mercator-like map as well.

    Model 1 will produce a time series of temperature anomaly, with a smooth and trends. You can ask for several different periods to be covered.

    Model 2 and 3 will produce shaded plots. If you have asked for a direct solver (oCG=F) you'll get a graph of eigenvalues. If the verif() function is called, you'll get R2 for the fit, and you can request plots of residuals at a list of stations (u$residual=c(,,,)).

    If you are requesting area weighting, you can ask for a map of the weights and mesh for a specified month (numbered from start)(u$showmeshmonth=..)

    Weighting (new)

    Before V2.1, the only weighting was by estimated density based on number of stations in lat/lon cells. Now 4 kinds of weighting are allowed:
    1. Uniform weighting
    2. Cell weighting (as before)
    3. Mesh-based (ie by area)
    4. Voronoi (also area)
    In V2.1 Voronoi is not fully implemented, and mesh-based is preferred. You specify by u$wt_type=1,2 or 3.

    You can have two classes of stations that you weight separately. This was created for AVHRR vs land, but would also be useful for SST vs land. You need a mask u$wt_mask to select one subgroup. It's value is a logical expression based on columns of the inventory. Then you need u$wt_type to be a vector, eg c(3,1). That gives the wt_type of the set you've chosen, then the rest. They cannot both be >2.

    Then you need to express how one class is weighted relative to the other. This is the wt factor u$wtf, which ranges from 0 to 1. They are scaled so that (sum wts of class 1)/( total sum of weights) = u$wtf.

    You can also control smoothing of the weights. The parameter is the characteristic length of smoothing, in km, say u$wt_len=100. The default is 0, which means no smoothing. Smoothing gives some benefit, but takes a little time.

    Zip file contents

    • ReadMe.txt - summary of the contents
    • DataInv.zip - contains the preprocessed data files (eg GHCN.data) and inventory files. With this you won't have to use the preprocessor until you want an update. This file is 16Mb; because complete data exceeds the 20Mb limit here, I've had to separate the GSOD data and inv, which is in GSOD.zip.
    • TempLSv2.1.r - the master routine
    • TempLSfnsv2.1.r - the collection of functions for the master routine
    • TempLSuserv2.1.r - defaults for the user file, defining list u
    • TempLSprepv2.1.r - preprocessor file
    • TempLSprepinputv2.1.r - some names etc for the preprocessor file
    • userfiles.zip - a zip file of user routines for example problems. Most were used in compiling recent Moyhu posts.
    • TempLSv2a.pdf - a users guide for v2 which has been updated with new variable names etc for v2.1. It does not have a description of new features yet.
    • testall.r - a master file for testing - uses all the supplied example files, creates their directories of results, and makes a file testall.htm in which you can see all the code and output. Handy to check if you have everything. Good to run with R CMD BATCH.
    • testall.mht - an output from testall.r that I made recently, stored as IE archive.
    • miscR.zip - files like ryan.r (some of Ryan O's graphics routines, called by the user file for Antarctic.r). Also update.r, used by testall.r.







Monday, April 4, 2011

Blogger's spam filter

In about September last year, Blogger, the Google outfit that host this site, introduced a new spam filter. Until that time I had been very happy with them. I still am, in most aspects. But the spam filter is making dialogue on the site impossible. It is supposed to learn from the liberations that I make, but it seems to be just getting worse. I have not yet been blest with a single item of real spam, but about one in three genuine comments go to the bin for no obvious reason at all.

The problem is compounded because, since I'm on Australian time, they can sit in the spam bin for hours before I can fix it.

I've done what I can to raise the problem with Blogger. They don't offer direct communication, but delegate feedback to a forum. The response is that, no, you don't get a choice here, your sacrifice is for the general good.  We seem to be conscripted into an experiment whereby Google uses our rescue activity to improve its spam database.

So I'm looking at Wordpress. Blogger has just announced a revamp, but from their aggressive attitude about the filter, I don't see much hope. I'll wait a few days longer, and listen to advice on whether Wordpress is likely to be better, but it's looking inevitable.