Saturday, April 17, 2010

The Math description of TempLS

I've put up a file called TempLS.pdf (about 80 kb) on the document repository. It sets out the least squares theory, and goes through the analysis code explaining the implementation.

It may already be a little bit obsolete. It's likely that V1.4 will solve the design system by the conjugate gradient method. The matrix is well conditioned, and the method seems reliable, and at least five times faster, making inversion a matter of  10-20 seconds.

Thursday, April 15, 2010

TempLS - what's been happening, and where it's going

The development of the least squares calculation of a regional or global temperature is now spread over a series of posts, so I'd like to gather together what has been done, and where it might be going. The list of posts is here:







Least squares

The first post, GHCN processing algorithm, introduced the basic idea, variants of which Tamino and Jeff/Romanm had been using to combine stations in grid cells. The model is that the array of station temperature observations, over month and year, can be described as the sum of a local function L, depending on station and month (not year), and a global temperature G, depending only on year. The idea is that L contains the unwanted effects of site variability and seasonal effect, while G is the item of interest.

A weight function is used for the fitting. This has a value for each month-year station combination, and is the product of two components:
1. It is zero if there is no observation
2. It varies inversely with area density of stations.
The density is estimated with a 5deg x 5 grid. the stations falling in each cell are counted, and the cell area divided by that number is the cell estimate.

So a major difference with other methods ia that no grid averages are calculated. A grid is invoked, but only to count the stations contained.

The LS leads to a big linear system (t(B) is transpose):
A1*L + B*G = R1
t(B)*L + A2*G = R2
In principle, there could be about 90,000 equations and unknowns. But A1 and A2 are diagonal matrices, so L can be eliminated (the time consuming step), giving about 100 (number of years) equations for G. This takes about a minute of computing time for a global set. Total time for big global runs is generally less than two minutes.

That first post then pointed to the R code, and gave some basic results for land-only GHCN global data.

Comparisons


The next post, Comparison of GHCN results, was meant to compare the results with other efforts and the major indices. However, Zeke had done the hard work there - mine just slotted in, and seemed to give results very much inl line with the others.

More results


A side-effect of using R is that it is easy to create all sorts of masks for selecting different kinds of sites. The GHCN file v2.temperature.inv is read in as a dataframe, and each of its components can be used in logic tests. This includes latitude and longitude, country code, altitude, pop, urban status, etc. So the next post, More GHCN results., announced Ver 1.1, and showed some results for much-discussed classifications like urban/rural, airports etc. The main V1.1 change was to separate out the process of removing duplicates into a preprocessor program, since it only needs to be done once for any version of v2.mean.

Mechanics


The most commented post was the next, Tips on File Download. The lead-up to V2 was the effort to make the process more user-friendly - the adoption of a text repository, and structuring the code so the parts needed by users were visible at the top.
The next post, Version 1.2 of GHCN processor, described these improvements in detail.

Using the capability


The stimulus prompting much of this effort was Steve Hempell's determined efforts to get the code to work for him. The code was much improved by this (and helped by Carrick, Mosh and others). I posted another set of tests of some past issues - whether "dropped" GHCN stations had been trending down relative to "kept" stations, for example. This used an extension to the inventory categories whereby stations could be selected also by reporting history.

Then Steve got to work on the questions that were interesting him. Two posts were based on his thorough examination:
Latitudinal temperature histories and trends and Continents and trends.

The Big City Trends post related to an interest that Joseph had been exploring with his code.

Incorporating Ocean data


The least squares approach makes it relatively easy to combine data from multiple sources. The next challenge was a full global index, and this was done in V 1.3. It involved more preprocessing, so the inventory file was augmented by a series of synthetic stations based on the HadSST2 ocean grid cells. These could then be included in the fitting in the normal way. The cell density weighting took care of scaling issues. The results were described in Incorporating SST and Land/Ocean models. The fit to SST is excellent - the fit to combined land/sea is still good, but with more disparity. This could probably be improved by better balancing at the shore. V1.3 is described in TempLS Version 1.3 - SST, Land/Sea index, distance utility.

What's next?


Probably more applications posts. On the technical side, using the R global mapping capability. I expect V1.4 will routinely give a map showing the distribution of stations in each run. But beyond that, LS makes it possible to use other parametrisations, particularly spatial. We could show, say, trends expressed in terms of spherical harmonics. R does not seem to do shade plots, but its filled contours aren't bad.

In another direction, finer temperature plots will be possible. Monthly fitting needs little more calculation than annual. The months can be done separately, but for the big elimination step, that only reduces the computing by 1/12, which then has to be done 12 times.

TempLS Version 1.3 - SST, Land/Sea index, distance utility

I need a shorter name for the project. So this is TempLS V1.3, which I used for my previous post on Sea Surface Temperatures (SST) and combined indices. SST is the big change, but there are some others you'll notice.

One is a new facility for calculating distances by lat/lon. You can ask for all stations within a distance of a location, for example. You can even vary the weighting function. So if you wanted to replicate the GISS calculation of all the trend of rural stations within 1000 km of a location, weighted with a distance taper, you can.

The code is rearranged so the selection criteria sit in a function near the top, below the other user-choice items. If you build up a lot of these, you can put them on an external file, which you can include with a source().

As usual, the code is on the repository here, and at the document store. More details below.

A summary of this series of posts is here.



Summary of operation for TempLS

  1.  Download v2.mean and v2.temperature.inv from NOAA
  2.  Download preprocessorv1.3.r and GlobTempLSv1.3.r from the cument store.
  3.  If you want to use ocean temperatures, download HadSST2_SST_1850on.txt from hadsst2.zip, and change Sea=F to Sea=T at the top of the preprocessir file..
  4.  Check the top of both R files for user settings you might want to make. At the head of the GlobTemp file, choose the cases you want for a particular run. You may have to write conditionals.
  5.  Start R, and source("preprocessorv1.3.r"). You won't have to run this again unless the version or input files change.
  6.  Having decided your run choices, including tend and plot periods. source("GlobTempLSv1.3.r")
  7.  You'll find each case has generated jpeg plots and a combined Output file. The choice string for each run case is in the title.

SST and restructuring

The TempLS code as before consists of a preprocessor and an analysis stage. Once you have preprocessed (for this version), you don't have to do it again for each new analysis. The preprocessor now modifies both the GHCN data file v2.mean and the inventory file v2.temperature.inv. These are now called modified.data and modified.inv. You'll need to have them in the same directory as that where you do the analysis. A change from v1.2 is that there is now no intermediate file statyrs.txt.

At the top of the preprocessor file is a logical variable Sea. It's set to T by default, and then you must supply a copy of the HadSST2 file HadSST2_SST_1850on.txt. If you set it to F, then no SST data will be used, and the code behaves as in previous versions.

If you do have Sea=T, you'll need to check previously used selection conditions. I've introduced a new logical vec isLand which you can add to your choice to make sure you don't include sea when restricting by lat/lon, say (unless you wanted to). Sea=T means that the invebtory file has 2592 artificial stations added, with a madeup number starting with 899 and lat and lon - all other properties are blank.

New user choices

As said above, I've put the user choice criteria as a function (chooser()) near the top. This will help if you want to take it to an external file, for example.

You can now choose an outputdir and a plotdir to direct output to. I have removed the analysis input files choice from the analysis prog, since they are now created by the preprocessor.

As also said, you can choose stations via getdist(lat,lon). If you want to know the temp history of stations within 1000 km of Kansas City (lat=39, lon=-95) ypu could write:
if(tc=="KC"){i01=i01[ getdist(39,-95)<1000.]}

If you want to change the weights and get rural stations in that range only (as for GISS UHI), then

if(tc=="KC"){gtd=getdist(39,-95); i01=i01[ gtd<1000. & tv$urban=="A"]; weight=weight*(1-gtd/1000.);}

Other Issues

We've previously had trouble with Windows machines returning NULL to the time function used to mark output files etc. I've put in something that I hope will avoid trouble (tho' you won't get the times in the filename). Please let me know if there are still problems.

Trends are now reported with SE. This is calculated using simple OLS - no correction for correlation of residuals.

Temperature averages are now expressed relative to a 1961-1990 base - ie the mean over this period should be zero.


Wednesday, April 14, 2010

Big City Trends

Joseph asked about an analysis for trends in cities over 2M population. I'm not sure that the GHCN database captures this well. I looked up my own big city (Melbourne) and of the 5 stations listed, three were on the city fringe, and may well by climatically similar to any city fringe area.

However, the analysis is easy to do. The instruction I inserted was:
if(tc == "BigCity") i01 = i01[tv$ipop>2000 &tv$urban == "C"];

I also did MedCity, with pop >500,000. Joseph's analysis is here


Trends

TrendBigCityMedCity
Number205562
1901-20050.11540.08848
Trend se0.0073210.007264
1979-20050.30730.291
Trend se0.051950.05094


Plots




Incorporating SST and Land/Ocean models

The part of Gistemp that most people who look into it shy away from is where the gridded ocean data is included to make a global land/sea index. With some reason, because it's awkward, but not particularly challenging - the HadSST data is already gridded, and what remains is the messy stuff of keeping the grid methods consistent.

Other recent reconstructions have concentrated on land stations too, perhaps because, as Zeke says, the context was the examination of claims that drift in selection of stations creates biases.

The least squares method used here makes the incorporation of ocean data easier, because there is no intermediate gridding step. So I tried it, I downloaded HadSST2 data, and added a set of 2592 (36x72) synthetic stations - one at the centre of each grid cell. HAD uses the same 5x5 grid as I use for weighting for station density. So where the station sits in an ocean-only cell, it is fully weighted as the sole representative of that cell, and where ocean overlaps land, the new station is weighted according to the number of land stations in the same cell. If the cell is land only, there is no new data, so that cell does not add to the analysis.

New code

This all required some new coding. That's tied up with some other changes, so I'll post ver 1.3 shortly. The change was mostly in the preprocessor, which rearranges the Had data to the GHCN format and creates the new station file.

A summary of this series of posts is here.

Trends

Here is the table of trends from my analysis:
TrendLandSeaSSTNHLandSeaSHLandSea
Number9872259272812591
1901-20050.066160.066540.064280.0684
Trend se0.0045010.0038790.0059790.00381
1979-20050.1820.1510.24450.1094
Trend se0.02240.01780.031690.0171

Update: I hope to be routinely quoting standard errors for trends. At present these are based on OLS - no correction for correlation of residuals. So they are smaller than the AR4 values below.
 
I chose the time intervals for comparison with Table 3.3 from the AR4:
Temperature Trend (oC per decade)  
Dataset  1850-2005  1901-2005  1979-2005 
Northern Hemisphere 
CRU/UKMO (Brohan et al., 2006)  0.047 ± 0.013  0.075 ± 0.023  0.234 ± 0.070  
NCDC (Smith and Reynolds, 2005)  0.063 ± 0.022  0.245 ± 0.062  
Southern Hemisphere  
CRU/UKMO (Brohan et al., 2006)  0.038 ± 0.014  0.068 ± 0.017  0.092 ± 0.038  
NCDC (Smith and Reynolds, 2005)  0.066 ± 0.009  0.096 ± 0.038  
Globe  
CRU/UKMO (Brohan et al., 2006)  0.042 ± 0.012  0.071 ± 0.017  0.163 ± 0.046  
NCDC (Smith and Reynolds, 2005)  0.064 ± 0.016  0.174 ± 0.051  
GISS (Hansen et al., 2001)  0.060 ± 0.014  0.170 ± 0.047  

Plots


Here my plots are compared with the published gridded data, taken from WFT. The smoothing uses the 13-point filter from AR4 Chap 3

The anomaly base periods have been adjusted to match in each case to the published data.

Tuesday, April 13, 2010

The impact of war on GHCN observations


For computational reasons I've been studying the Schur complement of the matrix involved in LSS fitting of temperature models. That sounds hopelessly geeky, but it's just something which is basically a measure of the number of missing values in the GHCN station data. If I can predict them, I can get a much faster algorithm. So I plotted the diagonal, and ...



A pretty dranatic marker of the effect of world wars on Met observations.


Monday, April 12, 2010

Continents and trends

A joint post with Steve Hempell

In another run with ver 1.2.1 Steve has focussed on the continents. The trends are interesting, and somewhat different. For comparison, we show the rather rough plot from the AR4 SPM (SPM 4), which seem to be in agreement.

The Output files for this and the previous post are now on the document store.

Table of trends


TrendNorth AmericaSouth AmericaEuropeAfricaAsiaAustralia
1900-20090.07240.04510.09090.05360.09730.0497
1900-19970.04510.03120.0510.02180.07080.0501
1979-19970.04440.1330.3250.1440.311-0.169
1999-20090.006730.2930.2260.3250.3720.715

The last set of trends, for one decade only, are the least certain. All trends were calculated by OLS fitting.


The AR4 Plot



SPM-4. Note that the boundary between Europe and Asia might not correspond to ours.

Our plots


Africa

Africa

Asia

Asia

Australia

Australia

Europe

Europe

North America

North America

South America

South America