Friday, October 21, 2011

The Berkeley Surface Temperature (BEST) analysis

I woke up this morning and saw that the BEST analysis papers and data where online. And there were already posts at Judith Curry's, WUWT, Tamino's and Stoat, and soon one by Zeke at the Blackboard. The dynamic blogroll here put Zeke's upbeat "Some interesting results from BEST" directly above Stoat's pithier "BEST is boring".

My expectations had been somewhat more in line with Stoat's, and indeed the concensus seems to be that it confirms what had been known. But I was interested in the analysis, mainly because the claimed novelty was the least squares method that TempLS uses, which I thought David Brillinger would have improved considerably.

One of the minor surprises was that DB was not on the list of authors of the main analysis paper, despite being one of the big names on the Team. However, he is acknowledged handsomely, and the sophistication of the statistics does indicate his contribution.

There have been a number of other analyses in the last two years which, like BEST, confirm the major indices. Some were land-only, others included sea surface temperatures.

So, with that preamble, here is a very preliminary discussion, mainly of the paper Berkeley Earth Temperature Averaging Process.

The main thing that has puzzled me about this project is that they have restricted themselves to land-only. I heard that the funding ran out before they could include oceans, but I thought the priority was odd. There isn't much point in fancy stats when 2/3 of the globe is missing. They go into a discussion of the different ways the land-only indices handle this problem. They, like NOAA, treat the stations as representing land only. GISS weights the stations in such a way that they attempt to represent the whole globe. BEST says the issue has had insufficient discussion. I think that the BEST discussion fills a much-needed gap in the literature. The land-only indices should really be of only historical interest.

Anyway, they have done some things in the modelling I had been putting off, specifically kriging. TempLS analysis is based on OLS, which effectively assumes that the residuals are independent, identically distributed. Of course that isn't true, and it is possible to improve. I had that on the back burner, mainly because I suspected it would make no real difference to the mean estimate, though it gives a better handle on the error. I think that BEST have affirmed that, which is valuable.

Their model (eq (4)) is
\(d_i(t_j) = \theta(t_j)+b_i+W(\vec(x)_i,t_j) + \epsilon_{i,j}\)
for the temperature in the i-th station at the j-th timestep, and bi is the baseline temp at station i. θ is the global variation. The TempLS equivalent (omitting the noise ε) is:
dsmy  ~  bsm  +  θy

 TempLS decomposes timestep into years y, months m and s corresponds to i above. A big apparent difference is that TempLS includes seasonality in the baseline. It's not currentrly clear to me what BEST does here. The online summary says that they allow for it at a later stage. The TempLS global function can be assumed to vary with year only or with year and month. For trends it makes little difference to the result.

BEST's weather term W  is estimated from correlations. I'm not sure why it is done this way rather than including the correlation in the weight function.

TempLS does a weighted fit in which the weight is by area density (and zero for no record). BEST put a lot mote into their weighting. They use a kind of kriging which adds spatial structure, and they weight by the confidence they have in the reading. Naturally this is only very approximately estimated.

An interesting sub-discussion is on the number of degrees of freedom. They say that 180 well-placed stations should be enough, echoing a figure of Phil Jones. Others  say even sixty would do, and TempLS did this analysis, fairly successfully.

TempLS avoided homogenization (with little apparent effect); BEST introduce a new technique - the scalpel. Or fairly new - they break the series into two when they suspect a station discontinuity, and reduce the weighting nearby to compensate for the greater uncertainty. I suspect the loss of information here is rather great - with the lower weighting, of course, but also the much lower confidence in the baseline means of the fragments (assumed independent) rather than the higher confidence you would have in a longer period.

They also intervene to identify outliers - not by removing them but by reducing the weighting (with similar effect). There are some dangers in this, and one needs to think with a dataset like GHCN how it interacts with the efforts that have already been made to remove outliers. "Outliers" in a preprocessed data set are much more likely to represent real data.

They do a much more sophisticated uncertainty analysis than TempLS, and I like their jackknife approach, and will probably use it. On the other hand, I think their spatial uncertainty analysis is rather pointless when they cover land only.

They break the station baseline temperatures (climatologies) down by latitude and altitude. I'm not sure why, as it isn't helpful for an analysis of global temperature. It is of independent interest, but the analysis could have been done separately on the derived baselines.

There will be several posts to come. I've only just started to look at the Matlab code. An ambition is to port it to R, but there is quite a lot of it. Currently the zip file of text data is somewhat corrupted - I've managed to get all but the flag.txt file, which probably is not needed to get started. Of course, the paper discussed here uses GHCN data only. Maybe I'll be able to compare BEST and GISS on the same data.


  1. An issue to look at might be the extension back to 1800, which has raised some eyebrows.

  2. Belette,
    GHCN had quite a lot of early data too. I think BEST is claiming that their better stats reduces the error to a tolerable level. Maybe.

    I've got a new KMZ file (latest post) which lets you see just where these early stations are, for both BEST and GHCN.