Wednesday, March 31, 2010

Comparison of GHCN results

I promised some comparison plots from the code reported in the last post, but Zeke has made the task much easier. He posted a spreadsheet with all the data nicely arranged, and I added my own data - you can find the spreadsheet "Temp Comps_NS.xls" in the repository.

Below are some of the plots and discussion.

A summary of this series of posts is here.

Here is the global data from 1880 ( or 1881 in my case). Details of the sources are on Zeke's post. All data is set to zero average in 1961-90.

And here is a close-up of the period 1960-2009.

The agreement is good. This is encouraging, because the principle of the least squares method is very different, and bypasses all the merging of stations issues. Of course, the agreement of all the other reconstructions is also impressive.

Here is a plot with just the main indices GISS and Hadcrut:

GHCN processing algorithm

As Zeke says, it seems to be a custom now for bloggers to post codes to grid GHCN data and calculate trends. Here's my offering. It's influenced by Jeff Id's program, and particularly by Romanm's Least Squares. In fact, it's an extension of Roman's offset method to the whole global set. It's fairly different to the conventional approach. It doesn't actually do gridding, or (per Roman) calculate anomalies. It's probably the shortest of the codes, and runs quite fast.

A summary of this series of posts is here.

The linear model

I'll attach a pdf file explaining the mathematics in more detail (soon). But here's a sketch. The monthly data, from v2.mean after averaging over duplicates, is in a big array x(S,m,y). The indices are station, month and year.

The model is that x can be represented in terms of a global temperature G(y) depending on year alone (not month or station) plus a local offset L(S,m) for each station S, with a seasonal component dependent on month m (but not year), plus error:
x(S,m,y) = G(y) + L(S,m) + eps(S,m,y)
I do a least squares weighted fit.

The weight function w(S,m,y) is inversely proportional to area density of stations, and of course, zero for (S,m,y) combinations where there is no data. So we're minimising (Latex-like notation) Sum_{S,m,y} w(S,m,y)*(x(S,m,y) - G(y) - L(S,m))^2


The weight function uses a grid. But it just counts stations in each cell - no explicit grid averaging is done. The count is for each month (and year). The cell area is divded by this count to give wt(S,m,y). A conventional 5 deg x 5 lat/lon cell is used.

The design matrix system

Minimising LS gives the system
L(S,m)*Sum_{y} w(S,m,y) + Sum_{y} {w(S,m,y)*G(y)}
            = Sum_{y} {w(S,m,y)*x(S,m,y)}

Sum_{S,m} {w(S,m,y)*L(S,m)} + G(y)*Sum_{S,m} w(S,m,y)
           = Sum_{S,m} {w(S,m,y)*x(S,m,y)}

Regarding (S,m) as a single index, this is actually a symmetric system:
A1*L +B*G = R1
t(B)*L +A2*G = R2

A1 is the very big block, but is a diagonal matrix. t(B) is the transpose of B. A2 is diagonal too.


This can no doubt be solved (for local L and global temp G) using R's lm() type functions. Numerically, it is essential to take advantage of A1 being diagonal. I don't know how R can do that, and would appreciate advice.

I constructed a home-made block solver which was quite fast and apparently effective (although I have an idea for a faster version). A wrinkle is that the matrix is singular, because you coould add any constant to the global temp G and subtract it from L. So I added an equation to fix the initial G to zero.


To test the code, I created a synthetic data file with the v2,mean format, but made up of just a trend with a sinusoidal temperature (same for all stations), with a random scattering of missing data. It should give an exact answer, and it did. The test code is included here.


I haven't tried hemispheres, yet, just global. I'll show in a following post comparisons with HADCRUT etc. For the moment, I just looked at global temps and trends from 1900 to present, and 1978 to present. I was particularly interested to compare with Jeff's value of 0.2476 C/dec. Here's my fig, with the result 0.2542 C.dec. (The AR1 correction is coming).

For 1900-2009, my slope is 0.0747 C/dec - well within Zeke's range.

Code and next steps

The code, and the test code, is in the repository. In a future post, I'll show results for the local seasonal temperatures from the LS fit - they are pretty interesting too.

Thursday, March 11, 2010

On Polynomial Cointegration and the death of AGW

Well, I never thought I'd write a post with such a title. And I must confess that until about two weeks ago, these terms were very foreign to me. But recently, a paper by the econometricians Beenstock and Reingewertz has been circulating among sceptic blogs. It seems unpublished, but with the file called "Nature_Paper091209.pdf" there has been some jumping to conclusions. There was a brief discussion at WUWT, and more recently, David Stockwell has run a series on the topic.
Update.  Bart has commented has commented and linked an interesting discussion on his blog. Tamino has taken this up with a very thorough discussion (with sequels). The first properly sceptical discussion of the Beenstock paper was by Eli Rabett

Further update. VS has noted that the much more significant thread on Bart's blog (now much discussed in itself) is here.

Ringing phrases like:
"Therefore, greenhouse gas forcings, global temperature and solar irradiance are not polynomially cointegrated, and AGW is refuted."
ensure that those who like that sort of thing will keep it going for a while, so I thought I should try to figure it out. I'm still not an expert, but I hope what I did figure out might help.

Time series and unit roots

The theory is derived from the abstract notion of modelling a stochastic time series by a recurrence relation:
y_t + c1*y_{t-1} + ... + cn*y_{t-n} = ε_t
where ε_t is a stationary stochastic variable with zero mean (serially uncorrelated, constant variance).

This can be re-expressed symbolically using the backward difference operator
Dy_t = y_t-y_{t-1}
(a0 + a1*D + a2*D^2 +...+ D^n) y_t = ε_t
This is just binomial algebra

This is the undriven form, but a non-stochastic driver could be introduced. The corresponding non-stochastic recurrence relation (with ε_t=0) has n power solutions of the form b^t (t=0,1,2,...), and the b's are solutions of a characteristic n-th order polynomial. It is expected that these will all have magnitude not greater than 1, otherwise the process would have blown up.

The case where one or more roots are equal to 1 is called the unit root. It's important because the behaviour is qualitatively different. Instead of disturbances decaying with time, the process can just wander. It is a random walk.

Testing for unit roots

The point of the second difference form is that there is a clear criterion for unit roots. There is one root if a0=0, and the series is called I(1). There is a repeated root if a0=0 and a1=0, and the series is called I(2).

Now although people who like this sort of thing like to talk of asymptotic behaviour, there is usually no known underlying mechanics to justify this. There is just a finite period of observations. So the coefficients are never completely known (or of course, even whether it is an exact model). You can't get true asymptotics from a finite observation period.

So you can never say that a0=0. All you might be able to say is that a0 is bounded away from zero - significantly different at some level of confidence, say 95%. That's a test, and variants include the Dickey-Fuller test, and the augmented ADF test.

So when you say a series is I(1), you are saying that a test failed to yield significance. You weren't able to show, at 95% confidence, that a0 was different from 0. That's a negative conclusion, and certainly doesn't mean that you're 95% confident that the series is I(1). You're not 95% confident of anything.

In fact it's hard to quantify confidence in I(1) status in any way. Higher levels, eg I(2), likewise say that we weren't able to show that both a0 and a1 are significantly different from zero. Two failures is even harder to express as a confidence level.


This is used in causal arguments as follows. If, say, a temperature rise is believed to be caused by a CO2 rise, then there is a consistency requirement for the series. The long-term qualitative difference between a series with roots less than unity, and one with unit roots is great. So it would be odd to find that if temperature is I(1), so its a0 = 0 but a1 is not 0, while for CO2 (with coefs b0, b1 etc) b0 = b1 = 0. For that would mean that the temperature differences were stable, with disturbances decaying, even though the corresponding differences of CO2 could drift. That's the supposed significance of the failure of "polynomal cointegration". Though its still a big jump to say "AGW is refuted".

For let me say again, we can never be totally confident, on finite data, of statements about I(1), I(2) etc. But we don't seem to be able to quantify our uncertainty. Even less can we quantify compound statements like T is I(1), CO2 I(2), therefore incompatible.

Back to Beenstock

So how do B&R quantify their claim that AGW is refuted? Explicitly, not at all. I didn't see any confidence level attached to it. They did describe a number of tests on the CO2 series. That is, they quantified the extent to which they can say that the test for non-zero b0 and b1 failed. But oddly, there seemed to be no test statistics at all on temperature. They just said that it is I(1), apparently relying on papers by Kaufmann and others. But no confidence level was cited.

So with no quantitative confidence in one of the parts of a compound proposition, with what confidence can they say that "AGW is refuted".

They go on to make even more remarkable claims
Although we reject AGW, we find that greenhouse gas forcings have a temporary effect on global temperature. Because the greenhouse effect is temporary rather than permanent, predictions of significant global warming in the 21st century by IPCC are not supported by the data.
Again, with no physics but just a century or so of observations, they are able to say what is temporary and what is permanent.

Saturday, March 6, 2010

A blatant fiddle in the D'Aleo/Watts SPPI report.

The D'Aleo/Watts report has come under justified criticism for it's silly claims about marching thermometers etc, amplified into claims about malfeasance by various scientific groups.

But there's one little thing that particularly bugs me, because I have pointed it out more than once, but it makes no difference. They say  

"See the huge dropout of data in Africa, Canada and Siberia in the two maps from NASA GISS with 250 km smoothing from 1978 to 2008"

and show these pictures:

Well, yes, quite a reduction in going from 1978 to 2008. But actually, if you go to the GISS site and ask for April 2008, what you see is not nearly as sparse.

So what is going on?

I managed to track down where the SPPI image came from. Bob Tisdale has a version of it in a post dated May 20 2008. It's an early version (current when he posted it), made with an early batch of data that came in for that month. And it's compared with the full version of April 1978.

It's a bit like that Langoliers post. But it just keeps turning up.