moyhu: December 2020

There has been a series of posts at WUWT by Andy May on SST averaging, initially comparing HADSST with ERSST. They are here, here, here, here, and here. Naturally I have been involved in the discussions; so has John Kennedy. There has also been Twitter discussion.My initial comment was here:

"Just another in an endless series of why you should never average absolute temperatures. They are too inhomogeneous, and you are at the mercy of however your sample worked out. Just don’t do it. Take anomalies first. They are much more homogeneous, and all the stuff about masks and missing grids won’t matter. That is what every sensible scientist does.

..."

The trend was toward HADSST and a claim that SST had been rather substantially declining this century (based on that flaky averaging of absolute temperatures). It was noted that ERSST does not show the same thing. The reason is that HADSST has missing data, while ERSST interpolates. The problem is mainly due to that interaction of missing cells with the inhomogeneity of T.

Here is one of Andy's graphs:

In these circumstances I usually repeat the calculation that was done replacing the time varying data with some fixed average for each location to show that you get the same claimed pattern. It seems to me obvious that if unchanging data can produce that trend, then the trend is not due to any climate change (there is none) but to the range of locations included in each average, which is the only thing that varies. However at WUWT one meets an avalanche of irrelevancies - maybe the base period had some special property, or maybe it isn't well enough known, the data is manipulated etc etc. I think this is silly, because they key fact is that some set of unchanging temperatures did produce that pattern. So you certainly can't claim that it must have been due to climate change. I set out that in a comment here, with this graph:

Here A is Andy's average, An is the anomaly average, and Ae is the average made from the fixed base (1961-90) values. Although no individual location in Ae is changing, it descends even faster than A.

So I tried another tack. Using base values is the simplest way to see it, but one can just do a partition of the original arithmetic, and along the way find a useful way of showing the components of the average that Andy is calculating. I set out a first rendition of that here. I'll expand on that here, with a more systematic notation and some tables. For simplicity, I will omit area weighting of cells, as Andy did for the early posts.

Breakdown of the anomaly difference between 2001 and 2018

Consider three subsets of the cell/month entries (cemos):

Ra is the set with data in both 2001 and 2018 (Na cemos)
Rb is the set with data in 2001 but not in 2018 (Nb cemos)
Rc is the set with data in 2018 but not in 2001 (Nc cemos)

I'll mark sums S of cemo data with a 1 for 2001, 2 for 2018, and an a,b,c if they are sums for a subset. I use a similar notation for averages with A plus suffices. I'll set out the notation and some values in a table:

Set data	N	Weights	2001	2018
2001 or 2018	N=18229		S1, A1=S1/(Na+Nb)=19.029	S2,A2=S2/(Na+Nc)=18.216
Ra 2001 and 2018	Na=15026	Wa=Na/N=0.824	S1a, A1a=S1a/Na=19.61	S2a, A2a=S2a/Na=19.863
Rb 2001 but not 2018	Nb=1023	Wb=Nb/N=0.056	S1b, A1b=S1b/Nb=10.52	S2b=0
Rc 2018 but not 2001	Nc=2010	Wc=Nc/N=0.120	S1c=0	S2b, A2b=S2b/Nb=6.865

I haven't given values for the sums S, but you can work them out from the A and N. The point is that they are additive, and this can be used to form Andy's A2-A1 as a weighted sum of the other averages. From additive S:
S1=S1a+S1b and S2=S2a+S2c
or
A1*(Na+Nb)=A1a*Na+A1b*Nb, or
A1*N=A1a*Na+A1b*Nb+A1*Nc
and similarly
A2*N=A2a*Na+A2c*Nc+A2*Nb
Differencing
(A2-A1)*N=(A2a-A1a)*Na-(A1b-A2)*Nb+(A2c-A1)*Nc
or, dividing by N

A2-A1=(A2a-A1a)*Wa-(A1b-A2)*Wb+(A2c-A1)*Wc

That expresses A2-A1 as the weighted sum of three terms relating to Ra, Rb and Rc respectively. Looking at these individually

(A2a-A1a)=0.253 are the differences between the data points known for both years. They are the meaningful change measures, and give a positive result
(A1b-A2)=-7.696. The 2001 readings in Rb have no counterpart in 2018, and so no information about increment. Instead they appear as the difference with the 2018 average A2. This isn't a climate change difference, but just reflects whether the points in R2 were from warm or cool places/seasons.
(A2b-A1)=12.164. Likewise these Rc readings in 2018 have no balance in 2001, and just appear relative to overall A1.

Note that the second and third terms are not related to CC increases and are large, although this is ameliorated by their smallish weighting. The overall sum that, with weights, makes up the difference is
A2-A1 = 0.210 + 0.431 -1.455 = -0.813
So the first term representing actual changes is overwhelmed by the other two, which are biases caused by the changing cell population. This turns a small increase into a large decrease.

So why do anomalies help

I'll form anomalies by subtracting from each cemo the 2001-2018 mean for that cemo (chosen to ensure all N cemo's have data there). The resulting table has the same form, but very different numbers:

Set data	N	Weights	2001	2018
2001 or 2018	N=18229		S1, A1=S1/(Na+Nb)=-.116	S2,A2=S2/(Na+Nc)=0.136
Ra 2001 and 2018	Na=15026	Wa=Na/N=0.824	S1a, A1a=S1a/Na=-0.118	S2a, A2a=S2a/Na=0.137
Rb 2001 but not 2018	Nb=1023	Wb=Nb/N=0.056	S1b, A1b=S1b/Nb=-0.084	S2b=0
Rc 2018 but not 2001	Nc=2010	Wc=Nc/N=0.120	S1c=0	S2b, A2b=S2b/Nb=0.130

The main thing to note is that the numbers are all much smaller. That is both because the range of anomalies is much smaller than absolute temperatures, but also, they are more homogeneous, and so more likely to cancel in a sum. The corresponding terms in the weighted sum making up A2-A1 are

A2-A1 = 0.210 + 0.012 + 0.029 = 0.251

The first term is exactly the same as without anomalies. Because it is the difference of T at the same cemo, subtracting the same base from each makes no change to the difference. And it is the term we want.

The second and third spurious terms are still spurious, but very much smaller. And this would be true for any reasonably choice of anomaly base.

So why not just restrict to Ra?

where both 2001 and 2018 have values? For a pairwise comparison, you can do this. But to draw a time series, that would restrict to cemos that have no missing values, which would be excessive. Anomalies avoid this with a small error.

However, you can do better with infilling. Naive anomalies, as used un HADCRUT 4 say, effectively assign to missing cells the average anomaly of the remainder. It is much better to infill with an estimate from local information. This was in effect the Cowtan and Way improvement to HADCRUT. The uses of infilling are described here (with links).

moyhu

Sunday, December 27, 2020

How averaging absolute temperatures goes wrong - use anomalies

How averaging absolute temperatures goes wrong - use anomalies

Breakdown of the anomaly difference between 2001 and 2018

So why do anomalies help

So why not just restrict to Ra?

Tuesday, December 15, 2020

GISS November global up by 0.25°C from October.

GISS November global up by 0.25°C from October.

Saturday, December 5, 2020

November global surface TempLS up 0.188°C from October.

November global surface TempLS up 0.188°C from October.