moyhu: Infilling, climatology and anomalies

Monday, June 30, 2014

Infilling, climatology and anomalies

There's been a lot of grumbling lately about USHCN. For some reason Steven Goddard has gone viral, and has been on some right wing media. Judith Curry gives a summary, with links.

WUWT has had a varied role. Initially Anthony Watts wrote with Zeke some good posts on some flaws in SG's methods. Zeke has continued at Lucia's. I chipped in too.

Then it took an odd turn. Anthony got invested in disputing SG's claims, only moderately exaggerated, about the number of USCHN stations that actually reported each month. When he found that there were quite a lot, "zombie" stations became the enemy. And with that, infilling.

The background is that USHCN tries to do something that no-one else does - to give an average (for the US) in absolute °F (absolute=not an anomaly). That can be done, but needs care (unlike here). USHCN does it by ensuring that every month has an entry for each of its 1218 stations, which ideally never change. But in fact some do become defunct. It's up to 20-30%. So for them USHCN just estimates a value from neighbouring stations, and proceeds.

So that is the latest villainy. I think USHCN should use anomalies, and I suspect they in effect do, and just convert back. But there is nothing wrong with the infilling method. I've been arguing in many forums that the US average, for a month say, is a spatial integral, and they are doing numerical integration. Numerical integration formulae are usually based on integrating an interpolation formula. If you first interpolate extra points using that formula, it makes no difference. Any other good formula will also do.

I don't have many wins. So I thought I would give a simple and fairly familiar example which would show the roles of averaging, climatology, infilling and anomalies. It's the task of calculating an average for one year for one station. Since it's been in the news, and seems to be generally a good station, I chose Luling, Texas.

Update. From comments, I see that I should emphasise that I'm not, in this example, trying to calculate the temperature of the US, or any kind of trend. The issue is very simple. Given a temp record of this one place, and the 2005 monthly averages (with a missing), what can be said about the annual average for 2005 for that place.

Update: I have a new post with a graphics version here

To simplify, I'll round numbers, assume months of equal length. All data is raw, and in °C. Climatology for each month will be simply the average of all instances, and the anomaly is just the difference from that.

So here is the basic data for 2005, where all months are available:

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Ann
2009	12.8	12.8	15.2	18.7	23.2	27.6	28.8	28.9	28.2	20.2	17.1	10.3	20.3
Climatology	10.3	12.1	16.3	20.3	24.2	27.7	29	29.1	26.2	21.2	15.4	11.2	20.2
2009 Anomaly	2.4	0.7	-1.1	-1.6	-0.9	-0.1	-0.2	-0.2	2.1	-1	1.6	-0.9	0.1

Now suppose we decide too many days are missing in February, and it has to be dropped. And suppose, as WUWT seems to want, we do just that:

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Ann
2009	12.8	NA	15.2	18.7	23.2	27.6	28.8	28.9	28.2	20.2	17.1	10.3	21
Climatology	10.3	NA	16.3	20.3	24.2	27.7	29	29.1	26.2	21.2	15.4	11.2	21
2009 Anomaly	2.4	NA	-1.1	-1.6	-0.9	-0.1	-0.2	-0.2	2.1	-1	1.6	-0.9	0

So the annual average has risen from 20.3°C to 21°C. That's a lot to follow from removing a month that wasn't unusually warm (for Feb). But if you look at the next line, most of that is accounted for by the change in climatology. It's average has risen by the same amount.

Let's note again that the anomaly average has changed only a small amount, from 0.1 to 0. That reflects that the omitted month was warmer than normal, but is a proportionate response. That's the benefit of anomalies. There is no climatology to make a spurious signal.

But we didn't want to change the annual climatology. That isn't supposed to change, at least not radically, from year to year.

Another way of seeing why just dropping is bad, which I find useful, is that ou can always replace the NA with the Ann average figure. That can't change the average. So this is exactly the same:

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Ann
2009	12.8	21	15.2	18.7	23.2	27.6	28.8	28.9	28.2	20.2	17.1	10.3	21
Climatology	10.3	21	16.3	20.3	24.2	27.7	29	29.1	26.2	21.2	15.4	11.2	21
2009 Anomaly	2.4	0	-1.1	-1.6	-0.9	-0.1	-0.2	-0.2	2.1	-1	1.6	-0.9	0

Infilling Feb with 21°C is obviously bad. And it shows up in the climatology. But that is what just dropping does.

Now suppose we infill, rather crudely, replacing Feb with the average of Jan and Mar. You know, fabricating data. We get:

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Ann
2009	12.8	14	15.2	18.7	23.2	27.6	28.8	28.9	28.2	20.2	17.1	10.3	20.4
Climatology	10.3	13.3	16.3	20.3	24.2	27.7	29	29.1	26.2	21.2	15.4	11.2	20.3
2009 Anomaly	2.4	0.7	-1.1	-1.6	-0.9	-0.1	-0.2	-0.2	2.1	-1	1.6	-0.9	0.1

It's not a particularly good infill. But it is effective. The annual average has risen from 20.3 to 20.4. Much better than 21. And the climatology has changed, by the same small amount.

This is basically how USHCN could handle the loss of a month without losing anomalies. In fact, they would take steps to adjust for the known climatology error, to get a better infill. But an even simpler way! Just infill the anomalies, and add to the climatology:

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Ann
2009	12.8	12.8	15.2	18.7	23.2	27.6	28.8	28.9	28.2	20.2	17.1	10.3	20.3
Climatology	10.3	12.1	16.3	20.3	24.2	27.7	29	29.1	26.2	21.2	15.4	11.2	20.2
2009 Anomaly	2.4	0.7	-1.1	-1.6	-0.9	-0.1	-0.2	-0.2	2.1	-1	1.6	-0.9	0.1

That actually worked artificially well, because the Feb anomaly infill happened to be almost exact. But if you really really don't like infilling, setting the Feb anomaly to zero would do nearly as well. Or to the anomaly average without Feb.

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Ann
2009	12.8	12.1	15.2	18.7	23.2	27.6	28.8	28.9	28.2	20.2	17.1	10.3	20.2
Climatology	10.3	12.1	16.3	20.3	24.2	27.7	29	29.1	26.2	21.2	15.4	11.2	20.2
2009 Anomaly	2.4	0.0	-1.1	-1.6	-0.9	-0.1	-0.2	-0.2	2.1	-1	1.6	-0.9	0.0

Well, this is an analogy of how USCHN can average stations over a month. But one last thing - we don't even have to average the climatologies each time. We can just average the anomalies and add to the annual climatology.

So the moral is

Infilling didn't hurt.
What did hurt was omitting the climatology part of Feb. That is because Feb is known to be cold. Just omitting Feb is bad. Infilling absolute temp gave a reasonable treatment of the climatology.
But dealing with anomalies alone is even better.
And not "fabricating data" is by far the worst.

Let me put it all yet another way. We're saying we don't know about Feb, because maybe half the days have no reading. But what don't we know?

"Throw it out" means we don't know anything about Feb. It could have been 10°C, 20 or 30. But we do know more. It was winter. In fact, we know the average temp. And our estimate should at least reflect that knowledge. What we don't know is the anomaly. That's what we can throw out.

By "throw it out" again we're really replacing with an estimate, even if we don't say so. And the default is the average of remaining anomalies. But we could also just say zero, or the average of neighbours (infill). It won't matter much, as long as we don't throw out what we do know, which is the climatology.

Weighting

OK, I might as well work in another hobbyhorse. I've said in comments that infilling is harmless when averaging. It actually just reweights. Suppose you have the Feb data. The average us just a weighted sum, each month weighted 1/12. What if you don't know Feb, so you replace with an average of Jan and March. the annual average is still a weighted sum of data. The weights are:
1/8,0,1/8,1/12,1/12...
It's still an estimate based on the same data. Jan and Mar have been upweighted to cover the gap left by Feb.

29 comments:

shubJune 30, 2014 at 9:14 PM
Your example is your example.

The dropping of a single February should knock out the anomaly value for the whole month, is it?

What would be the effect of a dropped February in a network of stations that do not have a dropped February?

The rest of the post is circular reasoning at its worst.
ReplyDelete
Replies
shubJune 30, 2014 at 10:26 PM
Like I said, the example is the limitation. One station cannot have a 'climate' signal can it? Climate and climate measurement are for points in a field.

The circular logic is here: "Just infill the anomalies, and add to the climatology". This is clearly a wrong step if the objective is to infer climatology back again from the infilled values.

ReplyDelete
Replies
And Then There's PhysicsJune 30, 2014 at 10:30 PM
If I understand what you're suggesting, it's essentially that trying to produce an annual average temperature (rather than an anomaly) is problematic because there is a large variation across the year and hence ignoring data for a particular month (for example) could have a large influence on the average value. Hence some kind of infilling is required. Given that - I think - this is not as big an issue for anomalies as it is for the actual temperature, do you think NOAA has created a bit of a rod for their own back in trying to do this. I can see why it's useful, because many may not quite understand what an anomaly is, but trying to produce robust estimates of annual averages without some kind of infilling is probably impossible. Hence they open themselves up to these kind of criticisms from those who don't really understand (or don't want to understand) the complications in trying to produce such results.
ReplyDelete
Replies
Everett F SargentJune 30, 2014 at 11:53 PM
Nick,

Slightly OT but what the heck.

FYI, two papers that provide independent verification of global warming (both published in 2013);

Global warming in an independent record of the past 130 years

http://thisse.1x.biz/docs/Anderson_2013_GlobalWarmingInAnIndependent%20RecordOfThePast130Years_GRL.pdf

Independent confirmation of global land warming without the use of station temperatures

http://www.leif.org/EOS/grl50425-global-temps.pdf

ReplyDelete
Replies
AnonymousJuly 1, 2014 at 2:43 AM
One problem to begin with. There are only 50 USHCN stations with 360 non-Estimated values from 1961-1990. So your baseline is compromised. As is your anomaly.

Second. If you go ahead with the infilling, USHCN changes it pretty much every day. I think about 20% of the Final data changed from 1 day to the next when I Iooked.

Third. Infilling tend to reinforce trends. Just infill Feb = Jan + Mar / 2 would probably not reinforce trends because it is not based on anything but the stations Jan and Mar value. It wouldn't be based on all the other nearby stations and their trends.

ReplyDelete
Replies
Everett F SargentJuly 1, 2014 at 3:46 AM
Well it's not like Steve Goddard has been outright lying about the USHCN network all that long ...

http://stevengoddard.wordpress.com/2010/09/27/us-thermometer-readings-show-cooling-since-1895-before-adjustments/

Oops, never mind.
ReplyDelete
Replies

Add comment

An interactive topic index for all Moyhu posts.
Latest Ice and Temperature data
Climate Data Portals
A gallery of Javascript-enhanced graphics
Temperature trend viewer
Google Maps and GHCN
WebGL map of past GHCN/SST station temperatures
WebGL map of GHCN/SST station temperature trends
HiRes NOAA OI SST with WebGL and Movie
Regional Hi-Res SST movies
WebGL Facility
TempLS Guide
More pages, and blog glossary

moyhu

Monday, June 30, 2014

Infilling, climatology and anomalies

Infilling, climatology and anomalies

Weighting

29 comments:

Search This Blog

Maintained Pages

Recent Comments

Blogroll

Blog Archive

Translate

Resources

About Me

moyhu

Monday, June 30, 2014

Infilling, climatology and anomalies

Infilling, climatology and anomalies

Weighting

29 comments:

Search This Blog

Maintained Pages

Recent Comments

Blogroll

Subscribe To

Blog Archive

Translate

Resources

About Me