WUWT has had a varied role. Initially Anthony Watts wrote with Zeke some good posts on some flaws in SG's methods. Zeke has continued at Lucia's. I chipped in too.
Then it took an odd turn. Anthony got invested in disputing SG's claims, only moderately exaggerated, about the number of USCHN stations that actually reported each month. When he found that there were quite a lot, "zombie" stations became the enemy. And with that, infilling.
The background is that USHCN tries to do something that no-one else does - to give an average (for the US) in absolute °F (absolute=not an anomaly). That can be done, but needs care (unlike here). USHCN does it by ensuring that every month has an entry for each of its 1218 stations, which ideally never change. But in fact some do become defunct. It's up to 20-30%. So for them USHCN just estimates a value from neighbouring stations, and proceeds.
So that is the latest villainy. I think USHCN should use anomalies, and I suspect they in effect do, and just convert back. But there is nothing wrong with the infilling method. I've been arguing in many forums that the US average, for a month say, is a spatial integral, and they are doing numerical integration. Numerical integration formulae are usually based on integrating an interpolation formula. If you first interpolate extra points using that formula, it makes no difference. Any other good formula will also do.
I don't have many wins. So I thought I would give a simple and fairly familiar example which would show the roles of averaging, climatology, infilling and anomalies. It's the task of calculating an average for one year for one station. Since it's been in the news, and seems to be generally a good station, I chose Luling, Texas.
Update. From comments, I see that I should emphasise that I'm not, in this example, trying to calculate the temperature of the US, or any kind of trend. The issue is very simple. Given a temp record of this one place, and the 2005 monthly averages (with a missing), what can be said about the annual average for 2005 for that place.
Update: I have a new post with a graphics version here
To simplify, I'll round numbers, assume months of equal length. All data is raw, and in °C. Climatology for each month will be simply the average of all instances, and the anomaly is just the difference from that.
So here is the basic data for 2005, where all months are available:
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Ann | |
2009 | 12.8 | 12.8 | 15.2 | 18.7 | 23.2 | 27.6 | 28.8 | 28.9 | 28.2 | 20.2 | 17.1 | 10.3 | 20.3 |
Climatology | 10.3 | 12.1 | 16.3 | 20.3 | 24.2 | 27.7 | 29 | 29.1 | 26.2 | 21.2 | 15.4 | 11.2 | 20.2 |
2009 Anomaly | 2.4 | 0.7 | -1.1 | -1.6 | -0.9 | -0.1 | -0.2 | -0.2 | 2.1 | -1 | 1.6 | -0.9 | 0.1 |
Now suppose we decide too many days are missing in February, and it has to be dropped. And suppose, as WUWT seems to want, we do just that:
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Ann | |
2009 | 12.8 | NA | 15.2 | 18.7 | 23.2 | 27.6 | 28.8 | 28.9 | 28.2 | 20.2 | 17.1 | 10.3 | 21 |
Climatology | 10.3 | NA | 16.3 | 20.3 | 24.2 | 27.7 | 29 | 29.1 | 26.2 | 21.2 | 15.4 | 11.2 | 21 |
2009 Anomaly | 2.4 | NA | -1.1 | -1.6 | -0.9 | -0.1 | -0.2 | -0.2 | 2.1 | -1 | 1.6 | -0.9 | 0 |
So the annual average has risen from 20.3°C to 21°C. That's a lot to follow from removing a month that wasn't unusually warm (for Feb). But if you look at the next line, most of that is accounted for by the change in climatology. It's average has risen by the same amount.
Let's note again that the anomaly average has changed only a small amount, from 0.1 to 0. That reflects that the omitted month was warmer than normal, but is a proportionate response. That's the benefit of anomalies. There is no climatology to make a spurious signal.
But we didn't want to change the annual climatology. That isn't supposed to change, at least not radically, from year to year.
Another way of seeing why just dropping is bad, which I find useful, is that ou can always replace the NA with the Ann average figure. That can't change the average. So this is exactly the same:
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Ann | |
2009 | 12.8 | 21 | 15.2 | 18.7 | 23.2 | 27.6 | 28.8 | 28.9 | 28.2 | 20.2 | 17.1 | 10.3 | 21 |
Climatology | 10.3 | 21 | 16.3 | 20.3 | 24.2 | 27.7 | 29 | 29.1 | 26.2 | 21.2 | 15.4 | 11.2 | 21 |
2009 Anomaly | 2.4 | 0 | -1.1 | -1.6 | -0.9 | -0.1 | -0.2 | -0.2 | 2.1 | -1 | 1.6 | -0.9 | 0 |
Infilling Feb with 21°C is obviously bad. And it shows up in the climatology. But that is what just dropping does.
Now suppose we infill, rather crudely, replacing Feb with the average of Jan and Mar. You know, fabricating data. We get:
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Ann | |
2009 | 12.8 | 14 | 15.2 | 18.7 | 23.2 | 27.6 | 28.8 | 28.9 | 28.2 | 20.2 | 17.1 | 10.3 | 20.4 |
Climatology | 10.3 | 13.3 | 16.3 | 20.3 | 24.2 | 27.7 | 29 | 29.1 | 26.2 | 21.2 | 15.4 | 11.2 | 20.3 |
2009 Anomaly | 2.4 | 0.7 | -1.1 | -1.6 | -0.9 | -0.1 | -0.2 | -0.2 | 2.1 | -1 | 1.6 | -0.9 | 0.1 |
It's not a particularly good infill. But it is effective. The annual average has risen from 20.3 to 20.4. Much better than 21. And the climatology has changed, by the same small amount.
This is basically how USHCN could handle the loss of a month without losing anomalies. In fact, they would take steps to adjust for the known climatology error, to get a better infill. But an even simpler way! Just infill the anomalies, and add to the climatology:
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Ann | |
2009 | 12.8 | 12.8 | 15.2 | 18.7 | 23.2 | 27.6 | 28.8 | 28.9 | 28.2 | 20.2 | 17.1 | 10.3 | 20.3 |
Climatology | 10.3 | 12.1 | 16.3 | 20.3 | 24.2 | 27.7 | 29 | 29.1 | 26.2 | 21.2 | 15.4 | 11.2 | 20.2 |
2009 Anomaly | 2.4 | 0.7 | -1.1 | -1.6 | -0.9 | -0.1 | -0.2 | -0.2 | 2.1 | -1 | 1.6 | -0.9 | 0.1 |
That actually worked artificially well, because the Feb anomaly infill happened to be almost exact. But if you really really don't like infilling, setting the Feb anomaly to zero would do nearly as well. Or to the anomaly average without Feb.
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Ann | |
2009 | 12.8 | 12.1 | 15.2 | 18.7 | 23.2 | 27.6 | 28.8 | 28.9 | 28.2 | 20.2 | 17.1 | 10.3 | 20.2 |
Climatology | 10.3 | 12.1 | 16.3 | 20.3 | 24.2 | 27.7 | 29 | 29.1 | 26.2 | 21.2 | 15.4 | 11.2 | 20.2 |
2009 Anomaly | 2.4 | 0.0 | -1.1 | -1.6 | -0.9 | -0.1 | -0.2 | -0.2 | 2.1 | -1 | 1.6 | -0.9 | 0.0 |
Well, this is an analogy of how USCHN can average stations over a month. But one last thing - we don't even have to average the climatologies each time. We can just average the anomalies and add to the annual climatology.
So the moral is
- Infilling didn't hurt.
- What did hurt was omitting the climatology part of Feb. That is because Feb is known to be cold. Just omitting Feb is bad. Infilling absolute temp gave a reasonable treatment of the climatology.
- But dealing with anomalies alone is even better.
- And not "fabricating data" is by far the worst.
Let me put it all yet another way. We're saying we don't know about Feb, because maybe half the days have no reading. But what don't we know?
"Throw it out" means we don't know anything about Feb. It could have been 10°C, 20 or 30. But we do know more. It was winter. In fact, we know the average temp. And our estimate should at least reflect that knowledge. What we don't know is the anomaly. That's what we can throw out.
By "throw it out" again we're really replacing with an estimate, even if we don't say so. And the default is the average of remaining anomalies. But we could also just say zero, or the average of neighbours (infill). It won't matter much, as long as we don't throw out what we do know, which is the climatology.
"Throw it out" means we don't know anything about Feb. It could have been 10°C, 20 or 30. But we do know more. It was winter. In fact, we know the average temp. And our estimate should at least reflect that knowledge. What we don't know is the anomaly. That's what we can throw out.
By "throw it out" again we're really replacing with an estimate, even if we don't say so. And the default is the average of remaining anomalies. But we could also just say zero, or the average of neighbours (infill). It won't matter much, as long as we don't throw out what we do know, which is the climatology.
Weighting
OK, I might as well work in another hobbyhorse. I've said in comments that infilling is harmless when averaging. It actually just reweights. Suppose you have the Feb data. The average us just a weighted sum, each month weighted 1/12. What if you don't know Feb, so you replace with an average of Jan and March. the annual average is still a weighted sum of data. The weights are:1/8,0,1/8,1/12,1/12...
It's still an estimate based on the same data. Jan and Mar have been upweighted to cover the gap left by Feb.