Friday, February 6, 2015

Why homogenise data?

I've been writing a lot lately (eg here and here) about homogenisation. The main message is that adjustment has relatively small effects and is not, as some say, the arithmetic basis of AGW. But sometimes the answer is, well, if it has little effect, why do it? So I thought to make some general remarks on homogenisation and adjustment.

Firstly, some common mis-characterisations. It isn't "altering the record". The true record sits with the national met offices, but organisations like NOAA have online unadjusted records, which are as accessible as the adjusted.

Secondly, records aren't adjusted in the belief that thermometers were read incorrectly. They are usually adjusted specifically for the calculation of a regional average. Station records are used as representative of a region. When something happens that is not climate-based, then it isn't representative of the region.

Let's look at a specific case. In 1928, the station for Wellington, NZ, was moved from Thorndon, at sea level, to Kelburn, at about 100m altitude. The temperature dropped by nearly 1°C, and the NOAA algorithm picked this up and reduced the pre-1928 readings.

Now the Thorndon readings aren't "wrong". What is wrong is the 1°C drop. There is no reason to believe that the region experienced that. So the adjustment is to make the record consistent ("homogeneous"). The convention is to leave the present unchanged and adjust the past. It doesn't matter which.

Averaging

There is a lot of pixels wasted on some alleged mis-adjustment of some individual stations. That misses the big picture. When you calculate a global annual average, you take over a million readings to get a single number. The averaging drastically reduces the noise. White noise would reduce by a factor of 1000. There are various dependences, but still, the reduction is huge. Noise isn't much of a problem. What is a problem is bias. That doesn't reduce.

But there is a counter to bias too - the use of anomalies. Consistent bias goes out with the mean. So the remaining problem is inconsistent bias. Inhomogenieties.

We saw this with TOBS in the US. There is a bias depending on which time of day you read. And there is no "correct" time. The bias doesn't matter unless the time of observation changes. And still, for trend, that wouldn't matter unless there was a bias in the direction of changes. Otherwise they would cancel. But there is a bias. In the US, volunteers used to ne enjoined to observe in late afternoon. That weakened, and morning obs became popular. That created a cooling bias.

Correction introduces noise

So NOAA and others try to recognise inhomogenieties and correct them. Sometimes that goes wrong. A real change is wrongly corrected. Or a real but temporary change gets corrected but by a wrong amount, or not fixed when the cause goes away.

But this comes back to the tradeoff between noise and bias. Correcting bias is important. If you create noise in the process, that may be acceptable. And a fixed algorithm can be tested with synthetic data to see if it introduces bias.

Most of the fusses about adjustment are totally spurious. "Look, they are altering the record! data must not be touched!". But sometimes a definite error shows up. There may have been one at Reykjavik. Part of a temperature dip was wrongly considered an inhomogeneity. So should it be fixed?

No! As noted above, a good algorithm has been tested for lack of bias. If you start intervening, it loses that property. Noise won't hurt the average, but taking out the bits that displease naysayers certainly will.

So does it matter?

In TempLS, I use unadjusted data. I satisfied myself a while ago that there was little difference in outcome. But I could only do that because someone had identified and corrected the inhomogeneities, to give me something to check against. And it may just be good luck that this time the inhomogeneities cancelled.

Naysayers bang the drum about adjustments. But if none were done, then I bet we'd be hearing stories about how some station was moved, and ignored, so now the record is unreliable.

Appendix

Beset by people who couldn't shed the idea that data should only be adjusted if fault is proved, I gave this analogy. Here is a table of BHP share prices. Note the final column - price adjusted for dividends and splits. It's not that the old prices were defective; it's just that dividends and splits do not alter the productivity of the company, or to what you get in total from your shareholding. If you want to make historic sense of the raw prices, you have to keep making allowances. But the adjusted prices give a continuous picture of what a holding is worth.
If you were compiling a stock index (eg Dow, cf GISS) that is what you would use.

1. Hello Nick
You say: "Station records are used as representative of a region"

What if, at some point of the calculation of an average temperature, regions are used to infer station temperatures? How would that affect your thinking? Do you think such a step should not occur in the calculation?

You then say: "Let's look at a specific case. In 1928, the station for Wellington, NZ, was moved from Thorndon, at sea level, to Kelburn, at about 100m altitude. The temperature dropped by nearly 1°C, and the NOAA algorithm picked this up and reduced the pre-1928 readings."

That's good. But what if algorithms pick up 1C or ~3C drops in temperature and make local station changes, but without the corresponding station history to back up algorithm-detected drops? What if, following this, the next step in the algorithm involves saying 'well, since there is a drop that looks like it could have come from a move, it's a move' and makes corresponding changes?

How would you feel about it, methodologically speaking?

1. Shub,
"What if, at some point of the calculation of an average temperature, regions are used to infer station temperatures? How would that affect your thinking? Do you think such a step should not occur in the calculation?"
On land, at least, stations are all we have. I assume you mean inferring station temperatures from their neighbors?

I think that's OK. As I keep emphasising, the purpose is to calculate a spatial average, which is an integral. It also ends up as a weighted sum. All that inferring does is to upweight the neighbors and downweight the alteree in the average. Which basically means that for some of the area that had been estimated by the alteree, it is estimated by neighbors instead. Moving the boundaries.

"How would you feel about it, methodologically speaking?"
Well, it's what they do, and what I am talking about. For most of the world, we don't have good station history. Changes have to be inferred (as Wellington was), and sometimes wrongly. That's my point about randomness vs bias. Wrong inferences that don't create bias are much less harmful to the average than bias.

2. Interestingly enough, for Berkeley at least, the difference between a metadata-only based breakpoint homogenization and a metadata + empirical breakpoint homogenization is pretty small globally. More on this soon. There are certain regions and periods of time where good station metadata is lacking, unfortunately.

2. There are two points. First, if stations are used to calculate average and averages are used to calculate stations, it is clear there is non-independent computation.

1) Why use non-independent, i.e., circular analysis? 'The answer doesn't change' is not a good answer, and there is a possibility it is a wrong answer.

The problems do not stop there. For example, in Paraguay, almost all stations are adjusted, presumably using neighbouring non-Paraguayan stations. Which means, the data represented as Paraguayan temperature does not represent Paraguay.

Why create information out of non-information?

Secondly, changes made to local station record, en-route to calculation of a global average *have to be compatible with the actual local history of the place AND the global average*. Why is this not followed? Again, the 'answer doesn't change' logic doesn't work. If you go to the BEST website, they advertise their product for local stations. "Did you know? Berkeley Earth gives you historical temperature data for your home town, state, and country" - they say.

Except we know that your home town is likely an alteree who has been estimated by its neighbours.

Are alternative methods not possible? Of course they are. Take SA. Can 'true' temperatures be calculated for the larger geographic region of Eastern South America, alternatively, excluding Paraguayan temperatures? Yes. But the error margins would be wider. Is this correct?

1. "Why use non-independent, i.e., circular analysis?"
It's certainly not circular. It's actually a common situation in relaxation and other numerical methods for PDE. You just solve simultaneous equations. But "averages are used to calculate stations" suggests more than what happens, which is that a few local stations can be used to interpolate.

"Which means, the data represented as Paraguayan temperature does not represent Paraguay."
Maybe it hurts Paraguayan national pride, but borders mean nothing for a global average.

*have to be compatible with the actual local history of the place AND the global average*
I don't know where that rule came from or what it means.

"Yes. But the error margins would be wider. Is this correct?"
Yes. As said, in principle every point in the region is estimated based on station values. If one station is downweighted, it is estimated with more distant stations, which increases the error estimate somewhat.

3. Nick, do I understand you correctly that your code doesn't perform any sort of homogenization?

By the way, that['s a very nice explanation of the trade off between measurement bias and measurement noise. I'd add that once the measurement noise is sufficiently large compared to the bias, there's no real advantage to correcting bias any further. I've encountered that trade off in localization algorithms;

Also, homogenization algorithms can reduce the spatial resolution of the reconstructed temperature field (we saw this with BEST for example).

I know you've been discussing the global mean signal, defined as an integral over the surface temperature field, but with recent interest in regional scale climate change, having a higher spatial resolution might be more important than bias free measurements of trend. When we see large deviations from the global mean value in e.g. the Arctic, it might be more important to not smear out this variation in order to achieve a minor increase in accurate of the long term trend.

1. Carrick,
The code works from SST and land monthly station temperatures. I can feed in GHCN adjusted, or unadjusted. The latter are just the monthly averages of the min/max as recorded (with a little QC).

Yes, homogenisation can reduce resolution. It is meant to identify regional values for averaging. I would expect that most regional averaging of interest would be on a coarser scale than the smearing.

4. "It's certainly not circular."

If local stations are adjusted to match their neighbours - with the evidence supporting the need for adjustment being that local stations don't match their neighbours - the argument is circular. There is no question about circularity itself. My question is why such methods are preferred. After all, you could leave out such stations and yet derive a regional or global average.

"But "averages are used to calculate stations" suggests more than what happens..."

Take Puerto Casado, a now-familiar example. BEST turns the local C/century trend from -0.89 to +1.36. BEST knows nothing about the station's field history. At best the altitude change is about ~10 meters. Clearly, more rather than less has been done to the station's record.

"I don't know where that rule came from or what it means."

It's quite simple. When you makes adjustments to a station to 'correct' for something that is not climate-based and isn't representative of the region. the product needs to be representative of the station too.

Otherwise you just use the existence of a station to synthesize a record that has no connection to the station. Taken another way, you have done more to a station than you declare.

"Maybe it hurts Paraguayan national pride, but borders mean nothing for a global average"

Paraguay is not a large country. But it is not a small one either. If all stations in Paraguay needed adjustment based on rural stations from neighbouring countries - which I presume you would agree are no better developed than the Paraguayan towns themselves - the magnitude of adjustments is over a large region and not confined to a 'few local stations'.

What is the impact of synthesized temperature records over such large regions on the error in individual years? Do we know?

1. "When you makes adjustments to a station to 'correct' for something that is not climate-based and isn't representative of the region. the product needs to be representative of the station too. "

Why? The global average is the average of regions. Stations are just a means of getting it.

In fact, the adjusted files that appear on the web should probably be held more internally. They really are a collection of regional estimates, using the station names for convenience. But that is a handy way of recording and communicating the information. It's useful for other people (like me) wanting to make an index. Why people like delingpole should care about it is a mystery.

"What is the impact of synthesized temperature records over such large regions on the error in individual years? Do we know?"

The plot on my previous post will tell you (toggle the "trendback"). But the coming breakdown post will tell a lot more.

2. "They really are a collection of regional estimates, using the station names for convenience." ... "Why people like delingpole should care about it is a mystery"

'Cause the story you tell people is different. People like Delingpole go around thinking global averages are derived from local records. It's pretty stupid of them.

Leaving jokes aside, why should adjusted products be 'representative of stations and regions'? That's just one way of putting it. Again with the example of Puerto Casado, the BEST method takes a -1.36 C/century trend and converts it into a +1.36 C/century trend. The regional trend is 1.37 ± 0.43.In other words, BEST's method picked up a station and put the region into it and left nothing of the station behind.

If a large region has just one good station, you use the one station and derive the regional average but the error is high (assume the grid size is much smaller than the size of the region. If the same large region has several bad stations and one good stations, by BEST's method the regional average will derive solely from the good station. The bad stations will be adjusted to make them look like the good one. The error will be low!

Using stations like excuses is bad methodology. Are they real entities? Then use their unadjusted, or minimally adjusted records. Are they just grid points? Then don't stuff them with information from nearby stations but count them for reducing error.

I don't know if this is easily done - take out all stations whose linear trend needs >30-40% alteration from the original. Calculate the global average. What would the error be?

3. nigguraths, homogenization should be viewed as a step one takes during the process of averaging temperature over (e.g.) 5°x5° cells. What is important is the bias left in the cell average is reduced after homogenization from what it would have been without homogenization, and not whether the homogenization increases or decreases the error in individual stations.

If we look at the type of errors that homogenization is attempting to fix, this includes station moves as well as changes in instrumentation and local environment (the so-called "UHI" effect).

If you wanted to look at the data for a single station, it is technically an error to even consider the combined record or two stations (one near town and one at the airport for example), since this is really two stations. It would be a mistake to attempt to correct for UHI, since UHI is a real component of the temperature field.

If we had enough stations, and we uniformly sampled the temperature field, we could ignore UHI. Because we don't uniformly sample temperature, and there is a tendency for stations to be located near towns, we do have to make some sort of UHI correction.

But in any case that UHI-corrected station data is no longer a real temperature series, so trying to relate it to the temperature that would be measured by an ideal thermometer is pointless.

Similarly, we take individual stations that share a single station ID, and combine them into a single record in a manner that tends to preserve the regional trend, whatever we have left is no longer a real temperature series. And again trying to relate this hybrid station temperature series to individual locations in the temperature field is generally going to be pointless.

I would say that both of these effects must be corrected for, even if we can only do so in a manner that reduces the bias in the measurement of regionally averaged temperature, which increasing the noise in the remaining pseudo-temperature series.

In summary, we can't compare homogenized temperature to individual stations to determine the accuracy of the homogenization process. This means that other indirect validation methods are required. It makes the problem harder to study, but it's by no means an insoluble one.

4. Carrick, let me restate my example to make my question/s clear:

Say you have a large region, say the size of Argentina, covering a good few grid squares (the size of which are immaterial). Say you have just one station recording temperature in the whole region. What temperature would you give the squares in the region? The station's. Would the error in estimation of the region's average temperature be high? Of course - much of the temperature field is unsampled.

You have the same large region. You have one good station recording temperature. You have 8 bad ones surrounding it. By BEST's method, the bad stations will get altered to match the good one (to 70%, 80% 100%, whatever). How high would the error in estimation be? The error would be lower as the temperature field would count as being sampled better.

Do we know how much change in error occurs due to including such bad stations? That was my question.

"...we can't compare homogenized temperature to individual stations to determine the accuracy of the homogenization process."

Your starting point appears to be the same as Nick's- that there is a temperature field and we have a good sense of what it is, magnitude-wise and trend-wise. The problem then appears as how best to fit the little rural stations to the field. My starting point is different - the field is completely unknowable to us except via the stations that are in it. Homogenization and adjustments appear like weirdo stuff to me - taking data and doing something to it does not give back data. So I don't know what to make of 'the accuracy of homogenization process'.

Incidentally, Booker's story was in the Sunday Telegraph and featured on Drudge Report over the weekend.

5. Shub, if you trace through my explanation, the purpose of the some of the adjustments (UHI, station moves) is to remove bias associated with nonuniform sampling, before averaging over stations to remove noise associated with the local field measurement.

I get it would be more ideal to apply the homogenization corrections only in such a way that you could preserve the field at the individual stations. This would permit more traditional validation methods. In this case, you're stuck with Monte Carlo studies to determine the net influence of the homogenization process associated with UHI and station moves.

That said, see the two links below:

It does not appear that the homogenization correction produces a substantial effect, and for BEST, only applying corrections for which there is metadata does not result in a substantial loss of resolution. I do not think the "empirical adjustments" that BEST has been experimenting with are justified, and I think this step should be removed from their monthly updated product, at least until the problems it introduces are fixed.

5. nigguraths, if you use Metadata as the basis homogenization corrections, then it certainly isn't circular.

You are right if they use regionalize data series to spot and correct other series, there is a circularity issue. It will produce a spatial smearing. The bias introduced by this smearing is probably minor on the global mean temperature but as I pointed out above, when researchers are trying to lean on the regional scale to be accurate, this is still a problem for the series.

Put another way, if the only measurement that you think is valid from your series is the global value, that is all you should publish. You certainly shouldn't publish the individual cities reconstructed temperature, if that is expected to actually be meaningless (due to smearing).

Using Nick's trend plotter, for example, we can see there is a clear spatial smoothing in the BEST approach.

Forget about Paraguay, the entire continent of South America virtually gets spatially homogenized to the same temperature.

Your question about the impact of this homogenization on global mean temperature is I think a valid issue.

If you want to look for an effect of homogenization, see if the temperature for BEST is running hotter than other series. We can also compared GISS 1200-km to GISS 250-km smoothing.

Comparing series with only meta data corrections to ones that internally data mine would be useful (what Zeke alluded to), but of course that won't tell you very much if the homogenization algorithm is guilty over smoothing for other reasons.

If I get a chance today, I'll perform an analysis on some of these series and see what I can come up with.

1. Here's the link for Nick's trend plotter for GISS and BEST.

2. Carrick, Shub,
Spatial smearing isn't circularity. If anything, it is diffusion. Solving a system with time stepping where neighboring nodes modify each other is universal in PDE.

Southwell's 1948 relaxation method for solving Laplace equation with boundary constraints was just to successively, in a grid, replace each value by the average of the neighbors. It converges efficiently and accurately. In Southwell's paper, "computer" means person with pen.

Who could do that? But I've been working on my breakdown paper. It's true that S America does have the largest adjustments.

4. Nick: Spatial smearing isn't circularity. If anything, it is diffusion. Solving a system with time stepping where neighboring nodes modify each other is universal in PDE

I guess that depends on what you mean by "circular".

If you mean that the quantity to be determined appears on both sides of the equation (on the left as an assignment, on the right as a term in the computation), then I think it's true that BEST for example is, in that sense, "circular".

But it's the problems with using non-technical words. In filtering theory, I would use the word "recursive" instead.

I might use the word "spectral smearing" here instead of "diffusive" because I tend to think of diffusion as something that occurs over time. This splatter happens independently "frame by frame", regardless of what occurred previously or in the future.

5. By the way, I was wrong to suggest that recursively (or circularity) had anything to do by itself with spatial smearing. It's unrelated.

How much spatial smearing has more to do with how many distant stations you end up looking at when making a homogenization correction.

In fact, you could actually reduce the spatial smearing via an appropriate tempero-spatial recursive filter design.

In any case, the point I think Nick was trying to make about "circularity" is correct. Just because there is recursion, doesn't mean the answer is less accurate.

6. A couple of images showed up on twitter. I'm taking the liberty to repost them here:

Effect of homogenization on global temperature index

The impact of homogenization on the recent global temperature index seems minimal.

Effect of homogenization on spatial resolution.