Thursday, February 9, 2017

Flutter in GHCN V3 adjusted temperatures.

In the recent discussion of the kerfuffle of John Bates and the Karl 2015 paper, the claim of Bates that the GHCN adjustment algorithm was subject to instability arose. Bates claim seemed to be of an actual fault in the code. I explained why I think that is unlikely, but rather it is a feature of the Pairwise Homogenisation Algorithm (PHA).

GHCN V3 adjusted is issued approximately daily, although it is not clear how often the underlying algorithm is run. It is posted here - see the readme file and look for the qca label.

Paul Matthews linked to his analysis of variations in Alice Springs adjusted over time. It did look remarkable; fluctuations of a degree or more over quite short intervals, with maximum excursions of about 3°C. This was in about 2012. However Peter O'Neill had done a much more extensive study with many stations and more recent years (and using many more adjustment files). He found somewhat smaller variations, and of frequent but variable occurrence.

I don't have a succession of GHCN adjusted files available, but I do have the latest (downloaded 9 Feb) and I have one with a file date here of 21 June 2015. So I thought I would look at differences between these to try to get an overall picture of what is going on.

I restricted to data since 1880, in line with what most indices use. So the first thing I should show is a histogram of all the differences for all stations:

The mean is -0.004°C and the sd is 0.331°C. Here is a breakdown by months - the result is remarkably even

I next looked at the years since 1999 - 21st Century. Again the histogram was:

Now the mean was - 0.0017, and sd 0.221. And the breakdown by months was



The PHA is a trade-off. It seeks to reduce bias from non-climate events, which would not be reduced by the veraging process. The cost is a degree of uncertain and sometimes wrong identification, which appears as added noise. Now noise is heavily damped by the averaging, as long as it is unbiased. Ensuring that is part of the design of the algorithm, and can be tested on synthetic data.

Here there is quite substantial noise showing up as time discrepancies. I did a demonstration a while ago showing that adding white noise of even 1°C amplitude made virtually no difference to the average. So thinking of the global average, the sd of 0.33°C for the whole period is not necessarily alarming. And what is reassuring is that the mean is very close to zero, not only overall but for each month. This strongly suggests that the noise does not introduce bias.

I'd like to take this further with a regional breakdown, and rural/urban. But for the moment, I think is expands the picture of this flutter and what it means.

Appendix - a comment from Bob Koss, which I am posting here to get readable format

I noticed a couple people mentioned v4 USCRN data. They aren't in v3.

Here are a couple data tables giving data means and tallies. Adjusted - Raw calculations.

USCRN from v4.b.1.20170209.
Year  -Mean   -Mths   +Mean   +Mths   ±Mean ±Mths All_Mean All_Mths Stns
2001  0.000   0       0.000   0       0.000   0       0.000    11       2
2002  -0.210  5       0.225   18      0.130   23      0.023    129      17
2003  -0.215  24      0.237   48      0.086   72      0.018    339      39
2004  -0.292  29      0.233   97      0.113   126     0.022    638      67
2005  -0.328  37      0.256   133     0.129   170     0.025    861      79
2006  -0.322  26      0.265   155     0.181   181     0.033    982      92
2007  -0.210  11      0.294   170     0.263   181     0.040    1199     104
2008  -0.210  11      0.339   154     0.302   165     0.040    1237     106
2009  -0.210  12      0.358   145     0.315   157     0.039    1252     105
2010  -0.210  11      0.368   135     0.324   146     0.038    1239     106
2011  -0.210  12      0.367   106     0.309   118     0.029    1239     105
2012  -0.210  12      0.381   95      0.314   107     0.027    1267     106
2013  -0.210  11      0.379   48      0.269   59      0.013    1267     106
2014  0.000   0       0.449   11      0.449   11      0.004    1262     106
2015  0.000   0       0.000   0       0.000   0       0.000    1264     106
2016  0.000   0       0.000   0       0.000   0       0.000    1261     106
2017  0.000   0       0.000   0       0.000   0       0.000    105      105

GHCN from v3.3.0.20170201
Year  -Mean   -Mths   +Mean   +Mths   ±Mean ±Mths All_Mean All_Mths Stns
2001  -0.527  7774    0.456   7586    -0.041  15360   -0.022   28608    2752
2002  -0.523  7543    0.450   7517    -0.037  15060   -0.019   28993    2786
2003  -0.530  7388    0.446   7547    -0.037  14935   -0.018   29861    2778
2004  -0.523  7131    0.443   7279    -0.035  14410   -0.017   28963    2809
2005  -0.515  6869    0.446   6985    -0.031  13854   -0.015   28215    2677
2006  -0.511  6567    0.442   6968    -0.020  13535   -0.010   28238    2655
2007  -0.511  6333    0.436   6893    -0.017  13226   -0.008   28720    2640
2008  -0.507  6156    0.419   6786    -0.022  12942   -0.010   29013    2653
2009  -0.490  5767    0.401   6618    -0.014  12385   -0.006   29050    2659
2010  -0.467  5528    0.388   6400    -0.008  11928   -0.003   29244    2666
2011  -0.449  5213    0.376   6091    -0.004  11304   -0.002   28670    2663
2012  -0.430  4816    0.353   5645    -0.007  10461   -0.003   28606    2634
2013  -0.403  4331    0.322   5253    -0.006  9584    -0.002   28247    2575
2014  -0.366  4032    0.295   4924    -0.002  8956    -0.001   27937    2525
2015  -0.354  3589    0.281   4386    -0.005  7975    -0.001   26151    2465
2016  -0.355  3279    0.278   4013    -0.007  7292    -0.002   24368    2199
2017  -0.611  86      0.406   332     0.197   418     0.167    494      494

Note: GHCN makes no adjustments for the past two years other than using TOBS corrections for USHCN data. A large number of stations are labeled a total 
failure by PHA. Over the passage of years eventually many of these failures are accepted as valid with some being adjusted and others simply passed 
along. By the time you get back to 1951, 48% of the data is adjusted down while 23% is adjusted up.

2016 had 29162 months at 2594 stations having at least one month of valid data in the qcu. That is after cleaning errors.

2016 had 29162 months at 2594 stations having at least one month of valid data in the qcu. That is after cleaning errors.


  1. The problem seems to be somewhat congruent to multiple sequence alignment in bioinformatics, which suffers the same sort of issues - the area of the energy landscape close to the global minimum is very flat and has lots of local minima.

    The only real solution I know of is to produce an ensemble of outputs (or even better, represent the entire energy landscape). That however means long calculations and vast downloads, which we know from experience (e.g. with the HadCRUT4 ensemble) everyone will ignore anyway. I believe GHCN do have an ensemble, but I've never heard of anyone using it.

    I suspect that the ensemble results are very stable over time, and that the flutter essentially arises from the adjustments crudely sampling within the ensemble space. It's an interesting area for further study though. If I had time I'd start by running the current data through PCA, and then truncating months of the end to see how things change. Then I'd try adding noise to the current data and see if that produces the same kind of spread.

    1. The next version, GHCNv4, will have a limited ensemble. It explores the uncertainty from the main settings of the pairwise homogenization method.

      The (also incomplete) estimates of uncertainties due to inhomogeneities of the HadCRUT ensemble is more complete.

      In the long-term the approach of GHCNv4 is more promising because they estimate the uncertainties from the data, while HadCRUT uses prior information from the literature and needs to assume that that is valid for all stations, while every network and climate has its own problems.

    2. These scientists such as Roy Spencer are pathetically inept. What does it take for someone owning a time series with a clear nuisance variable (not kidding, it's a real statistical term) to blithely ignore that variable and publish results without removing that nuisance variable?

      In the case of Spencer's data, it's clear that he can remove the ENSO variability. There is a model for ENSO which is easily derived from the angular momentum variations in the earth's rotation, so it should be as straightforward as removing a 60Hz hum from an electrical signal. Show me an electrical engineer or physicist who is not going to do that kind of compensation correction and I will show you one that won't make much progress

      The entire cabal of Curry, Webster, Tsonis, Salby, Pielke, and Gray who have spun their wheels for years in trying to understand ENSO need to be marginalized and some fresh perspectives need to be introduced.

      I am worked up because I made the mistake of listening to the EPA hearings today. The one witness who was essentially schooling the Republican thugs was Rush Holt PhD, who is now CEO of the AAAS but at one time was a physicist congressman from New Jersey. You could tell he understood how those cretins thought and knew it was hopeless but decided to teach anyone else in the audience who might be listening. My favorite bit of wisdom he imparted was that science isn't going to make any progress by looking at the same data over and over again the same way, but by "approaching the problem with a new perspective"

      Watch it here, set to 100 minutes into the hearing

      Suggest drain the swamp of these charlatans such as Spencer, Bates, Curry, Lindzen, et al. Might as well hit them hard now before they occupy positions in the Trump administration.

    3. Nuisance parameter as defined in wikipedia
      "any parameter which intrudes on the analysis of another may be considered a nuisance parameter."

      Examples of nuisance parameters:
      1. Periodic tide effects when trying to measure sea-level height increases
      2. Daily and seasonal temperature excursions when trying to measure trends

      ENSO is a nuisance parameter because it gets in the way of measuring global temperature trends. They compensate for the two examples above but not ENSO, presumably because it is not as easy to filter and they don't know how much to compensate for it. I say just do the compensation anyways.

    4. thanks much for your excellent response. i would never have thought there'd be a wiki entry for it, but there it is. There's something a bit bizarre about casting aspersions on an influence which is known and part of the data but maybe peripheral to the process being studied.

  2. That the homogenized data for some stations flatters is in itself okay. Also if the algorithm works right that will unavoidably happen.

    Every day new data comes in. That makes it possible to see new inhomogeneities. These breaks are detected using the statistical test SNHT. Sometimes breaks will be seen as statistically significant that with one more data point just do not cross the significance threshold and with again new data will be significant again. And so on. One significant break can also influence whether other breaks in the pair are detectable.

    After detecting the breaks in the pairs, these breaks are assigned to a specific station (called attribution in the paper), whether a break is detected and the exact year in which it is detected will influence this attribution. If one station has a break that is near statistically significant, this could thus even influence the results for its surrounding stations.

    The influence of inhomogeneities is largest for stations and becomes less for networks, continents and the world. In the upcoming GHCNv4 homogenization will likely not change the global mean warming much any more.

    Homogenization improves the data the most at the station level and smaller scales, but data at the station level is still highly uncertain. If these small scales are important to you, please contact your local national weather service, they know much better what happened to their network and their data will likely be more accurate that what we can do for a global dataset.

    The pairwise homogenization algorithm is fully automatic. It is thus easy to run it every night and that gives the most accurate results. Last time I asked, but that is years ago, NOAA also actually ran the algorithm every night.

  3. I'd be curious to see what homogenization does to USCRN data. I would expect any changes introduced to be an indication of potential error introduce by homogenization.

    1. You want to see a difference in the mean before and after a break; that is the test the algorithm tries to detect. The USCRN only has a bit more than 10 years of data, so the uncertainty in the means of the two short period before and after the break would be large and you would most likely simply not see anything because nothing is statistically significant even if there were real inhomogeneities.

      The SNHT test used in pairwise homogenization algorithm (PHA) has some problems with short series, it detects too much breaks in such cases. I would expect that the attribution step of the pairwise homogenization algorithm would remove nearly all of these wrong breaks again. If you really want to do this, with such short series, it would be good to replace the SNHT in PHA with the corresponding test of RHtests, which was designed to remove the problem of SNHT with over-detection for short series and near the edges.

    2. Victor, thanks for the information. I don't know if the USCRN stations are included in GHCN V3. However, I understand GHCN V4 is adding tens of thousands of stations, on a par with BEST, and I am guessing the USCRN and many other stations with relatively short periods of record may be included. If true, this is where comparing the USCRN results before and after homogenization could be very informative and might be helpful for improving the routines.

    3. Yes, the new dataset for GHCNv4 will be the ISTI dataset, which has a similar size as the Berkeley Earth dataset and also includes shorter station series. Not sure if they are that short and most are longer ones. I would not be surprised if they first remove such very short series.

      There is something related you may like: After homogenization of the standard US network the data fits better to the USHCRN than before homogenization.

      Evaluating the impact of U.S. Historical Climatology Network homogenization using the U.S. Climate Reference Network
      Numerous inhomogeneities including station moves, instrument changes, and time of observation changes in the U.S. Historical Climatological Network (USHCN) complicate the assessment of long-term temperature trends. Detection and correction of inhomogeneities in raw temperature records have been undertaken by NOAA and other groups using automated pairwise neighbor comparison approaches, but these have proven controversial due to the large trend impact of homogenization in the United States. The new U.S. Climate Reference Network (USCRN) provides a homogenous set of surface temperature observations that can serve as an effective empirical test of adjustments to raw USHCN stations. By comparing nearby pairs of USHCN and USCRN stations, we find that adjustments make both trends and monthly anomalies from USHCN stations much more similar to those of neighboring USCRN stations for the period from 2004 to 2015 when the networks overlap. These results improve our confidence in the reliability of homogenized surface temperature records.

    4. Victor, thanks for the additional info. I vaguely remember seeing something about that comparison last time I visited the USCRN web site over a year ago. I've been meaning to go back to update data I downloaded for Texas area stations. I went to the link you provided, but it appears to be paywalled. However, I searched the title and found a publicly available PDF: here (in case anyone else is interested).

  4. I guess the naive question is why doesn't NOAA do the hard grunt work of evaluating stations data on a case by case basis and carefully documenting the adjustments. Once past adjustments are assigned, they should be frozen for all future updates.

    Wind tunnel tests are evaluated and data adjusted differently for each different test set up. Using an automated "algorithm" would be an inferior method. No honest specialist would endorse such a fluttering algorithm. The result is better data and a traceable case by case documentation.

    The noise being randomly distributed for a couple of cases examined is not very convincing to me. NOAA is paid for by US taxpayers. They should prioritize a more defensible analysis of particularly US weather station data.

    1. "Once past adjustments are assigned, they should be frozen for all future updates."
      No, that would be very unwise. PHA makes many thousands of decisions about whether possibly irregular behaviour should be corrected. New information which may affect that decision is coming in. Inflexibility will hurt.

      But there is a very strong case for automated, flexible decision making. For averaging, the enemy is bias, not noise. PHA trades bias for noise. That's OK, provided you can show that the extra noise is itself unbiased. With an automated algorithm you can test that.

      In CFD I used to sometimes be asked - if acoustic oscillations (say) aren't really there in practice, can't you just freeze them? And the answer is, no, they are part of the dynamics. The physics won't work if you intervene in those ways.

    2. Not sure I agree. It's OK of course to go back and revisit an adjustment based on better information. However, in a wind tunnel test, you would do the adjustments based on knowledge of the test setup, perhaps CFD simulations, etc. However, the important point is that this must be done by a real human being using engineering judgment on a case by case basis. An automated "algorithm" would not be acceptable to anyone involved. There needs to be a clearly documented process in every case.

      Another thought based on flight testing. Often there are "bad sensors" giving clearly questionable data. You don't try to "adjust" those censors based on neighboring censors. You either fix the censor or you simply discard that data.

    3. David Young, if you run your computational fluid dynamics code that is an automated algorithm. Do not put your own work down, it has its value.

      There have just as many unreasonable people in your political movement that have complained about manual adjustments.

      There is a group working on parallel measurements to study influence of changes in observational methods. At the moment it is a volunteer effort. If you know of taxpayers willing to pay for it, I would welcome it. It is always better to have more lines of evidence.

    4. "Another thought based on flight testing. Often there are "bad sensors" giving clearly questionable data. You don't try to "adjust" those censors based on neighboring censors. You either fix the censor or you simply discard that data"
      "discard that data" is an adjustment. And in global temperature averaging it often has a rather specific effect. It says, replace that value by the global average. Although you can improve on that by using some kind of local average (without the bad point).

      Much of what you see in homogenisation is a version of discarding. You replace the doubtful data, usually over some time period, by some estimate based on nearby information. Expressing this as an adjusted value in a table is just part of the mechanics of implementation. It is useful, because it means someone else doing an integration doesn't need to repeat the decision-making process. But the drawback is that it does lead to the sort of WUWT over-analysis, based on the idea that people are really tryig to say what Alice Springs should have been. They aren't; they are trying to work out what value assigned to AS would give the best estimate of the region value in the integral. So if they say - replace AS by an average of nearby stations - that is exactly the "discard" effect. Alice is discarded, and the neighboring stations only are used to estimate the region. But it is presented as a superior value for AS, which isn't really the point. I think overall it would probably be better if NOAA didn't publish adjusted values at all, but that this was left as an intermediate stage in integration, which is where it belongs.

    5. Climate scientists manually adjusting temperature data based on their "expert judgement". I can see the headlines now.

      Discarding of obviously bad data is done as well, but homogenisation isn't about that. The data is good as recorded, it's just that the measurement conditions may be different compared to other times in the record (e.g. because a station has moved location). To produce a homogeneous like-for-like record that change in conditions needs to be accounted for.

      Since these are events which happened decades ago there just isn't an avenue to do any grunt work even if they thought it might me a good approach.

    6. Yes, Victor, but CFD codes have VASTLY better verification and validation than weather station data. And smart people look at the details of every series of runs for consistency, etc. The analogy is not really valid.

      Yes parallel measurements is a very good idea when there are equipment or siting changes for example. My question is why in the world has NOAA not done that? Just another example of what I would call lack of due diligence at NOAA. Perhaps they are underfunded, but they should prioritize this very highly I would think given its critical importance to critical policy issues.

      In the climate wars I don't have a "political movement" so you should not smear me by trying to place me in your nicely labeled political categories. That's what is called prejudice. As to the substance, yes there will always be disagreements about adjustment methods. I would argue that a well documented case by case expert driven process would be more accurate and result in better visibility.

    7. PaulS, Of course there is scope to do case by case expert evaluation of past instrument changes and station siting changes. We do that all the time with wind tunnel tests. There is always extensive documentation to look at. In many cases, there is some documentation as well for weather stations even though not as extensive as for wind tunnels. Anthony Watts has done some of this work.

      The problem here is that the weather station network was not designed for long term trend determination. That of course makes it very hard to really do this job of adjustments in a defensible and transparent way.

    8. Paul Young: "CFD codes have VASTLY better verification and validation than weather station data."

      Okay, so your original claim that it was a problem that the algorithm is automatics was wrong? Can happen in a quick internet comment.

      To make that statement you need to be well versed in the scientific literature on the validation of homogenization methods, could you tell me what you see as the 3 most important publications in that field?

    9. Sad that newbie fluid dynamics engineers such as David Young never studied the work of pioneers such as Faraday and Rayleigh back in the 1800's. They realized that applying a periodic sinusoidal modulation to a volume of fluid often causes a period doubling.

      Alas, Faraday and Rayleigh didn't live long enough to explain the variability in climate that we have observed since, ala ENSO. Yet, like Laplace before them in establishing the primitive equations for atmospheric flow, we can imagine that they would have likely realized that a yearly modulation stimulated by the earth's orbit leads to a biennial modulation in the thermocline properties. In fact, this period-doubling modulation, mixed in with the angular momentum variations in the earth's rotation (evidenced by the Chandler wobble and lunar tidal forces) will accurately model the significant ENSO variations. One can take any interval of ENSO and once mapped to this modulation will ergodically extrapolate to any other interval.

      It really is amazing that Lord Rayleigh proposed a modulated wave formulation in 1883 which is identical to the Mathieu wave equation used heavily by ship engineers in every modern-day liquid sloshing model. Mind blowing that this can be applied to ENSO, so cool.

      David Young can be forgiven for being a newbie who hasn't studied the literature, and so goes around battling phantoms of his own making. He only has his wind-tunnel hammer as a tool, so everything to him looks like a turbulent nail.

    10. Let me clarify my view a little. The big problem I see with NOAA's adjustment algorithm is that it appears to be unstable to small additions of new data. That would of course be a serious problem with a CFD code too. It would cause a wind tunnel test to be shut down and a large effort to find the problem and fix it.

      My opinion is that temperature data from weather stations might be be better handled the way wind tunnel or flight test data is handled and adjusted. Just a suggestion. You know the field of adjustments better than I do so I would find your technical thoughts interesting.

    11. That last comment was directed to Victor. Has this instability issue been examined in the literature? I really want to know.

    12. David,
      I would see the instability as an analogue of turbulence. It is a confusing factor if you really want to find high resolution velocities. But you can still perfectly well work out the mean flow, and that determines what you often really want to know in the wind tunnel.

    13. David Young, I would not know what to study. What would be your hypothesis? "Does a yes/no process lead to yes/no results?" Not sure if the answer to that is publishable. :-|

      There are naturally many studies on the noise level and how that determines the probability of correctly finding a break and the false alarm rate. Or on how the signal to noise ratio determines how accurate the position of the break is. Or on how much homogenization improves the trend estimates, if I may plug my blind benchmarking study:

    14. Nick Stokes: "I think overall it would probably be better if NOAA didn't publish adjusted values at all, but that this was left as an intermediate stage in integration, which is where it belongs."

      Agree on the one hand, homogenized data is not homogeneous station data. Homogenized data gives an improved estimate of the regional climate. The short-term variability is still the one of the station.

      What I like about homogenized data is that it improves the transparency of the climate data processing. You can clearly see what this step in the processing does.

      In addition people can quickly make an analysis of the specific question they are interested in without having to do the homogenization themselves every time. Weather services cannot pre-compute all numbers and graphs people may need.

    15. Nick, I think the turbulence issue is different in character than data adjustment algorithms. In steady state RANS you model the turbulence to make it a steady state BVP and in that context, you want stable numerical methods. So for example if I changed the grid a little, I want the answer to only change a little. It's muddier in time accurate simulations.

      As I said above, an unstable CFD code is perfectly useless and people would jump to find and fix the problem by finding some way to "stabilize" the algorithm and/or understand if the problem is singular, etc.

    16. Victor, Its the same issue we studied a couple of years ago in AIAA Journal. We found that extremely small details caused dramatically different answers in our CFD codes for one problem. We were able to document that the problem itself was singular and that the codes were OK, but only with very careful analysis and actually seriously looking for negative results.

      You need to look at Paul Matthews information and then look to duplicate the anomalous behavior. Then one would want to change the algorithm to stabilize it.

    17. David,
      "As I said above, an unstable CFD code is perfectly useless"
      Yes, but I don't believe this is an unstable code. It is an algorithm that generates a somewhat chaotic pattern. That is why the analogy with turbulence. There is a fine scale on which you see chaos, but on the scale you are interested in (spatial mean, of flow or temp) that washes out, and the result does not reflect the local instability.

    18. Nick, I understand your analogy but think it still doesn't justify the instability shown by Paul Matthews. You want to "model" turbulence for a stable calculation. So you smooth and time average it.

      The adjustment methods seem like a sophisticated form of interpolation and averaging. It should be a smoothing operator, not one having high sensitivity to small additions of later data. I still think that's a reason for NOAA to really do a thorough audit of their method. The "turbulence" here is not in the modeled data but is introduced by the unstable adjustment algorithm.

  5. A more important question is why this flutter issue has not received significant attention in the literature. Perhaps its there and I'm unaware of it. Paul Matthews has documented that NOAA simply refused to reply or respond when the issue was pointed out to them multiple times.

    1. Good grief.

      It's getting hot DY. Have you noticed?

    2. The problem is that David Young's wind tunnels don't operate underwater.

  6. Nice work Nick,
    I think you should redo this exercise with GHCNv4.
    I'll bet that the relative frequency and magnitude of the flutter will be much smaller in v4.

    V4 does a much more sensible adjustment in Alice Springs (If we can accept that it discards all data before 1941, I dont know why, but I believe that the station moved from the town to the airport then)

    I have seen that GHCN v3 can do strange things with remote lonely stations, for instance those in the high Arctic. I believe that GHCN v4 will be a general remedy for this kind of problems. If the lonely stations are supported by new neighbour stations, it will be easier for the PHA to "decide" if the temperature changes are real or not..

    1. Olof,
      Yes. I looked at V4 unadjusted here (Google map here). But I haven't really looked at the adjusted version. I'll start saving some files.

    2. V4 will not adjust arctic stations.

      In the past ( in Iceland for example) it was found that certain stations had abrupt discontinuites that were related to retreat of ice cover. ( Ask Zeke he went to iceland to talk to them about one case) any ways, the algorithm saw a break and "fixed" it, but actually the change was real with a real physical basis

  7. Bob Koss tried to post a comment, but ran into trouble. I have posted it as an appendix to the main post above, to preserve the format.

  8. Nick and Victor: When I look at BEST's plots of the difference between station data and the "regional expectation", there often seems to be a strong seasonal signal. Due to local environment, during summer a station may be warmer than average for the region and the opposite in winter. When a breakpoint detection algorithm is on the verge of reporting a shift to warmer readings, that shift is most likely to be detected in the summer. The following winter, there may be less confidence that a breakpoint has been detected. FWIW, this seems to be one mechanism that could cause "flutter" in the homogenized output from some stations.


    1. It is quite common for a station to have a different seasonal cycle as its neighbors. Not only in the mean, but also in how strong the correlations with other stations are, which produces a seasonal cycle in the noise of the difference time series. To remove these effects is difficult because they can also change at a break point.

      NOAA's pairwise homogenization method only looks at the annual average temperature. National datasets, especially manually homogenized datasets, often also look at the size of the seasonal cycle, or at the series of the summer mean or the series of the winter mean. This avoids problems with the annual cycle and the correlations in time of monthly differences is higher.

      I had expected BEST to do the same as NOAA; their paper say they follow NOAA, but it is not clear to me whether they use monthly or annual data. They went out of their way not to hire anyone with relevant expertise to appease the mitigation skeptics. So maybe they used a sub-optimal method using monthly data. Will page Mosher on Twitter to ask.

    2. Yes A while back I was looking at what our algorithm did to CRN stations ( a gold standard) in 5% or so of the cases we were adjusting them. It had to do with our recalculation of seasonal cycles for stations.

      We havent finished looking at it , priorities and all that

    3. VIctor and Steve: Thanks for taking the time to reply. Victor: If the NOAA PHA only looks at annual averages, does that limit you statistical power to identify a breakpoint? I vaguely remember that some algorithms were finding as many a one breakpoint every one or two decades. In that case, you won't have very many data points defining a breakpoint surrounded by two stable relationship in a pair of stations records. Getting the overall trend correct depends on getting the correction at the breakpoint right. For the 20th-century, you could have a half-dozen or more breakpoints. If each adjustment came with a confidence interval, then the uncertainty in the overall change (and trend) is going to be really high.

      Whenever I've look at BEST aligning split records with the regional expectation, it seems to take only two or three breakpoints for the trend of aligned record to appear to perfectly match the trend of the regional expectation. Or at least it looks that way in the final product - which I think is smoothed over 13 months. I recognize that the regional expectation is derived from kriging unadjusted individual records - not by averaging the aligned records. Nevertheless, it is distressing to see how easily the segments from a flawed record can be aligned to agree with a particular trend. And if the record one is aligning against is biased ... I'm not saying I believe this is what happens, but it is in the back of my mind.

      Has anyone looked to see if the overall trend of stations varied with the number of corrected breakpoint in the record?

      Thanks, Frank

    4. That was a long list of questions and had to wait for a quiet moment in the weekend.

      You will in most cases not be able to detect all breaks, but station temperature data is expected to have one break every 15 to 20 years.

      Just going to monthly data does not have benefits over annual data. Monthly data is also more noisy and you have all the problems I mentioned above. However, there are inhomogeneities that only have a small effect on the annual mean, while they have a clear effect on the annual cycle. You could improve detection of small inhomogeneities by including breaks in the seasonal cycle; people working manually typically do so.

      You are right that errors accumulate over time and are largest in the early period. Not only because of error accumulation of the corrections, but also because the network was much less dense then and the nearby stations are thus less nearby making the difference time series more noisy. This makes detection harder and corrections more uncertain.

      Every developer of a homogenisation method has naturally checked how well it works. I was the first author of a large and blind study comparing many homogenisation methods and for temperature it improves the trend estimates. NOAA's pairwise homogenisation method also participated and was one of the recommended methods. People have compared the US data before and after homogenisation with the US Climate Reference Network. After homogenisation it fits better.

      NOAA made a similar blind test as mine for the US and could show it improves the trend estimates (but some of the bias remains). On that same dataset also the method of Berkeley Earth was tested and it compared similarly well for the US. The International Surface Temperature Initiative is now working on making a global validation dataset.

    5. Victor,

      Do you think it would be a good test to check whether the same flutter properties are exhibited when homogenising synthetic benchmarking data? That would presumably be a good cross-check of the validity of the benchmark test setup.

    6. One could. I will not do it because I have seen nothing that would convince me that this is in anyway a problem. But if someone has some precious life time to waste: be my guest.

      It could be that the "flutter" is smaller for such benchmarks because their signal to noise ratio is for Europe and the USA, which is larger than for the middle of Australia or Africa.

      The results on a benchmark will on average be the same whether you have one year more or one year less data, but individual stations may well be sometimes different. There is nothing special about the current length. (Large changes in the length and network configuration naturally do start to matter; like I wrote above just 10 years of data is not well suited for homogenization.)