GHCN V3 adjusted is issued approximately daily, although it is not clear how often the underlying algorithm is run. It is posted here - see the readme file and look for the qca label.
Paul Matthews linked to his analysis of variations in Alice Springs adjusted over time. It did look remarkable; fluctuations of a degree or more over quite short intervals, with maximum excursions of about 3°C. This was in about 2012. However Peter O'Neill had done a much more extensive study with many stations and more recent years (and using many more adjustment files). He found somewhat smaller variations, and of frequent but variable occurrence.
I don't have a succession of GHCN adjusted files available, but I do have the latest (downloaded 9 Feb) and I have one with a file date here of 21 June 2015. So I thought I would look at differences between these to try to get an overall picture of what is going on.
I restricted to data since 1880, in line with what most indices use. So the first thing I should show is a histogram of all the differences for all stations:
The mean is -0.004°C and the sd is 0.331°C. Here is a breakdown by months - the result is remarkably even
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
Mean | -0.0039 | -0.0039 | -0.0038 | -0.0041 | -0.0042 | -0.0041 | -0.0039 | -0.0038 | -0.0039 | -0.0039 | -0.0039 | -0.0041 |
sd | 0.3312 | 0.3311 | 0.331 | 0.3306 | 0.3313 | 0.3312 | 0.331 | 0.3307 | 0.3304 | 0.3305 | 0.3306 | 0.3305 |
I next looked at the years since 1999 - 21st Century. Again the histogram was:
Now the mean was - 0.0017, and sd 0.221. And the breakdown by months was
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
Mean | -0.0018 | -0.0017 | -0.0012 | -0.0012 | -0.0029 | -0.0022 | -0.0018 | -0.0015 | -0.0017 | -0.0013 | -0.0017 | -0.0018 |
sd | 0.2202 | 0.2194 | 0.2191 | 0.2184 | 0.2236 | 0.223 | 0.2228 | 0.2215 | 0.2201 | 0.2196 | 0.2197 | 0.2188 |
Analysis
The PHA is a trade-off. It seeks to reduce bias from non-climate events, which would not be reduced by the veraging process. The cost is a degree of uncertain and sometimes wrong identification, which appears as added noise. Now noise is heavily damped by the averaging, as long as it is unbiased. Ensuring that is part of the design of the algorithm, and can be tested on synthetic data.Here there is quite substantial noise showing up as time discrepancies. I did a demonstration a while ago showing that adding white noise of even 1°C amplitude made virtually no difference to the average. So thinking of the global average, the sd of 0.33°C for the whole period is not necessarily alarming. And what is reassuring is that the mean is very close to zero, not only overall but for each month. This strongly suggests that the noise does not introduce bias.
I'd like to take this further with a regional breakdown, and rural/urban. But for the moment, I think is expands the picture of this flutter and what it means.
Appendix - a comment from Bob Koss, which I am posting here to get readable format
I noticed a couple people mentioned v4 USCRN data. They aren't in v3. Here are a couple data tables giving data means and tallies. Adjusted - Raw calculations. USCRN from v4.b.1.20170209. Year -Mean -Mths +Mean +Mths ±Mean ±Mths All_Mean All_Mths Stns 2001 0.000 0 0.000 0 0.000 0 0.000 11 2 2002 -0.210 5 0.225 18 0.130 23 0.023 129 17 2003 -0.215 24 0.237 48 0.086 72 0.018 339 39 2004 -0.292 29 0.233 97 0.113 126 0.022 638 67 2005 -0.328 37 0.256 133 0.129 170 0.025 861 79 2006 -0.322 26 0.265 155 0.181 181 0.033 982 92 2007 -0.210 11 0.294 170 0.263 181 0.040 1199 104 2008 -0.210 11 0.339 154 0.302 165 0.040 1237 106 2009 -0.210 12 0.358 145 0.315 157 0.039 1252 105 2010 -0.210 11 0.368 135 0.324 146 0.038 1239 106 2011 -0.210 12 0.367 106 0.309 118 0.029 1239 105 2012 -0.210 12 0.381 95 0.314 107 0.027 1267 106 2013 -0.210 11 0.379 48 0.269 59 0.013 1267 106 2014 0.000 0 0.449 11 0.449 11 0.004 1262 106 2015 0.000 0 0.000 0 0.000 0 0.000 1264 106 2016 0.000 0 0.000 0 0.000 0 0.000 1261 106 2017 0.000 0 0.000 0 0.000 0 0.000 105 105 GHCN from v3.3.0.20170201 Year -Mean -Mths +Mean +Mths ±Mean ±Mths All_Mean All_Mths Stns 2001 -0.527 7774 0.456 7586 -0.041 15360 -0.022 28608 2752 2002 -0.523 7543 0.450 7517 -0.037 15060 -0.019 28993 2786 2003 -0.530 7388 0.446 7547 -0.037 14935 -0.018 29861 2778 2004 -0.523 7131 0.443 7279 -0.035 14410 -0.017 28963 2809 2005 -0.515 6869 0.446 6985 -0.031 13854 -0.015 28215 2677 2006 -0.511 6567 0.442 6968 -0.020 13535 -0.010 28238 2655 2007 -0.511 6333 0.436 6893 -0.017 13226 -0.008 28720 2640 2008 -0.507 6156 0.419 6786 -0.022 12942 -0.010 29013 2653 2009 -0.490 5767 0.401 6618 -0.014 12385 -0.006 29050 2659 2010 -0.467 5528 0.388 6400 -0.008 11928 -0.003 29244 2666 2011 -0.449 5213 0.376 6091 -0.004 11304 -0.002 28670 2663 2012 -0.430 4816 0.353 5645 -0.007 10461 -0.003 28606 2634 2013 -0.403 4331 0.322 5253 -0.006 9584 -0.002 28247 2575 2014 -0.366 4032 0.295 4924 -0.002 8956 -0.001 27937 2525 2015 -0.354 3589 0.281 4386 -0.005 7975 -0.001 26151 2465 2016 -0.355 3279 0.278 4013 -0.007 7292 -0.002 24368 2199 2017 -0.611 86 0.406 332 0.197 418 0.167 494 494 Note: GHCN makes no adjustments for the past two years other than using TOBS corrections for USHCN data. A large number of stations are labeled a total failure by PHA. Over the passage of years eventually many of these failures are accepted as valid with some being adjusted and others simply passed along. By the time you get back to 1951, 48% of the data is adjusted down while 23% is adjusted up. 2016 had 29162 months at 2594 stations having at least one month of valid data in the qcu. That is after cleaning errors. 2016 had 29162 months at 2594 stations having at least one month of valid data in the qcu. That is after cleaning errors.
The problem seems to be somewhat congruent to multiple sequence alignment in bioinformatics, which suffers the same sort of issues - the area of the energy landscape close to the global minimum is very flat and has lots of local minima.
ReplyDeleteThe only real solution I know of is to produce an ensemble of outputs (or even better, represent the entire energy landscape). That however means long calculations and vast downloads, which we know from experience (e.g. with the HadCRUT4 ensemble) everyone will ignore anyway. I believe GHCN do have an ensemble, but I've never heard of anyone using it.
I suspect that the ensemble results are very stable over time, and that the flutter essentially arises from the adjustments crudely sampling within the ensemble space. It's an interesting area for further study though. If I had time I'd start by running the current data through PCA, and then truncating months of the end to see how things change. Then I'd try adding noise to the current data and see if that produces the same kind of spread.
The next version, GHCNv4, will have a limited ensemble. It explores the uncertainty from the main settings of the pairwise homogenization method.
DeleteThe (also incomplete) estimates of uncertainties due to inhomogeneities of the HadCRUT ensemble is more complete.
In the long-term the approach of GHCNv4 is more promising because they estimate the uncertainties from the data, while HadCRUT uses prior information from the literature and needs to assume that that is valid for all stations, while every network and climate has its own problems.
These scientists such as Roy Spencer are pathetically inept. What does it take for someone owning a time series with a clear nuisance variable (not kidding, it's a real statistical term) to blithely ignore that variable and publish results without removing that nuisance variable?
DeleteIn the case of Spencer's data, it's clear that he can remove the ENSO variability. There is a model for ENSO which is easily derived from the angular momentum variations in the earth's rotation, so it should be as straightforward as removing a 60Hz hum from an electrical signal. Show me an electrical engineer or physicist who is not going to do that kind of compensation correction and I will show you one that won't make much progress
The entire cabal of Curry, Webster, Tsonis, Salby, Pielke, and Gray who have spun their wheels for years in trying to understand ENSO need to be marginalized and some fresh perspectives need to be introduced.
I am worked up because I made the mistake of listening to the EPA hearings today. The one witness who was essentially schooling the Republican thugs was Rush Holt PhD, who is now CEO of the AAAS but at one time was a physicist congressman from New Jersey. You could tell he understood how those cretins thought and knew it was hopeless but decided to teach anyone else in the audience who might be listening. My favorite bit of wisdom he imparted was that science isn't going to make any progress by looking at the same data over and over again the same way, but by "approaching the problem with a new perspective"
Watch it here, set to 100 minutes into the hearing
Suggest drain the swamp of these charlatans such as Spencer, Bates, Curry, Lindzen, et al. Might as well hit them hard now before they occupy positions in the Trump administration.
nuisance variable?
DeleteNuisance parameter as defined in wikipedia
Deletehttps://en.wikipedia.org/wiki/Nuisance_parameter
"any parameter which intrudes on the analysis of another may be considered a nuisance parameter."
Examples of nuisance parameters:
1. Periodic tide effects when trying to measure sea-level height increases
2. Daily and seasonal temperature excursions when trying to measure trends
ENSO is a nuisance parameter because it gets in the way of measuring global temperature trends. They compensate for the two examples above but not ENSO, presumably because it is not as easy to filter and they don't know how much to compensate for it. I say just do the compensation anyways.
thanks much for your excellent response. i would never have thought there'd be a wiki entry for it, but there it is. There's something a bit bizarre about casting aspersions on an influence which is known and part of the data but maybe peripheral to the process being studied.
DeleteThat the homogenized data for some stations flatters is in itself okay. Also if the algorithm works right that will unavoidably happen.
ReplyDeleteEvery day new data comes in. That makes it possible to see new inhomogeneities. These breaks are detected using the statistical test SNHT. Sometimes breaks will be seen as statistically significant that with one more data point just do not cross the significance threshold and with again new data will be significant again. And so on. One significant break can also influence whether other breaks in the pair are detectable.
After detecting the breaks in the pairs, these breaks are assigned to a specific station (called attribution in the paper), whether a break is detected and the exact year in which it is detected will influence this attribution. If one station has a break that is near statistically significant, this could thus even influence the results for its surrounding stations.
The influence of inhomogeneities is largest for stations and becomes less for networks, continents and the world. In the upcoming GHCNv4 homogenization will likely not change the global mean warming much any more.
Homogenization improves the data the most at the station level and smaller scales, but data at the station level is still highly uncertain. If these small scales are important to you, please contact your local national weather service, they know much better what happened to their network and their data will likely be more accurate that what we can do for a global dataset.
The pairwise homogenization algorithm is fully automatic. It is thus easy to run it every night and that gives the most accurate results. Last time I asked, but that is years ago, NOAA also actually ran the algorithm every night.
I'd be curious to see what homogenization does to USCRN data. I would expect any changes introduced to be an indication of potential error introduce by homogenization.
ReplyDeleteYou want to see a difference in the mean before and after a break; that is the test the algorithm tries to detect. The USCRN only has a bit more than 10 years of data, so the uncertainty in the means of the two short period before and after the break would be large and you would most likely simply not see anything because nothing is statistically significant even if there were real inhomogeneities.
DeleteThe SNHT test used in pairwise homogenization algorithm (PHA) has some problems with short series, it detects too much breaks in such cases. I would expect that the attribution step of the pairwise homogenization algorithm would remove nearly all of these wrong breaks again. If you really want to do this, with such short series, it would be good to replace the SNHT in PHA with the corresponding test of RHtests, which was designed to remove the problem of SNHT with over-detection for short series and near the edges.
Victor, thanks for the information. I don't know if the USCRN stations are included in GHCN V3. However, I understand GHCN V4 is adding tens of thousands of stations, on a par with BEST, and I am guessing the USCRN and many other stations with relatively short periods of record may be included. If true, this is where comparing the USCRN results before and after homogenization could be very informative and might be helpful for improving the routines.
DeleteYes, the new dataset for GHCNv4 will be the ISTI dataset, which has a similar size as the Berkeley Earth dataset and also includes shorter station series. Not sure if they are that short and most are longer ones. I would not be surprised if they first remove such very short series.
DeleteThere is something related you may like: After homogenization of the standard US network the data fits better to the USHCRN than before homogenization.
Evaluating the impact of U.S. Historical Climatology Network homogenization using the U.S. Climate Reference Network
Numerous inhomogeneities including station moves, instrument changes, and time of observation changes in the U.S. Historical Climatological Network (USHCN) complicate the assessment of long-term temperature trends. Detection and correction of inhomogeneities in raw temperature records have been undertaken by NOAA and other groups using automated pairwise neighbor comparison approaches, but these have proven controversial due to the large trend impact of homogenization in the United States. The new U.S. Climate Reference Network (USCRN) provides a homogenous set of surface temperature observations that can serve as an effective empirical test of adjustments to raw USHCN stations. By comparing nearby pairs of USHCN and USCRN stations, we find that adjustments make both trends and monthly anomalies from USHCN stations much more similar to those of neighboring USCRN stations for the period from 2004 to 2015 when the networks overlap. These results improve our confidence in the reliability of homogenized surface temperature records.
Victor, thanks for the additional info. I vaguely remember seeing something about that comparison last time I visited the USCRN web site over a year ago. I've been meaning to go back to update data I downloaded for Texas area stations. I went to the link you provided, but it appears to be paywalled. However, I searched the title and found a publicly available PDF: here (in case anyone else is interested).
DeleteI guess the naive question is why doesn't NOAA do the hard grunt work of evaluating stations data on a case by case basis and carefully documenting the adjustments. Once past adjustments are assigned, they should be frozen for all future updates.
ReplyDeleteWind tunnel tests are evaluated and data adjusted differently for each different test set up. Using an automated "algorithm" would be an inferior method. No honest specialist would endorse such a fluttering algorithm. The result is better data and a traceable case by case documentation.
The noise being randomly distributed for a couple of cases examined is not very convincing to me. NOAA is paid for by US taxpayers. They should prioritize a more defensible analysis of particularly US weather station data.
"Once past adjustments are assigned, they should be frozen for all future updates."
DeleteNo, that would be very unwise. PHA makes many thousands of decisions about whether possibly irregular behaviour should be corrected. New information which may affect that decision is coming in. Inflexibility will hurt.
But there is a very strong case for automated, flexible decision making. For averaging, the enemy is bias, not noise. PHA trades bias for noise. That's OK, provided you can show that the extra noise is itself unbiased. With an automated algorithm you can test that.
In CFD I used to sometimes be asked - if acoustic oscillations (say) aren't really there in practice, can't you just freeze them? And the answer is, no, they are part of the dynamics. The physics won't work if you intervene in those ways.
Not sure I agree. It's OK of course to go back and revisit an adjustment based on better information. However, in a wind tunnel test, you would do the adjustments based on knowledge of the test setup, perhaps CFD simulations, etc. However, the important point is that this must be done by a real human being using engineering judgment on a case by case basis. An automated "algorithm" would not be acceptable to anyone involved. There needs to be a clearly documented process in every case.
DeleteAnother thought based on flight testing. Often there are "bad sensors" giving clearly questionable data. You don't try to "adjust" those censors based on neighboring censors. You either fix the censor or you simply discard that data.
David Young, if you run your computational fluid dynamics code that is an automated algorithm. Do not put your own work down, it has its value.
DeleteThere have just as many unreasonable people in your political movement that have complained about manual adjustments.
There is a group working on parallel measurements to study influence of changes in observational methods. At the moment it is a volunteer effort. If you know of taxpayers willing to pay for it, I would welcome it. It is always better to have more lines of evidence.
"Another thought based on flight testing. Often there are "bad sensors" giving clearly questionable data. You don't try to "adjust" those censors based on neighboring censors. You either fix the censor or you simply discard that data"
Delete"discard that data" is an adjustment. And in global temperature averaging it often has a rather specific effect. It says, replace that value by the global average. Although you can improve on that by using some kind of local average (without the bad point).
Much of what you see in homogenisation is a version of discarding. You replace the doubtful data, usually over some time period, by some estimate based on nearby information. Expressing this as an adjusted value in a table is just part of the mechanics of implementation. It is useful, because it means someone else doing an integration doesn't need to repeat the decision-making process. But the drawback is that it does lead to the sort of WUWT over-analysis, based on the idea that people are really tryig to say what Alice Springs should have been. They aren't; they are trying to work out what value assigned to AS would give the best estimate of the region value in the integral. So if they say - replace AS by an average of nearby stations - that is exactly the "discard" effect. Alice is discarded, and the neighboring stations only are used to estimate the region. But it is presented as a superior value for AS, which isn't really the point. I think overall it would probably be better if NOAA didn't publish adjusted values at all, but that this was left as an intermediate stage in integration, which is where it belongs.
Climate scientists manually adjusting temperature data based on their "expert judgement". I can see the headlines now.
DeleteDiscarding of obviously bad data is done as well, but homogenisation isn't about that. The data is good as recorded, it's just that the measurement conditions may be different compared to other times in the record (e.g. because a station has moved location). To produce a homogeneous like-for-like record that change in conditions needs to be accounted for.
Since these are events which happened decades ago there just isn't an avenue to do any grunt work even if they thought it might me a good approach.
Yes, Victor, but CFD codes have VASTLY better verification and validation than weather station data. And smart people look at the details of every series of runs for consistency, etc. The analogy is not really valid.
DeleteYes parallel measurements is a very good idea when there are equipment or siting changes for example. My question is why in the world has NOAA not done that? Just another example of what I would call lack of due diligence at NOAA. Perhaps they are underfunded, but they should prioritize this very highly I would think given its critical importance to critical policy issues.
In the climate wars I don't have a "political movement" so you should not smear me by trying to place me in your nicely labeled political categories. That's what is called prejudice. As to the substance, yes there will always be disagreements about adjustment methods. I would argue that a well documented case by case expert driven process would be more accurate and result in better visibility.
PaulS, Of course there is scope to do case by case expert evaluation of past instrument changes and station siting changes. We do that all the time with wind tunnel tests. There is always extensive documentation to look at. In many cases, there is some documentation as well for weather stations even though not as extensive as for wind tunnels. Anthony Watts has done some of this work.
DeleteThe problem here is that the weather station network was not designed for long term trend determination. That of course makes it very hard to really do this job of adjustments in a defensible and transparent way.
Paul Young: "CFD codes have VASTLY better verification and validation than weather station data."
DeleteOkay, so your original claim that it was a problem that the algorithm is automatics was wrong? Can happen in a quick internet comment.
To make that statement you need to be well versed in the scientific literature on the validation of homogenization methods, could you tell me what you see as the 3 most important publications in that field?
Sad that newbie fluid dynamics engineers such as David Young never studied the work of pioneers such as Faraday and Rayleigh back in the 1800's. They realized that applying a periodic sinusoidal modulation to a volume of fluid often causes a period doubling.
DeleteAlas, Faraday and Rayleigh didn't live long enough to explain the variability in climate that we have observed since, ala ENSO. Yet, like Laplace before them in establishing the primitive equations for atmospheric flow, we can imagine that they would have likely realized that a yearly modulation stimulated by the earth's orbit leads to a biennial modulation in the thermocline properties. In fact, this period-doubling modulation, mixed in with the angular momentum variations in the earth's rotation (evidenced by the Chandler wobble and lunar tidal forces) will accurately model the significant ENSO variations. One can take any interval of ENSO and once mapped to this modulation will ergodically extrapolate to any other interval.
It really is amazing that Lord Rayleigh proposed a modulated wave formulation in 1883 which is identical to the Mathieu wave equation used heavily by ship engineers in every modern-day liquid sloshing model. Mind blowing that this can be applied to ENSO, so cool.
David Young can be forgiven for being a newbie who hasn't studied the literature, and so goes around battling phantoms of his own making. He only has his wind-tunnel hammer as a tool, so everything to him looks like a turbulent nail.
Let me clarify my view a little. The big problem I see with NOAA's adjustment algorithm is that it appears to be unstable to small additions of new data. That would of course be a serious problem with a CFD code too. It would cause a wind tunnel test to be shut down and a large effort to find the problem and fix it.
DeleteMy opinion is that temperature data from weather stations might be be better handled the way wind tunnel or flight test data is handled and adjusted. Just a suggestion. You know the field of adjustments better than I do so I would find your technical thoughts interesting.
That last comment was directed to Victor. Has this instability issue been examined in the literature? I really want to know.
DeleteDavid,
DeleteI would see the instability as an analogue of turbulence. It is a confusing factor if you really want to find high resolution velocities. But you can still perfectly well work out the mean flow, and that determines what you often really want to know in the wind tunnel.
David Young, I would not know what to study. What would be your hypothesis? "Does a yes/no process lead to yes/no results?" Not sure if the answer to that is publishable. :-|
DeleteThere are naturally many studies on the noise level and how that determines the probability of correctly finding a break and the false alarm rate. Or on how the signal to noise ratio determines how accurate the position of the break is. Or on how much homogenization improves the trend estimates, if I may plug my blind benchmarking study:
http://variable-variability.blogspot.com/2012/01/new-article-benchmarking-homogenization.html
Nick Stokes: "I think overall it would probably be better if NOAA didn't publish adjusted values at all, but that this was left as an intermediate stage in integration, which is where it belongs."
DeleteAgree on the one hand, homogenized data is not homogeneous station data. Homogenized data gives an improved estimate of the regional climate. The short-term variability is still the one of the station.
What I like about homogenized data is that it improves the transparency of the climate data processing. You can clearly see what this step in the processing does.
In addition people can quickly make an analysis of the specific question they are interested in without having to do the homogenization themselves every time. Weather services cannot pre-compute all numbers and graphs people may need.
Nick, I think the turbulence issue is different in character than data adjustment algorithms. In steady state RANS you model the turbulence to make it a steady state BVP and in that context, you want stable numerical methods. So for example if I changed the grid a little, I want the answer to only change a little. It's muddier in time accurate simulations.
DeleteAs I said above, an unstable CFD code is perfectly useless and people would jump to find and fix the problem by finding some way to "stabilize" the algorithm and/or understand if the problem is singular, etc.
Victor, Its the same issue we studied a couple of years ago in AIAA Journal. We found that extremely small details caused dramatically different answers in our CFD codes for one problem. We were able to document that the problem itself was singular and that the codes were OK, but only with very careful analysis and actually seriously looking for negative results.
DeleteYou need to look at Paul Matthews information and then look to duplicate the anomalous behavior. Then one would want to change the algorithm to stabilize it.
David,
Delete"As I said above, an unstable CFD code is perfectly useless"
Yes, but I don't believe this is an unstable code. It is an algorithm that generates a somewhat chaotic pattern. That is why the analogy with turbulence. There is a fine scale on which you see chaos, but on the scale you are interested in (spatial mean, of flow or temp) that washes out, and the result does not reflect the local instability.
Nick, I understand your analogy but think it still doesn't justify the instability shown by Paul Matthews. You want to "model" turbulence for a stable calculation. So you smooth and time average it.
DeleteThe adjustment methods seem like a sophisticated form of interpolation and averaging. It should be a smoothing operator, not one having high sensitivity to small additions of later data. I still think that's a reason for NOAA to really do a thorough audit of their method. The "turbulence" here is not in the modeled data but is introduced by the unstable adjustment algorithm.
A more important question is why this flutter issue has not received significant attention in the literature. Perhaps its there and I'm unaware of it. Paul Matthews has documented that NOAA simply refused to reply or respond when the issue was pointed out to them multiple times.
ReplyDeleteGood grief.
DeleteIt's getting hot DY. Have you noticed?
The problem is that David Young's wind tunnels don't operate underwater.
DeleteNice work Nick,
ReplyDeleteI think you should redo this exercise with GHCNv4.
I'll bet that the relative frequency and magnitude of the flutter will be much smaller in v4.
V4 does a much more sensible adjustment in Alice Springs (If we can accept that it discards all data before 1941, I dont know why, but I believe that the station moved from the town to the airport then)
https://www1.ncdc.noaa.gov/pub/data/ghcn/v4/beta/products/StationPlots/AS/ASN00015590/
I have seen that GHCN v3 can do strange things with remote lonely stations, for instance those in the high Arctic. I believe that GHCN v4 will be a general remedy for this kind of problems. If the lonely stations are supported by new neighbour stations, it will be easier for the PHA to "decide" if the temperature changes are real or not..
Olof,
DeleteYes. I looked at V4 unadjusted here (Google map here). But I haven't really looked at the adjusted version. I'll start saving some files.
V4 will not adjust arctic stations.
DeleteIn the past ( in Iceland for example) it was found that certain stations had abrupt discontinuites that were related to retreat of ice cover. ( Ask Zeke he went to iceland to talk to them about one case) any ways, the algorithm saw a break and "fixed" it, but actually the change was real with a real physical basis
Bob Koss tried to post a comment, but ran into trouble. I have posted it as an appendix to the main post above, to preserve the format.
ReplyDeleteNick and Victor: When I look at BEST's plots of the difference between station data and the "regional expectation", there often seems to be a strong seasonal signal. Due to local environment, during summer a station may be warmer than average for the region and the opposite in winter. When a breakpoint detection algorithm is on the verge of reporting a shift to warmer readings, that shift is most likely to be detected in the summer. The following winter, there may be less confidence that a breakpoint has been detected. FWIW, this seems to be one mechanism that could cause "flutter" in the homogenized output from some stations.
ReplyDeleteFrank
It is quite common for a station to have a different seasonal cycle as its neighbors. Not only in the mean, but also in how strong the correlations with other stations are, which produces a seasonal cycle in the noise of the difference time series. To remove these effects is difficult because they can also change at a break point.
DeleteNOAA's pairwise homogenization method only looks at the annual average temperature. National datasets, especially manually homogenized datasets, often also look at the size of the seasonal cycle, or at the series of the summer mean or the series of the winter mean. This avoids problems with the annual cycle and the correlations in time of monthly differences is higher.
I had expected BEST to do the same as NOAA; their paper say they follow NOAA, but it is not clear to me whether they use monthly or annual data. They went out of their way not to hire anyone with relevant expertise to appease the mitigation skeptics. So maybe they used a sub-optimal method using monthly data. Will page Mosher on Twitter to ask.
Yes A while back I was looking at what our algorithm did to CRN stations ( a gold standard) in 5% or so of the cases we were adjusting them. It had to do with our recalculation of seasonal cycles for stations.
DeleteWe havent finished looking at it , priorities and all that
VIctor and Steve: Thanks for taking the time to reply. Victor: If the NOAA PHA only looks at annual averages, does that limit you statistical power to identify a breakpoint? I vaguely remember that some algorithms were finding as many a one breakpoint every one or two decades. In that case, you won't have very many data points defining a breakpoint surrounded by two stable relationship in a pair of stations records. Getting the overall trend correct depends on getting the correction at the breakpoint right. For the 20th-century, you could have a half-dozen or more breakpoints. If each adjustment came with a confidence interval, then the uncertainty in the overall change (and trend) is going to be really high.
DeleteWhenever I've look at BEST aligning split records with the regional expectation, it seems to take only two or three breakpoints for the trend of aligned record to appear to perfectly match the trend of the regional expectation. Or at least it looks that way in the final product - which I think is smoothed over 13 months. I recognize that the regional expectation is derived from kriging unadjusted individual records - not by averaging the aligned records. Nevertheless, it is distressing to see how easily the segments from a flawed record can be aligned to agree with a particular trend. And if the record one is aligning against is biased ... I'm not saying I believe this is what happens, but it is in the back of my mind.
Has anyone looked to see if the overall trend of stations varied with the number of corrected breakpoint in the record?
Thanks, Frank
That was a long list of questions and had to wait for a quiet moment in the weekend.
DeleteYou will in most cases not be able to detect all breaks, but station temperature data is expected to have one break every 15 to 20 years.
Just going to monthly data does not have benefits over annual data. Monthly data is also more noisy and you have all the problems I mentioned above. However, there are inhomogeneities that only have a small effect on the annual mean, while they have a clear effect on the annual cycle. You could improve detection of small inhomogeneities by including breaks in the seasonal cycle; people working manually typically do so.
You are right that errors accumulate over time and are largest in the early period. Not only because of error accumulation of the corrections, but also because the network was much less dense then and the nearby stations are thus less nearby making the difference time series more noisy. This makes detection harder and corrections more uncertain.
Every developer of a homogenisation method has naturally checked how well it works. I was the first author of a large and blind study comparing many homogenisation methods and for temperature it improves the trend estimates. NOAA's pairwise homogenisation method also participated and was one of the recommended methods. People have compared the US data before and after homogenisation with the US Climate Reference Network. After homogenisation it fits better.
NOAA made a similar blind test as mine for the US and could show it improves the trend estimates (but some of the bias remains). On that same dataset also the method of Berkeley Earth was tested and it compared similarly well for the US. The International Surface Temperature Initiative is now working on making a global validation dataset.
Victor,
DeleteDo you think it would be a good test to check whether the same flutter properties are exhibited when homogenising synthetic benchmarking data? That would presumably be a good cross-check of the validity of the benchmark test setup.
One could. I will not do it because I have seen nothing that would convince me that this is in anyway a problem. But if someone has some precious life time to waste: be my guest.
DeleteIt could be that the "flutter" is smaller for such benchmarks because their signal to noise ratio is for Europe and the USA, which is larger than for the middle of Australia or Africa.
The results on a benchmark will on average be the same whether you have one year more or one year less data, but individual stations may well be sometimes different. There is nothing special about the current length. (Large changes in the length and network configuration naturally do start to matter; like I wrote above just 10 years of data is not well suited for homogenization.)