There has been some interest in matching performance of GCM data and observations in recent decades. Some is stimulated by a very public dispute between Douglass et al 2007 on tropical tropospheric temperatures, and Santer et al 2008 (a Who's Who list of authors) in response. Some have taken the test in the latter to be an example to follow.
For blog discussion, see eg (just a sample) CA, Lucia, Chad, Tamino
In another thread of activity, Lucia has a running series of checks on the GMST trends and the significance of their deviation from observations. There was a long series of "falsifications" of IPCC predictions listed here, and more recently a test over a longer period of GIStemp and Hadcrut.
I've been participating in the latter discussions, generally arguing that what was "falsified" was not an IPCC or model prediction, but a derived number, which had been stripped of some of its its acknowledged variability, which much amplifies the significance of deviation.
The models, from the KNMI database
People have been using a set of 22 models from the KNMI database, hindcasting from at least 1900 to 2009, and comparing to GIStemp and Hadcrut. SteveM has listed a scraping script here. I've lodged my R script for this whole application in my repository here.
I'm following a graph style that Santer used, and which Chad has developed to present a wide range of results. I'm going to talk first about the model results themselves.
I used a set of 20 model results. I think this was a subset of the 22 models Lucia and Chad used. When I came to download from KNMI, I had error messages from gfdl_cm2_1 and miroc3_2_medres. The models were:
The number in brackets shows the number of runs from each model. I simply averaged these.
I also averaged the monthly results and worked only with annual means. This removed the need for great care with seasonal variation.
I calculated the regressions over the period 1980-2009. Later I'll do different periods, but at the moment I'm going to just describe the forms of the data. Each model mean trend has, from the regression, a standard error. In the following plot, I've shown the slopes (C/yr) with vertical whiskers length 2*se. The centre is marked with a cross green bar, whose width is proportional to the number of runs.
I should add that I did not correct for
Then the trends look like this:
The mean is shown as a violet line.
Discussion of model results
The main thing to say is that there is a lot of scatter. This can legitimately be criticised. The big effort over the years has been to get the model results more consistent, but that's where it stands.
It is impossible to ignore this scatter when testing the "performance" of the models against observations. There's no point in testing some derived quantity, say the mean, and saying that represents the models. It doesn't. It's only a part of what they say.
Inclusion of observations
If we include an observation set, say GIStemp, then a regression of that data also indicates a slope with an error range. Adding that to Fig 1, as a shaded bar with width 2*se, gives the following Figure
Now something about testing the correspondence becomes clear. The scatter of the models is much greater than the observed "weather noise" (hatched area), and the whiskers are long. This will dominate any proper test.
I'ln now describe various limited tests that have been used. These ignore various sources of scatter, which makes it much easier to infer significant deviation.
Lucia's earlier falsifications.
Lucia concentrated on the last 8 or 9 years in this series, and tested a single number, the "IPCC mean", usually of about 0.02 C/year. That's equivalent to testing the following setup:
It's an OK test of a number, but not of any reasonably measure of model performance.
Douglass et al
Curiously, they did the opposite. They
allowed for the effect of the spread of model means, but ignored the errors in regression trends of both models and observations, as Santer et al 2008 observed. They were testing a pattern like this:
Note that the green band here is not the average scatter, but the uncertainty of the mean due to scatter - 2*se.
Lucia's GIStemp test and Santer's test.
This recent test, over a longer period, subtracted the observed from the model trends and tested the significance of the results. She generally did find significant results for longer periods, althoug not for the shorter terms, which had been earlier "falsified".
She said, correctly, that she was pretty much following Santer's multi-model mean. Testing the difference trends, is equivalent to testing the differences of the regression slopes with combined variances. Both of these tests take account of the regression error, but not the scatter of the actual slopes. They are effectively testing this:
And if you turn the whiskers into standard error of the model mean, you get this (band = 2*se):
This test is just what you would use if the 20 models were in fact just 20 runs of the same model, with the same expected trend. Then you would indeed see the regression slopes scattered, but within the error limits marked.
Update - Lucia has pointed out that I have misunderstood Santer's test, and that his variance takes account of the scatter of the trends, and not the regression uncertainties. I'll make a corrected diagram to reflect that.
In Fig 6 I've left the actual model means marked (yellow). They don't contribute to this significance estimate, but they do indicate that they are scattered far too much to be explained as random regression fluctuations of a single model.
A digression on the significance of significance.
The reason for testing statistical significance is to see whether a finding is robust. That is to say, suppose in the future you did something that could be considered the same test. While you would repeat the same inputs, random fluctuations can't be repeated. However, you can estimate how big an effect they would have, and if you think they would only rarely change the result, then it is significant.
But there are other circumstances where you might not want to be restricted to repeating the same inputs. That arises here with the choice of models. There are actually a very large number of model runs that could have been included. We didn't formally make a random choice but someone (maybe at KNMI) did try to select a representative subset. You, or they, could have chosen a different subset, and you don't want that choice to upset the result. For example, in the future there might be new results that you want to use, or you might want to retire some of the ones you did use for other reasons.
So although the selection of models is not strictly a random process, it is not one that you should build into a statement about model performance. Such a statement should work for any reasonable selection of models.
Lucia's HADCrut test.
In discussion at Lucia's, I made this argument, and she seemed to think it had some merit. So in her latest test, she included the effect of scatter of model means. Working through the trend differences, she then subtracted the trend variances. This required the assumption that the observed and model "weather noise" were similar, which is an awkward one. Anyway, it made little difference, which I wondered about, since in the diagrams scatter of model means looks like a very big effect. So I thought I'd try my own analysis.
The model here for the fluctuation of the difference is that the variance comes from three sources:
1. Regression error of the observations
2. Regression error of each of 20 models
3. Scatter of the regression slopes
Treating 2 and 3 as independent is the opposite extreme to treating 3 as just the manifestation of scatter caused by 2. It's justified by the observation that the scatter is large, and the knowledge that the models do have a lot of differences. Still, the truth does lie somewhat between fully independent and identical models. Here's the result, with now a wider model error band, which just touches the observation range:
Then, I tried to do a full range of years. I plotted four standard errors in all, as bands around the difference between model mean trend and GIStemp trend. They were:
1. Observed regression se ("weather noise") Fig 3 above, with models represented just by their mean
2. Model scatter only - the Douglass variant
3. SE from model regression variance and observed regression
4. SE from (3) plus model scatter
The plot shows error bands of 2*se. The x axis is starting year for the trend to 2009.
So if you account for all sources of variance, the error bars of the difference of trends seem to be significant before 1980, but not after.
I suspect the models will seem to perform better with the following analysis improvements, which I'll try soon:
1. Adjustments like MEI
2. Allowance for autocorrelation of residuals
3. Correction for inter-model dependence.
The last I don't know how to do, but it may be important. The error bars on the mean seem tight because it is assumed that model residuals are independent, so the mean will have much lower variance than the average of the models themselves. But to the extent that the model residuals are correlated, this is less true.