Monday, February 8, 2010

Testing the performance of GCM models

There has been some interest in matching performance of GCM data and observations in recent decades. Some is stimulated by a very public dispute between Douglass et al 2007 on tropical tropospheric temperatures, and Santer et al 2008 (a Who's Who list of authors) in response. Some have taken the test in the latter to be an example to follow.

For blog discussion, see eg (just a sample) CA, Lucia, Chad, Tamino

In another thread of activity, Lucia has a running series of checks on the GMST trends and the significance of their deviation from observations. There was a long series of "falsifications" of IPCC predictions listed here, and more recently a test over a longer period of GIStemp and Hadcrut.

I've been participating in the latter discussions, generally arguing that what was "falsified" was not an IPCC or model prediction, but a derived number, which had been stripped of some of its its acknowledged variability, which much amplifies the significance of deviation.

The models, from the KNMI database

People have been using a set of 22 models from the KNMI database, hindcasting from at least 1900 to 2009, and comparing to GIStemp and Hadcrut. SteveM has listed a scraping script here. I've lodged my R script for this whole application in my repository here.

I'm following a graph style that Santer used, and which Chad has developed to present a wide range of results. I'm going to talk first about the model results themselves.

I used a set of 20 model results. I think this was a subset of the 22 models Lucia and Chad used. When I came to download from KNMI, I had error messages from gfdl_cm2_1 and miroc3_2_medres. The models were:
The number in brackets shows the number of runs from each model. I simply averaged these.

I also averaged the monthly results and worked only with annual means. This removed the need for great care with seasonal variation.

I calculated the regressions over the period 1980-2009. Later I'll do different periods, but at the moment I'm going to just describe the forms of the data. Each model mean trend has, from the regression, a standard error. In the following plot, I've shown the slopes (C/yr) with vertical whiskers length 2*se. The centre is marked with a cross green bar, whose width is proportional to  the number of runs.

I should add that I did not correct for AR4  correlation of residuals. This is not as strong for annual data, but is still there. Had I included it, the error ranges would have been wider. At the moment, I'm presenting these plots mainly as an intro to testing methods.

Then the trends look like this:

Fig 1.
The mean is shown as a violet line.

Discussion of model results

The main thing to say is that there is a lot of scatter. This can legitimately be criticised. The big effort over the years has been to get the model results more consistent, but that's where it stands.

It is impossible to ignore this scatter when testing the "performance" of the models against observations. There's no point in testing some derived quantity, say the mean, and saying that represents the models. It doesn't. It's only a part of what they say.

Inclusion of observations

If we include an observation set, say GIStemp, then a regression of that data also indicates a slope with an error range. Adding that to Fig 1, as a shaded bar with width 2*se, gives the following Figure

Fig 2.

Now something about testing the correspondence becomes clear. The scatter of the models is much greater than the observed "weather noise" (hatched area), and the whiskers are long. This will dominate any proper test.

I'ln now describe various limited tests that have been used. These ignore various sources of scatter, which makes it much easier to infer significant deviation.

Lucia's earlier falsifications.

Lucia concentrated on the last 8 or 9 years in this series, and tested a single number, the "IPCC mean", usually of about 0.02 C/year. That's equivalent to testing the following setup:

Fig 3.

It's an OK test of a number, but not of any reasonably measure of model performance.

Douglass et al

Curiously, they did the opposite. They
allowed for the effect of the spread of model means, but ignored the errors in regression trends of both models and observations, as  Santer et al 2008 observed. They were testing a pattern like this:

Fig 4
Note that the green band here is not the average scatter, but the uncertainty of the mean due to scatter - 2*se.

Lucia's GIStemp test and Santer's test.

This recent test, over a longer period, subtracted the observed from the model trends and tested the significance of the results. She generally did find significant results for longer periods, althoug not for the shorter terms, which had been earlier "falsified".

She said, correctly, that she was pretty much following Santer's multi-model mean. Testing the difference trends, is equivalent to testing the differences of the regression slopes with combined variances. Both of these tests take account of the regression error, but not the scatter of the actual slopes. They are effectively testing this:

Fig 5.

And if you turn the whiskers into  standard error of the model mean, you get this (band = 2*se):

Fig 6.

This test is just what you would use if the 20 models were in fact just 20 runs of the same model, with the same expected trend. Then you would indeed see the regression slopes scattered, but within the error limits marked.
Update - Lucia has pointed out that I have misunderstood Santer's test, and that his variance takes account of the scatter of the trends, and not the regression uncertainties. I'll make a corrected diagram to reflect that.

In Fig 6 I've left the actual model means marked (yellow). They don't contribute to this significance estimate, but they do indicate that they are scattered far too much to be explained as random regression fluctuations of a single model.

A digression on the significance of significance.

The reason for testing statistical significance is to see whether a finding is robust. That is to say, suppose in the future you did something that could be considered the same test. While you would repeat the same inputs, random fluctuations can't be repeated. However, you can estimate how big an effect they would have, and if you think they would only rarely change the result, then it is significant.

But there are other circumstances where you might not want to be restricted to repeating the same inputs. That arises here with the choice of models. There are actually a very large number of model runs that could have been included. We didn't formally make a random choice but someone (maybe at KNMI) did try to select a representative subset. You, or they, could have chosen a different subset, and you don't want that choice to upset the result. For example, in the future there might be new results that you want to use, or you might want to retire some of the ones you did use for other reasons.

So although the selection of models is not strictly a random process, it is not one that you should build into a statement about model performance. Such a statement should work for any reasonable selection of models.

Lucia's HADCrut test.

In discussion at Lucia's, I made this argument, and she seemed to think it had some merit. So in her latest test, she included the effect of scatter of model means. Working through the trend differences, she then subtracted the trend variances. This required the assumption that the observed and model "weather noise" were similar, which is an awkward one. Anyway, it made little difference, which I wondered about, since in the diagrams scatter of model means looks like a very big effect. So I thought I'd try my own analysis.

My test

The model here for the fluctuation of the difference is that the variance comes from three sources:
1. Regression error of the observations
2. Regression error of each of 20 models
3. Scatter of the regression slopes
Oddly, (3.) was the only one used by Douglass, and they were the only ones to use it.

Treating 2 and 3 as independent is the opposite extreme to treating 3 as just the manifestation of scatter caused by 2. It's justified by the observation that the scatter is large, and the knowledge that the models do have a lot of differences. Still, the truth does lie somewhat between fully independent and identical models. Here's the result, with now a wider model error band, which just touches the observation range:

Fig 7.

Then, I tried to do a full range of years. I plotted four standard errors in all, as bands around the difference between model mean trend and GIStemp trend. They were:
1. Observed regression se ("weather noise") Fig 3 above, with models represented just by their mean
2. Model scatter only - the Douglass variant
3. SE from model regression variance and observed regression
4. SE from (3) plus model scatter

The plot shows error bands of 2*se. The x axis is starting year for the trend to 2009.

Fig 8.

So if you account for all sources of variance, the error bars of the difference of trends seem to be significant before 1980, but not after.

I suspect the models will seem to perform better with the following analysis improvements, which I'll try soon:
1. Adjustments like MEI
2. Allowance for autocorrelation of residuals
3. Correction for inter-model dependence.

The last I don't know how to do, but it may be important. The error bars on the mean seem tight because it is assumed that model residuals are independent, so the mean will have much lower variance than the average of the models themselves. But to the extent that the model residuals are correlated, this is less true.


  1. Isn't half the trouble coming from averaging multiple runs from the same model, and then forgetting you did that? That in itself should be a reminder that you won't observe the mean trend behavior over short time spans; the individual model runs themselves can diverge from the mean trend over short time spans.

  2. Well, I hadn't forgotten. And I am planning to check. I'll do a run with no averaging.

  3. Hhh... I thought i left a comment this morning? I'm not going to write a lot until I can verify my comments show. I think we discussed some of the issues at my blog anyway.

    Also, I can't seem to paste copied text into your comments. The only other blog where that happens is Pielke's. Is this a blogspot thing?

    On the minor question about my estimating the 'weather' noise contained in the variability across model runs, I thought up two possible ways based on available data and model data. 1) Use the noise in the observation. 2) use the variability in repeat realizations in models with multiple runs. I chose (1), mostly for computational simplicity. However, it happens to be the choice that gives the larger uncertainty intervals. If you want me to use the other alternative, then I'll get tighter uncertainty intervals and more rejections.

  4. Lucia,
    Thanks for your comments here and at your site. Sorry response is a bit slow today - grandparent duties.

    I've occasionally had comments that I thought I made not show. I've set to no moderation, so that isn't it. I'm not very experienced yet at blog management, and haven't delved into the spam filter (if there is one).

    Yes, the no paste is a blogspot thing - the cure is to use an identity (say Google or Typekey). Then it's OK. I found that trick at RP's.

    I'll make a substantive response to your comment at the Blackboard a bit later. I'm going to need to make some changes for the Santer error you pointed out.

  5. Nick ... SWEET!

    Excellent post.
    Very clear visualizations.

    Can you identify the two high 'out of band' models? Looks like maybe ...


  6. Ron,
    Sorry about the slow response. I tried to run the program again to print out the trends, and found that KNMI was not downloading reliably. So I had to use some backup data, which I think should give the correct results. Here are the numerical trends.
    cccma_cgcm3_1_00 0.0436
    cccma_cgcm3_1_t63 0.0352
    cnrm_cm3 0.0168
    csiro_mk3_0 0.0158
    csiro_mk3_5 0.0207
    gfdl_cm2_0 0.0148
    giss_model_e_h_00 0.0172
    giss_model_e_r_00 0.0172
    ingv_echam4 0.00187
    inmcm3_0 0.0248
    ipsl_cm4 0.0240
    miroc3_2_hires 0.0479
    miub_echo_g_00 0.00573
    mpi_echam5_00 0.0119
    mri_cgcm2_3_2a_00 0.0173
    ncar_ccsm3_0_00 0.0306
    ncar_pcm1_00 0.025
    ukmo_hadcm3 0.0386
    ukmo_hadgem1 0.0407

    So yes, your estimate was right.