Thursday, May 27, 2010

Fallacy in the Knappenberger et al study

This is a follow-up post to the previous post on the pending paper:

“Assessing the consistency between short-term global temperature trends in observations and climate model projections"

by Patrick  Michaels, Chip Knappenberger, John  Christy, Chad  Herman, Lucia  Liljegren and James  Annan

I'm calling it the Knappenberger study because the only hard information I have is  Chip's talk at the ICCC meeting. But James Annan has confirmed that Chip's plots, if not the language, are from the paper.

Fallacy is likely because, as I showed in the previous post, the picture presented there is considerably different after just four months of new readings. Scientific truth should at least be durable enough to outlast the publication process.

The major fallacy

Chip's talk did not provide an explicit measure of the statistical significance of their claim of non-warming, despite hints that this was the aim. The main message we're meant to take, according to James, is
"the obs are near the bottom end of the model range"
Amd that's certainly what the plots suggest - the indices are scraping against that big black 95% level. This is explicit in Chip's slide 11:
"In the HadCRUT, RSS, and UAH observed datasets, the current trends of length 8, 12, and 13 years are expected from the models to occur with a probability of less than 1 in 20. "

But here's the fallacy - that 95% range is not a measure of expected spread of the observations. It expresses the likelihood that a model output will be that far from the central measure of this particular selection of models. It measures computational variability and may include some measure of spread of model bias. But it includes nothing of the variability of actual measured weather.

The GISS etc indices of course include measurement uncertainty, which the models don't have. But they also include lots of physical effects which it is well-known that the models can't predict - eg volcanoes, ENSO etc. There haven't been big vocanoes lately, but small ones have an effect too. And that's the main reason why this particular graph looks wobbly as new data arrives. Weather variability is not there, and it's big.

Sources of deviation

I posted back in Feb on testing GCM models with observed temperature series. There were three major sources of likely discrepancy identified:

    Noise in measured weather
    Noise in modelling - unpredictable fluctuations
    Uncertainty from model selection

I showed plots which separated the various effects. But the bottom line is that the measured trends and the model population means both have variance, and to compare them statistically, you have to take account of combined variance (as in a t-test for means).

I railed against Lucia's falsifications of "IPCC projections" a couple of years ago. A big issue was that Lucia was taking account then of weather noise, but not model uncertainty. The result is that something that was then "falsified" is no longer false. The total variability had been underestimated. The same effect is being seen here in reverse (model noise but no weather noise).

Estimating model uncertainty

I don't know in detail how the probability levels on Chip's slides were calculated. But it's hard, because model runs don't form a defined population subject to random fluctuations. They are chosen, and with fuzzy ctiteria. Individual runs have fluctuations that you can estimate, but there's no reason to suppose that across models they form a homogeneous population.

That is significant when it comes to interpreting the 95% levels that are quoted. As often in statistical analysis, there's no real observation of the tail frequencies. Instead, central moments are calculated from observation, and tail probabilities quoted as if the distribution is normal.

Normality is hard to verify, and even if verified for the central part of the distribution, it's still a leap to apply that to the tail. The unspoken basis for that leap is some variant of the law of large numbers. If getting into the tail requires the conjunction of a number of independent happenings, then it's a reasonable guess.

But if the occurrence of a tail value is dependent on simple selection (of model run), then even if the scatter looks centrally bell-shaped, as in Chip's slide 5, the reasons for thinking the tail fades away as quickly as a normal distribution would is not really there. The slide does note correctly that the points on the histogram are not independent.


  1. It might be good to wait for the paper. Who knows if a presentation at Heartland is a fair representation of it.

    "that 95% range is not a measure of expected spread of the observations."

    Well, until we see how the range is computed, we don't know exactly what it's a measure of.

    "But it includes nothing of the variability of actual measured weather."

    If each model were run 100 times, then I'd say you could repeat the exercise for each model separately and have some idea of how to compare actual weather variability and the weather variability in that model.

    But that's not what we have.

  2. CE,
    Well, that was the basis in calling it the Knap study - it's a stand-alone Web presentation on the web. One might have wondered about the link to the paper, but Annan's confirmation helps.

  3. Nick Stokes: "Well, that was the basis in calling it the Knap study - it's a stand-alone Web presentation on the web."

    I agree with CE here, a presentation is hardly the same as a study.

    Anyway, it's obvious the models overestimate the climate sensitivity, at least based on the comparison of the model outputs to the measurements. Four months additional data doesn't change that conclusion.

  4. Carrot Eater: "If each model were run 100 times, then I'd say you could repeat the exercise for each model separately and have some idea of how to compare actual weather variability and the weather variability in that model."

    And even then, you'd have to use a long enough time interval so that the variance in the climate models would be expected to be realistic. Down the road, the models may be able to generate realistic fluctuations over shorter periods, and even then you need to just consider the models which have some skill in this regard, not mish-mash the models that don't together with those that should.

  5. Carrick,
    Well, the ICCC claims to be a scientific conference, so I suppose that a presentation like this should be treated as some kind of research study.

    There's no separate analysis of sensitivity, so for current purposes any conclusion on that is only as strong as the proof of trend shortfall. And that depends on firstly finding a shortfall and then a proper demonstration that the shortfall is significant.

    Then there's the problem that "sensitivity" as normally spoken of is equilibrium sensitivity, so short term trends aren't really the place to look.

  6. Nick: "Then there's the problem that "sensitivity" as normally spoken of is equilibrium sensitivity, so short term trends aren't really the place to look"

    If the warming over say a 15-year period is to be associated with increased CO2 forcings (for the models this is certainly how it works), and the models are over-predicting the warming, then the constant they are using to connect CO2 concentration to radiative forcing, the "sensitivity", is too large.

    They usually speak of "environmental" sensitivity anyway, because there really isn't such as thing as "equilibrium" in climate. Ocean-atmospheric oscillations, which are a natural part of the system, affect the albedo for example, so you will never have radiative balance.

  7. Has there been significant warming from 1990 to the present?

    Watch the attribution of the Pinatubo and consider there was an El Niño from 1991 to 1995.

    Phil Jones:

    James Hansen:

  8. Well, Anon, there has certainly been warming. Phil Jones, in your link, says 0.12C/dec since 1995. You seem to be querying whether it is a) attributable to other causes, or b) not statistically significant.

    On the search for other causes, it's always possible to think of something else that might have had a role. My view is just this - that we have been accumulating GHG, and warming is expected. If there wasn't any, we'd need to re-examine the theory. But there has been, and much in line with expectations. If it is to be attributed to some other cause, a case needs to be made.

    On statistical significance, you need a stochastic model in order to evaluate that significance. I'm not sure what model PJ had in mind, but statistical significance is usually a high hurdle. You need many years of data. That's my point in these posts - it's very hard to say that the deviation from models is statistically significant based on just a few years trend. Same with deviation from zero. But again, no-one is arguing that AGW is proved by observed trends. It's based on GHG and radiative physics. Observed warming is confirmation.

  9. Warming of 0.12C in 15 years, in 20 years or in even 30 years (El Chichon) makes some difference.


  10. But again, no-one is arguing that AGW is proved by observed trends. It's based on GHG and radiative physics. Observed warming is confirmation.

    I wish more people would put it this way.

  11. Nick,

    You refer to the tail of the model distribution beiong poorly quantified. Would the output of the project be useful in this regard? It's just one model, but thousands of runs being done, giving more info on the spread in the outcome, I guess.

    You say that model output doesn't include weather variability. To what extent is or isn't the variability in a single model run output representative of that weather variability? I'm actually not sure how/where the variability in a model output originates, but I guess it's due to dynamical feedbacks in the parameterized physical processes, of which the effect/error is somehow correlated and cumulative (ie short term positive feedback), but also bounded by physical constraints (conservation of energy). Or am I totally off base here?


  12. Bart,
    I based the comment about the tail on what seemed to be done in Chip's slide 5. A histogram of trends is plotted (6000 elements, it says) and a normal distribution drawn which presumable matches in mean and variance. He says that the elements are not independent, and I think that is very much a problem.

    I agree that a probably better calculation could be based on the residuals, and your extra dataset would help there. The problem though is that it is always hard to characterise the behaviour of tail elements. There just aren't enough to empirically verify normal behaviour there, and it is on these tail properties that null hypothesis tests are based.If the tails are fatter than normal, the tests will give false rejections. In the histogram, they do look fatter than normal.

    In the absence of empirical proof of tail behaviour, people hope that a law of large numbers will prevail - that tail behaviour is a combined effect of several independent effects, which will combine in a binomial way. But it's not at all clear that this will work for model run outputs.

    Your second query about the source of variation in model output relates to this. Both in real weather and models, a lot of variation is caused by somewhat periodic behaviours (eg ENSO in real world). GCM's and non-linear, but the state mapping from one timestep to the next is approximately linear - a big sparse matrix with lots of complex eigenvalues. A few repetitions brings the larger ones to the fore, producing oscillations. That's one source of noise. Some will be related to real weather - some to discretisation.

    Another related source of noise is from discrete changes in forcing, perhaps also induced by cloud models etc. There is ringing from those eigenvalues.

    So models and weather share some noise source, from models successfully doing their job. But models have their own noise from various discretisations, and climate indices have measurement noise. As I said above, an indication that these are large is from the disparity between indices that are supposed to be measuring the same thing. This is worse than it seems, because those indices are using a large amount of common data, which will have errors that don't show up as discrepancies.