Monday, February 8, 2010

Testing the performance of GCM models



There has been some interest in matching performance of GCM data and observations in recent decades. Some is stimulated by a very public dispute between Douglass et al 2007 on tropical tropospheric temperatures, and Santer et al 2008 (a Who's Who list of authors) in response. Some have taken the test in the latter to be an example to follow.

For blog discussion, see eg (just a sample) CA, Lucia, Chad, Tamino

In another thread of activity, Lucia has a running series of checks on the GMST trends and the significance of their deviation from observations. There was a long series of "falsifications" of IPCC predictions listed here, and more recently a test over a longer period of GIStemp and Hadcrut.



I've been participating in the latter discussions, generally arguing that what was "falsified" was not an IPCC or model prediction, but a derived number, which had been stripped of some of its its acknowledged variability, which much amplifies the significance of deviation.

The models, from the KNMI database


People have been using a set of 22 models from the KNMI database, hindcasting from at least 1900 to 2009, and comparing to GIStemp and Hadcrut. SteveM has listed a scraping script here. I've lodged my R script for this whole application in my repository here.

I'm following a graph style that Santer used, and which Chad has developed to present a wide range of results. I'm going to talk first about the model results themselves.

I used a set of 20 model results. I think this was a subset of the 22 models Lucia and Chad used. When I came to download from KNMI, I had error messages from gfdl_cm2_1 and miroc3_2_medres. The models were:
"cccma_cgcm3_1"(5),
"cccma_cgcm3_1_t63",
"cnrm_cm3",
"csiro_mk3_0",
"csiro_mk3_5",
"gfdl_cm2_0",
"giss_model_e_h",(3)
"giss_model_e_r",(5)
"ingv_echam4",
"inmcm3_0",
"ipsl_cm4",
"miroc3_2_hires",
"miub_echo_g",(3)
"mpi_echam5",(4)
"mri_cgcm2_3_2a",(5)
"ncar_ccsm3_0",(7)
"ncar_pcm1_00",
"ncar_pcm1_01",
"ukmo_hadcm3",
"ukmo_hadgem1"
The number in brackets shows the number of runs from each model. I simply averaged these.

I also averaged the monthly results and worked only with annual means. This removed the need for great care with seasonal variation.

I calculated the regressions over the period 1980-2009. Later I'll do different periods, but at the moment I'm going to just describe the forms of the data. Each model mean trend has, from the regression, a standard error. In the following plot, I've shown the slopes (C/yr) with vertical whiskers length 2*se. The centre is marked with a cross green bar, whose width is proportional to  the number of runs.

I should add that I did not correct for AR4  correlation of residuals. This is not as strong for annual data, but is still there. Had I included it, the error ranges would have been wider. At the moment, I'm presenting these plots mainly as an intro to testing methods.

Then the trends look like this:

Fig 1.
The mean is shown as a violet line.

Discussion of model results



The main thing to say is that there is a lot of scatter. This can legitimately be criticised. The big effort over the years has been to get the model results more consistent, but that's where it stands.

It is impossible to ignore this scatter when testing the "performance" of the models against observations. There's no point in testing some derived quantity, say the mean, and saying that represents the models. It doesn't. It's only a part of what they say.

Inclusion of observations


If we include an observation set, say GIStemp, then a regression of that data also indicates a slope with an error range. Adding that to Fig 1, as a shaded bar with width 2*se, gives the following Figure

Fig 2.

Now something about testing the correspondence becomes clear. The scatter of the models is much greater than the observed "weather noise" (hatched area), and the whiskers are long. This will dominate any proper test.

I'ln now describe various limited tests that have been used. These ignore various sources of scatter, which makes it much easier to infer significant deviation.

Lucia's earlier falsifications.


Lucia concentrated on the last 8 or 9 years in this series, and tested a single number, the "IPCC mean", usually of about 0.02 C/year. That's equivalent to testing the following setup:

Fig 3.

It's an OK test of a number, but not of any reasonably measure of model performance.

Douglass et al


Curiously, they did the opposite. They
allowed for the effect of the spread of model means, but ignored the errors in regression trends of both models and observations, as  Santer et al 2008 observed. They were testing a pattern like this:

Fig 4
Note that the green band here is not the average scatter, but the uncertainty of the mean due to scatter - 2*se.

Lucia's GIStemp test and Santer's test.


This recent test, over a longer period, subtracted the observed from the model trends and tested the significance of the results. She generally did find significant results for longer periods, althoug not for the shorter terms, which had been earlier "falsified".

She said, correctly, that she was pretty much following Santer's multi-model mean. Testing the difference trends, is equivalent to testing the differences of the regression slopes with combined variances. Both of these tests take account of the regression error, but not the scatter of the actual slopes. They are effectively testing this:

Fig 5.

And if you turn the whiskers into  standard error of the model mean, you get this (band = 2*se):

Fig 6.

This test is just what you would use if the 20 models were in fact just 20 runs of the same model, with the same expected trend. Then you would indeed see the regression slopes scattered, but within the error limits marked.
Update - Lucia has pointed out that I have misunderstood Santer's test, and that his variance takes account of the scatter of the trends, and not the regression uncertainties. I'll make a corrected diagram to reflect that.

In Fig 6 I've left the actual model means marked (yellow). They don't contribute to this significance estimate, but they do indicate that they are scattered far too much to be explained as random regression fluctuations of a single model.


A digression on the significance of significance.



The reason for testing statistical significance is to see whether a finding is robust. That is to say, suppose in the future you did something that could be considered the same test. While you would repeat the same inputs, random fluctuations can't be repeated. However, you can estimate how big an effect they would have, and if you think they would only rarely change the result, then it is significant.

But there are other circumstances where you might not want to be restricted to repeating the same inputs. That arises here with the choice of models. There are actually a very large number of model runs that could have been included. We didn't formally make a random choice but someone (maybe at KNMI) did try to select a representative subset. You, or they, could have chosen a different subset, and you don't want that choice to upset the result. For example, in the future there might be new results that you want to use, or you might want to retire some of the ones you did use for other reasons.

So although the selection of models is not strictly a random process, it is not one that you should build into a statement about model performance. Such a statement should work for any reasonable selection of models.

Lucia's HADCrut test.


In discussion at Lucia's, I made this argument, and she seemed to think it had some merit. So in her latest test, she included the effect of scatter of model means. Working through the trend differences, she then subtracted the trend variances. This required the assumption that the observed and model "weather noise" were similar, which is an awkward one. Anyway, it made little difference, which I wondered about, since in the diagrams scatter of model means looks like a very big effect. So I thought I'd try my own analysis.


My test


The model here for the fluctuation of the difference is that the variance comes from three sources:
1. Regression error of the observations
2. Regression error of each of 20 models
3. Scatter of the regression slopes
Oddly, (3.) was the only one used by Douglass, and they were the only ones to use it.

Treating 2 and 3 as independent is the opposite extreme to treating 3 as just the manifestation of scatter caused by 2. It's justified by the observation that the scatter is large, and the knowledge that the models do have a lot of differences. Still, the truth does lie somewhat between fully independent and identical models. Here's the result, with now a wider model error band, which just touches the observation range:

Fig 7.


Then, I tried to do a full range of years. I plotted four standard errors in all, as bands around the difference between model mean trend and GIStemp trend. They were:
1. Observed regression se ("weather noise") Fig 3 above, with models represented just by their mean
2. Model scatter only - the Douglass variant
3. SE from model regression variance and observed regression
4. SE from (3) plus model scatter

The plot shows error bands of 2*se. The x axis is starting year for the trend to 2009.


Fig 8.

So if you account for all sources of variance, the error bars of the difference of trends seem to be significant before 1980, but not after.

I suspect the models will seem to perform better with the following analysis improvements, which I'll try soon:
1. Adjustments like MEI
2. Allowance for autocorrelation of residuals
3. Correction for inter-model dependence.

The last I don't know how to do, but it may be important. The error bars on the mean seem tight because it is assumed that model residuals are independent, so the mean will have much lower variance than the average of the models themselves. But to the extent that the model residuals are correlated, this is less true.


Sunday, January 31, 2010

Carbon dioxide feedback.

The recent paper in Nature of Frank et al on CO2 feedback has had a run in the blogosphere. That was started by a review in Science Daily. Briefly, the paper says that the minor feedback mechanism by which warming causes CO2 to be expelled from the oceans, causing more warming, is less than some recent estimates. The key figure is the increase of CO2 in ppmv resulting from 1C warming, and Frank et al give a median figure of 7.7, with a range from 1.7 to 21.4.

That's actually a pretty wide range. But does it refute any sort of consensus? I looked up what the AR4 had to say. There wasn't much, but in Chap 7, sec 7.3.4.3, I found this:
“A 1°C increase in sea surface temperature produces an increase in pCO2 of 6.9 to 10.2 ppm after 100 to 1,000 years (Heinze et al., 2003; see also Broecker and Peng, 1986; Plattner et al., 2001). “

That seems quite within the range Frank et al find. Yet people who suggest some pillar of AGW has fallen say that recent estimates were up to 40 ppmv/C. Lubos Motl had cited that figure, so I put the IPCC figure to him. He cited this 2006 paper by Scheffer, Brovkin and Cox as the source of the 40 ppmv/C claim.

I looked up that paper. They had deduced their estimate from the CO2 response to the LIA, and said
we arrive at an estimated carbon sensitivity (α) to temperature of 41 (following  Mann and Jones) to 12 (following Moberg et al) ppmv CO2°C.


A rather ambivalent figure to be refuting.

Thursday, January 28, 2010

GHCN Stations warming?

Update Fri Jan 29 2.56 pm (East Aust Time) Big Oops. Programming error.

It is surprising that such a simple program could have a programming error. But there was a memory overflow, and it had a big effect on the results.

With that corrected, there is indeed a rise in station mean temps. The new graph is here. I should emphasise that the plot is of "naive" means - just averaging all readings for each year, including duplicates.

The new plot is like that on p 14 of the d'Aleo/Watts report, and different from that on p 11.





The revised R code I used to calculate and plot this example is here


Earlier post:

The new report of d'Aleo and Watts, trumpeting calculations of E.M.Smith, makes much of a supposed shift of GHCN stations to warmer areas as an alleged source of warming. Indeed, it is full of accusations that this is done with fraudulent intent.

Of course, anomaly calculations wouldn't show warming for that reason. But is the station set actually warming?

I did a simple calculation. Just the average temperature of all stations in the GHCN set v2.mean, for any year. You might expect a small rise reflecting global warming. But if there is nett movement of stations to warmer climes, that should show as a bigger effect.

Here's the result, plotted from 1950. The trend is actually down.


The R code I used to calculate and plot this example is here

Update 29 Jan. I investigated the downspike about 2006. It's caused by some stations returning a lot of missing months, which yields erratic results. So I put a screen in the program requiring at least 9 months of data before a station could contribute to the year's average. It didn't change the overall picture, but did eliminate the spike at 2006.


Following a suggestion of Carrot Eater, I checked the v2.mean adjusted file. The results, not very different, are here .

Saturday, January 23, 2010

GHCN Station selection.

WUWT has featured Marc Sheppard's American Thinker article, which gives a prominent place to E.M.Smith's claims about how GHCN has been selectively dropping cool stations to boost apparent global warming. I was going to blog extensively on this, but I see that Real Climate. and more completely Zeke Hausfather have it well covered. However, maybe there's a bit to add to the WUWT discussion.



First some elementary things which people like Smith and Sheppard really should check before going into this stuff. All climate trend work is done in terms of temperature anomalies. For each station a base average for each month is established over some base period (1961-1990 for GHCN). For each month, the anomaly is the difference between the average for the month and the base average. So the base average takes out the effect of location, and just reports whether, for that station, the temp was higher or lower than "usual". This is vital in establishing a regional or global average - otherwise you really do have to worry about whether you have a fair distribution of hot and cold places. That is hard, and its why you'll see few references to a global temp in deg C.

So selectively dropping cool stations would not at all make global anomaly bigger or smaller.

The other elementary step is to enquire why the stations were discontinued. The NOAA explains their station selection thinking here. Zeke expands in his post. The bottom line is that GHCN was a historic database. It could include many more stations because it had years to process the data. But once you start trying to maintain such a database in real time, it's much harder. Data doesn't come preprocessed and checked down the wire from much of the world. A lot comes in print, and has to be digitised. You need to rather carefully work out just how much you need. That's why NOAA did not try to keep up many of the stations that contributed to the historic network.

Anomalies and Zeke's analysis.

The basic study of the independence of anomalies from local topography etc was Hansen and Lebedeff (1987). It showed that over a period of time, anomalies correlated well over distances up to 1200 km and more, without taking account of long term average temp. That means that you can indeed select representative sites without trying to balance this factor.

I'm still not convinced by E.M.Smith's claim that discontinued stations are cooler. But suppose they are. Does it make a difference? Zeke says no. Here's his plot, showing that if you take stations currently reporting, and compare with stations that have a long record in the GHCN database but do not currently report, then there is no real warming or cooling effect.


And to show that the high-latitude altitudes still get coverage, here's his distribution of reporting sites:



Bolivia


In the course of the WUWT thread, I did check one Sheppard/Smith claim in detail. They scoffed at the representation of Bolivia, showing this map:

and saying:

“There’s a wonderful baseline for Bolivia — a very high mountainous country — right up until 1990 when the data ends. And if you look on the [GISS] November 2009 anomaly map, you’ll see a very red rosy hot Bolivia [boxed in blue]. But how do you get a hot Bolivia when you haven’t measured the temperature for 20 years?”

Well, we see a big red patch around Bolivia. It seems uniformly pretty warm, and if that applies to Andean topography generally, it seems suggesting that Bolivia was warmer than usual is not unreasonable. So I looked around for stations near Sucre, Bolivia, and found:
Anomaly...Dist.....Alt...Name
1.63C.....346km..3410m..La Quiaca
2.02C.....561km....88m..Arica
1.61C.....564km...385m..Tacna
3.21C.....571km...189m..Marisca
2.66C.....595km...950m..Jujuy Aero

A good range of altitudes there. The Dist is distance from Sucre. Arica is by the sea (desert); others are inland. The anomalies show some variation, but they are all pretty warm, as the red patch suggests. No reason to expect Bolivia to be different.

Wednesday, January 6, 2010

What if there were no Greenhouse Effect?

Roy Spencer recently put up a post on his blog, reposted at WUWT, on the old chestnut of "What would the world be like without the greenhouse effect?". His thesis: "there would be no weather on Earth without the greenhouse effect." That, and some other things he wrote, are wrong.

But first, let's look at how "no GH effect" could be implemented. The simplest way is just to imagine that the Earth had the same gas constituents, but somehow they magically lost their ability to absorb and emit thermal IR. Roy S goes further, saying:
"So, let’s imagine an extremely cold Earth and atmosphere, without any water vapor, carbon dioxide, methane or any other greenhouse gases – and with no surface water to evaporate and create atmospheric water vapor, either."
That's a big change, but let's go with that.



Heat engines and weather


The Earth's atmosphere is often described as a big heat engine, generating weather. It is. A heat engine creates motion from transferring heat from a hot source to a cool sink. Sunlight creates lots of differential heating - between regions of different albedo, different latitudes, and between night and day regions. The latitude difference creates the classic Hadley cell. Hot air in the tropics rises from the surface, moves polewards, and descends in mid-latitudes, warming the cooler surface there. The warm source is maintained by an excess of sunlight over IR loss, and the cool sink is maintained by a corresponding deficit. This happens independently of the greenhouse effect.

The Hadley cell combines with the Coriolis effect to yield trade winds (the part of the cell where the replacement air flows back, cooled, from mid-latitude to tropics, and the mid-latitude westerlies (roaring forties), produced by the angular momentum transported from the tropics. Lots of weather there, and not due to the GHE.

Adiabatic Lapse Rate


Roy says "[without GHE]...Only the surface and a shallow layer of air next to the surface would go through a day-night cycle of heating and cooling. The rest of the atmosphere would be at approximately the same temperature as the average surface temperature. And without a falloff of temperature with height in the atmosphere of at least 10 deg. C per kilometer, all atmospheric convection would stop."
This is a common misunderstanding of the dry adiabatic lapse rate. It is not related to upper air cooling due to GHG emission. It happens in any gas (eg pure N2) which is in motion in a gravity field (g=9.8 m/s^2). It is something a like a reverse Carnot cycle. Imagine that you had a balloon with 1 kg gas (say N2), in an isothermal atmosphere under gravity, and you perform the following cycle.
1. From an initially isothermal state, you raise it rapidly by 1 km. The air expands adiabatically and cools (by 9.8 K). Because it is denser than the surrounding air, work is done to raise it.
2. The balloon is then held still until it warms to the ambient temperature, absorbing heat from the high altitude. It expands further, doing wasted work displacing gas.
3. The balloon is then quickly lowered 1 km. It is compressed, becoming hotter than ambient (by 9.8K), so work is needed to pull it down.
4. Again, the balloon is allowed to cool to ambient, delivering heat to the lower altitude. The cycle can repeat.
This is a classic heat pump. You do work, and move a fixed amount of heat downward in each cycle. Actually, in this case work is not required, because source and sink are at the same temperature, and the work is wasted. But the process can also be carried out with any lapse rate up to the dry adiabat of 9.6 K/km, and then true heat pumping is done, and heat goes from cooler to warmer.

The cycle has neutral effect if there is already a lapse rate (negative temperature gradient) of 9.8 K/km. For then the gas in the balloon is always at ambient temperature. It requires no work to raise it, and transfers no heat on arrival.

You might worry about whether there is some effect due to the external air being displaced up or down. OK, just imagine there are two balloons, one lowered just as the other is raised, so then there is no nett displacement of the atmosphere.

This is how the dry adiabatic lapse rate is maintained. Atmospheric motions, driven by the above mentioned heat engine, do the work. Whenever the gradient is less than the dry adiabat, heat is drawn downwards, until it is restored - well, almost (see next section).

Moist adiabat, etc


In reality, the dry adiabat is not always attained. There are processes which tend to conduct heat down any temperature gradient. The heat pump process that I described works against these, but can no longer maintain the full 9.8 K/km. These processes include:
1. Molecular conduction (heat diffusion) - very small
2. Turbulent heat transport - larger
3. Latent heat transport - can be larger again. But it requires that actual phase change occurs - evaporation and condensation, within the cycle. It enhances turbuleny transport, because the air carries more total heat.
4. Rosseland radiative transport. This is an enhanced thermal diffusion involving repeated absorption and emission of IR (by GHGs). It's often overlooked, and I'll talk more about it in a future post.

The postulates of Roy S would remove the leakage mechanisms 3 and 4, actually reinforcing the dry adiabat.

Spencer's error


Roy says "And without a falloff of temperature with height in the atmosphere of at least 10 deg. C per kilometer, all atmospheric convection would stop.". And that's his error. It wouldn't stop - it just has to be driven, by the atmospheric heat engine, as it is now. And that heat engine just needs regions of different temperature, which there would certainly be, with or without the GHE. In fact, the main latitudinal driver would work in much the same way.

A world without the GHE


So what would it be like? The conventional calculation is 33 C colder (for the same albedo), and that seems to me about right. There would still be latitude differences, and day/night differences, as well as land warm spots due to albedo variation. There would still be strong atmospheric motion - in fact the main circulations (eg Hadley) would still exist. The dry adiabatic lapse rate of 9.8 K/km would be almost universal. There would still be a tropopause somewhere, because at some altitude the atmospheric motions will reduce, the heat pump will fade, and heating from UV absorption by ozone etc will become more significant.

Monday, December 21, 2009

Darwin and GHCN Adjustments


Willis Eschenbach created a stir a few days ago with his post at WUWT entitled The Smoking Gun At Darwin Zero. Darwin Zero is one of five duplicate Darwin records in the GHCN database. It is the longest, and the others seem to be fragments of it, although there are some variations.

The claim is that the homogeneity adjustments are so large, and upward, that they indicate manipulation. This stern statement was underlined:
Those, dear friends, are the clumsy fingerprints of someone messing with the data Egyptian style ... they are indisputable evidence that the "homogenized" data has been changed to fit someone’s preconceptions about whether the earth is warming.


Thursday, December 17, 2009

Repository

This post is intended as a repository for code and bulky materials. Please don't comment here, as your remarks will be hard to find.

If you have problems, my email name is nastokes and is found at westnet and com and au.

I've heard of occasional problems with downloading R code. I think cut and paste should generally work, but I'm trying to now post copies of files on a document store.

Contents:
Code for GHCN Adjustments Histogram
Code for GHCN stations mean temp per year
Code for KNMI GMST model trends
Code for GHCN v2.mean update
Code for GISS UHI Adjustment
Code for GHCN Temperature Global Trend
Code for GHCN Temperature V1.11
Code for GHCN Temperature V1.4.1