Monday, September 23, 2013

Active ocean acidification calculator.

I wrote a post a few weeks ago on ocean acidification. The message was that there is more to acidification than pH, and that it's basically an issue of equilibrium between carbonate species. Difficulties of measuring pH can be avoided.

In the course of discussion, I said I would try to develop an active calculator to display some of the principles.

It calculates equilibrium adjustments, and I expect that is a rather specialised interest. I'm putting it forward to demonstrate
  • that any two of the species are sufficient to determine the others - pH is not essential and can be treated as just a following variable
  • CO2 when dissolved is almost completely reacted (at usual pH)
The main process of interest is the addition of CO2, which causes carbonate to go down. This affects the solubility of CaCO3. I haven't shown that equilibrium, because it only makes sense where solid CaCO3 is present. The calculator has two modes. You can show the effect of adding a fixed amount of any reagent, or you can specify any two. Details and gadget below the jump.

The gadget

There are four species (H+,CO2,HCO3- and CO3--). For details of the chemistry, see my previous post. I also have two combinations, dissolved inorganic carbon (DIC) and total alkalinity (TA). These are conserved during the reaction as total C and nett charge. There are six bars, four in gold for the species, and two for the combinations. They are scaled on a negative log scale (like pH). You can click on each bar to vary each constituent. But first you need to set a constraint using the radio buttons at the top. The default setting is marked "Add". Here you can only vary the species (gold). Your click will indicate what the concentration would have been if the change (+ or -) happened without reaction. But in fact your addition is added appropriately to DIC and TA, and then the equilibrium adjusts with these held constant. Sometimes almost all what you have added reacts, and you see only a small change in that variable. If you click another radio button, you move to "set" mode. The column you have marked is held fixed, and the column you click (a different one) will move to the value you have set (if possible - see below). "Set" is a slight misnomer - it's better seen as the values the other components are known to have when the "set" values have been measured. The blue bars show the current state, with log values beside. The absolute values show in the table on the right, with the pK values. The intent, not yet implemented, is that you can vary these.

Update: You can now vary pK1 and pK2 (but see Notes below). Modify the text box and click the new "Accept Data". You can use this to look at totally different equilibria if you want. To make it monoprotic, just set pK2 to, say, 15.


I'm using a Newton-Raphson process to solve the equations. This is fast and accurate, but may not necessarily converge. On the right there is a purple box which tells you whether your request has converged or failed. If failed, for whatever reason, no change has been made and you can try again. Some requests are impossible. For example, no carbon species can exceed DIC. But sometimes the Newton process just had trouble with the size of change you asked for. Try getting there in two steps.


A Bjerrum plot is shown (from the previous post). It shows equilibrium concentrations normalised against DIC vs pH. A thin gold bar moves to show where you currently are on the plot.

Here it is:


The constants pK1 and pK2 come from Zeebe's review article referred to in my previous post. I do not distinguish between CO2 and H2CO3, and the constants reflect that. [CO2] means the combined value. A pK1 based on carbonic acid alone would be lower, but you would then need the equilibrium between CO2 and H2CO3.

Monday, September 16, 2013

Adjusting temperature series stats for autocorrelation.

I've been writing a lot lately on autocorrelation for temperature time series. This is basically a methods post.

I've often referred to a post by Hu McCulloch, in which he reproves a scientist for quoting just a OLS regression for a monthly series with OLS CI's, and shows how the CI's are much larger if the residuals are assumed to be not white noise but AR(1) correlated. He gives a useful set of formulae, and describes in particular the Quenouille AR(1) correction to the trend uncertainty. This seems to be the current standard, and it is what I use in my trend viewer.

Stats program R has an arima(xreg=) function which will give trends with error estimate for general ARIMA models. It solves a slightly different optimisation for regression, with a different trend for each ARIMA type. But for AR(1), the results are similar to the Quenouille approach.

So why bother with approximations? I'm interested for two reasons, in connection with the trend viewer. In my weekly updates, I do about 5 million regressions. That takes a bit over an hour, but only because I avoid the individual regression summations, instead using differences of cumulative sums calculated for each dataset. You can't to that in a package. arima() would take many hours.

The other reason is that to provide interactive output, I have to be able to repeat the calculation in Javascript. Time isn't an issue there, but programming simplicity is.

In this post, I'll describe some extensions which allow Quenouille style corrections to more complex ARMA() models. I also give corrections to the OLS trend. The trend corrections are O(1/N) (N=#data) and I believe the error is then O(1/N^2), and for the CI's, O(1/N). Those errors are for the approximation to the stochastic model - whether it is the right model is a greater uncertainty.

With these formulae the complete procedure is
  1. Do a normal OLS regression.
  2. Calculate the residuals and estimate the first few autocorrelations (one lag for A(1)m 2 for AR(2) and ARMA(1,1) etc)
  3. Use the Yule-Walker equations to estimate the ARMA parameters. That just solves a few linear equations in the AR() parameters.
  4. Correct the trend and uncertainty using thge formulae given here.

Fitting stochastic models

I'm talking about fitting to regression residuals, so it is simpler to have a regression already done, although I described a method for simultaneously fitting regression and residual model. An ARMA model for regression residuals can be written: A(L)d=M(L)ε Here A is a polynomial in the lag operator L, A=1-A1*L-A2*L^2... applied to the time series d (so L d_i = d_{i-1}) M(L) is a similar polynomial in L: M(L)=1+M1*L+... ε is a vector of iid random variables The series d is taken to be the residuals of a regression on data y: d=y-a-b*x. The notation ARMA(j,k) means that A is of order j and M of order k. The effect of applying the polynomial A to d can also be represented by the lower triangular matrix (AR(2) example) A=
This is of course a fragment - if d has length n then A is (n-j) x n and is Toeplitz. M also corresponds to such a matrix. The regression equation can be written M-1L d = ε Comprehensive programs like the R arima() will minimise d* K* K d where K = M-1L over coefficients a and b, and return the appropriate uncertainties. The alternative described above is to minimise d* d, and then adjust the coefficients and their uncertainty.

Finding and adjusting trend

The regression of the ARMA model gives trend b = y* KK x/ x* KK x, (KK = K* K) assuming K x has zero mean. But KK is symmetric, and also near Toeplitz, deviating near the top left and bottom right corners. The i'th row of KK (far from corners) is symmetric about the i'th value (with unsymmetric truncation). But the arithmetic progression x can be split into a constant xi and a part x-xi that is antisymmetric about i. The latter yields a zero sum with that row of KK, so the result is R*x_i, where R is the (constant near centre) rowsum of KK. So KK * x ≈ R * x This argument gives b = y* KK x/ x* KK x ≈ y* x/ x* x which is the OLS value. That's why only a small adjustment is needed - it is due to end effects. In the case of AR(n) models, the deviation of KK from Toeplitz only affects the top n and bottom n rows, so it is a fairly simple matter to add up the missing terms. The corrections to the OLS trend for the various models are, for time (=x) series y of length N:
OLS or AR(0) b
And here are the results comparing these approximations with values computed with the R arima() program, setting xreg and using method="CSS".The corrected values are much closer than the OLS trend. Trends are from 1980 to July 2013.

AR(1) approx1.57421.56621.49361.42571.3073
AR(2) approx1.59591.59071.50791.46961.3389

Trend uncertainties.

The original adjustment for AR(1) autocorrelated regression residuals was given by Quenouille (1952), and has been widely used. In an earlier post, I gave the corresponding approximations for ARMA(1,1) and AR(2). I'll exrend to AR(3) and give the derivations here. Note that whereas for the trend we were looking for an O(1/N) adjustment, here it is O(1) - ie much bigger.

So it is sufficient to find an uncertainty adjustment for either the OLS trend or the true AR() trend. OLS is simpler.

Our ARMA model is
A(L)d=M(L)ε ε is iid
or, reverting to matrices
d=A-1M ε
The variance of the OLS trend b is then estinated by
var(b)= x*C x/x* x
where x is time (mean removed) and C is the estimated covariance of d. In the model, C=M*A*-1A-1M.
Now C is symmetric, and apart from end effects, Toeplitz, reflecting stationarity. That means that each row is symmetric about the diagonal. So when the i'th row of C multiplies x, the effect of the part that is odd about xi makes no contribution. But the even part is just the constant . So the effect of applying C to x is just to multiply by the constant rowsum of C (apart from O(1/N) end effects). The ignoring of end effects here requires that there is only moderate autocorrelation, so C is diagonally dominant.

Each row of C is just the diagonal value (zero-lag variance of d) multiplied by the acf. That diagonal value divided by x*x. So the correction required to the variance of b is just (with our 1/N caveats) the sum of the autocovariance function.

Now it is good to go back to polynomials. The (Toeplitz) matrix product M*M is equivalent to the polynomial M(L-1)*M(L). And the acf is almost just the set of coefficients of the Laurent series got by expanding M(L-1)*M(L)/(A(L-1)*A(L)) in powers of L. A Laurent series is just a power series with both positive and negative powers.

I say almost, because the zero'th order coefficient of that series is not 1. So we have to find it and divide by it. This could be done with messy series expansion calcs, but it is neater with complex analysis, replacing L by z.

The Laurent series may converge in an annulus in the complex plane, and here it converges on the unit circle. To get the central coeficient, divide by z and integrate on that contour. It's a rational function, so that just means summing the residues inside the unit circle.

How to do that? For the previous post I did the algebra by hand, but for AR(3) I resorted to Mathematica, and I'll describe that process. I write A(z)=(1-az)(1-bz)(1-cz), so
I don't need to ever find the roots a,b,c, and they don't have to be real. To get the residue at a, I define f(a,b,c)=a2/((a-b)(a-c)(1-a^2)(1-ba)(1-ca))
Then I ask Mathematica to Simpliify[f[a,b,c]+f[b,c,a]+f[c,a,b]]
It comes back with a polynomial in the numerator and factors in the denominator, symmetric of course in a,b,c. The factors in the denominator I can reassemble as values of A(z) at specific points. The numerator I can recognise symmetrics like ab+bc+ca, and re-express these as coefficients of A.
Let A0 be the result - the zero'th order coefficient of 1/A(1/z)/A(z). Then the numbers in the acf are just the Laurent coefficients of 1/A(1/z)/A(z)/A0, and the sum - the Quenouille correction - is just 1/A(1)/A(1)/A0.

So here are some results. First the formulae for the Quenouille factors for adjustment of trend uncertainty:


And here are the numerical results for the monthly data, period 1980-July 2013. First the OLS trend and standard error; then the various models, with in each case the R arima() trend and se, and then the se as calculated by the above Quenouille-style approx.

OLS trend β1.70201.70921.75001.40861.5676
OLS s.e. β0.04630.06510.05140.06650.0641
AR(1) trend β1.69841.70841.74491.41611.5678
AR(1) s.e. β0.07850.09360.07650.11010.0978
AR(1) Quen s.e.0.07810.09330.07620.10960.0974
ARMA(1,1) trend β1.69531.70251.74231.41881.5687
ARMA(1,1) s.e. β0.09460.11320.09360.11890.1027
ARMA(1,1) Quen s.e.0.09400.11230.09300.11820.1022
AR(2) trend β1.68541.68311.73631.42211.5756
AR(2) s.e. β0.09290.11030.08990.12280.1045
AR(2) Quen s.e.0.09250.11060.08940.12170.1037
AR(3) trend β1.68931.68531.73951.40981.5673
AR(3) s.e. β0.09080.10600.09010.10730.0971
AR(3) Quen s.e.0.09000.10590.08930.10600.0961


For autocorrelated data, the ordinary OLS trend analysis will get the trend near right with a O(1/N) error, which can be calculated. But the standard error will be substantially underestimated. There are at least three options
  • Use a package like arima() in R, which probably uses a general optimisation algorithm
  • Use the Newton-Raphson-style algorithm that I described here (I'm planning an update with more systematic code). This is fast and does everything at once. It is easy to code in R, but not so much in, say, Javascript.
  • Calculate and correct the OLS trends, using Yule-Walker equations to estimate model coefficients.
The approximations will deteriorate with vary high autocorrelation. Then the best remedy is to instead use, say, annual averages. Little information will be lost, and the autocorrelation will reduce.

Sunday, September 15, 2013

August GISS Temp up by 0.09°C

GISS LOTI went from 0.53°C in June to 0.62°C in August. This rise. similar to last month's fall, was larger than TempLS and contrasts with satellite measures, which declined.

Here is the GISS map for August 2013:

And here, with the same scale and color scheme, is the earlier TempLS map for August:

And here is the WebGL map of actual station temps during the month.

Previous Months

December 2012
December 2011
August 2011

More data and plots

Wednesday, September 11, 2013

TempLS global temp up 0.04°C in August

The TempLS monthly anomaly for August 2013 was 0.500°C, compared with 0.461° in July. This continues a sequences of small changes.

Here is the spherical harmonics plot of the temperature distribution:

Warm spots in East Europe, N Canada. Australia also continued its long warm spell. S America cool.

And here is the map of stations reporting:

Monday, September 9, 2013

More on global temperature spectra and trends.

In my previous post, I looked at ARIMA models and their effect on uncertainty of temperature trend, using autocorrelation functions. I was somewhat surprised by the strength of apparent periodicity with a period of 3-4 years.

It is plausible to connect this with ENSO, so in this post I want to probe a bit with FFT's, removing ENSO and other forcings in the spirit of Foster and Rahmstorf. But first I'd like to say something about stat significance and what we should be looking for.

Statistical significance is to do with sampling. If you derive a statistic from a sample of a population, how much would it vary if you re-sampled. How this applies to time series statistics isn't always clear.

If you measure the average weight of people on a boat, and you are interested in whether that boat will float, then yo don't need statistical significance. You only need to measure accurately. But if you want to make inferences about a class of people on boats, then you do. And you need to define your population carefully, and decide to worry about variables like age and sex.

A trend is just a statistic;  is a measure of growth. You can quote a trend (for a fixed period) for something that is growing exponentially, not of course expecting it to be constant over other times. Trend is a weighted average, and the same issue of statistical significance applies. You can quote a trend for 20 Cen temperature anomaly without worrying about its properties as a sample, or that it was not very linear. But if you want to make inferences about climate, you do need to treat it as a sample, and form some model of randomness. That involves the study of residuals, or deviation from linearity, and so apparently non-random deviation from linearity also becomes important, because you have to distinguish the two.

The autocorrelation functions of the previous post illustrate (and help to solve) the issue. Short term lags form the basis of the stochastic models we use (ARIMA). But once we get beyond lag ten or so, there is behaviour that doesn't look like those models. Instead it is due to periodicities, and quadratic terms etc.

Now in that case, the (lag stochastic/secular) regions were not well separated, and it is likely that the secular effects are messing with the ARIMA models.

So, coming back to what "sampling" means here. It's a notion of, what if we could rerun the 20th cen? Of course, with GCM's we can. But would the secular effects repeat too? ENSO is a case in point. It has aspects that can be characterised - periods etc. But they aren't exactly predictable. Phase certainly isn't. Should they have some kind of stochastic representation?

We can defer that by using what we know of that instance of ENSO forcing (and volcanic and solar) to try to focus more on the ARIMA part of the acf's. I'll do that here by looking at the Foster and Rahmstorf residuals.

Power spectra

But first I'll revisit the previous analysis and its apparent periodicity. Here are the acf's:


And here are the corresponding power spectra, which are the discrete Fourier transforms of the acf's


You can see the prominent peak at about 0.22 yr-1 (45 months) which Is likely ENSO related.

Removal of forcings

Foster and Rahmstorf (2011) used known forcings (ENSO, volcanic and solar) to regress against temperature, and showed that the residuals seemed to have a much more even trend. I did my own blog version here. But for this exercise I used the F&R residuals described by Tamino.

First, here are the power spectra. The marked frequency at about 0.22 yr-1 (or its harmonic at 0.44 yr-1) is still often dominant, but its magnitude is reduced considerably:


And here are the acf's. The critical thing is that the oscillation near zero is reduced, and intrudes less on the peak at zero, which is smaller, and should be freer of secular influences.


Here are the acf's with models fitted. It is no longer the case that ARMA(1,1) is clearly better than AR(1), and both taper close to zero before the acf goes negative. For the satellite indices, there is little difference between the models. In all cases, the central peaks of the acf's are much less spread, indicating much tighter CIs.


Finally, here is the table of model fit results:

OLS trend β1.70201.70921.75001.40861.5676
OLS s.e. β0.04630.06510.05140.06650.0641
AR(1) trend β1.69841.70841.74491.41611.5678
AR(1) s.e. β0.07850.09360.07650.11010.0978
AR(1) Quen s.e.0.07810.09330.07620.10960.0974
AR(1) ρ0.48020.34450.37350.46130.3948
ARMA(1,1) trend β1.69531.70251.74231.41881.5687
ARMA(1,1) s.e. β0.09460.11320.09360.11890.1027
ARMA(1,1) Quen s.e.0.09400.11230.09300.11820.1022
ARMA(1,1) ρ0.69010.64330.66720.56960.4849
ARMA(1,1) MA coef-0.2744-0.3378-0.3447-0.1353-0.1058

As Foster and Rahmstorf found, the trends are not only quite large, but have much reduced standard errors, whatever model is chosen.


Subtracting out the regression contributions of the forcings (ENSO, volcanic and solar) does reduce the long-lag oscillatory contributions to the acf, though they are not eliminated. This enables a better separation between those and the stochastic effects.

Thursday, September 5, 2013

Uncertainty of temperature trends.

This post follows a previous one. There have been intermittent controversies about quoted trends for temperature series. One from about four years ago was well-described by Hu McCulloch, with a good summary of the theory. The issue is that the simple OLS regression gives generally good trend measures, but the associated uncertainty is greatly underestimated if the residuals are correlated in time. Monthly series are usually quite correlated; annual series much less so, but they have higher uncertainties due to fewer degrees of freedom. The effect of correlated residuals is often likened to a reduced number of dof.

The standard remedy has been the one recommended by McCulloch (but long used before); an AR(1) regression model, though usually implemented as the OLS trend with a Quenouille correction (1952) to the confidence intervals of the trend. This post introduces an extension of the Quenouille approx to ARMA(1,1).

The Quenouille approximation is fine, but in 2011 Foster and Rahmstorf argued that AR(1) was inadequate, and ARMA(1,1), with an extra parameter, better described the observed autocorrelation functions.

This has some practical consequences, because ARMA(1,1) was then adopted by the Skeptical Science trend calculator and is being cited as the basis for various numbers of "years without statistically significant warming". However, other such quotes, like the famous "none since 1995" quote of Phil Jones, are based on the more standard AR(1) model.

My trend viewer uses AR(1) with Quenouille. Since I'm regularly updating now, the inconsistency may be an issue. So I decided to further examine the choice of model.

Some linear algebra of regression and ARIMA

This builds on an earlier post showing how autoregressive models could be fitted exactly by solving polynomial equations. Suppose we have a vector y of temperatures over consecutive time periods (months), and a matrix X of regressors, the first column just o\1's, and the second (and last) the times of each y. The 3-col matrix Y is just (y,X).

Then the OLS regression problem is to fit to the stochastic model
d = Y B = ε; B=(1, -b), b=(α,β)
The RHS ε is a vector of random variables, which are assumed iid (indentically independently distributed). b are the coefficients, d the residuals. The OLS fit is X* (Y B) = 0.

An AR(1) model brings in a lag matrix L. Ly is just y with every element moved back 1. It is just a square matrix with 0 everywhere except for a 1 just below each points on the diagonal. The stochastic model is just (I - ρL)d = ε.

The operator on d has these properties: it is banded (just two diagonals), and Toeplitz. Toeplitz means that each row is almost a copy of the one above, shifted one place to the right (so just one new element on the left). Here it expresses stationarity - ie the relation between variables doesn't change between timesteps (except that time moves on a step).

A shift of two places is just expressed by the matrix L2. The general ARMA model is
A d = M ε
where A and M are just low order polynomials in L. The notation ARMA(a,m) just lists the order of these polynomials. There are a+m parameters.

Again ε is assumed iid, so the correct fitting is the LS solution of M-1A d = 0. However, in terms of the coefficients b, this is little different from the OLS fit. The significant difference comes in the uncertainty attributed to the coefficients.

The OLS formula X* Y B=X* ε expresses the coefficients b as stochastic variables (the solution values correspond to mean ε=0),
b= (X*X)-1X* ε + mean(b)
so Covar(b) = (X*X)-1 X*K X (X*X)-1
where K=Covar(ε).


Continuing the idea that stationarity implies Toeplitz matrices, it is reasonable to expect that K can be modelled by a Toeplitz matrix, in which each row is a copy of the autocovariance, centered on the diagonal. And the covariance of b is then made up from the zero, first and second central moments of the acv. We could calculate it that way, but noise from the ragged ends is a problem. That's why ARMA models are used.

But in the ARMA models, we have a similar Toeplitz matrix which is
(A-1M)* A-1M
so it is reasonable to compare these with the acv. I'll normalise so central values are 1, which makes the acv an acf.

Here is a plot of the acf's of five major temperature indices from 1980 to present, using the R acf() command. It's an active plot - you can press the buttons for the case you want:


There are some marked features in these acf's, to an extent that I found surprising. The gerneral pattern is that the part close to zero represents the structure of randomness, while further away it is dominated by systematic variations - periodicities and other non-random deviations from linearity. There is a marked periodicity at about 45 months. It's in the data, and shows up in a FFT. Here it shows up as a sort of sinc function. Naturally the acf models can't follow it, nor should they. It would be better to take out some of these low frequency effects. I'll say more about that in a future post that looks at AIC considerations.

Now AR(1) fits a geometric progression, forcing a match to the first lag. As Foster and Rahmstorf say, this tends to underestimate, and ARMA(1,1) gives an extra degree of freedom which allows the later decay to vary to match the slower decay observed. So here are the same acf's but with the two models added. The coefficients are calculated using the R arima() command with the CSS optimisation, and the model acf's produced by ARMAacf().


But now you can see that, while the ARMA(1,1) is a good match until the acf gets close to zero, it of course stays positive while the acf goes negative. The central moments of the acf include the negative parts, so it's reasonable to say that the ARMA(1,1) approx overshoots, while the earlier undershoot of the AR(1) approx starts to look better.

What about other models? Here is HADCRUT 4 modelled with AR(2) and AR(3) as well.

A family of Quenouille approximations

If you center the time sequence so that it has zero mean, then the trend variance is proportional to the zero'th moment (area) under the acf within the range. Since the ARMA model acf's generally tend rather rapidly to zero relative to the trend range, it's a good approximation to take the whole acf area. If ρ is the parameter of the AR(1) model, then the acf G(i,j) = |ρ|i-j, and if you sum over all j for fixed i, the area is (1+ρ)/(1-ρ).

For general ARMA with polynomials A(L), M(L), L=lag, there are corresponding approx factors. I'll describe the general theory in a coming post (it involves complex analysis), but for ARMA(1,1)
with M(L)=1+mL, A(L)=1-rL,


The coefficients are calculated using the R arima() command with the CSS optimisation. The period is from 1980 to June 2013. Trends are quoted in °C/century. Note that I give standard errors only - you need to multiply these by 2 to get a conventional confidence interval.

OLS trend β1.56621.56831.48421.41391.3039
OLS s.e. β0.06560.07050.06210.09160.0900
AR(1) trend β1.57071.57941.49251.44901.3269
AR(1) s.e. β0.14520.13710.13310.25090.2372
AR(1) Quen s.e.0.14480.13670.13270.25040.2371
AR(1) ρ0.65910.57970.64090.76380.7480
ARMA(1,1) trend β1.58011.59561.50381.47141.3474
ARMA(1,1) s.e. β0.21260.19850.19110.32980.3393
ARMA(1,1) Quen s.e.0.21150.19760.19010.32900.3387
ARMA(1,1) ρ0.85950.83080.84660.87140.8861
ARMA(1,1) MA coef-0.3785-0.4007-0.3690-0.2677-0.3329

Conclusions and next steps

I think AR(1) probably underestimates the variability of the trend and ARMA(1,1) probably overestimates it. I think it is possible to use the area under the ACF directly without use of a model, but it will require tapering, which may introduce similar ambiguity.

One quantitative way of deciding when increasing the number of model parameters should end is the Akaike information criterion. Up to ARMA(1,1) there is no sign of overfitting.

I'll expand on the topic of the extended Quenouille approx in a future post. The mathematics is interesting, and the results are elegant. But it requires the model coefficients (as it does for AR(1)), so increasingly it only sidesteps a smaller part of the work. I described in an earlierautocorrelation post how these could be obtained by polynomial solution; I have a more systematic version which I'll post soon.