## Regression as derivative

In two recent posts here and here, I looked at a moving OLS trend calculation as a numerical derivative for a time series. I was mainly interested in improving the noise performance, leading to an acceleration operator.

Along the way I claimed that you could get essentially the same results by either smoothing and differentiating the smooth, or differencing and smoothing the differences. In this post, I'd like to develop that, because I think it is a good way of seeing the derivative functionality.

This has some relevance in the light of a recent paper of Marotske et al, discussed here. M used "sliding" regressions in this way, and Carrick linked to my earlier posts.

#### Integrating by parts

My earlier derivation was for continuous functions. If we define an operator:
R=t/X, t from -N to N, zero outside
and X is a normalising constant, then the OLS moving trend is
β(t) = ∫R(τ)y(τ+t) dτ
where ∫ is over all reals ( OK since R has compact support). X is chosen so that t has unit trend: ∫R(τ)*(τ+t) dτ = 1.

I'll define W(t)=-∫tR(τ) dτ, (modified following suggestion from HaroldW, thanks) and use D=d/dτ, so DW = -R, and W=-D-1R. Then
∫D(W(τ)y(τ+t)) dτ = 0 = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ,
or, β(t) = ∫R(τ)y(τ+t) dτ = ∫W(τ)Dy(τ+t) dτ

Now W is a standard Welch taper.

It is the cumulative integral of R, and since that has mean subtracted, so integral over the whole range is zero, then it is a quadratic that starts from zero at -N and returns to zero at N, and is zero outside that range. So that establishes our first proposition:
β(t) = ∫W(τ)Dy(τ+t) dτ
ie a Welch-smoothed derivative of y.

Now D is wrt τ, but would give the same result if it were wrt t. In that case, it can be taken outside the integration:
β(t) = D∫W(τ)y(τ+t) dτ

That is or second result - the sliding trend β(t) is just the derivative of the Welch-smoothed y.

#### Application to time series

I introduced D because it has a nice difference analogue
Δy = yi - yi-1

It's inverse Δ-1 is a cumulative sum (from -∞). So the same summation by parts works:
β(i) = Σj R(j)y(i+j)
Again W = -Δ-1R is a symmetric parabola coming to zero at each end of the range - ie Welch. Then
Σj Δ(W(j)y(i+j) = 0 = -Σj R(j)y(i+j) + Σj W(j)Δy(i+j)
or β(i) = Σj W(j)Δy(i+j)

Again that's the first result - the sliding trend is exactly the Welch smooth of the differences of y. Smoothed differentiation.

Again, Δ can be regarded as applying to i rather than j.
β(i) = Σj W(j)y(i+j)+Σj W(j)y(i+j-1) = ΔΣj W(j)y(i+j)

The sliding trend is exactly the differences of the Welch smooth of y.

1. Nick,

Of course, all of this is interesting:

http://en.wikipedia.org/wiki/Window_function#Welch_window

In doing nearshore ocean and laboratory water wave analysis we used:a Tukey window (a = 0.9):

http://en.wikipedia.org/wiki/Window_function#Tukey_window

And then my preferred method was the:

http://en.wikipedia.org/wiki/Welch%27s_method

using 2N-1 windows (D = 0 and D = M/2) each window was either linear or quadratic detrended, Tukey window applied, and final result was variance adjusted wrt a boxcar window function.

Back in the 60's it was decided that n = 2048 at dt = 0.25 sec or ~17 minute wave record, unfortunately, to this day, that rather computer limited method (60's CPU technologies) is still applied today.

As a side issue, I am not aware on any ARIMA methods that can produce/synthesize a realistic water wave time series that can be used to drive a laboratory wave generator (if you know of any 'hands on' application I'd be very interested).

Finally, IMHO, any reasonable discussion on FIR/IIR filtering method's must show the SOP semi-log plot of the filter's magnitude (or better yet amplitude) response function, if you are are just going to show either just the real or imaginary parts of the response function, at least for me, that is of little engineering utility.

I've been doing this stuff for over three decades now, FIR's have certain strengths (limited autocorrelation via the finite window (but 'stair steps' inherent in the Finite nature) and weaknesses limited attenuation decay and strong nodal behavior) and IIR's have certain strengths (monotonic and strong attenuation and weaknesses infinite autocorrelation (with decay features wrt the pole count).

AFAIK, there is no perfect filter, however, there are a lot of filters that can be misapplied in time series analyses.

That is all.

1. Interesting stuff Everett. ".... and final result was variance adjusted wrt a boxcar window function."

Could you explain how you disentangle the negative lobes? I can see how you could compensate for the frequency profile but surely doing this in the presence of negative lobes will enhance the negated but attenuated components to full-scale negated components.

Maybe I'm not following what you are describing.

2. Since you are wave man , could I ask you to comment on an idea I've been toying with for a couple of years now.

The termocline is density boundary in a similar way to the ocean surface but with a density difference about ( crudely ) 1000 time less than air/water boundary.

This will mean it will respond to tidal forces in a similar way to the surface but will respond most strongly to periods about 1000 times longer than the primary surface tides of 12h.

1000*12h ~ 1.4y.

Does that idea seem reasonable in the context of you knowledge of the subject and are you able to suggest a better ball-park figure than what I have roughed out, taking the example of the tropical Pacific ocean.

Obviously we don't do too well at deterministic modelling of tides so far but it seems to me that there will be inter-annaul scale tides on the thermocline.

I have a strong gut rejection of the idea that ENSO 'causes itself'. There is some understanding of the positive feedbacks and I hypothesise that the trigger is such long period tides.

Can you improve on my numbers?

thx.

3. Greg,

On your 1st question, you can pre-compute the variance adjustment factor based on the two windowing functions under the assumption of stationary variance of the time series, most people do this, as it's easier (we are looking at the entire time series variance as a single number). The other way is to directly apply the two windowing functions and compute the ratio of the total variance from both, I've always found the two approaches to be very close to each other. We always assume stationary or ergodic behavior, but one must always look at the actual water wave time series, this is somewhat easier to do automatically with Welch segment averaging. I don't know if I've answered your 'negative lobes' question sufficiently?

As to your 2nd question, the idealized box profile of the stratified ocean would be governed by the densimetric Froude number, which is how we modeled it in numerical and physical OTEC models back in the late 70's (a radial discharge jet goes through it's zone of flow establishment (ZOFE) after that it becomes a buoyant plume (except at the frontal zone of mixing in a uniform flow field)) The reality is a different matter, here one might use, for example, a gradient Richardson number, or more likely a convective parameterization as is done in AOGCM's (those mechanical computer details I'm still trying to understand, as I currently have actual zero understanding other than that there are convective parameterizations). The major tides are the diurnal and semi-diurnal, so those two have periods of ~24 and ~12 hours. However things like currents, winds, air pressure and waves cause differences between idealized tidal harmonics and actual tidal time series. I'm currently working on some tidal time series for VA/NC, and it becomes rather clear that there are also NINO type signatures in those time series. I am no longer with the USACE ERDC CHL, but I still know a fair amount of their numerical/statistical modelers, they recently completed an NAD study using their SMS moodeling suite (I believe most of those are deterministic, as to how well they hindcasted the historical record, I have no idea):

http://en.wikipedia.org/wiki/SMS_(hydrology_software)

I really don't know about your very low frequency tidal conjecture though, so I'll leave that one alone. Sorry.

2. Nice to see this sort of discussion. In view of the obsession climatologists seems to have with "trends" it would be good if they knew what they are actually did.

I dislike the term "smoother" because it is imprecise. Lots of things can be "smoother" but that tells us nothing of how they are distorted. All filters 'distort' the data so we need to ensure we are distorting it in a useful way and not in unexpected ways of which we remain ignorant.

If we talk about low-pass filters this instantly means we are considering the frequency domain, which is what we must do to make appropriate choices.

Since the aim in this case is find the derivative and also apply some low-pass noise filtering, one good options is to use a well behaved filter like a gaussian, which in monotonic in freq domain and is never negating. ( The downside being that it has a rather slow transition and is never exactly zero either so there is some stop-band leakage. This is often not a problem but should be noted ).

As Nick points out the order is not important. This a result of the fact that convolution, the process by which FIR filters are calculated is a linear operation. As a result it is commutative. It can be noted that the first difference is also a trivial convolution with a kernel: [-1,1] . This introduces a small phase offset since it represents the rate of change half way between the two points. So the time series needs shifting. An alternative is [-1,0,1] which estimates over three points and does not have a phase-shift.

So gauss of diff is mathematically identical diff of gauss, to extend Nick's argument.

However, in calculating the gaussian kernel we can apply some neat maths. Since we know the analytic form of the gaussian we can calculate its derivative analytically and use this as a kernel that will do the diff and the low-pass gaussian in one step. This has the nice advantage of giving an accurate multipoint estimation of the derivative rather than the clunky two-point numerical estimations.

I did this example earlier in relation to the M&F thread.

http://climategrog.files.wordpress.com/2015/02/cmip5_dgauss-vs-gauss-diff.png?w=800

Now the frequency response is similar to Nick's earlier plots without the horribly defective oscillations and negative lobes. The initial rise is similar but it tails off in a controlled manner. I do not have a plot of ready but if Nick has the tools at this disposal to pop that up it would be worth visualising in comparison to the other frequency plots.

This filter is used in image analysis as a edge detection aid but it is a very good solution for anywhere where the rate of change and some low-pass filtering is required.

What M&F were attempting to do would a case in point.

I did an test on the same data M&F were using and achieved a much "smoother" result using a filter period much shorter than their 15y base period.

https://climategrog.files.wordpress.com/2015/02/cmip5-xs.png?w=800

To simplify discussions I did diff of gaussian, not the gaussian-derivative there. So that remain jitter could probably could be removed.

I discuss the implications of that plot over at CA.

1. Yes, I've just confirmed that using gaussian-derivative filter leaves a beautifully smooth result. The furriness is a result of the imprecise two-point numerical approximation to the derivative. A nice example.

2. "However, in calculating the gaussian kernel we can apply some neat maths."
Greg,
The thing about the Gaussian is that it doesn't have compact support. What happens when you truncate?

The Welch function has plenty of neat maths. As I mentioned on the other thread, the spectrum is
π²/6(sin(x)-x*cos(x))/x²
where x=f*π/10, f=freq in 1/Cen
The smoother forms that I plotted have also analytic forms. The way you get the basic form is:
The regress operator is just x * gate function.
So its FT is just derivative of sinc. To get the next level, multiplied by Welch = 1-x² (switching to unit interval (-1,1))]
So add ((iω)² the third deriv of sinc
etc

3. Thanks for the detail NIck. I'll work through that,

I think there's an error in the formula, it only seems to work for n=10 ;?

Though gaussian is theoretically infinite, it is customary to truncate the kernel at 3 sigma. the error is <1% IIRC. Usually negligible.

4. This is the actual spectrum from the impulse response of 3-sigma GD truncated as described, not analytically calculated:

http://climategrog.files.wordpress.com/2015/02/dgauss-5y_freq_resp.png?w=800

5. Greg,
"Though gaussian is theoretically infinite, it is customary to truncate the kernel at 3 sigma."

Thanks, that's what I was looking for. I'm not a fan of truncated gaussian; I think the footprint is too long for the value you get. At least 1/3, maybe 1/2 is tail that isn't contributing much.

I haven't had much time for this today (near 11 pm here), but I'll do a comparison plot of gauss and Welch tomorrow.

6. It's all trade-offs, there's no free lunch in all this.

If you don't mind ripple and neg lobes you can crop gaussian at 2-sigma, it's still pretty good. It's the last 1/3 that "doesn't do much" that gives you nice finish. If you just need to paint a barn door, crop it.

I think your blue line on the last post was looking promising, can this be taken a step further , or does that start to need a long kernel too?

3. "n = 2048" should be "n = 4096' sorry about that.

4. "used:a Tukey window (a = 0.9)" should be "used a Tukey window (a = 0.1)" sorry about that.

5. Hi Nick,
Excellent as always. One quibble, though. You've defined W(t) = integral {-inf to t} R(tau) d(tau)). This produces an inverted parabola, and results in the counter-intuitive minus sign in the final formula. [And in the analogous discrete-time version.]

Instead, it would be better to define W(t) = integral {t to inf} R(tau)d(tau), which generates a right-side-up parabola like the figure. Then R= -DW, and the minus sign is eliminated in the end.

[I came at this from the other direction, deriving the discrete-time version starting from the OLS formula, which produces the form W(n) = sum {i=n to N} R(i) naturally. And apologies for not using latex for these formulas, but it didn't work for me. Not sure if it's not enabled here, or whether I just entered it wrong.]

1. Latex has not be working for me either, for what it's worth.

2. Thanks, HaroldW,
Yes, the signs need straightening out. It was strictly correct, and worked in my program, but of course negative W is not a Welch. I just changed the sign rather than direction of integration, else I'd have to explain about backward cumsum.

About Latex - MathJax is installed but I think not allowed in comments. Since I don't use it (I find that HTML isn't bad, and less lurchy), I probably should discontinue it, since the library takes time to load. I don't think I can make Jax usable in comments - Blogger (and most platforms) allow limited facilities for security. That's where it helps to host yourself, like Lucia. But that's not all roses either.

6. Before discussing the derivaiton I'll just test whether I can cut and paste your eqns into comments:
∫D(W(τ)y(τ+t)) dτ = 0 = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ,

7. Somebody was asking about the characteristic period of the ENSO basin. It is 4.25 years as estimated by Allan Clarke at FSU
“Wind Stress Curl and ENSO Discharge/recharge in the Equatorial Pacific.” Journal of Physical Oceanography,2007

Of course this period works well when used as in a wave equation model for ENSO, if we also apply the appropriate forcing, primarily from the 2.33 year period of QBO.

1. More like 4.44 IMO, if you amplitude modulate with 28.6y you get the characteristic "roughly 3-5 year" pseudo-periodicity. There's evidence in power spectrum to suggest that, I don't know why that should be the case. Chance maybe.

https://climategrog.wordpress.com/?attachment_id=283

4.44y is the double passage of the lunar apsides period over the tropics, suggesting a tidal effect.

2. What the heck are you talking about now? Do you even know what a characteristic period is?

A characteristic period or frequency in this case is the root of the undamped 2nd-order wave equation, which is what Clarke derived. This has nothing to do with an external forcing but has to do with the properties of the volume.

Or is this just more of that FUD that you seem to excel at pushing?

3. so if you have a system which can be described by an undamped 2nd-order wave equation and a multitude of external drivering forces of either periodic or broadband stochastic nature, what periodicities would properties of your volume be likely to respond most strongly to?

4. As I said earlier, the strongest external forcing aligns with the 2.33 year period of the quasi-biennial oscillation (QBO). This is a well known correlation as the upper atmosphere winds clearly interact with the pacific ocean as they descend on a very periodic basis.

The 2.33 year period is very evident but contains jitter so that the result shows evidence of a frequency modulation interaction with the wave equation.

5. You really are a master of avoiding the question. You should be in politics, oh wait....

I did not ask you what "the strongest external forcing aligns with" and you had just stated that "This has nothing to do with an external forcing", now you reply about external forcing. LOL.

I asked you what period your "undamped 2nd-order wave equation" was likely to respond most strongly to. You prefer to reply by talking about something else which you say is nothing to do with it.

Odd. I wonder why that is.

6. Do you really want to play stump the chump with me? Obviously, a second-order undamped equation will resonate strongly to a external forcing that has the same frequency as the characteristic period of the equation itself. This isn't play-time you know.

7. Thanks, it was a bit of a chase but we got there. So you now have an answer to your earlier questions:

"What the heck are you talking about now? Do you even know what a characteristic period is?"

So should there be a variation in tidal forces around the equator with a repetition period of 4.43y, the undamped 2nd order equation you referred to with a characteristic period of 4.25 years as estimated by Allan Clarke at FSU, would be like to resonate with it.

This would mean that a relatively small periodic force could lead to sizeable effects. If that is subject to a roughly 29y modulation it would lead to the peaks that I noted in the reanalysis trade wind data that fit the typical description of ENSO as having 3 to 5 year periodicity.

The orthodox account of ENSO is that it "creates itself". I'm presenting an alternative hypothesis that it may be triggered by a lunar tidal forcing.

Until today I was unaware of Clarke's work which seems to tie in with this interpretation. That's useful , thank you.

8. That's not the way it works. You can close your eyes really hard PeterPan-like and try to imagine a 4.43 year forcing, but if it doesn't exist you are out of luck. Better to go with what the data and the other research says, that the primary forcing is the 2.33 year QBO and include that in the model. I am apparently not a cherry-picking over-zealous over-fitter that you are.

9. That evades the question of what is causing QBO.

Like I said above, if that is the natural resonance of the system, then it could be exited by either random variations or an external driver.

A sinusoidal 4.43y would not resonate a 2.33y system but a punctual forcing every 4.43y would. Tidal forces are inv. cube and vary about 45% between perigee and apogee. The latitude of perigee changes with a period of 8.85y due the to precession of the lunar apsides, this is passes over the tropics twice in that period.

This is analogous to the solar zenith peaking over the equator twice a year.

So there is a lunar forcing of about the right period to excite the 2.33y QBO natural resonance in the system.

You can close your eyes really hard PeterPan-like and try to make it go away. I think it is worth investigation as possible systematic driver of QBO.

10. An annual forcing can also stimulate the QBO. It is further away from a resonant frequency but its strength makes up for it.

8. Cool. Now in that line.
∫D(W(τ)y(τ+t)) dτ = 0 = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ

the LHS is integral of the diff across the interval so you are saying
W(τ)y(τ+t) is constant . I don't see the reason for that.

Also this seems to be where you are invoking integration by parts but I can't relate what you're doing to my knowledge of this method. Maybe you could show in little more detail how you get there. For ex. with reference to the forms given here:
http://tutorial.math.lamar.edu/Classes/CalcII/IntegrationByParts.aspx

".... and X is a normalising constant, then the OLS moving trend is"
β(t) = ∫R(τ)y(τ+t) dτ

I don't see that this has any connection to OLS. This integral is convolution with a short segment of fixed slope. ie convolution with a constant d/dt. With suitable normalisation this is a moving average of dT/dt. Where does OLS come in?

That would seem to imply X is re-evaluated at each point thus R(τ) is R(τ,t) and the rest needs to change to account for that.

Sorry if I'm missing the point , I appreciate it's difficult to explain this kind if derivation in a quick sketch like that.

1. Greg,
The D is derivative, so the left integral is just the difference from one end to the other. And it is zero at each end.

OLS trend on afixed interval centered at 0 is just β=∫τy(τ) dτ/∫τ² dτ
or similar sum Here, with mean(x)=0.
Mine is just a moving form with the t slide parameter.

2. Thanks Nick. I realise what D is.

OLS trend on afixed interval centered at 0 is just β=∫τy(τ) dτ/∫τ² dτ
or similar sum Here, with mean(x)=0.

You should say centred on 0,0 you can justify mean(x)=0. because you have t to centre the interval , this is not true of y.

I think there must be a false assumption somewhere since you do not end up with sin(f) and FT(diff)*FT(rect) = f*sinc(f) = sin(f).

3. Greg,
If x is centred (has zero mean), you don't need to centre y. If you add any constant to y, in the product that constant is multiplied my mean(x), ie 0.

4. β=∫τy(τ) dτ/∫τ² dτ
if you substitute y'=y+c in that expression you get an extra term: the integral of the constant, normalised by the variance of x.

If that could be ignored, you would not use the variance. of y when calculating OLS., you could use the RMS

That missing term is the average of y over the interval, so you are removing a y dependence from the result of the sliding process. I think this may explain why you have an extra 1/f in your spectrum as I noted above.

I also think this may be what Carrick picked when he pointed out an error in the dimension of your result.

5. Greg -
The extra term is (∫τ c dτ) / (∫τ² dτ). Given that the range of integration is symmetric about the origin, viz., -N to N, the result is 0.

6. Thanks for the reply, Harold.

if we ignore the normalisation factor in the denominator, that integral is 0.5 c.τ^2

Tell me how you get zero.

There's something wrong here since, when I do a convolution diff and a convolution RM on an impulse signal, then take do an FFT ( well chirp-z actually what I did ) I get the freq resp to be sin(f). This agrees with the what I calculate as produce of the individual FTs : f*sinc(f) .

So somewhere I have two results: theoretical and empirical that corroborate each other and it is not the same a Nick's red line which he gets by analytic maths.

For me the ground truth is the FT of the impulse response. If the maths does not add up, I start to look where I made an error in the maths. At the end of the day all three approaches have to be reconciled.

Now I have much respect for Nick, who is more familiar with the pure maths/ODE angle that I am, and he's probably smarter than I am.

That does not mean he is right.

I've learnt quite a bit from this discussion already, so it was very interesting for me. I think I've found the error in the algebra, so I'd be pleased if Nick can post a correction, if needed, or explain how my impulse response and FT logic is wrong.

7. "Tell me how you get zero."
The integrand ( cτ ) is an odd function of τ, and the integration is over a symmetric interval of τ (-N to +N). Hence, zero.

Alternatively, one can take the indefinite integral (0.5 c.τ^2) and evaluate it at the endpoints (-N and +N), and subtract. Still get zero.

8. Thanks, HaroldW,
I've been using quite a bit integration (or summation) by parts. You write down something which, when differentiated, by product rule gives the integrand that you want, and something else that may be more informative. Integrating the derivative gives a difference, here zero, and so equating those product rule integrals to zero gives an alternative expression for the integral in question.

9. Thanks Nick, Harold.

I don't thing the maths is being presented in very rigorous fashion , t is not always zero and y has a finite mean. However the result clearly needs to be invariant w.r.t. translation in x and y, so the result is correct.

If beta is to represent the OLS slope then both intervals need to be the deviation from the respective means.

I'm being picky because maths is a strict science. The hand-waving engineer's approximations come later.

No one has raised any objection to the spectral response of running mean of rate of change that I've presented, so perhaps the correct conclusion here is that sliding-trend, which minimises variance, provides some roll-off that running mean, which minimises deviation, does not.

I'm not 100% convinced but if that's correct it would certainly be in interesting result.

10. "No one has raised any objection to the spectral response of running mean of rate of change that I've presented..."
There are no running means here. The OLS trend estimate is the correlation of a windowed ramp R(.) with the original time series, or a parabolic window W(.) with the derivative.

In Nick's notation,
β(t) = ∫R(τ)y(τ+t) dτ
or
β(t) = ∫W(τ)Dy(τ+t) dτ

11. Reply button isn't working at the moment (I tried multiple browsers).

Greg Goodman: I don't thing the maths is being presented in very rigorous fashion , t is not always zero and y has a finite mean. However the result clearly needs to be invariant w.r.t. translation in x and y, so the result is correct.

When I implement LSF routines, I subtract the mean from the data I'm fitting. Much improved accuracy. For linear trend, the regression formula people use is based on performing the LSF on the mean-subtracted data..

Incidentally Nick, what you've derived is true for infinite precision arithmetic. Given that there's a finite noise floor, there is very likely going to be a preferred method for computing the running trend.

To give an analogy, formally,

E[(x - xbar)^2) = E(x^2) - xbar^2

but numerically the left-hand-side expression is superior to the right-hand-side one, especially when you have accurately measured data. Also, this was a bigger issue when people used to use single precision to do math.

12. That's what I have trouble with intuitively. If I'm applying a bell shaped window fn to apply low-pass filtering fine. but why is the data at the ends of the segment any less relevant to the OLS slope than middle?

Bear in mind that the OLS trend is an estimator of the rate of change of y wrt x or dy/dx, once you start slinging this along you are doing something very close to the running mean of the rate of change.

I can see that W(τ) will produce roll-off, I can not see that weighting the data that way is the same as minimising square deviation from the line. That criterion cannot depend upon the position the data point within the window. Each point must be weighted equally.

13. Actually with OLS trend, the end points are weighted more heavily than the center of the data. This leads to the well-known sensitivity to outliers near end points for OLS trend.

What seems odd about Nick's results is it is backwards (the middle seams weighted more heavily).

14. To follow up on that comment, here's the formula for linear regression (specified as subtraction from the mean):

http://www2.owen.vanderbilt.edu/Germain.Boer/images/req2.gif

Note that the y-data are weighted by (x-xbar)/var(x)

Again what seems odd here is it’s linear in x’ = x-xbar and weights the end points more strongly.

Assuming the bounds [-N, N] formula for the Welch window is 4 * x’N * (1-x/N) which is quadratic and weights the points near the center most strongly.

There’s something obvious here that I’m not getting.

15. 'There’s something obvious here that I’m not getting."
The OLS trend estimate weights the end-points of the time series the most heavily.
However, considering the alternative formulation of the trend estimate as a [Welch-]weighted average of the first differences, the center *first differences* are weighted the most heavily.

16. 'k that makes sense.

The question of what this manipulation does to your noise floor becomes even more germane if the most heavily weighted points are near the center for welch smoothed difference, but the "actual" regression formula, weights the end points most heavily.

17. My impression is that the discussion does not have any clear direction, because the neither the parameter being estimated not the noise model have been specified accurately enough to allow for unique answers.

18. I often think of the trend with a mechanical analogy. It's the first moment. If the points were on a see-saw, it's the end ones that have most effect.

But as HaroldW says, if you want to move the see-saw by applying a dipole torque (difference), that is best done at the centre.

19. I wrote this last night but it does relate somewhat to Pekka's point and Nick's see-saw.

The denominator is the same for every term. The assumption is that the true relationship is linear and that there is negligible error in x .

To simplify the notation, let's assume both means are zero.

the error model is : yi'=bxi+ei
the numerator then becomes:
Σ(xi.yi')= Σb.xi.xi + Σxi.ei = bΣxi^2 + Σ(xi. ei )
when divided out, the first term give the slope.

Now, the fitted line has to pass through (0,0) in the zero mean definition. A y error at x=0 cannot influence the choice of fitted slope in terms of minimising errors since the slope does not affect the y error at x=0. similar arguments apply as we depart from x=0: the weighting is that which will minimise the total error. [ ie Nick's see-saw.]

y errors are assumed to be, normally distributed and independent of x. This also implies that there is no systematic variability in the data other than the presumed linear relationship. This means that it will not provide a good estimation of the 15y when there is a 10y systematic variability due to volcanoes, for example.

"This leads to the well-known sensitivity to outliers near end points for OLS trend."

If it is "sensitive" to this condition, the basis assumptions that the errors are normally distributed and that the data sufficiently well sample to represent the distribution, have not been satisfied.

It is true that a large random error at the end will have an effect while the same error at zero would not. This will not cause significant problems if it is part of a large normal sample. OLS is not meant to be perfect. It is the statistically best estimator of the slope.

If the data point is an "outlier" in a small sample, then the OLS condition has not been met.

It is sadly very, very common for people ( even a large proportion with the letters P,H and D somewhere in their name ) to be either totally ignorant of the basic assumptions under which OLS is a "best estimator" or think that they don't really matter and then get mislead by their own spurious results.

The one which is most gleefully ignored is the requirement for negligible x errors.

OLS is not a low-pass filter and should not be used as one. It is a valid way to minimise the effects of normally distributed errors, provided there is no significant systemic variability other than the presumed linear relationship.

M&F are trying to use the sliding-trend as a "smoother". It is a misuse. They should have used properly chosen filter.

20. If OLS weights the end points and Nick's method weights the middle. This again raises the question of whether it indeed represents OLS . I'm still not convinced that it does.

21. It strikes me that minimising sum of squares of y(t) is equivalent to minimising absolute deviation in dy/dt , making running mean of dy/dt identical to sliding OLS trend (notwithstanding the numerical differences raised by Carrick).

22. Pekka Pirilä: My impression is that the discussion does not have any clear direction, because the neither the parameter being estimated not the noise model have been specified accurately enough to allow for unique answers.

Actually I don't think it is a big deal to add a noise model for any OLS-based processor. While we typically use white noise, it is easy to also easy to evalute correlated noise directly applying analytic methods to the Hessian matrix.

With any noise model that we can specify analytically, we can also use Monte Carlo based methods to make these sorts of evaluations.

Because of the relative straightforwardness of the method, I think it is a more robust approach than alternative methods for measure trend.

I got interested in OLS acceleration filters as an approach to quantity the slow-down (break point analysis). Like with trend, it is straightforward to model noise, certain types analytically with the Hessian matrix, or with more realistic noise models using Monte Carlo.

I don't agree (as some people seem to suggest) that we can simply avoid the question of measurement error: In my parlance, if you can't quantify how robust your measurement is, you don't have a true measurement.

23. I think this post still needs a "statement of the problem" as I think Pekka had implied.

What exactly is in need of filtering?

If it is the GISS temperature time-series, I don't see any real noise (apart from within a 1 year window). All the fluctuations are accounted for by known natural variability factors such as ENSO, volcanic disturbances, etc.

The "filter" I use is a multiple regression approach I call CSALT, which uses OLS to properly scale the set of factors. The resultant signal is thus cleaned up via this "noise model".

24. WHT,
"I think this post still needs a "statement of the problem" as I think Pekka had implied.

What exactly is in need of filtering? "

It's a different way of seeing regression as a derivative filter. A convolver which returns an approximate derivative. People use it that way; I'm just showing that it does indeed return a Welch-smoothed derivative.

Differentiation amplifies HF, and without smoothing gives a mess. I've been exploring how much smoothing is needed. But it isn't just a filter problem. Differentiation is the main functionality here.

25. As I understand Nick's post and comments this thread discusses determination of the derivative of a time series at interior points of the long full time series.

When, what we wish to determine is the derivative at one point, we have too conflicting requirements. The calculation should be limited to a short enough interval to make it likely that the trend determined over the interval is close to the derivative of the underlying signal at the particular point. On the other hand we should not use so few points that the statistical uncertainties grow too much.

I describe the requirements qualitatively in the above, but the specifications should be made explicit and quantitative in order to be able to fully judge the merits of alternative tapers. Most important for that are assumptions that concern the smoothness of the underlying process that produces the derivative we wish to determine. If the process is described by a continuous function with continuous derivatives up to a given order and with some given upper limits for those derivatives, we might be able to determine an optimal taper, when we make also assumptions about the properties of the noise.

These requirements may be too difficult to satisfy, but even then we should start by deciding at some quantitative level, what we think about the underlying process and about the nose in order to make justified choices.

If the underlying process has poor smoothness properties, then it's probably best to use a very simple and transparent formula, which might well be the OLS trend over a specified interval centered at the point considered.

26. Thanks Nick, now I get it better. I can see a good application of an differentiating filter. First and second derivatives on a time series are useful if the plan is to evaluate differential equations. The trick is to remove the amplified noise while keeping the underlying curvature intact. See this post at the Azimuth Forum for an example:
https://forum.azimuthproject.org/discussion/comment/13806/#Comment_13806

One has to smooth somewhat before doing the derivative otherwise it is just noise.

27. Pekka,
"When, what we wish to determine is the derivative at one point, we have too conflicting requirements. The calculation should be limited to a short enough interval to make it likely that the trend determined over the interval is close to the derivative of the underlying signal at the particular point. On the other hand we should not use so few points that the statistical uncertainties grow too much."

That's why I've been looking at alternative views of regression. The estimator has to do some smoothing, else it just returns HF noise. So you can't get a point value. I've been looking for the best compromise between derivative accuracy and smoothing.

When you numerically differentiate a smooth curve, the enemy is higher derivatives, which the formula will try to allow for to some extent. You try for a small interval. When you differentiate a noisy curve, the enemy is noise, and you try for smoothing and a larger interval. When you are looking for the derivative of a smooth curve overlain with noise, you want to find the balance point.

28. Even if we can't practically implement it that way, Nick's proof that the running trend filter is equivalent to a Welch-windowed smoothed derivative of the function under study gives us insight into what the mathematical operation is that we're performing.

To me this particular approach is well behaved, well studied, and I find the suggestion that it is somehow difficult to analyze what the operation we're performing or the measurement noise associated with the process to be a little surprising. This is just not very difficult to do.

The other thing is, noise amplification of high frequency is with respect to the ration of the output of the filter (which is a derivative) to the input (which is a amplitude series). That is not a dimensionless ratio.

If you write it as a dimensionless ratio of measured trend to actual trend, which involves dividing the transfer function by 2*pi*i*f, you don't actually see any amplification of high frequencies.

Finally, you can suppress the nulls that bother Greg by applying a low-pass filter to the output of this filter. You can pick the first null of that filter to correspond to the maximum of the first lob of the running trend filter.

9. Posting from the hip a bit here, but my gut feel without checking the maths is that your red line is the spectrum of your acceleration trend , not the linear trend.

1. Is this on the right thread?

2. yes, sorry, it was supposed to follow on from my comment above, I got tired of all the hoops you need to jump through to post here.

The red line is the sister post but it relates to my comment about 1/f in previous comment. I think once this is resolved, you will find that sliding trend is sin(f) in FD and red line is your acceleration trend.

Greg

3. Sorry make that:

W(x) = 4 * (x/N) * (1-(x/N))

4. Hm... sorry this is for [-N/2, N/2]. Teach me to do physics while sipping Joe and watching TV. lol

DW = -R
∫D(W(τ)y(τ+t)) dτ = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ

let's stick with W to keep it clearer and check the working before doing the substitution.

W(τ)y(τ+t) = +∫DW(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ ; integration by parts
http://tutorial.math.lamar.edu/Classes/CalcII/IntegrationByParts_files/eq0010M.gif

∫D( W(τ)y(τ+t) ) dτ = +∫DW(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ

Integration by parts ??

What you are effectively doing is:
∫D( W(τ)y(τ+t) ) dτ = 0 = W(τ)y(τ+t)

You've justified the left-hand equality, but there seems to be logical slip in implying this justified the RHS.

11. Greg,
"If it is "sensitive" to this condition, the basis assumptions that the errors are normally distributed and that the data sufficiently well sample to represent the distribution, have not been satisfied."
By sensitive, he just means that it is weighted, whether an outlier or not. It's not saying that outliers have any particular prevalence.

"OLS is not a low-pass filter and should not be used as one."
I've shown the spectrum. It does damp high frequency, relative to those of period around the regression period. It isn't a low-pass filter - it's a smoothing differentiating filter.

That's the idea of its treatment of noise. HF is damped, low F is differentiated and passed. As usual, mid-range is messy.

"If OLS weights the end points and Nick's method weights the middle."
It weights the middle for differences. To extend the see-saw analogy, suppose it has bolt-heads along the side. You can tilt the see-saw either by adding weights, or by applying a spanner to the bolts. (Why would you do that? Maths means never having to say you're silly).

For best effect, you'd put weights at the ends, but apply the spanner to the middle. Spanner = local torque = difference (dipole).

"It strikes me that minimising sum of squares of y(t) is equivalent to minimising absolute deviation in dy/dt"
If you mean dy/dt as estimated by differences, then that's true, in a Welch-weighted sense.

"W(τ)y(τ+t) = +∫DW(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ ; integration by parts"

In this formula, the LHS means difference from one end of the range to the other, which as explained above, is zero. And with W Welch, DW is a (downhill) sawtooth pulse, as in regression (after minus).

1. "In this formula, the LHS means difference from one end of the range to the other, which as explained above, is zero."

No, Nick, you're not addressing the point I've raised.

∫D( W(τ)y(τ+t) ) dτ = 0 = W(τ)y(τ+t)

The LH equality is what you discussed and I agree. That is not the same as the RH equality,
The LH term is identically zero. the RH is a continuous function: the Welch weighted segment of the time series. This is not identically zero, it is a time series.

The integration by parts eqn allows this time series to be evaluated at each value of t as the sum of two finite integrals. Neither of them disappear, neither are they identically equal.

Your mistake is in identifying the time series with the int of diff trick that is a single number and is equal to zero.

Greg.

2. Greg,
Then it's just a product rule derivative
∫( W(τ)y(τ+t) )' dτ = 0 = ∫W'(τ)y(τ+t) dτ + ∫ W(τ)y(τ+t)' dτ
That's how IBP works. With I = D^-1 = integrate
D(IA*B)=A*B+IA*DB (product rule)
So I(D(IA*B))=Δ(IA*B) = I(A*B)+I(IA*DB)
So I(A*B) = Δ(IA*B) - I(IA*DB)
where Δ is difference over range.

So as long as you can handle IA, you have made progress. Here, if A is linear, IA is Welch.

3. Your explanation above seems logical enough , thanks.

"It [W] is the cumulative integral of R, and since that has mean subtracted, so integral over the whole range is zero, then it is a quadratic that starts from zero at -N and returns to zero at N, and is zero outside that range."

As we have seen, for the definite integral to be zero, it simply has to start and finish at the same value. Not necessarily at zero.

If it starts at zero then R(-N)=R(+N)=0 , a special case where all R is zero, and zero slope; W is flat.

It seems you are shifting the Welch window if it starts at zero .

4. This is the problem with the diff.int trick, you lose information about the constant. diff.int=0 does not contain enough information.

A sloping line through (0,0) has an integral a.x^2 ; diff.int=0

This weights the extremes not middle.

5. sorry, forget the last comment .

The problem is the Welch can't start at zero if it is the cumulative integral that equals zero.

6. "That's the idea of its treatment of noise. HF is damped, low F is differentiated and passed. As usual, mid-range is messy."

Why would anyone chose a filter with a substantial range that is "messy"?

There are many well-behaved filters: gaussian, triple running mean , lanczos ...
There are always compromises, that does not mean accepting a mess.

7. Greg,
"There are many well-behaved filters: gaussian, triple running mean , lanczos ...
There are always compromises, that does not mean accepting a mess. "

All filters have a messy region. They compromise between incompatible goals. Often, to faithfully pass low (or band) frequencies while damping HF. Somewhere there is a region where they do neither. A gaussian is good at damping, but does not give a flat response anywhere. Lanczos tries to, but at cost of negative weights etc.

You keep not distinguishing between -pass filters and differentiating filters. With diff the incompatibility is more acute, since the desired result increases with HF, so when you switch to damping, it's at the worst time. That's why the results of my smoothed filters have a lot of noise at the freq of the trend period. Messy but inevitable. A Gaussian won't differentiate.

8. ps - new post coming on this.

9. "A Gaussian won't differentiate."

No. but a gauss-diff will , slightly more accurate than diff then gauss. Either way you get a well formed pass band and monotonic roll-off, no negs, no mess.

Equally Lanczos can be stacked on top of first diff to provide a good transition band and nice flat pass- & stop-band.

"Lanczos tries to, but at cost of negative weights etc."

Not sure what you mean by that. A 3 lobe Lanczos has very little overshoot or ripple. The main drawback is it requires quite a long kernel to achieve this good-a filter.

If I don't have to get as near as possible to the end of the data, this is a good high quality filter.

10. http://climategrog.wordpress.com/2013/11/28/lanczos-filter-script/

I don't know of another filter that looks as nice as that.

cost of negative weights? No cost. I object strongly to negative lobes, but I really see no "cost" in negative weights.

11. Greg,
"cost of negative weights? No cost."
As Wiki says:
"Since the kernel assumes negative values for a > 1, the interpolated signal can be negative even if all samples are positive."

12. Technically that is true. If you use a 5 lobe Lanczos it has substantial ringing artefacts. I would not recommend a 5-lobe L. There is a note to that effect in the script which I provided.

However, the 3 lobe filter for which I linked the freq. plot above has very small overshoot : < 1% at worst

That is the whole point of this filter, to window the sinc fn to provide a finite kernel with minimal ringing. It is probably the theoretically closest solution to an ideal step filter.

That's a whole world away from inverting lobes of the order of 50% we've been discussing here. For someone who seems to think this kind of "mess" is some kind of inevitability, I'm surprised you are a so quick talk about negative costs without realising they are less than a cent per dollar.

That's the sort of 'mess' I can live with.

12. For most typical cases that do not involve error ranges for both variables, smoothing and differentiating are linear operations that commute exactly and can be combined to one single step weighted sum.

When we know very little about the smoothness properties of the underlying process, including negative lobes in the smoothing function does not make any sense at all, such a filter can be justified only by assumptions that involve limits for higher derivatives. When even the existence of those derivatives is questionable negative lobes should definitely be avoided.

When the filter is based on a relatively small number of data points it may be worthwhile to look at the coefficients of the combined linear operator that's used to estimate the derivative. It's often much easier to judge it's suitability to the task by such a direct inspection that by any more indirect means like by the properties of the Fourier transform. In the case of temperature time series it's essential that the annual and diurnal variability with accurately known periods are not let to bleed through. When that's done properly the remaining signal that we are looking for is unlikely to have such smoothness properties that any sophisticated method could be justified.

A simple moving average does not produce as smooth results as other common filters, but that's not necessarily a fault. You could argue also that the smoothness produced by the other filters is misleading and that the moving average gives a more correct impression of the variability that is really present in the signal.

13. Pekka: For most typical cases that do not involve error ranges for both variables, smoothing and differentiating are linear operations that commute exactly and can be combined to one single step weighted sum.

This sort of statement is completely not true for finite precision arithmetic. Since any algorithm is generally going to be implemented using double precision numbers, that is a real issue with this sort of statement.

Even simple algebraic properties are not obeyed, including even the associative property:

(A+ –B) + C ≠ A + (–B + C)

(A,B an C are positive numbers here. I've written this in a manner to illustrate the problems.)

This is why languages like C have an order of evaluation specified (it prevents optimizers from reordering your expressions).

This sort of problem is easier to see with single-precision arithmetic, but it is true with double precision (which I typically use instead).

In generally the issue is taking the difference between two very nearly equal numbers always results in the loss of precision relative to an ordering of numbers than avoids this.

I mentioned above the preference of calculating:

E((x - mean)^2)

E(x^2) - xmean^2

When you have precisely measured quantities the second expression will result in a loss of precision in your variance (or standard deviation) estimate.

It is also why we should generally subtract the mean before computing an OLS calculation. As I pointed out the ordinary regression formula is obtained in exactly this fashion.

When we know very little about the smoothness properties of the underlying process, including negative lobes in the smoothing function does not make any sense at all, such a filter can be justified only by assumptions that involve limits for higher derivatives. When even the existence of those derivatives is questionable negative lobes should definitely be avoided.

Since you can use spectral analysis to look at the impact of the filter on the time series you are analyzing, it is a rather trivial matter to determine whether there are issues associated with retaining higher-frequency lobs.

As I pointed out above, if there is an issue with the side lobes, it is also trivial to apply a cascaded smoothing filter after the derivative filter. By selecting a smoothing filter that has nulls at the locations of the maxima of the first side lobe, you not only have increased the order of the filter, you've also suppressed the most important lobe of the derivative filter.

As Nick points out there are always trade-offs between different approaches, and absolute statements such as yours rarely hold upon practice.

A simple moving average does not produce as smooth results as other common filters, but that's not necessarily a fault. You could argue also that the smoothness produced by the other filters is misleading and that the moving average gives a more correct impression of the variability that is really present in the signal

The moving average is exactly the sort of filter that has large side-lobes. Yet you now seem to be objecting filters that don't include the artifacts associated with the side-lobes, as if this is giving a less correct impression of the variability that is present in the signal. That is really a curious argument.

If you want to show the smoothed version and you want to present the original variability, it is enough to provide a smoothed filtered overlaid over the original data:

Certainly smoothing by itself always necessitates reduction in the original variability, so if you want the original variability you just present the original variability too.

14. Carrick,

I'm very familiar with problems from floating point arithmetic, and agree that they must often be considered.

Moving average produces often misleading results. It's not uncommon to read press releases that tell about a sudden change in some economic indicator claiming that something dramatic occurred in the latest data point, possibly even speculation about what's the reason of that recent change, when the real change is that one exceptional value was dropped out from the beginning of the period. The spectral properties are surely in many ways unsatisfactory.

In spite of all that, I do still think that the advantages of moving average make it often the best choice, when we consider an irregularly changing value that's not controlled by an underlying function that has well defined higher derivatives bound by some reasonable bounds.

What's appropriate depends on the application, and what's misleading depends both on the application and the sophistication of the audience. Methods that are best in some fields of signal processing are not necessarily the most appropriate in analysis of temperature time series.

1. Pekka: What's appropriate depends on the application, and what's misleading depends both on the application and the sophistication of the audience. Methods that are best in some fields of signal processing are not necessarily the most appropriate in analysis of temperature time series.

Well I absolutely agree with this: When you are doing filtering, or any other sort of signal processing, you do have to consider the intended audience.

That's why you won't see me back away from the use of running average or running trend filters, even if you may see me caution against only using rectangular windows in that application, or only using one filter width.

15. Thanks for you comments Pekka.
"In spite of all that, I do still think that the advantages of moving average make it often the best choice"

Well it has two selling points: compact kernel ( not hard to achieve if you are prepared to accept a crap frequency response ) and a zero in the the FR. Most filters manage that with the rare exception being gaussian.

A three pole RM can approximate a gaussian and provide a zero without massive distortion.

Many seem to use RM by ignorance rather than by design. If the result is "smoother" they have attained their design objectives of "doing a smooth". That they may be inverting peaks etc does not even occur to them.

M&F2015 was a fine example of this problem.

They chose a 15y sliding trend , presumably a nice round number of about the right size, without knowing that it would invert the strong volcanic signal and lead them to a false conclusion that their ( untested, non validated ) "innovative" method would detect a dependency on model TCS if one was present.

Not knowing or examining the frequency properties of the "smoother" they chose is very poor practive. Not testing that their novel method was able to do what they assumed it would do is very poor and should certainly have been picked up in peer review.

Failing to notice that their fig3b ( 62y trends ) did show a clear bifurcation into two groups and investigate was very poor. It also contradicts their conclusions.

This paper is particularly disappointing since Froster and Gregory 2006 had impressed be as being about the only climate paper I've seen to date that recognised the importance of regression dilution on estimating TCS.

Maybe it was Gregory that was the more rigorous of that pair of authors.

1. Greg,
It seems clear that M&F made all these choices following, what was discussed in AR5. From Box 9.2 (p. 769) that discusses the hiatus, we can read:

During the 15-year period beginning in 1998, the ensemble of HadCRUT4 GMST trends lies below almost all model-simulated trends (Box 9.2 Figure 1a), whereas during the 15-year period ending in 1998, it lies above 93 out of 114 modelled trends (Box 9.2 Figure 1b; HadCRUT4 ensemble-mean trend 0.26°C per decade, CMIP5 ensemble-mean trend 0.16°C per decade). Over the 62-year period 1951–2012, observed and CMIP5 ensemble-mean trends agree to within 0.02ºC per decade (Box 9.2 Figure 1c; CMIP5 ensemble-mean trend 0.13°C per decade). There is hence very high confidence that the CMIP5 models show long-term GMST trends consistent with observations, despite the disagreement over the most recent 15-year period. Due to internal climate variability, in any given 15-year period the observed GMST trend sometimes lies near one end of a model ensemble (Box 9.2, Figure 1a, b; Easterling and Wehner, 2009), an effect that is pronounced in Box 9.2, Figure 1a, b because GMST was influenced by a very strong El Niño event in 1998.

From this excerpt we can find both the periods of 15 and 62 years and the use of linear trends over those periods.

In my view their paper is not meant to be great original science, it's purpose is only to extract some information of what the CMIP5 model run ensemble contains, and to do it in a way that mimics the choices of AR5 as far as practical. It's unfortunate that out of all papers on climate science this kind of simple exercises are among the few published in Nature. This is not nearly the most important that authors themselves have published, but this among the most visible among their publications.

The "empirical" input comes from a model run ensemble that's not representative in any well defined way of anything more general. The "data" is not worth of any sophisticated statistical analysis. They can answer semiquantitatively some questions, but what the questions or the answers really mean is not obvious at all.

Looking at strongly overlapping periods is also an approach that leads often to misleading conclusions. The effective number of independent intervals is very low for the 62 year trends, and rather limited also for the 15 year trends.

Model runs of the type of the CMIP5 ensemble could, in principle, be used to study how well various smoothing methods can help in interpreting the actual behavior of the models, but that might require so much larger ensembles that the cost of all the model runs cannot be justified.

2. Thanks for that.

So it was AR5 that chose to analyse the situation in terms of trends that can be thrown out by the volcanic forcing. Oh dear.

The idea of the paper was not too bad as quick poke at the question.It's just unfortunate that they apparently converted the AR5 trend into a sliding trend without understanding what they were doing.

"because GMST was influenced by a very strong El Niño event in 1998." It was also influenced by circa 10y variability due to volcanism.
Both of which, of course, invalidate using OLS fit to a linear model in the first place.

This simplistic obsession with "trends" needs to be banned from climatology if we are to make any progress on the question.

3. Greg,
"If the result is "smoother" they have attained their design objectives of "doing a smooth". That they may be inverting peaks etc does not even occur to them. "

I think you're still missing that regression is not just another filter. It gets a smoothed derivative. That's the design objective, not "doing a smooth". It is supposed to turn peaks into zeroes.

4. Greg: This simplistic obsession with "trends" needs to be banned from climatology if we are to make any progress on the question.

I happen to think this obsession with banning measurements of trends (and higher derivatives) is what needs to be banned. ;-)

There are many places in physics and engineering where slopes are used. One place is to estimate the rate of warming… which is the point of what we're doing here.

Another one is group delay estimated from total phase measurements. As we all know here the group delay τ is given in terms of the total phase Φ by the first derivative with respect to angular frequency ω,

τ(ω) = – Φ'(ω)

(sign depends on convention of course.)

That's actually the application I was studying with when I first got involved with running OLS trend (derivative) filters.

Nick---I think Greg is concerned that M&F may be flipping the sign of the peaks & valleys of the trend.

5. Carrick, I'm not saying OLS regression of a linear model should never be used in physics, it is a fundamental and valuable technique, though you don't see physicists calling it a "trend" , that has temporal connotations and is probably inherited from finance / econometrics.

There are very few situations in climatology where the necessary condition ( mathematical meaning ) that there be no systematic variation other than the supposed linear relationship, is met.

This is even pointed out by the IPCC quote above but that does not stop them using "trends" like they mean something. They don't. They are data processing errors. This is where all the arguments about "cherry-picking" come from. What no one seems to get is that it is not the start or end dates that are improper it is the whole concept of fitting a "trend" that is invalid.

As for M&F, there is a significant component of the signal in the latter part of 20th.c. which will be inverted by a 15y RM or sliding-trend.

That means they scrambled the data before starting the analysis and any results and conclusions are worthless. ( Not *necessarily* wrong but worthless, invalid and hence uninformative). Had they tested the method before using it they would have discovered that.

6. Greg: Carrick, I'm not saying OLS regression of a linear model should never be used in physics, it is a fundamental and valuable technique, though you don't see physicists calling it a "trend" , that has temporal connotations and is probably inherited from finance / econometrics.

I hope not. That's just crazy talk. Also, we don't use the word "trend" because we don't deal with temperature data. We could use the word "temperature velocity" here (first derivative of temperature amplitude with time)., but that's clumsy language, even if more precise. Generally people understand that "temperature trend" means the same as "temperature velocity".

There are very few situations in climatology where the necessary condition ( mathematical meaning ) that there be no systematic variation other than the supposed linear relationship, is met.

If what you are interested in is a smoothed version of the first derivative of a measured quantity, using a sliding-window OLS filter is a perfectly reasonable method for arriving at this quantity.

This filter can be improved by using tapered windows, but as Nick's analysis showed here, even the rectangular windowed OLS filter gives you a response which is formally equivalent to Welch-windowed running average of the derivative of the original function.

As for M&F, there is a significant component of the signal in the latter part of 20th.c. which will be inverted by a 15y RM or sliding-trend.

I looked at this by comparing different window types and I didn't see this. I think part of the reason is the higher order fall off of the derivative filter (compared to a smoothed amplitude filter) and part is due to the 1/f-to-some-power spectrum exhibited by temperature fluctuations.

As I mentioned on another thread, if you want to compare the OLS filter using different choices for the taper, you do need to adjust the filter width so you get the same equivalent bandwidth for the filter.

That is not hard to do with standard filters such as Welch, Hann and Blackman:

There are straightforward corrections that one makes here, which are you need to increase the filter length by by 1.3, 1.5 and 1.9 when you go from rectangular to Welch, Hann and Blackman in that order. (These are based on using a window centered well away from the DC bin, so probably these should be tweaked for a pure low-pass filter.)

7. Sorry… cross-eyed this morning when reading from my table of values: The factor 1.5 is for Hamming & Parzen (triangular). To more decimal places (just for completeness) you would use 1.33333, 1.66666 and 1.91667 ≈ 2 for Welch, Hann and Blackman.

16. Nick, I've been saying for years that all this should be done in dT/dt , no problem with that, though they do not present any reason why they are smoothing dT instead of T(t). I think this is just typical trendology.

If the requirement is for general purpose LP filtering of dT/dt, derivative of gaussian seems the most suitable choice to me. Though one of your higher order Welch's would probably be equally good.

17. http://climategrog.wordpress.com/?attachment_id=209

Here is an illustration of why fitting trends in the presence of systematic variability is bunk ( as well as being bad maths ).

"Fitting a linear “trend” model to data from such a system, that is anything but linear in its behaviour, is invalid."
I don't think so. It gives you an estimate of the derivative at the centre of the range. Nothing wrong with that. It's only a problem if you think the derivative should be the same everywhere.

2. It gives you an number. Now you need to find out whether it is a reliable estimation of the slope.

http://en.wikipedia.org/wiki/Ordinary_least_squares#Assumptions
Non-autocorrelation: the errors are uncorrelated between observations: E[ εiεj | X ] = 0 for i ≠ j.

In my "cosine" example the errors will be strongly autocorrelated. Similarly in the presence of any significant systematic variation other than the supposed linear relationship being sought.

Yet again, the fundamental assumption is that deviations from the linear model are normally distributed. This can be allowed some flexibility but a whacking great cosine is going well beyond "flexible".

Neither of the trends in my graph are legit and the difference is spurious. The cosine is not going anywhere , there is no "accelerated warming".

3. Greg,
Differentiation doesn't involve any notion of error distribution. The error is secular, due to higher derivatives etc. Here, if anything, the model is differentiable function plus noise. You aim to differentiate the underlying and damp the noise.

18. What M&F did is essentially estimating the changes over the periods concerned. 15 year OLS trend is not very different from what could be determined by calculating the averages of the first 5 and last 5 years and dividing the difference by the time from the first period to the second period. For 62 year trend we could pick similarly 10-15 years from both ends. These values are perfectly meaningful and easy to understand. OLS trends are close to the same but have somewhat smaller statistical uncertainty as they use the information more optimally (at least assuming white noise).

All estimates of derivative at a point discussed in the above can be described as follows:
- pick individual pairs of values symmetrically from both sides of the point considered and calculate the derivative from those values
- form a weighted average of these estimates.

For a given footprint OLS has the smallest statistical uncertainty of such estimates. Some other weights give smoother behavior and are less affected by the size of the footprint. If the underlying process is expected to satisfy some specific smoothness properties (or have some spectral properties) then some other choice than OLS is probably better. All other choices suffer from the larger footprint that prevents getting properly close to the end point of the full time series.

1. To continue along the way I approach the issue.

Assuming that the underlying function can be represented by a Taylor expansion around the point considered, we wish to determine the coefficient of the first order term. All even terms cancel out in every pairwise difference. Thus the higher odd terms are the problem. The third order term distorts the estimate the more the further we get from the point considered and higher odd order terms do the same even more strongly.

In choosing the optimal weights of each pair we should take into account the relative strength of the noise and ow large we think the total influence of the higher odd order terms is - or more generally high large we expect the asymmetric deviation from the linear trend to be. Points where the expected deviations are much larger than noise should be excluded. Each pairwise estimate should have a weight that reflects the combined effect of noise and expected asymmetric deviation from linear trend (and takes into account also the denominator of the trend calculation).

2. Pekka,
The paired differences that you describe can each be expressed as a cumulative sum of simple differences. So their weighted sums can be expressed as weighted sums of such differences. You're really asking that that sum should be orthogonal to cubic, quintic etc. There are plenty of dof to meet that and still have good smoothing.

But I don't think it solves the main problem. In effect, in the frequency domain (FD) it makes the early part more exactly linear. But the main deviation from linearity comes when attenuation for HF starts. At that point, the response is neither a proper differentiator nor small. That is why there is an evident band pass character in the earlier time series. A more exactly linear rise would not help there.

3. Nick,
I'm not convinced that the temperature time series have properties that can be considered better in the frequency domain than in the time domain. That's possible, but my preference is to make the method maximally transparent in the time domain even if that leads to somewhat worse properties in the frequency domain.

We may also ask several different questions that are best served by different approaches to the analysis.

Very simple and robust methods tend to be best, when very little can be assumed about the underlying signal. Frequency domain is useful, when assumptions are better justified for the spectral properties than for effects seen directly in the time domain.

4. Which approach is better (more pleasing) depends partly on a persons experience as well as which is more clumsy for a given problem.

I'm not sure here exactly which properties are better described with the time domain analysis. It is certainly much easier to read the transfer function (normally expressed as the complex ratio of the velocity spectrum of the output to the velocity spectrum of the input) to figure out what is happening on the shoulder (rolloff region) of the filter.

The fact the phase is varying tells us that any wiggles you see are specious. They amount to a summation over varying amplitudes of the signal with varying phase shifts. In other words, just 'junk'.

You might find it pleasing to leave these wiggles in your original series, but generally the trained eye is taught to ignore any of the variations that are more than 50% the roll-off frequency of the signal.

So if you're interested in variations that are say 1/5 the frequency of the roll-off (say ENSO using a 13-month running-average filter), no problem.

But it doesn't really inform you about the high-frequency variability of the signal other than in some very primitive manner.

If I wanted to exhibit this in a time series in a more clean fashion, I'd use varying filter widths (e.g., 5-year, 2-year and 1-year) for example, but using a filter with a much higher roll-off (I use 4th order Butterworth typically).

The running average shouldn't be eschewed in my view, but it is still a very poor filter.

5. Thanks Carrick, how do you implement your Butterworth ? IIR I would guess. How do you determine where it has converged ? How much of the TS do you loose at each end?

6. Yes, I use an IIR implementation.

I do it "forward and backwards" so really it's an eighth order acausal zero-phase implementation. I typically use reflection about the end points, and the low-pass, high-pass portions of the filter are set by specifiying the 3-db corner frequencies. Depending on the application, I will subtract the DC offset, then add it back after filtering (this is a flag).

What I do with the forward-backwards + end-piont reflection is similar to what's done with MATALB's filtfilt.

[However, I'm not as clever in my code about how many points to retain in the reflected region.]

Here's an example of the output:

The light blue (cyan) line is the 4th order Butterworth smoother output. Because of the way it is implemented and because I don't go beyond fourth order, convergence/stability is never a problem.

7. Thanks, What is the corner frequency you used there?

Eyeball check it looks OK except for first and last 15-20y. But to be more rigorous about this and how well it works theoretically, ie with any data sample, you need to ask questions about convergence not just state it's "never a problem".

Specifically, all IIR filters have a spin up, a bit like climate models, they are iterative and need time to settle, or converge, to the correct answer. If you want to specify , say 1% accuracy, this can be very long on a fixed data set like climate where the filter period is a significant proportion of the sample length, eg 15y LP on 150y of data.

It is analogous to having a FIR truncate the ends of data, except that the user of a IIR filter needs to determine the convergence limit and crop the data himself.
This makes it easy to overlook the issue.

IIR filters are typically used in electronics and audio applications where the data is a continuous stream and this spin-up is just a turn on delay and it not a problem. It can not be ignored in TS analysis.

There is also the question of frequency dependant phase distortion, because IIR is backwards looking, not symmetrical like most FIR low-pass filters. What the phase effect of back and forth is needs to be considered. In the region where both back and forward runs are converged, it will probably cancel out correctly, for the rest, not so.

It also seems that back-and-forth approach is intended to address spin-up too, applying the converged filter to the early incorrectly filtered data on the return run. I'm not sure that this is valid and have not seen it discussed.

The early data of the spin-up period is not just unfiltered, it wrong. Corrupted data. Passing it through a fully spun up filter will remove any unfiltered low freq but will not correct the data corruption. Similarly, running the correctly filtered end section through a non converged filter will corrupt what was good.

So the result will have similarly distorted sections on each end. This is analogous to the FIR solution not being able to run up to the ends of the data, except that is pernicious.

All these techniques like reflection about end points are speculative and assume telepathic knowledge of what the did before the system was measured and what it will do in the future.

This is clearly unscientific speculation and IMO should be avoided in any serious work.

8. Greg, I used 1/10-years for the 3-dB corner for that example.

As to accuracy, the purpose of the smoothed curve here is to put a line on a graph to aid the viewer in tracking the long-term variation in the data. And that's it, it serves no other purpose.

A smoothing filter used in this way is a visual device, one that obviously distorts the data (it is throwing away valid data after all), so I wouldn't use the smoothed data for anything other than just visualization purposes.

If you don't think it's necessary to aid the viewer, you don't include the smoothed curve on the graph. In other words, putting it there is a subjective decision (so you can't even argue whether another person should put it there, because that would be like saying "blue" is better than "green").

The curve doesn't have to be more accurate than the thickness of the line, but other than near the end points—which are significantly influenced by the decisions that made near the end-points—the accuracy is clearly much better than the thickness of the curve. If I care about accuracy near the edges (I should), more typically I will chop the data that are 1over the corner frequency of the filter, or use a sensitivity analysis to figure out how far away from the edges I can go using different assumptions about the end-points.

But I'm afraid you have a significant misunderstanding of what we are doing and why here, when yo say:
All these techniques like reflection about end points are speculative and assume telepathic knowledge of what the did before the system was measured and what it will do in the future.

This is clearly unscientific speculation and IMO should be avoided in any serious work.

There is no right answer near the edges, but not padding is exactly equivalent to padding with zero for IIR filters.

In other words there is no way not to make assumptions in this case:

Padding with zero is obviously not a terrible choice for band-pass measurements, but clearly a much worse assumption than reflection about x or reflection about end points, or simply even the mean.

In other words, padding with zero for recursive (low-pass) smoothing filters is complete science fiction.

So you have to make assumptions for smoothing filters, otherwise you are dealing with gross inaccuracies near the boundaries. And for IIR filters these large errors propagate throughout the filtered data.

I like to think of the endpoint specification in analogy to splines, where you have to include specification of the derivatives at the boundary. If you have a temperature that stays about 20°C for example, nobody in their right mind is going to specify that it go to zero outside of the region where you have data when interpolating the data near the boundary points.

Using reflection about the end-point, then truncating at the endpoint has the effect of enforcing that the trend at the end-points are preserved.

So you have to make assumptions with IIR, and not padding is definitely a worse assumption that any reasonable end-point assumption that you might make.

9. Thanks for the detailed reply.

"But I'm afraid you have a significant misunderstanding of what we are doing and why here"

No, I see aim is to prime the run before it actually hits the data to reduce the errors during convergence period. But you did not mention the convergence question in your reply.

Ultimately you can not be sure of the output until it has converged. You are attempting to reduce the deviations in the non-converged region by priming it and hoping it does not matter. I doubt that 1/f is realistic even for a very low order filter.

I went into all this for TS analysis hoping to find a solution where a longer period FIR filter left me with very little data and found that if you wanted an accurate result spin-up was just as much a problem as end cropping in FIR.

19. Pekka: "I'm not convinced that the temperature time series have properties that can be considered better in the frequency domain than in the time domain."

since one of the conditions for OLS to be the statistically 'best estimator' is no auto-regression and temperature is highly autoregressive, that is another reason for not fitting trends to T(t).

The white noise, if we are to make that assumption is in dT not in T(t)

20. Greg: since one of the conditions for OLS to be the statistically 'best estimator' is no auto-regression and temperature is highly autoregressive, that is another reason for not fitting trends to T(t).

Well, no. That's the reason you have to check for the effects of autocorrelation in the noise on your filter. E.g., "realistic Monte Carlo". I haven't found any cases where it makes much of a difference, so I haven't explored it further. The general theorem is that you still have a linear unbiased estimator, but it's no longer the most efficient, so it's LUE but not BLUE.

If ou need to model the autocorrelation, generalized least squares methods exist which can handle the auto-correlation. Nick has some posts up on how to do this. The fastest numerical implementations use a modified Cholesky decomposition. I thought there was a Wiki article that discussed it, but I couldn't find it immediately.

Anyway you can find quite a bit to chew on by searching for the terms:

generalized least squares cholesky autocorrelation

1. Thanks for your comments. Yes, WonkyPedia is a moving target. I thought there was more on OLS there but it seems to have changed since last time I looked.

There are other techniques but they usually require more specific knowledge about the error structure which is typically not available in the case of climate data.

My point is it is best to remove the autocorrelation if possible and to a large extent taking derivative of temperature achieves this. One of the good points of M&F is that they are effectively using dT/dt. It's shame they corrupted the data with poor choice of filters to the point where they drew incorrect conclusions.

2. For those here who seem to difficulty grasping the importance of negative lobes, this is not a pedantic detail. Here is the late 20th c. section of CMIP5 ensemble mean compared to its 15y running average.
https://climategrog.files.wordpress.com/2015/03/cmip5_180morm.png?w=800

The El Chichon peak is neatly inverted and the Mt Agung one disappears !

Around El Chichon - Mt Pinatubo period the variation in the mean is almost in anti-phase with the unfiltered model output.

Now if one is to mangle the data in that way it may be that there is not "tracable imprint" of TCS in what is left. The conclusions of M&F2015 bases on 15y trends are ill-founded and worthless, Worse they are misleading.

3. When you say My point is it is best to remove the autocorrelation if possible and to a large extent taking derivative of temperature achieves this

But that is exactly what we are doing here. The question is about the efficiency of the filer is a valid one.

What I think you are advocating instead, which is first-difference followed by smoothing, is actually much less efficient that the filter we're using.

Remember, the only valid criticism of the moving-least-squares filter implemented using OLS is that it is inefficient. We shouldn't replace it with another filter that is even less efficient.

Some hay can be made using a generalized least-squares filter instead of OLS, so that's the direction I would suggest people look.

4. Efficiency is only a criterion if you are seeking to establish a linear relationship. If you are simply using a "trend" as a poor man's low-pass filter I would suggest choosing a better filter.

There is no objective statistical reason to be fitting a linear model to a system that has major non-linear forcings ( such a erratic eruptions, solar etc ) and a notable chaotic component with multiple feedbacks, both +ve and -ve.

That is why I say climatology needs to get away from "trends". It is always an implicit and unjustified assumption that it has some meaning and it does not. If you want to "smooth" the data you are wanting a LP filter. Then go and design one, do not start fitting arbitrary models to the data.

5. If, on the other hand, you want to suggest that you are trying to detect a linear rise due to CO2 forcing then you do need to address the issues of bias presented by the other non-linear variability that is _known_ to be in the data and establish that any "trend" is not a result of such biases or simply a random walk due to stochastic variability.

The best way to address the latter question is to analyse dT and not the time series itself.

In 30y of waffle about "trends" I don't recall seeing any papers addressing these issues before doing trend analysis, they just assume there is a trend and that OLS will accurately detect it.

My suggestion of banning discussion of trends is intentionally provocative but it would at least force those producing the multitude of banal and misleading "studies" based on trend to start doing some properly considered data processing.

6. "establish that any "trend" is not a result of such biases"
These are numerical procedures. They can't distinguish causes - CO2 or whatever. If there is a "bias" adding to the trend, well, that's what it is.

I'm basically advocating that trend be seen as a derivative estimate. It might indeed be better to call it that.

7. "The El Chichon peak is neatly inverted and the Mt Agung one disappears !"

Not what I see. I see a large lag filter applied that wipes out all the fine structure as it rolls over the details.

This reminds me of the conversation: patient:Doc it hurts when I do this. doctor: Then don't do it.

The generalizing that can be applied is that the filtering needed really depends on what you are trying to do.

What I am trying to do is model the temperature time series with known factors

http://imageshack.com/a/img909/5498/nEjPVr.gif

The amount of filtering required needs to be tailored to capture the features of the data and of the model at the same level of detail. For this particular application, I need a filter that removes the seasonal influences, which in the simplest case is a 12 month window.

8. "Not what I see. I see a large lag filter applied that wipes out all the fine structure as it rolls over the details."

All these filters: running mean, sliding trend &co. are symmetric kernels that do not produce a phase shift in the data.

You may be fooled into thinking this is a lag which would be a possible hypothesis based on eye-balling that graph. However, theory proves you are mistaken. This is the effect of the negative (inverting) lobe. A 15y RM inverts variability around 10y .

It does take out most of the HF, which would be fine it had not screwed the remaining signal.

9. Greg,
In your argument you imply the assumption that certain behavior in the frequency domain is in someway fundamental for the GMST, but that's only an assumption. It may equally well be that what we see in the time domain is more fundamental. In the latter case your argument makes no sense at all. The running mean does not have any negative lobes in the time domain.

My intuition tells that GMST does not have such properties that would make your argument correct. On this point I side with WHT.

10. Pekka, you are advancing some very loose arguments and 'intuition' here. I suggest you stick to theoretically defensible reasoning.

If a filter has negative lobes in FD you can be sure it is doing odd things in TD, you cannot chose to close one eye. Both FD and TD are equally valid. Sometime one may reveal things which are not so obvious in the other.

Look again that the graph WHT was commenting on:
http://climategrog.files.wordpress.com/2015/03/cmip5_180morm.png?w=800

He "can't see" the inversion, he thinks it's a lag. Again the error of eye-balling the graph and applying intuition. Simple theory tells us that there is no lag resulting from use of a symmetric kernel. That interpretation of the obvious distortions in the time series is erroneous. FD shows us where the neg. lobe lies, it inverts 10y variability. The effect on the TD then becomes clear.

I am not making any assumptions about the system, this data could come from anywhere and same things would be true.

In using FD as well as TD I'm looking at the data from two different angles ( orthogonal in this case ). Viewing the same thing from different angles is often helpful in understanding its form.

You are suggesting closing one eye and relying in intuition. In view of the knowledgeable and insightful remarks I'm used to seeing from you I find that position surprising.

Greg

11. Greg,

Sticking to theoretical arguments is a point only, if there are valid reasons to expect that the preconditions of those theoretical arguments are met.

My claim is that nobody has shown that they are met and my intuition tells that they really are not met.

As long as nobody has shown or even provided weak evidence to support that they are met, I consider my intuition significant.

21. Greg: Efficiency is only a criterion if you are seeking to establish a linear relationship. If you are simply using a "trend" as a poor man's low-pass filter I would suggest choosing a better filter.

This is completely wrong. Efficiency is different than linear.

Efficiency is a measure of how much variance you can explain with a particular fit. For uncorrelated noise, the Gauss-Markhov theorem establishes that OLS is the best linear unbiased estimator.

If you feed correlated noise into the derivation, you now find there is an additional term that would have canceled in the absence of correlation that now fails to cancel. Hence OLS is no longer "best" (which refers to efficiency).

This means that OLS based filters will do worse than generalized LSF filters when you have correlated noise. But they will still be linear and unbiased.

So saying we need to replace running OLS filters by smooth difference for example is not a good argument, if the operator you want to replace them by is numerically less efficient.

1. I was not suggesting that efficiency was linear. It is the trend that I was referring to, which is linear.

If you use an OLS regression to determine the "trend" on a 15y segment of data with a volcano in it, it will give you an estimation of the average rate of change, though this is the average of the whole segment. I'm unclear why Nick wants to see this as representing the derivative at the mid point more than any other part of the data.

So if this is supposed to be a fit to detect the underlying CO2 forcing then it is not at all efficient and is heavily biased by the significant systematic variability that has nothing to do with the linear relationship which is being sought.

This is a reflection of the underlying paradigm that climate is AGW + "internal" noise ( where "noise" carries the unstated assumption that it is random noise).. If that was the case there may be a chance of it working. However, we know that there is significant systematic variability so the use of a linear model and OLS to detect is not applicable.

That is the case of linear "trend" as a fit to a ln(CO2) forcing model. It does not work because the conditions for it to work are not met.

The other option is "trend" as a LP filter and we are all agreed (with the exception of Pekka ) that it is not a very good one. It would be easy to find a more suitable filter by specifying this is what is desired and defining some design criteria to be met.

There is a corner case where an overriding need to get the last few years may mean accepting some fairly ugly inversions in the data.

In general, it is not a valid way of detecting CO2 forcing and it is a very poor LP filter.

2. OLS based linear trend does certainly not describe very GMST changes very well, but for many purposes it may still be the best choice, when we cannot objectively show that any other alternative is better.

3. Well now we are agreed that fitting a trivial linear model does not work very well, I would suggest this implies we need to try harder, not to accept the one we've found to be unsatisfactory.

This similar to saying warming must be due to AGW because we haven't managed to model the system more satisfactorily.

4. "I'm unclear why Nick wants to see this as representing the derivative at the mid point"
Imagine a Taylor series expansion about the mid-point. The trend operator, or it's smoother versions, are odd functions. So even powers operated on that give zero. The first term that perturbs the first order expansion is the third. The trend operator is accurate to that order.

At any other point, that symmetry does not work. There is a second order error.

To put it another way, if you use the sliding trend to differentiate a parabola, you get exactly the mid-point derivative.

5. Thanks Nick, I suspect that is a non representative special case. What would be the result of doing it to a cosine with period 11.5 times less than the window width? Does it still give the derivative of the cosine evaluated at the midpoint?

Now create an AR1 time series, does it give the derivative of midpoint?

I have not run this but my guess is no. I think it gives an averages value for the whole windowed sample.

6. You are probably correct that it ( on average ) estimates the trend at the mid point better than the trend at other points. But what I'm saying is that is it not an estimation of any single point but of the entire period.

7. Greg,
"You are probably correct that it ( on average ) estimates the trend at the mid point"
No, I said it estimates the derivative at the mid point. Derivatives exist at a point, not over a period.

If you're trying to estimate a derivative, you'll take (if noise permits) an interval in which the Taylor series can be reasonably be truncated. If you can reasonably get it down to second order, then that has the property, indicated above, that the average trend over a period is the mid-point trend.

If higher derivatives don't become small, then you'll get a poor estimate of the derivative. But it's the mid-point you are trying for.

22. Nick: I'm basically advocating that trend be seen as a derivative estimate. It might indeed be better to call it that.

I think it's sufficient to define trend of a quantity as the time derivative estimate of that given quantity. Then move on. ;-)

1. M&F are using the trend figure as a temperature difference over the period, ie the transient climate response dT caused by radiative changes over that period. This seems to correspond to Nick's integral of derivative in his working. They are not actually using it as a derivative.

They explicitly state that they ignore the time lag between forcing and effect. They explicitly assume the rest can be shunted off into what they erroneously call a "random" error term. This is equivalent to assuming instantaneous equilibration and the errors of this simplification gets shifted in to "random " error term too.

Santer et al 2015 quotes others as showing models display time-constants of 30-45 months. Taking 3 time-constants as approximate equilibration means 7.5-11.25 years to settle. Their 62y processing may roughly justify crudely ignoring the lag , it certainly will not work for 15y where it will introduce significant error.

They then do their dT vs dF OLS regression. The errors are not random and bias the result. Their results are then such a jumble that the effects of model TCS are masked by the 'noise' they induced by their processing.

Saying this is the best we can do , or that OLS is still "efficient" does not help the fact that the OLS processing ignores the basic axioms of how it can be applied and leads to meaningless and misleading conclusions.

23. Firstly i'm pretty much out of my depth here. But a smoothing filter which seems to work ok is Hodrik Prescott (available as an excel add in)
on hadcrut 4 data and choosing an appropriate filter value this is a comparison with filter above:
Note how ends match data better.

http://s29.postimg.org/gj8awbojb/hp_filter_compared.png

1. thefordprefect --
Looks like an excellent smoother. I'm curious, though -- the (Excel add-in's) web page suggests lambda=14400 for monthly data. It appears you used 180K instead. Why did you choose that? or, what kind of response do you see when using the 14400 value?

2. The influence of the endpoints depend on the bandwidth of the filter and the end-point conditions. I used reflection around a point and a10-year smoother.

If you give me the bandwidth of your filter and the endpoint conditions, I can visually match it with virtually any other filter choice (including tapered windows).

3. Haroldw the filtering level was chosen to match Carrick's - included in the new plot are various filtering levels between 1e3 and 2e8

Carrick - no idea what the bandwidth is! this sort of smoothing I would suggest is simple for viewing - analysis get silly if you remove real peaks with filtering. With this in mind the Hodrick P filter can be tuned by changing one number. As you can see from the plots the end points are handled reasonably without adjustment at most filtering levels

Even at 2e10 the smooth line is not a bad fit to data

http://s10.postimg.org/6b9yn4gll/hodp_at_2e10.jpg

4. thefordprefect: this sort of smoothing I would suggest is simple for viewing - analysis get silly if you remove real peaks with filtering.

It's not clear what you mean by "real peaks". If you don't want to remove "real peaks" you don't smooth.

With this in mind the Hodrick P filter can be tuned by changing one number.

So can a 4th order Butterworth. You select the 3-db roll-off and that's it.

As you can see from the plots the end points are handled reasonably without adjustment at most filtering levels

Look, the algorithm is obviously making a choice regarding end-points, otherwise it has to truncate the filter before it reaches the edges. So just because you aren't personally enforcing an end-point decision, doesn't mean one isn't being enforced.

Beyond that, which filter a person uses is pretty subjective, if all you're using it for is graphical exploration of the data.

In the absence of an objective basis for picking a particular filter, it's entirely a matter of personal taste, I'm afraid.

5. Sorry I meant to say "otherwise it has to truncate the [filtered data] before it reaches the edges"

6. Well, the data is really smooth afterwards, so it must be doing its job really well. It's a great "smoother" :roll:

Looks like more econometrics stuff. They don't worry about silly sciency things like frequency response, cut-off frequency or pass-band, it just "smooooths". They talk about removing a cyclic variability but I don't see it targeting a specific period. The second term is first diff of first diff, ie an acceleration term.

Perhaps one needs to refer to the orginal authors paper rather than the chumps offering an Excel plug-in.

From their recommended lambdas it appears that 100 for yearly data may indicate a 10y "smooth" whatever that means. So it appears that the value of
180k may equate to a 35y "smooth".

It definitely looks better than a running mean, but that is not saying much. May be worth closer inspection but I'd want a proper description of what it is supposed to do and a freq plot if such a thing is defined for this filter.

I suspect this is one of those 'filters' which has frequency spectrum that changes with the data.

7. http://s10.postimg.org/6b9yn4gll/hodp_at_2e10.jpg

OK, with that high value of lambda I think I can see what this is doing, I it regression of a mixed linear + quadratic model and the lambda determines the weighting it gives to each. I think you are doing here with the very large lambda is fitting a pure acceleration quadratic to the data, imposing 1850 as the starting point.

As always, the question then becomes why are you fitting that model. There may be a valid reason to do that but you need to realise that is what you are doing and say why you would want to do it.

8. Greg: As always, the question then becomes why are you fitting that model. There may be a valid reason to do that but you need to realise that is what you are doing and say why you would want to do it.

Exactly. If it's just for presentation purposes, it becomes a question of what you can explain or defend as the presenter.

If it's part of an analysis, you usually aren't trying to smooth so much as feature extract. In that regard, secular variability is a legitimate candidate for feature extraction, but smoothing just to remove the high frequency portion of the data--there really are few places where you'd do this.

9. Few places?

Well if the HF is truly random regression based techniques can usually handle it and there's no need to filter. However, if you have a strong annual cycle that is probably an order of magnitude greater than rest of variability that you are trying to study this will decorrelate everything else.

If you are re-sampling to annual and there may be sub-annual periodic variability due to lunar tides for example, a proper anti-aliasing filter is required. This should have been done before re-sampling to monthly too. Taking monthly averages once again assumed everything is normally distributed.

There is this blind simplistic notion that everything is "random" when we know very well this is not the case.

10. Greg: Well if the HF is truly random regression based techniques can usually handle it and there's no need to filter. However, if you have a strong annual cycle that is probably an order of magnitude greater than rest of variability that you are trying to study this will decorrelate everything else.

Generally there is no reason to smooth to remove annual filtering. If you are studying a process that is bandwidth limited and does no overlap with the annual cycle, the basis functions associated your optimization process will not be sensitive to the annual cycle either. This is easy enough to see by a simple extension of the Wiener Kinchin theorem. So if there is correlated noise and it is "high frequency" compared to the signal of interest, it does not affect the estimate of the model parameters, either the central value or the error in the estimate.

It's only when there is spectral overlap between your basis function and the correlated noise that there is an effect—it makes the estimate noisier. However a smoothing filter followed by OLS does not remedy this problem, it is not "the best" linear unbiased estimator.

Instead, you have to model the correlated noise in this overlapping band using e.g., generalized least squares, which by an extension of the Gauss-Markhov theorem, can be shown to be the best linear unbiased estimator for that model.

I said "few places", because there really are a "few places" where smoothing the data before processing it improves the fidelity of the measurement. But not many. And generally it's still not the "best" approach to take in this situation.

24. Meant to say - Filter compared is
Carrick March 4, 2015 at 11:24 AM