moyhu: Regression as derivative

Saturday, February 21, 2015

Regression as derivative

In two recent posts here and here, I looked at a moving OLS trend calculation as a numerical derivative for a time series. I was mainly interested in improving the noise performance, leading to an acceleration operator.

Along the way I claimed that you could get essentially the same results by either smoothing and differentiating the smooth, or differencing and smoothing the differences. In this post, I'd like to develop that, because I think it is a good way of seeing the derivative functionality.

This has some relevance in the light of a recent paper of Marotske et al, discussed here. M used "sliding" regressions in this way, and Carrick linked to my earlier posts.

Integrating by parts

My earlier derivation was for continuous functions. If we define an operator:
R=t/X, t from -N to N, zero outside
and X is a normalising constant, then the OLS moving trend is
β(t) = ∫R(τ)y(τ+t) dτ
where ∫ is over all reals ( OK since R has compact support). X is chosen so that t has unit trend: ∫R(τ)*(τ+t) dτ = 1.

I'll define W(t)=-∫_∞^tR(τ) dτ, (modified following suggestion from HaroldW, thanks) and use D=d/dτ, so DW = -R, and W=-D^-1R. Then
∫D(W(τ)y(τ+t)) dτ = 0 = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ,
or, β(t) = ∫R(τ)y(τ+t) dτ = ∫W(τ)Dy(τ+t) dτ

Now W is a standard Welch taper.

It is the cumulative integral of R, and since that has mean subtracted, so integral over the whole range is zero, then it is a quadratic that starts from zero at -N and returns to zero at N, and is zero outside that range. So that establishes our first proposition:
β(t) = ∫W(τ)Dy(τ+t) dτ
ie a Welch-smoothed derivative of y.

Now D is wrt τ, but would give the same result if it were wrt t. In that case, it can be taken outside the integration:
β(t) = D∫W(τ)y(τ+t) dτ

That is or second result - the sliding trend β(t) is just the derivative of the Welch-smoothed y.

Application to time series

I introduced D because it has a nice difference analogue
Δy = y_i - y_i-1

It's inverse Δ^-1 is a cumulative sum (from -∞). So the same summation by parts works:
β(i) = Σ_j R(j)y(i+j)
Again W = -Δ_-1R is a symmetric parabola coming to zero at each end of the range - ie Welch. Then
Σ_j Δ(W(j)y(i+j) = 0 = -Σ_j R(j)y(i+j) + Σ_j W(j)Δy(i+j)
or β(i) = Σ_j W(j)Δy(i+j)

Again that's the first result - the sliding trend is exactly the Welch smooth of the differences of y. Smoothed differentiation.

Again, Δ can be regarded as applying to i rather than j.
β(i) = Σ_j W(j)y(i+j)+Σ_j W(j)y(i+j-1) = ΔΣ_j W(j)y(i+j)

The sliding trend is exactly the differences of the Welch smooth of y.

138 comments:

Everett F SargentFebruary 22, 2015 at 2:03 AM
Nick,

Of course, all of this is interesting:

http://en.wikipedia.org/wiki/Window_function#Welch_window

In doing nearshore ocean and laboratory water wave analysis we used:a Tukey window (a = 0.9):

http://en.wikipedia.org/wiki/Window_function#Tukey_window

And then my preferred method was the:

http://en.wikipedia.org/wiki/Welch%27s_method

using 2N-1 windows (D = 0 and D = M/2) each window was either linear or quadratic detrended, Tukey window applied, and final result was variance adjusted wrt a boxcar window function.

Back in the 60's it was decided that n = 2048 at dt = 0.25 sec or ~17 minute wave record, unfortunately, to this day, that rather computer limited method (60's CPU technologies) is still applied today.

As a side issue, I am not aware on any ARIMA methods that can produce/synthesize a realistic water wave time series that can be used to drive a laboratory wave generator (if you know of any 'hands on' application I'd be very interested).

Finally, IMHO, any reasonable discussion on FIR/IIR filtering method's must show the SOP semi-log plot of the filter's magnitude (or better yet amplitude) response function, if you are are just going to show either just the real or imaginary parts of the response function, at least for me, that is of little engineering utility.

I've been doing this stuff for over three decades now, FIR's have certain strengths (limited autocorrelation via the finite window (but 'stair steps' inherent in the Finite nature) and weaknesses limited attenuation decay and strong nodal behavior) and IIR's have certain strengths (monotonic and strong attenuation and weaknesses infinite autocorrelation (with decay features wrt the pole count).

AFAIK, there is no perfect filter, however, there are a lot of filters that can be misapplied in time series analyses.

That is all.
ReplyDelete
Replies
Greg GoodmanFebruary 22, 2015 at 2:09 AM
Nice to see this sort of discussion. In view of the obsession climatologists seems to have with "trends" it would be good if they knew what they are actually did.

I dislike the term "smoother" because it is imprecise. Lots of things can be "smoother" but that tells us nothing of how they are distorted. All filters 'distort' the data so we need to ensure we are distorting it in a useful way and not in unexpected ways of which we remain ignorant.

If we talk about low-pass filters this instantly means we are considering the frequency domain, which is what we must do to make appropriate choices.

Since the aim in this case is find the derivative and also apply some low-pass noise filtering, one good options is to use a well behaved filter like a gaussian, which in monotonic in freq domain and is never negating. ( The downside being that it has a rather slow transition and is never exactly zero either so there is some stop-band leakage. This is often not a problem but should be noted ).

As Nick points out the order is not important. This a result of the fact that convolution, the process by which FIR filters are calculated is a linear operation. As a result it is commutative. It can be noted that the first difference is also a trivial convolution with a kernel: [-1,1] . This introduces a small phase offset since it represents the rate of change half way between the two points. So the time series needs shifting. An alternative is [-1,0,1] which estimates over three points and does not have a phase-shift.

So gauss of diff is mathematically identical diff of gauss, to extend Nick's argument.

However, in calculating the gaussian kernel we can apply some neat maths. Since we know the analytic form of the gaussian we can calculate its derivative analytically and use this as a kernel that will do the diff and the low-pass gaussian in one step. This has the nice advantage of giving an accurate multipoint estimation of the derivative rather than the clunky two-point numerical estimations.

I did this example earlier in relation to the M&F thread.

http://climategrog.files.wordpress.com/2015/02/cmip5_dgauss-vs-gauss-diff.png?w=800

Now the frequency response is similar to Nick's earlier plots without the horribly defective oscillations and negative lobes. The initial rise is similar but it tails off in a controlled manner. I do not have a plot of ready but if Nick has the tools at this disposal to pop that up it would be worth visualising in comparison to the other frequency plots.

This filter is used in image analysis as a edge detection aid but it is a very good solution for anywhere where the rate of change and some low-pass filtering is required.

What M&F were attempting to do would a case in point.

I did an test on the same data M&F were using and achieved a much "smoother" result using a filter period much shorter than their 15y base period.

https://climategrog.files.wordpress.com/2015/02/cmip5-xs.png?w=800

To simplify discussions I did diff of gaussian, not the gaussian-derivative there. So that remain jitter could probably could be removed.

I discuss the implications of that plot over at CA.
ReplyDelete
Replies
Everett F SargentFebruary 22, 2015 at 2:11 AM
"n = 2048" should be "n = 4096' sorry about that.
ReplyDelete
Replies
Everett F SargentFebruary 22, 2015 at 2:14 AM
"used:a Tukey window (a = 0.9)" should be "used a Tukey window (a = 0.1)" sorry about that.
ReplyDelete
Replies
HaroldWFebruary 22, 2015 at 2:18 AM
Hi Nick,
Excellent as always. One quibble, though. You've defined W(t) = integral {-inf to t} R(tau) d(tau)). This produces an inverted parabola, and results in the counter-intuitive minus sign in the final formula. [And in the analogous discrete-time version.]

Instead, it would be better to define W(t) = integral {t to inf} R(tau)d(tau), which generates a right-side-up parabola like the figure. Then R= -DW, and the minus sign is eliminated in the end.

[I came at this from the other direction, deriving the discrete-time version starting from the OLS formula, which produces the form W(n) = sum {i=n to N} R(i) naturally. And apologies for not using latex for these formulas, but it didn't work for me. Not sure if it's not enabled here, or whether I just entered it wrong.]
ReplyDelete
Replies
AnonymousFebruary 25, 2015 at 1:25 AM
Before discussing the derivaiton I'll just test whether I can cut and paste your eqns into comments:
∫D(W(τ)y(τ+t)) dτ = 0 = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ,
ReplyDelete
Replies
@whutFebruary 25, 2015 at 2:19 AM
Somebody was asking about the characteristic period of the ENSO basin. It is 4.25 years as estimated by Allan Clarke at FSU
“Wind Stress Curl and ENSO Discharge/recharge in the Equatorial Pacific.” Journal of Physical Oceanography,2007

Of course this period works well when used as in a wave equation model for ENSO, if we also apply the appropriate forcing, primarily from the 2.33 year period of QBO.

ReplyDelete
Replies
Greg GoodmanFebruary 25, 2015 at 2:48 AM
Cool. Now in that line.
∫D(W(τ)y(τ+t)) dτ = 0 = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ

the LHS is integral of the diff across the interval so you are saying
W(τ)y(τ+t) is constant . I don't see the reason for that.

Also this seems to be where you are invoking integration by parts but I can't relate what you're doing to my knowledge of this method. Maybe you could show in little more detail how you get there. For ex. with reference to the forms given here:
http://tutorial.math.lamar.edu/Classes/CalcII/IntegrationByParts.aspx

".... and X is a normalising constant, then the OLS moving trend is"
β(t) = ∫R(τ)y(τ+t) dτ

I don't see that this has any connection to OLS. This integral is convolution with a short segment of fixed slope. ie convolution with a constant d/dt. With suitable normalisation this is a moving average of dT/dt. Where does OLS come in?

That would seem to imply X is re-evaluated at each point thus R(τ) is R(τ,t) and the rest needs to change to account for that.

Sorry if I'm missing the point , I appreciate it's difficult to explain this kind if derivation in a quick sketch like that.

ReplyDelete
Replies
AnonymousFebruary 26, 2015 at 3:57 PM
Posting from the hip a bit here, but my gut feel without checking the maths is that your red line is the spectrum of your acceleration trend , not the linear trend.
ReplyDelete
Replies
Greg GoodmanMarch 1, 2015 at 5:14 PM
Nick, from your intro:
DW = -R
∫D(W(τ)y(τ+t)) dτ = -∫R(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ

let's stick with W to keep it clearer and check the working before doing the substitution.

W(τ)y(τ+t) = +∫DW(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ ; integration by parts
http://tutorial.math.lamar.edu/Classes/CalcII/IntegrationByParts_files/eq0010M.gif

∫D( W(τ)y(τ+t) ) dτ = +∫DW(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ

Integration by parts ??

What you are effectively doing is:
∫D( W(τ)y(τ+t) ) dτ = 0 = W(τ)y(τ+t)

You've justified the left-hand equality, but there seems to be logical slip in implying this justified the RHS.
ReplyDelete
Replies
Nick StokesMarch 1, 2015 at 8:29 PM
Greg,
"If it is "sensitive" to this condition, the basis assumptions that the errors are normally distributed and that the data sufficiently well sample to represent the distribution, have not been satisfied."
By sensitive, he just means that it is weighted, whether an outlier or not. It's not saying that outliers have any particular prevalence.

"OLS is not a low-pass filter and should not be used as one."
I've shown the spectrum. It does damp high frequency, relative to those of period around the regression period. It isn't a low-pass filter - it's a smoothing differentiating filter.

That's the idea of its treatment of noise. HF is damped, low F is differentiated and passed. As usual, mid-range is messy.

"If OLS weights the end points and Nick's method weights the middle."
It weights the middle for differences. To extend the see-saw analogy, suppose it has bolt-heads along the side. You can tilt the see-saw either by adding weights, or by applying a spanner to the bolts. (Why would you do that? Maths means never having to say you're silly).

For best effect, you'd put weights at the ends, but apply the spanner to the middle. Spanner = local torque = difference (dipole).

"It strikes me that minimising sum of squares of y(t) is equivalent to minimising absolute deviation in dy/dt"
If you mean dy/dt as estimated by differences, then that's true, in a Welch-weighted sense.

"W(τ)y(τ+t) = +∫DW(τ)y(τ+t) dτ + ∫W(τ)Dy(τ+t) dτ ; integration by parts"

In this formula, the LHS means difference from one end of the range to the other, which as explained above, is zero. And with W Welch, DW is a (downhill) sawtooth pulse, as in regression (after minus).
ReplyDelete
Replies
Pekka PiriläMarch 2, 2015 at 8:03 PM
For most typical cases that do not involve error ranges for both variables, smoothing and differentiating are linear operations that commute exactly and can be combined to one single step weighted sum.

When we know very little about the smoothness properties of the underlying process, including negative lobes in the smoothing function does not make any sense at all, such a filter can be justified only by assumptions that involve limits for higher derivatives. When even the existence of those derivatives is questionable negative lobes should definitely be avoided.

When the filter is based on a relatively small number of data points it may be worthwhile to look at the coefficients of the combined linear operator that's used to estimate the derivative. It's often much easier to judge it's suitability to the task by such a direct inspection that by any more indirect means like by the properties of the Fourier transform. In the case of temperature time series it's essential that the annual and diurnal variability with accurately known periods are not let to bleed through. When that's done properly the remaining signal that we are looking for is unlikely to have such smoothness properties that any sophisticated method could be justified.

A simple moving average does not produce as smooth results as other common filters, but that's not necessarily a fault. You could argue also that the smoothness produced by the other filters is misleading and that the moving average gives a more correct impression of the variability that is really present in the signal.
ReplyDelete
Replies
CarrickMarch 3, 2015 at 4:13 AM
Pekka: For most typical cases that do not involve error ranges for both variables, smoothing and differentiating are linear operations that commute exactly and can be combined to one single step weighted sum.

This sort of statement is completely not true for finite precision arithmetic. Since any algorithm is generally going to be implemented using double precision numbers, that is a real issue with this sort of statement.

Even simple algebraic properties are not obeyed, including even the associative property:

(A+ –B) + C ≠ A + (–B + C)

(A,B an C are positive numbers here. I've written this in a manner to illustrate the problems.)

This is why languages like C have an order of evaluation specified (it prevents optimizers from reordering your expressions).

This sort of problem is easier to see with single-precision arithmetic, but it is true with double precision (which I typically use instead).

In generally the issue is taking the difference between two very nearly equal numbers always results in the loss of precision relative to an ordering of numbers than avoids this.

I mentioned above the preference of calculating:

E((x - mean)^2)

instead of the mathematically equivalent

E(x^2) - xmean^2

When you have precisely measured quantities the second expression will result in a loss of precision in your variance (or standard deviation) estimate.

It is also why we should generally subtract the mean before computing an OLS calculation. As I pointed out the ordinary regression formula is obtained in exactly this fashion.

When we know very little about the smoothness properties of the underlying process, including negative lobes in the smoothing function does not make any sense at all, such a filter can be justified only by assumptions that involve limits for higher derivatives. When even the existence of those derivatives is questionable negative lobes should definitely be avoided.

Since you can use spectral analysis to look at the impact of the filter on the time series you are analyzing, it is a rather trivial matter to determine whether there are issues associated with retaining higher-frequency lobs.

As I pointed out above, if there is an issue with the side lobes, it is also trivial to apply a cascaded smoothing filter after the derivative filter. By selecting a smoothing filter that has nulls at the locations of the maxima of the first side lobe, you not only have increased the order of the filter, you've also suppressed the most important lobe of the derivative filter.

As Nick points out there are always trade-offs between different approaches, and absolute statements such as yours rarely hold upon practice.

A simple moving average does not produce as smooth results as other common filters, but that's not necessarily a fault. You could argue also that the smoothness produced by the other filters is misleading and that the moving average gives a more correct impression of the variability that is really present in the signal

The moving average is exactly the sort of filter that has large side-lobes. Yet you now seem to be objecting filters that don't include the artifacts associated with the side-lobes, as if this is giving a less correct impression of the variability that is present in the signal. That is really a curious argument.

If you want to show the smoothed version and you want to present the original variability, it is enough to provide a smoothed filtered overlaid over the original data:

https://dl.dropboxusercontent.com/u/4520911/Climate/Temperature/hadcrut4.cvt.smoothed.png

Certainly smoothing by itself always necessitates reduction in the original variability, so if you want the original variability you just present the original variability too.
ReplyDelete
Replies
Pekka PiriläMarch 3, 2015 at 5:46 AM
Carrick,

I'm very familiar with problems from floating point arithmetic, and agree that they must often be considered.

Moving average produces often misleading results. It's not uncommon to read press releases that tell about a sudden change in some economic indicator claiming that something dramatic occurred in the latest data point, possibly even speculation about what's the reason of that recent change, when the real change is that one exceptional value was dropped out from the beginning of the period. The spectral properties are surely in many ways unsatisfactory.

In spite of all that, I do still think that the advantages of moving average make it often the best choice, when we consider an irregularly changing value that's not controlled by an underlying function that has well defined higher derivatives bound by some reasonable bounds.

What's appropriate depends on the application, and what's misleading depends both on the application and the sophistication of the audience. Methods that are best in some fields of signal processing are not necessarily the most appropriate in analysis of temperature time series.
ReplyDelete
Replies
Greg GoodmanMarch 3, 2015 at 6:29 AM
Thanks for you comments Pekka.
"In spite of all that, I do still think that the advantages of moving average make it often the best choice"

Well it has two selling points: compact kernel ( not hard to achieve if you are prepared to accept a crap frequency response ) and a zero in the the FR. Most filters manage that with the rare exception being gaussian.

A three pole RM can approximate a gaussian and provide a zero without massive distortion.

Many seem to use RM by ignorance rather than by design. If the result is "smoother" they have attained their design objectives of "doing a smooth". That they may be inverting peaks etc does not even occur to them.

M&F2015 was a fine example of this problem.

They chose a 15y sliding trend , presumably a nice round number of about the right size, without knowing that it would invert the strong volcanic signal and lead them to a false conclusion that their ( untested, non validated ) "innovative" method would detect a dependency on model TCS if one was present.

Not knowing or examining the frequency properties of the "smoother" they chose is very poor practive. Not testing that their novel method was able to do what they assumed it would do is very poor and should certainly have been picked up in peer review.

Failing to notice that their fig3b ( 62y trends ) did show a clear bifurcation into two groups and investigate was very poor. It also contradicts their conclusions.

This paper is particularly disappointing since Froster and Gregory 2006 had impressed be as being about the only climate paper I've seen to date that recognised the importance of regression dilution on estimating TCS.

Maybe it was Gregory that was the more rigorous of that pair of authors.

ReplyDelete
Replies
GregMarch 3, 2015 at 3:49 PM
Nick, I've been saying for years that all this should be done in dT/dt , no problem with that, though they do not present any reason why they are smoothing dT instead of T(t). I think this is just typical trendology.

If the requirement is for general purpose LP filtering of dT/dt, derivative of gaussian seems the most suitable choice to me. Though one of your higher order Welch's would probably be equally good.

ReplyDelete
Replies
GregMarch 3, 2015 at 3:56 PM
http://climategrog.wordpress.com/?attachment_id=209

Here is an illustration of why fitting trends in the presence of systematic variability is bunk ( as well as being bad maths ).
ReplyDelete
Replies
Pekka PiriläMarch 3, 2015 at 7:49 PM
What M&F did is essentially estimating the changes over the periods concerned. 15 year OLS trend is not very different from what could be determined by calculating the averages of the first 5 and last 5 years and dividing the difference by the time from the first period to the second period. For 62 year trend we could pick similarly 10-15 years from both ends. These values are perfectly meaningful and easy to understand. OLS trends are close to the same but have somewhat smaller statistical uncertainty as they use the information more optimally (at least assuming white noise).

All estimates of derivative at a point discussed in the above can be described as follows:
- pick individual pairs of values symmetrically from both sides of the point considered and calculate the derivative from those values
- form a weighted average of these estimates.

For a given footprint OLS has the smallest statistical uncertainty of such estimates. Some other weights give smoother behavior and are less affected by the size of the footprint. If the underlying process is expected to satisfy some specific smoothness properties (or have some spectral properties) then some other choice than OLS is probably better. All other choices suffer from the larger footprint that prevents getting properly close to the end point of the full time series.
ReplyDelete
Replies
GregMarch 4, 2015 at 3:48 AM
Pekka: "I'm not convinced that the temperature time series have properties that can be considered better in the frequency domain than in the time domain."

since one of the conditions for OLS to be the statistically 'best estimator' is no auto-regression and temperature is highly autoregressive, that is another reason for not fitting trends to T(t).

The white noise, if we are to make that assumption is in dT not in T(t)

ReplyDelete
Replies
CarrickMarch 4, 2015 at 11:45 AM
Greg: since one of the conditions for OLS to be the statistically 'best estimator' is no auto-regression and temperature is highly autoregressive, that is another reason for not fitting trends to T(t).

Well, no. That's the reason you have to check for the effects of autocorrelation in the noise on your filter. E.g., "realistic Monte Carlo". I haven't found any cases where it makes much of a difference, so I haven't explored it further. The general theorem is that you still have a linear unbiased estimator, but it's no longer the most efficient, so it's LUE but not BLUE.

If ou need to model the autocorrelation, generalized least squares methods exist which can handle the auto-correlation. Nick has some posts up on how to do this. The fastest numerical implementations use a modified Cholesky decomposition. I thought there was a Wiki article that discussed it, but I couldn't find it immediately.

Anyway you can find quite a bit to chew on by searching for the terms:

generalized least squares cholesky autocorrelation

ReplyDelete
Replies
CarrickMarch 6, 2015 at 2:49 AM
Greg: Efficiency is only a criterion if you are seeking to establish a linear relationship. If you are simply using a "trend" as a poor man's low-pass filter I would suggest choosing a better filter.

This is completely wrong. Efficiency is different than linear.

Efficiency is a measure of how much variance you can explain with a particular fit. For uncorrelated noise, the Gauss-Markhov theorem establishes that OLS is the best linear unbiased estimator.

If you feed correlated noise into the derivation, you now find there is an additional term that would have canceled in the absence of correlation that now fails to cancel. Hence OLS is no longer "best" (which refers to efficiency).

This means that OLS based filters will do worse than generalized LSF filters when you have correlated noise. But they will still be linear and unbiased.

So saying we need to replace running OLS filters by smooth difference for example is not a good argument, if the operator you want to replace them by is numerically less efficient.
ReplyDelete
Replies
CarrickMarch 6, 2015 at 2:51 AM
Nick: I'm basically advocating that trend be seen as a derivative estimate. It might indeed be better to call it that.

I think it's sufficient to define trend of a quantity as the time derivative estimate of that given quantity. Then move on. ;-)
ReplyDelete
Replies
thefordprefectMarch 6, 2015 at 11:52 PM
Firstly i'm pretty much out of my depth here. But a smoothing filter which seems to work ok is Hodrik Prescott (available as an excel add in)
on hadcrut 4 data and choosing an appropriate filter value this is a comparison with filter above:
Note how ends match data better.

http://s29.postimg.org/gj8awbojb/hp_filter_compared.png

Any comments ?

http://www.web-reg.de/hp_addin.html
ReplyDelete
Replies
thefordprefectMarch 6, 2015 at 11:55 PM
Meant to say - Filter compared is
Carrick March 4, 2015 at 11:24 AM
ReplyDelete
Replies

Add comment

An interactive topic index for all Moyhu posts.
Latest Ice and Temperature data
Climate Data Portals
A gallery of Javascript-enhanced graphics
Temperature trend viewer
Google Maps and GHCN
WebGL map of past GHCN/SST station temperatures
WebGL map of GHCN/SST station temperature trends
HiRes NOAA OI SST with WebGL and Movie
Regional Hi-Res SST movies
WebGL Facility
TempLS Guide
More pages, and blog glossary

moyhu

Saturday, February 21, 2015

Regression as derivative

Regression as derivative

Regression as derivative

Integrating by parts

Application to time series

138 comments:

Search This Blog

Maintained Pages

Recent Comments

Blogroll

Blog Archive

Translate

Resources

About Me

moyhu

Saturday, February 21, 2015

Regression as derivative

Regression as derivative

Regression as derivative

Integrating by parts

Application to time series

138 comments:

Search This Blog

Maintained Pages

Recent Comments

Blogroll

Subscribe To

Blog Archive

Translate

Resources

About Me