Wednesday, October 1, 2014

Analysis of short-centered PCA

There has been a lot of to and fro (see recent posts) about short-centered PCA - MBH style (background). My general view here is that there is an effect on the first principal component, but that merely moves to a particular alignment of basis vectors, and doesn't affect any reasonable reconstruction. And it all happened so long ago...

But still it's an interesting analytic problem, so I did some analysis. It particularly concerns the role of persistence, or autocorrelation. Increased autocorrelation tends to show more effect, because for a given number of proxies, it diminishes the effect of noise relative to the underlying short-centering operator (graphs below). But that underlying pattern is interesting, because it is much less dependent on autocorrelation. And the next PC's in succession are interesting, as I wrote about here.

White noise

You need an infinite quantity of proxies to see a pattern with white noise. But the pattern is the prototype for more difficult cases, so I'll develop it first. Much of what I have done recently is based on the work in the NAS report and its associated code. They compute and scale not only PCA first components from AR1() proxies, but also from the associated correlation matrices, which they scale and put on the same graph.

The algebra of short-centered averaging works like this. Define a N-vector h which has zeroes outside the short averaging period (M years at the end of N years of data), and values 1/M within, and another vector l which has 1's only. Then for a data vector x, length N, the valkue with short mean deducted is x-l*(h.x), or E*x, where E=I-l⊗h. I is the identity matrix, and ⊗ the dyadic product. For future reference, I'll note the dot products
l.l = N, l.h = 1, h.h = 1/M.

Then for a data vector X, Nxp, p the number of proxies, the PC's are the eigenvectors of XX*. The expected value of this, as a random noise vector, is p*K, where K is the autocorrelation matrix. For eigenvectors we can drop the p. The short-averaged version is E*X*X**E*. So for white noise, K is the identity I, and we need just the eigenvectors of E*E*.

Expanding, the eigenvalue equation is
(I-l⊗h - l⊗h + l⊗l h.h)*u = z u
(l⊗h + l⊗h - l⊗l h.h) u = (z-1) u

Since the matrix on the left now has range spanned by l and h, u must be in this space too. So we can set u = A*h+B*k, and there is a 2-dim eigenvalue for u in this space. I'll skip the 2-d algebra (just substitute u and work out the dot products), but the end results are two eigenvalues:
z=0, u=l
z=N/m, u= l-h
The latter is clearly the largest eigenvalue. There are also a set of trivial eigenvalues z-1=0; these correspond to the remaining unchanged eigenvalues of I.

So what has happened here? The short averaging has perturbed the correlation matrix. It has identified the vector u=l-h, which is 1 for years in 1:(N-M) and zero beyond, as the first principal component. This will appear in any reduced set of eigenvalues. All vectors orthogonal to it and l are still eigenvectors. So no new eigenvectors were created; u had been an eigenvalue previously. It just means that now whenever a subset is formed, it will come from a subset of the unperturbed set. They were valid bases for the space before; they still are.

Effect of autocorrelation.

I showed earlier DeepClimate's rendering of two AR1() cases from the NAS code:

AR(r=0.2), 5 PC1s from sets of 50 proxies.AR(r=0.9)

You can see that relative to the HS effect, the noise is much greater when r=0.2 (it depends on proxy numbers). And I can't usefully show white noise, because the noise obliterates the effect. but the eigenvector is not very different.

Here are some plots of PC1 from r=0.1 to r=0.9. I've kept to the same scale. You can see the gradual change in the matrix eigenvector, and the gradual downscaling of the noise, which makes the eigenvalue pattern stand out more.


My main ambition here was to get theoretical expressions for these eigenvalues. It is possible, and interesting, to convert the eigenvalue equation to a difference equation, which emphasises the Sturm-Liouville pattern for higher eigenvectors. For PC1, it just gives an expression for the rounding that you see. That's for another day.


  1. Playing with your code ...

  2. The Wegman Report already did a mathematical calculation of the bias in the eigenvalue.

    1. Where do you think they did a bias analysis for the actual data?
      As opposed to standard math and general description of how to do it? (Appendix A)

    2. Captain FlashheartOctober 2, 2014 at 9:40 AM

      Have you checked it for plagiarism?

    3. Yes, back in 2010, didn't see any ... but as noted, p.187 of Strange Scholarship in the Wegman Report about the Wegman Report:

      "The writing style here differs strongly from the rest of the WR and I believe was written by Scott. Most of WR pp.61-63 is a straightforward description of the basic mathematics of PCA, with only minimal reference to MBH: ...

      One might guess that not many members of the House studied this mathematical Appendix very carefully. For the intended audience, one might normally expect to see a pointer to some standard source on PCA, followed by a nontechnical discussion. Most of the WR seems to have been created by Said and Wegman, except this Appendix A s was requested to make the WR look more impressive?" (sic, English: should have been "Was it requested.)

      To the best of my knowledge, none of us who looked ever found the slightest evidence of plagiarism for Scott, whereas there as plenty for Wegman and/or his students. As best as I know, Scott was really not very involved with the WR, but has chosen not to withdraw his authorship. Said really should have been the second author, not Scott.

      Of course, no one who knows has ever been willing to say, but it looks like Appendix A was standard math, and then W+S inserted a paragraph or two of MBH mentions, bu did not do any real analysis.

      SSWR p.117 W.2.2 refers to a by Deep Climate on WR section 2.2, Background on Principal Components, i.e.,

      Wegman Report update, part 1: More dubious scholarship in full colour, July 29, 2010

      Simple summary: PCA and time-series explanation was mosaic-plagiarized from a bunch of sources, but with extra errors.
      The plagiarism style resembles that seen elsewhere in the WR.

      Now of course we cannot know (since Said has pretty much disappeared and Wegman isn't talking), so we cannot know ...but here is a conjecture consistent with all public evidence:

      1) Wegman+Said, especially the latter, put together WR 2.2 (pp.15-17) which at least was written at a level close to something appropriate for a Congressional hearing.

      2) But it was plagiarized and a bit mangled because Said likely didn't understand this very well ... and the lack of understanding certainly shows up throughout the WR's attempts at statistical analysis.

      3) But maybe Wegman at some point worried that it seemed weak, so he asked his long-time friend David Scott to contribute Appendix A, both getting a distinguished statistician onto the author list and getting a more impressive appendix ... although that appendix A.2 is totally inappropriate for the intended audience.

      Also, recall that Said had been Wegman's PhD student, and was a postdoc back at GMU when the WR was published, and the two others Ack'd for help were also part-time GMU grad students, and what they did was give other affiliations for all three.
      It is pretty hard to claim that a postdoc and 2 grad students were team of "eminent statisticians", so it helped to have Scott's name, even if he really was minimally involved.

      Again, we cannot know, but that is quite consistent with the public evidence.

      Put another way, the Wegman Report was multiply-fraudulent (and I do not use the word lightly, I know what it means.)
      ... and was not pro bono either, since Wegman and Said claimed it for credit against unrelated Federal grants.

  3. Nick,
    I noticed this post only after writing my previous comments. They might have suited also in this thread.

    I think that you made some errors in your mathematics. The covariance matrix, whose eigenvalues and eigenvectors are determined is not XX*, but X*X in case of time series that have already been centered. The difference is essential.

    Carrick linked to a comment of Jean S and Steve added there a further link to Wegman report. The Appendix A of Wegman report seems to contain in A.3 a correct description partly in component form. I may add soon a similar description using fully matrix notation. I that connection I may add some further comments on what that means for the results.

    1. Pekka,
      "The covariance matrix, whose eigenvalues and eigenvectors are determined is not XX*, but X*X in case of time series that have already been centered. The difference is essential."

      The non-zero eigenvalues are the same:
      Proof if X*Xu=λu, then (XX*)Xu=λ Xu,
      This is behind the svd decomposition.

      Here I need to look at XX*, because I am applying the short-centering operator. The eigenvectors I want are time series.

    2. Nick
      In the correct mathematics the factors from short-centering ends up between X* and X, not outside as in your derivation.

      A preliminary version of my derivation is here:

    3. Pekka,
      I am following the logic of the NAS progran cited earlier. By svd, X=UdV, where U is the set of nontrivial eigenvectors of XX*, V the eivecs of X*X and d the set of common eigenvalues. The first column of U is the PC1 that we know and love. I am looking at the effect of short centering (operator S) on PC1, so it is that U analysis that I need. I am looking for the first eivec of SXX*S*. S operates on the columns of X.

      The eigenvectors of this modified matrix are still orthogonal (symmetric). They are not orthogonal with the kernel XX*.

      I read your analysis, but I don't see how it can yield a time series PC1. It's the analysis of V, not U.

    4. Nick,
      If I understood you correctly, we have in my notation

      U = YV = BXV

    5. Pekka,
      Yes. SVD gives U, which are the PC's as time series, and V, which are the coefficients which combine the columns of X to produce those series. U are the non-trivial eigenvecs of XX*; V eivecs of X*X. As you say, they are closely related, and you can do an eival analysis of X*X, as you do, to get the coefs, and then combine, or do a direct eival analysis of XX*, as I do.

  4. Nick,
    I had perhaps too strongly in mind, how the short-centered analysis proceeds in practice rather than the task of figuring out consequences of assuming specific noise models and applying short-centering.

    Mentioning the words "non-trivial" in the original post might have been helpful in leading thoughts to the right track.