Monday, February 14, 2011

Ryan's code

 I'm trying to run Ryan O'Donnell's code for the Antarctic paper (O10), in the form that he posted at the CA site. The datafiles are there too. The code is called RO10%20Script.R. Update 16 Feb. Carrick says in the comments that he used the file RO2010 Code.txt, in the same dir, and that it ran to completion. There is a readme.txt file which mentions this latter file and not the .R file. The .txt file does not have the apparently defective code that I report being stuck on below. I am currently running it, and it reaches the error pointed out by JohnA. This is at a much later stage in the running, so that is good news. The confusion over the codes is puzzling. In that same CA thread, pax noted an error which I nad also encountered, and Ryan said that he had fixed it. In fact, he updated the .R file, not the ,txt one. Anyway, I'll switch to the .txt file. References for  Antarctica posts: The discussion is prompted by two papers: S09, a 2009 Nature paper by Steig et al and RO10 (or O10), A 2010 J Climate paper by O'Donnell et al. There has been heated blog controversy about these papers - this post is just one entry point. Here is a Real Climate post following S09's original appearance, and here is Eric Steig commenting on O10 (this led to other heated postings).
It's a long file. Eric Steig said that he found it hard to follow, preferring that it be broken up into separate files. But for turnkey effect, one file is more convenient.

When I'm trying to understand how a long program works, I often get into it with emacs and make a html version with links. Then one can make a table of contents, and link function calls to their definitions etc. And also color things. I've done that online here, and the source is available in a file (ryan.htm) I've uploaded at the document depository. It's not for running - just an aid to navigating around the code.

As far as running goes, I've come to a halt. I've asked Ryan about it here. Ryan has said that this part wasn't used for the paper. But there's a lot more code below it.

The run has completed the sections
SATELLITE OFFSETS
IMPUTATION EXPERIMENTS
and reached
RECONSTRUCTION
It did an infill with iridge and produced some plots
main.gnd = emFn(Yo, maxiter = 500, tol = 0.0005, type = "iridge", plt = c(T, 41), DoFL = 12).

But then it gets to the section
###Infill satellite PCs: Eigenvector-weighted method, optimal settings
and the first call to getEigen:
"eigen.pcs = getEigen(Y, pc.max = 13, type = "tsvd", maxiter = 500, tol = 0.0005, n.eof = 9, covr = FALSE, wt = TRUE, alpha = 100, progressive = c(T, 1, T, 1)) "

This is where most people have got stuck a problem, because covr is not an argument of getEigen. But if you take it out, you get, via a call within getEigen to emFn, down to tsvdRegFn:

tsvdRegFn = function (arg, dat, char, fn, n.eof, iter) {

### Remove unnecessary matrices
dat$C = dat$C.res = NA

### Setup Xhat
Xhat = matrix(0, nrow = arg$n, ncol = arg$p)

### Find scales and weighting factors
if (arg$covr == TRUE) { D = arg$wts / rep(1, arg$p) } else { D = sqrt(diag(dat$C))
D = arg$wts / D }  and there's my current puzzle. D = sqrt(diag(dat$C)) acts on something that has just been set to NA. Although when I look at it with str(), it seems to be a logical scalar anyway.

I wish I could explain better what has really happened here, but I'm trying to familiarise myself with the algorithm by running the code. I'll post again when I've made more progress.

1. But following your own link to getEigen, the function definition looks like this:

getEigen = function(Y, C = NA, dat = all.avhrr, pc.data = NA, DoFL = 12, form = 11, idx = NA, pc.max = 28, maxiter = 5000, tol = 0.0005,
n.eof = 9, type = "tsvd", wt = TRUE, progressive = c(T, 1, T, 1), alpha = 10, dat.st = 1982, dat.en = c(2006, 12), covr = FALSE) {

covr is there as the last (presumably optional) argument?

2. I'll note the code ran for me (version 2.12.1, 64 bit code, Mac OS 10.6.6).

How sure are you this isn't a memory related problem?

3. KC,
If you click on that trashcan symbol, there's an option to obliterate comments compleetely.

Anon,
That's my mistake - my original fix was not to remove the arg from the call, but to add it to the definition. It wasn't there in the original.

Carrick,
That's very interesting, and for me encouraging. Ryan's response was that I shouldn't be in a tsvd part of the code. I think the bit of code I am in is not right - I don't see how it makes sense to set div$C to NA and then make use of it. But it is indeed possible that I shouldn't be there. I did modify the tolerances to speed up the run, since I was only interested at this stage in getting the program sequence to run. I'll try again with the original tolerances. 4. Nick, I'm looking forward to your review of Ryan's code. One thing I noticed was it doesn't terminate if it can't download all of the data files (you can a "cannot open connection" error, and it keeps running, generating a long trail of error messages). My recommendation would be to make code like this slightly less turn-key: Give a list of instructions for how to run it, and an explanation for how to verify each step was performed successfully. Also, on speed, my code ran in about an hour on a rather horsey computer. I'd love to see what Ryan is running his on, if his only took 17 minutes. I have a couple of comments in the "On words" thread that appear to be in limbo. THere are two versions, if possible could you ixnay the older version, and approve the more recent version? 5. Carrick, Yes, I've restored the comment that got caught in the spam filter. I ran the code with original tolerance settings, but still got the same error sequence. 6. Would not implementing RegEM in R slow it down from the original Matlab (interpreted vs compiled)? Or are the operations "standard" enough that library stats processing calls can be combined in such away as to avoid a performance penalty? 7. Deepclimate, I would think, properly written, the codes would run in about the same amount of tine on either platform. I think the biggest issue, as a developer, is which environment is faster to generate a final product in. In my experience these days, for most code, more time is spent in developing and in the code verification stage than actually running the final product. 8. One of the things I'm curious about is timing. I spent a lot of time tweaking TempLS. R is good if you keep the numerical work in big chunks. Carrick, if you ever run the code again, I'd be really interested to know if it goes into that tsvdRegFn() function. 9. I will try that tomorrow and let you know. 10. Thanks, Carrick 11. And the answer is "yes". It definitely gets called... a bunch of times. Note, I'm using the file "RO10 Code.txt" as per the readme file. 12. Ah, that may be my problem. The RO10_Script.R version (which I used) is different to the RO10 Code.txt version, and the line I'm stuck on is replaced by a line: D = sqrt(colSums(dat$X ^ 2) / arg\$n.eff)
which should work. OK, I'll try that. Thanks.

Still, it's a puzzle. I found in my original version of the .R file the same error as noted here. Ryan responded, saying that he'd put up a fixed version, and indeed, the .R code had been fixed.

But the actual name of the R script used is not mentioned in that thread.

13. Carrick,
It ran to completion in that .txt version - took 46 minutes on my 4 Gb Windows setup. Now I can try to see what it does.

14. The other code is apparently something he was working on after the acceptance of the paper. I too encounter an error during the validation process with the AVHRR data, I assumed (perhaps erroneously) that it was innocuous.

In case Ryan see this, I'll repeat my recommendations:

My recommendation would be to make code like this slightly less turn-key: Give a list of instructions for how to run it, and an explanation for how to verify each step was performed successfully.