Thursday, April 15, 2010

TempLS Version 1.3 - SST, Land/Sea index, distance utility

I need a shorter name for the project. So this is TempLS V1.3, which I used for my previous post on Sea Surface Temperatures (SST) and combined indices. SST is the big change, but there are some others you'll notice.

One is a new facility for calculating distances by lat/lon. You can ask for all stations within a distance of a location, for example. You can even vary the weighting function. So if you wanted to replicate the GISS calculation of all the trend of rural stations within 1000 km of a location, weighted with a distance taper, you can.

The code is rearranged so the selection criteria sit in a function near the top, below the other user-choice items. If you build up a lot of these, you can put them on an external file, which you can include with a source().

As usual, the code is on the repository here, and at the document store. More details below.

A summary of this series of posts is here.

Summary of operation for TempLS

  1.  Download v2.mean and v2.temperature.inv from NOAA
  2.  Download preprocessorv1.3.r and GlobTempLSv1.3.r from the cument store.
  3.  If you want to use ocean temperatures, download HadSST2_SST_1850on.txt from, and change Sea=F to Sea=T at the top of the preprocessir file..
  4.  Check the top of both R files for user settings you might want to make. At the head of the GlobTemp file, choose the cases you want for a particular run. You may have to write conditionals.
  5.  Start R, and source("preprocessorv1.3.r"). You won't have to run this again unless the version or input files change.
  6.  Having decided your run choices, including tend and plot periods. source("GlobTempLSv1.3.r")
  7.  You'll find each case has generated jpeg plots and a combined Output file. The choice string for each run case is in the title.

SST and restructuring

The TempLS code as before consists of a preprocessor and an analysis stage. Once you have preprocessed (for this version), you don't have to do it again for each new analysis. The preprocessor now modifies both the GHCN data file v2.mean and the inventory file v2.temperature.inv. These are now called and modified.inv. You'll need to have them in the same directory as that where you do the analysis. A change from v1.2 is that there is now no intermediate file statyrs.txt.

At the top of the preprocessor file is a logical variable Sea. It's set to T by default, and then you must supply a copy of the HadSST2 file HadSST2_SST_1850on.txt. If you set it to F, then no SST data will be used, and the code behaves as in previous versions.

If you do have Sea=T, you'll need to check previously used selection conditions. I've introduced a new logical vec isLand which you can add to your choice to make sure you don't include sea when restricting by lat/lon, say (unless you wanted to). Sea=T means that the invebtory file has 2592 artificial stations added, with a madeup number starting with 899 and lat and lon - all other properties are blank.

New user choices

As said above, I've put the user choice criteria as a function (chooser()) near the top. This will help if you want to take it to an external file, for example.

You can now choose an outputdir and a plotdir to direct output to. I have removed the analysis input files choice from the analysis prog, since they are now created by the preprocessor.

As also said, you can choose stations via getdist(lat,lon). If you want to know the temp history of stations within 1000 km of Kansas City (lat=39, lon=-95) ypu could write:
if(tc=="KC"){i01=i01[ getdist(39,-95)<1000.]}

If you want to change the weights and get rural stations in that range only (as for GISS UHI), then

if(tc=="KC"){gtd=getdist(39,-95); i01=i01[ gtd<1000. & tv$urban=="A"]; weight=weight*(1-gtd/1000.);}

Other Issues

We've previously had trouble with Windows machines returning NULL to the time function used to mark output files etc. I've put in something that I hope will avoid trouble (tho' you won't get the times in the filename). Please let me know if there are still problems.

Trends are now reported with SE. This is calculated using simple OLS - no correction for correlation of residuals.

Temperature averages are now expressed relative to a 1961-1990 base - ie the mean over this period should be zero.


  1. Cool Nick,

    Work is progressing over here ( slowly) on the metadata side of things. Kinda slow progress due to using R. Coincidentally I was working on some functions to return a list of stations by lat,lon. Either within a distance
    from a given station, within an arbitrary grid box, sorted by nearest neighbor.

    I started working on the data as well, and I'm struggling a bit understanding the duplicates. What they are?
    so for a ghcnid you may have up to 9 duplicates starting and ending
    with different years? I know both you and Jeff are handling these duplicates, ( and I take it that they come from multiple min/max files
    but can you explain them a bit..and how you process them.

  2. Steve, I think duplicates reflect the fact that there may be multiple records of the same data - eg a local logbook and a central office record. They don't cover the same times, but where they overlap they are usually identical. There are also said to be cases where there is more than one thermometer on a site.

    Since there's no apparent reason to prefer one to the oher, I think most people just average. That's what I do. It's done individually for each month and station. That was my original reason for separating the preprocessing stage, since once done, you can store and reuse a smaller file.

  3. Ok,

    That makes sense, I was trying to write some code to look at the duplicates for each ghcnID to see if they were really dups or not
    and to see if averaging just made sense or if it was just what everyone did.
    I was worried about lacuna and cases where they were not identical.
    I was just gunna steal your code or Jeff's but for me this is more about learning R, so its slow going but fun. I still havent heard back on my MAC graphics error from the SIG.

    I looked into some of the GIS stuff for R, some very powerful stuff,
    but my R skills are still infantile. So hard not to write loops.

    I do have some stuff to read in the inv files, clean them up ( put in proper NA, remove commas from names, so that you can export it
    as a proper data frame or CVS. In the end when you ran a case
    you would output the series ( time,values) and also the inv file
    ( with the subsetting ) that produced your results, as well as the actual data your algorithm worked on.

    So: ghcn_mean_aiprorts.Rdata # a file of the actual temps used. a sub
    # of the whole file using your preprocessor
    # format.
    airports_Inv.Rdata # a subset of the inventory you used
    airport_results # the time series
    airport_Kml: a google earth tour of your sites.

    and of course plots

  4. Thanks Nick, I'm now able to understand the duplicates.

  5. Ok, now I have my bearings.

    When looking at this problem it wasnt clear to me that averaging was the only approach. also interested in the error ( too much confidence) added by decisions made at this stage.

    More later

  6. Thanks, Steven, that's helpful. From the choices in that GHCN manual
    "average all data points for that station/year/month to create a mean time series;"
    That's the one I'm doing. I think it's the best general purpose choice. The difference method also has merits.

  7. Thanks, I knew I recognized this issue just took a while for me to re understand the difference between an WMO mod flag and the Duplicate indicator. In the end of course one has to make a decision. I think what I may do is extract that choice as an optional step in the analysis path.

    So: methods for handling duplicates:

    1. Simple averaging of all. (overweighs duplicates)
    2. Hansen's Bias method.
    3. duplicate elimination and then averaging. (whats a real duplicate)

    any more?

    Not anything that will change the final answer in any substantial way,

  8. Steven #3 "So hard not to write loops. "
    One thing I suspect about R is that, as an interpreted language, there's an overhead with each instruction. So for speed I try to pass it computational tasks in large chunks. For example, when counting stations in cells (for each month/year) I don't just do 3 nested loops (s, then m then y, say), I do s and m and then a single instruction to add the y data using a mask. I could have done s and y and then aggregated m, but that's a smaller chunk.

  9. Ya Nick,

    Reading the guides specifically for programmers it's clear that looping is a very slow process in R, partly because its interpreted and partly I imagine because of the underlying programming model. the underlying model is Scheme which is a lisp variant. under the hood so to speak you probably end up with more optimizable code ( for loops) Anyways, I tend to do the same thing that you do. Sometimes though it helps to write it out explicitly as a loop and then get rid of the loop from the inside out. I can always make the looping code WORK..The diffuculting for me is always keeping track of what happens when you use 'functions" as indicies like x<-x[x<3]. The other difficulting is writing stuff that others can understand. at first ur code and jeffs was just inpenetrable ( i was stupid) now I get it, but not always at first glance.