Thursday, October 16, 2014

QC for TempLS

I plan to do more with TempLS (see last post) so I want a stable quality control (data) scheme. GHCN unadjusted is a document of record, and there is weird stuff in there which it seems they don't like to touch. I've noted current examples earlier in the year. So I did a survey of the data since 1850. Here is R's summary of the monthly averages:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-87.0 6.6 15.9 14.5 24.1 154.41166444

That's out of 10349232 months (of years with some data). Yes, the max is 154.4°C. There were 28 months with a min/max (not max) average >50°C.

To be fair, they use flags, and these oddities seem to be mostly where a decimal point slipped in the originating data. But they are big enough to have effects, so I have been using my own QC. On first look, I found the GHCN flags numerous and unhelpful, so I used a scheme where I checked with the adjusted file. This seemed to weed out the problem points without replacement. However, it excluded a lot of other points, so I allowed those if within 3°C of the appropriate mean.

I did that for the last five years of data and it worked well enough. But then I thought I should try for the whole record. I worked out how to extract the QC flags; here is a table of their occurrence:


The first column is no flag. QC flags aren't that common, about 0.24% of the total. The letters mean, according to the readme:
  • D - apparent duplicate
  • L - isolated
  • M - manual
  • O - outlier (>5 sd)
  • S - not so outlier (>2.5 sd), but no nearby data in agreement
  • W - seems to be last month

O and S are the big ones, and as expected, the very high ones are flagged O.

So I decided to just omit all flagged data. In future, I'll do that to all GHCN unadjusted before use.

If you've been watching the latest data over the last day, you'll have seen me experimenting. I think it is stable now.


Post a Comment