Friday, August 21, 2015

NOAA gives you data that you cannot use.

NOAA releases data, and example tools to process things, but they do not release data that you can use. Check this note from the "Pairwise Homogenization Algorithm" software readme:
# A New Dataset must have at least a Station List and a Data Files. At a minimum, the
# Station List defines the Station ID/Latitude/Longitude. The Data Files (one for
# each station/element - see data/benchmark/world1/monthly/raw) defines all 
# of the Monthly Temperatures in Annual Records. All files are in the GHCNMv3 format.
Now, the data that is released is in the form of years, so you cannot use the software they supply to recreate their data.
I have the Pairwise Homogenization Algorithm setup in MATLAB, and I can make it do whatever I want to the data, including make it give me totally erroneous data.

I actually emailed NOAA, and they didn't get back to me because I'm asking questions that they already have decided everyone in "the know" should know. It's always nice to get an official answer.
--mail--
To whom it may concern,

I am looking for sources for the weights for the Pairwise Homogenization Algorithm that is used in the NOAA data reported in ushcn.tavg.latest.FLs.52j.tar.gz.   I have MATLAB scripts setup to analyze the data, but I’ve been having trouble.  I believe that I cannot reproduce the data because I am missing the monthly entries, and the exported data is yearly.   Any guidance or references would be appreciated. 

Also, there are about 20% fewer weather stations in 2015 than in 1990.  Do you have a reference for this as well?  I'm just curious why there are so many fewer stations.
--end mail--
After reading a bunch of papers, I've decided that things, such as infilling, are pretty much magic that should be disregarded when you can actually have a meaningful dataset.
Here are all of the reasons that you might need to tune a dataset:
Changes in type of equipment.
Changes in region of the thermometer (You get an urban heat island, for example).
A station has consistently given bad data.
A station has been moved.

After reading through the papers that I could find, I believe that what NOAA has done is make a self-correlating time series. You can use the Pairwise Homogenization Algorithm to create a series that creates the same anomalies in time that you are trying to avoid. What I really want to know is why NOAA does not release a useful, monthly, dataset. This is the sort of thing that makes me shake my head at the soft sciences. They might be right, or wrong, but they are definitely sloppy by the standards of engineering.

Where does this leave me? I'm going to look at the actual data, remove anything that is not complete, and then plot it. Even if there are changes due to urbanization, equipment changes, and bad data: it is real data. If there is heating due to urbanization, that's just part of the heat, and that should show an increased trend in heating over years.

The new temperature.sh file is in SVN as revision 6.

No comments:

Post a Comment