Lots of smoke, hardly any gun. Do climatologists falsify data?

One of climate change denialists favorite arguments concerns the fact that not always can weather station temperature data be used as raw. Sometimes they need to be adjusted. Adjustments are necessary in order to compensate with changes the happened over time either to the station itself or to the way data were collected: if the weather station gets a new shelter or gets relocated, for instance, we have to account for that and adjust the new values; if the time of the day at which we read a certain temperature has changed from morning to afternoon, we would have to adjust for that too. Adjustments and homogenitations are necessary in order to be able to compare or pull togheter data coming from different stations or different times.

Some denialists have problems understanding the very need of adjustements – and they seem rather scared by the word itself. Others, like Willis Eschenbach at What’s up with that, fully understand the concept but still look at it as a somehow fishy procedure. Denialists’ bottomline is that adjustment do interfere with readings and if they are biased toward one direction they may actually create a warming that doesn’t actually exist: either by accident or as result of fraud.

To prove this argument they recurrently show this or that probe to have weird adjustment values and if they find a warming adjustment they often conclude that data are bad – and possibly people too. Now, let’s forget for a moment that warming measurements go way beyond meteorological surface temperatures. Let’s forget satellite measurements and let’s forget that data are collected by dozens of meteorological organizations and processed in several datasets. Let’s pretend, for the sake of argument, that scientists are really trying to “heat up” measurements in order to make the planet appear warmer than it really is.

How do you prove that? Not by looking at the single probes of course but at the big picture, trying to figure out whether adjustments are used as a way to correct errors or whether they are actually a way to introduce a bias. In science, error is good, bias is bad. If we think that a bias is introduced, we should expect the majority of probes to have a warming adjustment. If the error correction is genuine, on the other hand, you’d expect a normal distribution.

So, let’s have look. I took the GHCN dataset available here and compared all the adjusted data (v2.mean_adj) to their raw counterpart (v2.mean). The GHCN raw dataset consists of more than 13000 station data, but of these only about half (6737) pass the initial quality control and end up in the final (adjusted) dataset. I calculated the difference for each pair of raw vs adj data and quantified the adjustment as trend of warming or cooling in degC per decade. I got in this way a set of 6533 adjustments (that is, 97% of total – a couple of hundreds were lost in the way due to the quality of the readings). Did I find the smoking gun? Nope.

Distribution of adjustment bias in the GHCN/CRU dataset

Distribution of adjustment bias in the GHCN/CRU dataset

Not surprisingly, the distribution of adjustment trends2 is a quasi-normal3 distribution with peak pretty much around 0 (0 is the median adjustment and 0.017 C/decade is the average adjustment – the planet warming trend in the last century has been of about 0.2 C/decade). In other words, most adjustment hardly modify the reading, and the warming and cooling adjustments end up compensating each other1,5. I am sure this is no big surprise. The point of this analysis is not to check the good faith of people handling the data: that is not under scrutiny (and not because I trust the scientists but because I trust the scientific method).
The point is actually to show the denialists that going probe after probe cherry picking those with a “weird” adjustment is a waste of time. Please stop the non-sense.

Edit December 13.
Following the interesting input in the comments, I added a few notes to clarify what I did. I also feel like I should explain better what we learn from all this, so I add a new paragraph here (in fact, it’s just a comment promoted to paragraph).

How do you evaluate whether adjustments are a good thing?

To start, you have to think on why you want to adjust data on a first place. The goal of the adjustments is to modify your reading so that they could be easily compared (a) inter-probes and (b) intra-probes. In other words: you do it because you want to (a) be able to compare the measures you take today with the ones you took 10 years ago at the same spot and (b) be able to compare the measures you take with the ones your next door neighbor is taking.

So, in short you do want your adjustment to siginificatively modify your data – this is the all point of it! Now, how do you make sure you do it properly? If I were to be in charge of the adjustment I would do two things. 1) Find another dataset – one that possibly doesn’t need adjustments at all – to compare my stuff with: it doesn’t have to cover the entire period, it just has to overlap enough to be used as test for my system. The satellite measurements are good for this. If we see that our adjusted data go along well with the satellite measurements from 1980 to 2000, then we can be pretty confident that our way of adjusting data is going to be good also before 1980. There are limits, but it’s pretty damn good. Alternatively you can use a dataset from completely different source. If the two dataset arise from different stations, go through different processings and yet yield same results, you can go home happy.

Another way of doing it is to remeber that a mathematical adjustment is just a trick to overcome a lack of information on our side. We can take a random sample of probes and do statistical adjustment. Then go back and look the history of the the station. For instance: our statistical adjustment is telling us that a certain probe needs to be shifted +1 in 1941 but of course it will not tell us why. So we go back to the metadata and we find that in 1941 there was a major change in the history of our weather station, for instance war and subsequent move of the probe. Bingo! It means our statistical tools were very good in reconstructing the actual events of history. Another strong argument that our adjustments are doing a good job.

Did we do any of those things here? Nope. Neither I, nor you, nor Willis Eschenbach nor anyone else on this page actually tested whether adjustments were good! Not even remotely so.
What did we do? We tried to answer a different question, that is: are these adjustments “suspicious”? Do we have enough information to think that scientists are cooking the data? How did we test so?

Willis picked a random probe and decided that the adjustment he saw where suspicious. End of the story. If you think about it, all his post is entirely concentrated around figure 8, which simply is a plot of the difference between adjusted data and raw data. So, there is no value whatsoever in doing that. I am sorry to go blunt on Willis like this – but that is what he did and I cannot hide it. No information at all.

What did I do? I just went a step back and asked myself: is there actually a reason on a first place to think that scientists are cooking data? I did what is called a unilaterally informative experiment. Experiments can be bilaterally informative when you learn something no matter what the outcome of the experiment is (these are the best); unilaterally informative when you learn something only if you get a specific outcome and in the other case you cannot draw conclusions; not informative experiments.
My test was to look for a bias in the dataset. If I were to find that the adjustments are introducing a strong bias then I would know that maybe scientists were cooking the data. I cannot be sure about it, though, because (remember!) the whole point of doing adjustments is to change data in the first place!. It is possible that most stations suffer of the same flaws and therefore need adjustments going in the same direction. That is why if my experiment were to lead to a biased outcome, it would not have been informative.
On the other hand, I found instead that the adjustments themselves hardly change the value of readings at all and that means I can be pretty positive that scientists are not cooking data. This is why my experiment was unilaterally informative. I was lucky.

This is not a perfect experiment though because, as someone pointed out, there could be a caveat. One caveat is that in former times the distributions of probes was not as dense as it is today and since global temperature is calculated doing spatial averages, you may overepresent warming or cooling adjustments in few areas still mantaining a pretty symmetrical distribution. So, to test this you would have to check the distribution not for the entire sample as I did but grid by grid. (I am not going to do this because I believe is a waste of time but if someone wants to, be my guest).

Finding the right relationship between the experiment you are doing and the claim you make is crucial in science.

Notes.
1) Nick Stockes, in this comment, posts a R code to do exactly the same thing confirming the result.

2) What I consider here is the trend of the adjustment not the average of the adjustment. Considering the average would be methodologically wrong. This graph and this graph have both average of adjustment 0, yet the first one has trend 0 (and does not produce warming) while the second one has trend 0.4C/decade and produces 0.4C decade warming. If we were to consider average we would erroneously place the latter graph in the wrong category.

3) Not mathematically normal as pointed out by dt in the comments – don’t do parametric statistics on it.

4) The python scripts used for the quick and dirty analysis can be downloaded as tar.gz here or zip here

5) RealClimate.org found something very similar but with a more elegant approach and on a different dataset. Again, their goal (like mine) is not  to add pieces of scientific evidence to the discussion,  because these tests are actually simple and nice but, let’s face it, quite trivial. The goal it is really to show to the blogoshpere what kind of analysis should be done in order to properly address this kind of issue, if one really wants to.