Towards Open Climate Science

December 08, 2009

The events that have transpired (physically) at University of East Anglia and (virtually) around the globe have raised the important question of whether climate science is open and transparent enough. This has led, naturally, for a call for “open source” science.

Personally, this discussion links two amateur passions of mine, climate science and open source. Coincidentally these are central themes of Greg Papadopoulos’ and my book, “Citizen Engineer”, not because we miraculously anticipated this particular point in time, but because we saw these as the two largest knowledge gaps in today’s engineers.

I could write a long article about why an open approach to climate science is the right thing. Even acknowledging short term issues that it will create, such as the one raised by Roger Pielke Jr in his response to Andy Revkin, I can argue strongly that its the right thing in the long run. So instead of adding to that discussion, I want to move on and talk about what happens next, and propose two activities that can lay the ground work for the future.

The next important step in this conversation may not be obvious, but it is to formally define what we mean by “open source science”. Its easy to say that the raw data and code for all peer reviewed work should be publicly available. But working day-in and day-out in the world of open source software, I know, firsthand, that reaching a clear, usable definition is far harder than you might believe.

First, there are a series of practical questions that need to be answered, such as how soon data and code needs to be available. Is the live stream from satellites available on the web? Is it OK to sit on the data for 6 months? How about waiting until papers using the data have been published? And as people start to the work with the data there will inevitably be demands for related data. For example, how much other data on the real-time operation of a satellite will people want in order to do their own calibration of a raw data product?

But beyond the practical issues there’s a more subtle question of licenses. Your average person on the street would assume that open source software is just plain freely available (i.e. in the public domain), but almost all software that is considered ‘open source’ comes with a license that seriously restricts how you can use it. For example, the license may dictate whether the code can be used in a commercial product, or whether the copyright holder will relieve the user of any patents that they may have that relate to the software. Trademarks and attribution may also play a role. To see some easy to understand license options, take a look at the excellent site CreativeCommons.org, which provides a tool for creating your own, custom license for your website, blog, music, etc.

As you can imagine, the wide array of possible licenses leads to a long, heated and contentious discussion over what is truly “open” and what isn’t. < src="http://nearwalden.com/blog/images/2009/12/OSI-logo-100x117.png” alt="OSI-logo-100x117.png” border="0” width="100” height="117” align="left” style="margin: 10px”/> In the software world the Open Source Initiative is a non-profit that was formed for this purpose, and is generally recognized by the open source community as the standards-bearer of the definition. As you can see on their site, they also maintain a list of widely used licenses and how they stack up with the OSI standard.

One of the most intriguing aspects of the open source licensing world is a class of license which are referred to as “viral” or “reciprocal”. These licenses place requirements on derived works, often that the derived work is placed under the same license. The father of this type of license is the GNU Public License, or GPL. This clever license uses the US copyright system,not to prevent others from using a work, but instead to propagate free and open software. In other words, it says that you get the benefit of using this work, but in exchange you have to share your work which used this in the same way.

Its not hard to imagine using a GPL-like license in climate science. A data set could come with the requirement that results based on the data set also be freely available. Similarly, code used in an algorithm could have the same restriction (note that the algorithm is difficult to control, but code is covered under copyright law and can have an attached license). As you can see, this subtle idea could have broad, and lasting implications for the use of data and code in climate science.

So as you can see, the question of “what is open climate science” is less well defined than many would imagine, which leads me to two proposed actions.

First, is to find a formal home for the definition of “open climate science”. This important activity needs a home, just as open source software has OSI, which can manage the process of creating and maintaining the definition. This process will take some time, but if the climate science community is serious about transparency and openness, executing on this process will be required to making true progress. (Note: the Creative Commons project [Science Commons(http://sciencecommons.org/) may be useful here, but I don’t know much about it)

The second proposed action is simpler and can happen quickly. This activity is to publicly document (presumably on a website) basic facts about the ‘openness’ of the top sources of climate data and algorithm. Is the raw data available? Are the algorithms and code available? Who can have access? How do you get them?

While I’m not equipped to spearhead the first activity, I can certainly help get the second underway. Anyone else interested?