Bill Mill web site logo

Why Publish CS Papers Without Code?

Imagine that you read a paper analyzing "Hamlet" in great detail. Intrigued, you went to the bookstore to look for a copy, only to find that there were none. Confused, you checked the library - nothing. Finally, you went online to shakespeare.com, only to find a note saying that "Hamlet" was still being edited and would be released Any Day Now.

Unfortunately, an analogous situation is the norm in the world of academic computer science. Graduate students and professors produce code by the truckload, and a majority of them produce only papers about their work. While this is certainly not always the case, it was difficult for me to even find these examples of academic source code. In all four cases, it has proven very useful to programmers outside of academia.

Life, not Math

The fact is, computer science research is more like biology research than it is like mathematical research. For an experiment to be valid, it should be repeatable. If you're publishing analyses of programs without publishing the code that was analyzed, how can the community possibly verify what you claim?

Bugs are endemic to source code everywhere, and there is no reason to believe that academic code is any different. All of the analysis in the world is irrelevant if the program you are analyzing has a subtle bug embedded in it. We, computer scientists, should expect and demand that published, well tested code be made available with every paper which claims to analyze or draw conclusions from a program of any significant size.

A Problem of Environment?

At the moment, there is just no easy way to do this. Where the open source world has SourceForge and other project hosting sites, there is no similar environment where academic papers can live alongside the code they reference. Instead, computer science researchers are encouraged to publish papers in traditional, copyright-locked, academic journals. These journals are not prepared to handle large volumes of code, nor should they be. Computer science research was made for the web, and there should be a home for it on the web.

It is intimidating for a harried academic that just wants to get work done, and advance in his field, to have to set up the host of tools required to share code effectively. If there were a SourceForge-for-academics site where they could simply register their paper, drop in the source code, and have a project ready made for them, it would be much more likely that they would participate. If such a project were to gain steam, the network effect would make it an invaluable resource for academic and industry programmers alike. Collaboration and open code sharing would lead to much more rapid progress in the field, and hopefully encourage the greater rigor that other fields require of their practitioners.

Two Different Worlds

Right now, academics and open-source programmers live largely in two separate, often parallel, worlds. Where they do collide, as they do in the Haskell language, there is often extremely interesting work being produced. Each brings a different, interesting, viewpoint to the table, and greater coordination between the two would have nearly universal benefits for programmers of every stripe.

So free the code! If you're an academic programmer, consider publishing the code that you have, regardless of what you may think of it. Consider asking the journal that you publish in to retain copyright over your work. If you're an open source programmer, look for some work in a field that interests you, and email the author if he hasn't released his code. Ask him an interesting question, and convince him that his code would be useful. If you're both, tell the world your ideas for how we can all work better together.

Update:

I've been contacted already by two scientists who care about the reproducibility of programs. First, Gavin Baker wrote to tell me about the insight toolkit, which "is a cross-platform open-source image processing toolkit for performing registration and segmentation" that attempts to provide reference implementations of published algorithms. He also pointed me towards The Insight Journal, which allows authors to publish open articles which are automatically verified with CMake. This is exactly the type of thing I was hoping to hear about.

Only a few minutes later, "I. Vlad" wrote to tell me that computational geophysicists have a similar system called Madagascar, which uses SCons to provide automated verification of results. Furthermore, they encourage the use of open data sets from the website for the Society of Exploration Geophysicists.

Good to hear that these scientists are out there making stuff happen, while I sit here on my duff.

Update 2

Made a correction to the update. (Made it clear that the Madagascar project did not set up the SEG site - sloppy writing on my part).