This past month I finished reviewing 4 papers for CHI and 6 for the WWW conference. For CHI, 3 of the 4 papers described small, simple applications intended to test some user interface idea. For WWW, 4 of the 6 were machine learning/model fitting papers that tested algorithms on a particular non-sensitive data sets, one ran a small user study, and one was an experimental system like the CHI submissions. For these 9 of 10 papers, there was absolutely no technical barrier to publishing the source code and/or data used for the paper: no privacy concerns and no complicated system architecture dependencies. (The last paper was a theory paper that needed no replicating.)
There’s been a good amount of discussion about replication of prior experiments and its importance to good science. There’s even a workshop on it at CHI 2014. For replication, lack of the prior system or data is fatal.
Right now everyone talks about the importance of replication, but we aren’t doing enough to make it the norm. So here’s a simple proposal. In future conferences, let’s require that every submission (and every final published version) include a single, final paragraph labeled “Replication”. In it, the authors should describe which of their experimental materials will be necessary to replicate their experiment, and either (i) where to find them or (ii) why they can’t make them available.
For an experimental system, the authors should give a link to the source code, or explain why their code cannot be shared. “My code is ugly” is a terrible excuse. “My code won’t run on your system” is only slightly better: even if it won’t run as is, it may be easier for future replicators to modify the previous code instead of writing their own system from scratch. But there are plenty of legitimate explanations, including “I work at a company and the code is protected intellectual property”–we may not like that one, but we can’t expect a single author to change company policy.
For data, the authors should give a link to their data gathering instruments (e.g. surveys or software) and data sets, or explain why it’s impossible to share the data—privacy concerns being the obvious one. Of course, such concerns can often be addressed by properly scrubbing the data, and authors should explain why they can’t do that. The right preparation can help—for example, it could be easy and useful to get consent from survey respondents to publish their raw responses.
Ideally, this replication paragraph won’t just be boilerplate. Rather, the program committee will take it into consideration at they make decisions about what to accept. A paper that fails to justify withholding code or data should be rejected, or should be accepted only conditioned on the code or data being made available. But even if the PC doesn’t want to be so strict in the beginning, I suspect that the need to explain, in public, why they aren’t sharing their code or data will be a great prod encouraging more authors to make that small extra effort, to the benefit of science.
I’ll brag that our group is doing reasonably well on this front. Nowadays, all the software projects we build go up on github, either in the general group repository or in a project-dedicated one like exhibit or nb. Using github significantly improves our software development process, and make public sharing happen by default. We’ve even accomplished a little bit on data: when we published our study in CHI of lightweight note-taking, we posted a site where any users of our tool could “contribute their notes to science”, scrubbing them and marking them for public use. Scientists can download the resulting corpus of notes for their own research. At 2500 notes, our published corpus is only a small fraction of our fill 200,000-note collection, but it’s a start. If other groups were making similar efforts and it became the norm, we’d be trying harder to grow that corpus.