I attended the VISSW 2011 workshop last sunday. It was fun, but a few of the papers exhibited a painfully familiar pattern: they put together a plausible-seeming user interface but didn’t evaluate it with a user study. I left frustrated, with no sense of whether the ideas of the interfaces would be good or bad to incorporate in my own work. With the system already implemented, other researchers are disincentivized from implementing (it wouldn’t be novel) so they can’t evaluate it. Thus, if the original researchers don’t do the evaluation, nobody will. This is a not uncommon complaint in computer science—our field doesn’t seem committed to following through with evaluations of the ideas they invent and implement. Some faculty at Stanford have even created a course aimed at teaching students how to properly evaluate their research systems.
So here’s a proposal for improving the incentives a little bit. Change the submission requirements for conference papers: they have to contain the system description and the hypothesis to be tested, along with a detailed evaluation plan. Papers are then evaluated and accepted on the basis of a commitment to execute the evaluation plan (and update the paper with results) before the conference but after acceptance.
This approach would have several benefits.
- Researchers could defer the work of evaluation until their submission is accepted. Once it’s accepted, they have strong motivation to do the evaluation (else the paper cannot be presented). For work that turns out not to be publishable, the evaluation work is not wasted.
- The evaluations would take place after the submission deadline, meaning work on the system could continue right up to that deadline. This gives us something to do in the “dead space” between acceptance and presentation (which is forced upon us by the long lead time for required for travel planning). The work presented at the conference would be “fresher”; the long lead time on conference submission would have less impact on the publication of timely results
- This approach would also address the recently popularized problem of a bias towards positive-outcome evaluation that may lead to incorrect claims of statistical significance in outcomes. If reviewers consider a paper that contains only the system and evaluation procedure, they will be forced to asses the paper purely on the grounds of whether the proposed system is interesting enough to be worthy of evaluation. If it is, then the paper should be accepted regardless of whether the outcome of that evaluation is positive or negative. If it is not, then the inclusion of a positive evaluation should not change the rejection decision.
Turning to logistical concerns, this approach means that the paper is not finalized until shortly before the conference (a couple weeks, to give reviewers a chance to confirm that the evaluation plan was followed). But as more conferences move towards electronic-only publication, this schedule becomes feasible. And this scheme wouldn’t cover e.g. multi-year longitudinal evaluations. But it would certainly cover a large number of the papers with short (inadequate?) user studies appearing in our HCI conferences.
Of course, there’s the simpler approach of requiring evaluations at submission. This meets the primary goal of having systems evaluated, but loses the three benefits I’ve outlined above: researchers invest energy evaluating systems that would be rejected independent of the evaluation; the evaluation work will be older/staler by the time of the conference, and the bias of reviewers to accept positive results would continue.