A Proposal for Increasing Evaluation in CS Research Publication

I attended the VISSW 2011 workshop last sunday.  It was fun, but a few of the papers exhibited a painfully familiar pattern: they put together a plausible-seeming user interface but didn’t evaluate it with a user study.  I left frustrated, with no sense of whether the ideas of the interfaces would be good or bad to incorporate in my own work.  With the system already implemented, other researchers are disincentivized from implementing (it wouldn’t be novel) so they can’t evaluate it.  Thus, if the original researchers don’t do the evaluation, nobody will.   This is a not uncommon complaint in computer science—our field doesn’t seem committed to following through with evaluations of the ideas they invent and implement.   Some faculty at Stanford have even created a course aimed at teaching students how to properly evaluate their research systems.

So here’s a proposal for improving the incentives a little bit.  Change the submission requirements for conference papers: they have to contain the system description and the hypothesis to be tested, along with a detailed evaluation plan.   Papers are then evaluated and accepted on the basis of a commitment to execute the evaluation plan (and update the paper with results) before the conference but after acceptance.

This approach would have several benefits.

  1. Researchers could defer the work of evaluation until their submission is accepted.   Once it’s accepted, they have strong motivation to do the evaluation (else the paper cannot be presented).   For work that turns out not to be publishable, the evaluation work is not wasted.
  2. The evaluations would take place after the submission deadline, meaning work on the system could continue right up to that deadline.  This gives us something to do in the “dead space” between acceptance and presentation (which is forced upon us by the long lead time for required for travel planning).  The work presented at the conference would be “fresher”; the long lead time on conference submission would have less impact on the publication of timely results
  3. This approach would also address the recently popularized problem of a bias towards positive-outcome evaluation that may lead to incorrect claims of statistical significance in outcomes.   If reviewers consider a paper that contains only the system and evaluation procedure, they will be forced to asses the paper purely on the grounds of whether the proposed system is interesting enough to be worthy of evaluation.  If it is, then the paper should be accepted regardless of whether the outcome of that evaluation is positive or negative.   If it is not, then the inclusion of a positive evaluation should not change the rejection decision.

Turning to logistical concerns, this approach means that the paper is not finalized until shortly before the conference (a couple weeks, to give reviewers a chance to confirm that the evaluation plan was followed).  But as more conferences move towards electronic-only publication, this schedule becomes feasible.  And this scheme wouldn’t cover e.g. multi-year longitudinal evaluations.  But it would certainly cover a large number of the papers with short (inadequate?) user studies appearing in our HCI conferences.

Of course, there’s the simpler approach of requiring evaluations at submission.  This meets the primary goal of having systems evaluated, but loses the three benefits I’ve outlined above: researchers invest energy evaluating systems that would be rejected independent of the evaluation; the evaluation work will be older/staler by the time of the conference, and the bias of reviewers to accept positive results would continue.

6 Responses to “A Proposal for Increasing Evaluation in CS Research Publication”

  • [...] This post was mentioned on Twitter by Omar Alonso and David Karger, Jonathan Groves. Jonathan Groves said: Haystack: A Proposal for Increasing Evaluation in CS Research Publication: I attended the VISSW 2011 workshop la… http://bit.ly/h3XYkG [...]

  • David Huynh says:

    David, maybe it’s not about evaluation, but re-evaluation. Whenever I read about experiments done in other disciplines, I always read that those experiments were reproduced in many locations over the world, on different subjects, by many different teams. I don’t think I’ve ever read about an HCI experiment that gets re-evaluated by independent teams. If there’s no re-evaluation, what’s the point of evaluation? Should we just take the authors’ words that their 6 subjects recruited from their local community agreed that their research was promising? There’s no incentive at all to re-evaluate someone else’s work, and it’s next to impossible to get anybody else’s system to run in the first place.

  • David Karger says:

    Indeed, it would be amazing progress if CS began to engage in reevaluation. But I’ll settle for half a loaf over none. Evaluation is helpful even without reevaluation. Evaluation, if done properly, will confirm or refute the authors’ hypothesis. If there’s no evaluation, you have to trust that the authors can know the correctness of their hypothesis without ever testing it. That’s a lot of trust. If there’s evaluation without reevaluation, you have to trust is that the authors were careful and honest in doing the evaluation. This seems easier to believe. Of course it isn’t black and white; even careful honest scientists make mistakes, which is why reevaluation is so useful. But that first evaluation does provide a significant increase in confidence, even if it doesn’t give you absolute certainty.

  • Sudheendra says:

    I think this is a great idea, but I think committees will find it hard to accept papers without some indication of how the evaluation turns out. And sometimes the data informs the analysis that needs to be done, so its hard to come up with the detailed evaluation without looking at some data. How about if the authors were allowed to report the results of a formative user study (say 5 users, p < 0.15 is ok) and present a detailed evaluation plan to expand the study, for submission. They can expand the study after the paper gets accepted.

    Speaking for myself, this is something I would like to do anyway between acceptance and camera ready (if allowed by the conference and if I think it doesn't change the conclusions except to make them stronger), since it gives me a stronger archival publication. I think people are generally happy to do a good amount of work for an accepted paper; but doing a lot of work for something that may get thrown away because a paper gets rejected seems a pity.

  • David Karger says:

    Formative evaluations to help figure out what to evaluate are a normal part of the research process, and are often described in the submission/publication. I think that’s fine, and expect many people would choose to include these in their pre-evaluation writeup. However, technically these formative evaluations don’t prove anything, so it isn’t a case of giving the committee “some indivation of how the evaluation turns out”—it’s just a case of convincing the committee that the direction is interesting and worthy of evaluation.

  • A. Tasso says:

    I don’t know if I agree with the pre-trial acceptance. This could potentially tilt researcher incentives towards focusing on trial design rather than on implementation and data quality control. You could then end up with a lot of very well designed, but poorly implemented, trials.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>