While I’m here at CHI, I thought I’d post a bit of analysis I did of the CHI reviewing statistics. I was exploring the question of whether we really need three reviewers on every CHI submission.
At the PC meeting, there was a lot of discussion of the huge amount of work that was going into reviewing for CHI. CHI accepted about 300 of 1350 submissions. As a first time PC member, I was particularly bothered by the work on rejected papers. I felt pretty guilty wasting three separate reviewers’ time on a paper I was sure would be rejected. If we could cut down to two we’d be saving a lot of work.
In particular, I wanted to explore the following model. The PC member responsible for a paper begins by soliciting three reviewers as usual. However, they initially send the paper to only two of them, chosen at random. Based on the scores by those two reviewers, some “clear reject” papers are dropped. The remainder are sent out to the third reviewer, and a decision is made using the three resulting scores.
I analyzed the procedure using this year’s CHI reviews. I took the preliminary scores reviewers submitted for papers, prior to any discussion. Each paper had three reviewers; if we chose two at random, we’d get one of three pairs. Thus, for analysis, I split each paper into 3 “review pairs” each counting for 1/3 of a paper, which is the right way to analyze outcomes in expectation. I grouped these pairs according to the two scores they represent (I ignored expertise ratings). For each pair of values, I counted how many (thirds of) papers with those two scores ended up accepted, and how many ended up rejected. I looked for (presumably low) pairs of scores that were associated with few or no papers that were ultimately accepted. If we reject papers that get those pairs, we won’t mistakenly reject many papers.
Here’s a specific result for those who don’t want to read more: if, after getting the two scores, we had rejected any paper with both scores below three, we would have skipped sending out 469 papers (roughly half the total) in expectation. Meanwhile, we would have rejected 6 papers that actually got accepted under the current scheme. Alternatively, if we rejected all papers with both scores below 2.5, or scoring a 3 and less than 1.5, we would save 250 reviews, still a noticeable improvement, without sacrificing any acceptances at all!
So, the clear question is, are we willing to make 6 mistakes (out of the roughly 300 acceptances) in order to save ourselves 469 reviews of generally bad papers? I would say yes. I’m pretty sure that 6 is far less than the number of undetected “mistakes” we make through imperfect reviewing (in fact, I’m sure of it, since CHI has rejected more than 6 of my submissions, and all those rejections were obviously mistakes). So I don’t think we’re introducing significantly more error using the scheme I suggested, and we’re certainly saving a lot of effort!
For those who want more numbers, the table below contains all the data I worked with. First, 2 columns representing a score pair. Then, for each pair, the number of paper “thirds” with that pair of scores that were accepted and rejected. Then I compute a “ratio” of accepts to rejects. If you want to get the most accepts with the fewest rejects (best cost-benefit), you sort by those ratios, and accept (for third review) the score pairs with the biggest ratios. The table is sorted that way. I then take a cumulative sum to compute, for each row, the cumulative number of (thirds of) accepts and rejects that would result if you accepted all score pairs above that row. Dividing by three and subtracting from the number of actual accepts/rejects gives me the number of false rejects (papers that should be accepted but get rejected under the 2-review scheme) and false accepts (papers that will ultimately be rejected, that get a third review under this scheme). Finally, subtracting false accepts from total number of rejects gives me the number of third reviews that would not have to be done under the proposed scheme.
There are many possible approaches to reducing reviewing load. The scheme I described is clearly implementable, and has the benefit that we know exactly how it would have performed if we had used it this year. Of course, there is a chance that I am over-fitting and the results would be different next year. But these are pretty large samples we’re working with, so I’m moderately confident it generalizes well.
The only other concern regards the pipeline. Clearly, sending out for a third review could add to the time lag. But if doesn’t have to—in the first phase, we are only asking for 2/3 as many reviews, so we could shorten the timeline. In the second “third review” phase, we are asking for half as many third reviews as before, so again we could shorten the timeline. Hopefully this would work out.
I welcome your thoughts.
|Score 1||Score 2||Accepts