Last year I used an analysis of CHI review data to argue that we could save a lot of reviewers’ time on low quality papers by modiyfing our review process. With all the current talk of the value of replication, I figured it was worth testing the same procedure with this year’s review data, which Dan Olsen was kind enough to provide.
CHI currently collects three external reviews on every paper before engaging the program committee to make an accept/reject decision. Last year I did an analysis that suggested rejecting papers with two bad reviews (scores below 3) without requiring a third, and showed that this would have saved 469 reviews (of papers that should have been rejected) while accidentally rejecting just 6 of about 300 papers. I argued that 6 papers was probably dwarfed by other mistakes the PC makes, so this would be a worthwhile tradeoff.
Because I’m replicating, I don’t need to detail the analysis again; you can find that in last year’s post. I explored the following procedure: send each paper to two reviewers and, if the scores are too low, reject it without further consideration. Otherwise, send it to a third reviewer and then on the the program committee for decision. Varying the definition of “too low” provides a tradeoff between false positives (extra reviews for papers that ultimately get rejected) and false negatives (accidental early rejection of papers that would have been accepted). Looking at last year’s data, a pretty appealing threshold was to reject any paper whose two scores were both lower than 3. This is a natural rule because 3 is the “neutral” score in CHI reviews; both reviews below signifies that both reviewers recommend rejection. It was also a good threshold because it saved 469 reviews while creating only 6 accidental rejections.
So, did the results replicate this year? Applying exactly the same procedure, we get a remarkable level of agreement. Of the roughly 1567 papers submitted, there were 369 acceptances and 1198 rejections. Using the “two below 3″ rule, we skip 491 reviews of low-rated papers, while accidentally rejecting 4.7 papers (see last year’s post, or the discussion below, to understand why we get a fraction). An almost identical outcome, just slightly better.
Some caveats: the data I got was slightly messy, with a few papers showing no reviews or less than three. I presume these were withdrawn submissions, but haven’t had time to find out. There were less than ten of these, not enough to influence the results significantly.
For those who want to explore the data themselves, I’ve prepared a table of the relevant numbers. Score 1 and Score 2 are the reviewer scores; Accepts counts the number of pairs (remember, three pairs to a paper) from accepted papers that got those two scores, Rejects the number of pairs from rejected papers that got those two scores. All other columns are functions of these two. Ratio measures the ratio of accepts to rejects for a given pair of scores. I’ve order the rows by this ratio, which is useful for visualizing the optimum false positive/false negative tradeoff. Cum acc and Cum rej total the accept and reject pairs in the lines above, then divides by three so we can count papers (averaged over outcomes) instead of counting pairs. These cumulative totals count the number of accepts and rejects above the current line, thus telling you the number of papers from each category that would be given a third review if you used the given line as the threshold for deciding on that third review. False rejects then shows how many papers that were ultimately accepted would have been rejected in the first round using the given threshold, while saved reviews counts the number of ultimately rejected papers for which the threshold would have led to skipping the third review.
Note that the accepts and rejects columns are counting “review pairs”. Each paper in the data set got three reviews. If we imagine the chair selecting three reviewers but then requesting a review from only 2 of them (keeping the third in reserve for when the first reviews are good), then there are three possible outcomes. Counting over all three outcomes of all three papers yields the average outcome—the expectation should the chair select their two reviewers at random from the pool of three. It is this averaging that produces fractions in the other columns.
|Score 1||Score 2||Accepts||Rejects||Ratio||Cum acc||Cum rej||false rejects||saved reviews|