As is now an annual tradition, I’ve performed my analysis of the allocation of reviewers for this year’s CHI conference. The data from the CHI review process suggests that we can reduce the number of reviewers per paper (and thus reduce the workload on our community) without significantly affecting the outcome. At present, every paper gets three reviews. As in the past two years, I ask what would happen if we initially used only two reviewers per paper, then sent for a third review only when necessary (and what the right “necessary” criterion should be).
I used exactly the same methodology as in the past two years, which you can read about here and here. This year I had accurate data thanks to PC chair Tovi Grossman, who took a snapshot of reviewing scores just before beginning the “discussion” phase of the review process (when reviewers can see each others’ comments and may change their review scores). A total of 1542 papers were reviews, and ultimately 454 were accepted.
The results are extremely consistent with the past two years. In particular, if after getting two reviews we immediately reject any papers with both review scores below 3 (the neutral score) then in expectation we skip 641 unnecessary reviews while rejecting only 10 of the 454 papers that “should have been” accepted. This represents a savings of 641/5809=11% of the total amount of reviews, with a “noise” loss of only 10/454=2% of the accepted papers. I believe this 2% is dwarfed by other errors (review inaccuracies), happens primarily to papers on the borderline of acceptance, and is worth it if we can save workload for the entire community. To make the point stronger, I have confirmed that 2/3 of the skipped reviews were below 3—not only are we saving reviewer work, we are primarily saving reviewers work on papers they don’t like!
To assess my proposal, I use exactly the same methodology as last year. I analyze the process of choosing two of each paper’s reviews at random as the “first” reviews, determining whether those two reviews would trigger a third review, and then bucket the result according to whether the paper was ultimately accepted or rejected. For example, while analyzing a rule such as “don’t bother with a third review if the initial two scores are both less than three”, a paper that received scores of 1,2, and 4 would be seen as skipping a third review if the pair (1,2) were chosen (probability 1/3) but receiving a third review if pair (1,4) or (2,4) were chosen. Skipping the third review was beneficial if the paper was ultimately rejected as it saved a “pointless” review, but damaging if it was ultimately accepted as it would lead to an “accidental” rejection of what should have been accepted.
Slightly complicating things was the fact that 86 papers received 4 or even 5 reviews (presumably a race condition where multiple reviewers were recruited simultaneously), 300 received only 2 (presumably the third arrived after the deadline), and 33 had only 1 review. I properly adjusted probabilities for the five, four, and two-review cases, and dropped the 1-review cases as they could not be sensibly analyzed. 33 is too small a number to materially affect the analysis.
Below, the table of outcomes. For each pair of scores I report the number of papers (in expectation) that would initially get that pair of scores (Score1 and Score2) and were ultimately accepted (Accepts) and Rejected (Rejects). Now suppose all papers with initial score-pair at or below a given row of the table were rejected without a third review. Then all the ultimately-rejected papers below this row would be saved from getting a third review, while all ultimately accepted papers below this row would be inappropriately rejected for lack of their third review. For a given row, the “Saved Reviews” column counts the number of reviews that would be skipped using that row as a threshold, while the “False Rejects” column counts the number of papers that get rejected by that threshold but got accepted under the current 3-reviews scheme.
The table is sorted by maximum score (of the pair) which makes it easy to see what would happen if you rejected papers that didn’t achieve a given maximum score. So for example, rejecting any paper that failed to get any score of 4 or higher would save 1152 reviews (almost 1/3 of the total) but lead to 113 mistaken rejections (almost 1/3 of accepted papers).
|Score1||Score2||Accepts||Rejects||Ratio||Cum Acc||Cum Rej||False Rej||Saved Reviews|