Allocating CHI Reviewers, 2014 Edition

As is now an annual tradition, I’ve performed my analysis of the allocation of reviewers for this year’s CHI conference.  The data from the CHI review process suggests that we can reduce the number of reviewers per paper (and thus reduce the workload on our community) without significantly affecting the outcome.  At present, every paper gets three reviews.  As in the past two years, I ask what would happen if we initially used only two reviewers per paper, then sent for a third review only when necessary (and what the right “necessary” criterion should be).

I used exactly the same methodology as in the past two years, which you can read about here and here.  This year I had accurate data thanks to PC chair Tovi Grossman, who took a snapshot of reviewing scores just before beginning the “discussion” phase of the review process (when reviewers can see each others’ comments and may change their review scores).  A total of 1542 papers were reviews, and ultimately 454 were accepted.

The results are extremely consistent with the past two years.  In particular, if after getting two reviews we immediately reject any papers with both review scores below 3 (the neutral score) then in expectation we skip 641 unnecessary reviews while rejecting only 10 of the 454 papers that “should have been” accepted.   This represents a savings of 641/5809=11% of the total amount of reviews, with a “noise” loss of only 10/454=2% of the accepted papers.  I believe this 2% is dwarfed by other errors (review inaccuracies), happens primarily to papers on the borderline of acceptance, and is worth it if we can save workload for the entire community.  To make the point stronger, I have confirmed that 2/3 of the skipped reviews were below 3—not only are we saving reviewer work, we are primarily saving reviewers work on papers they don’t like!

The Details

To assess my proposal, I use exactly the same methodology as last year.  I analyze the process of choosing two of each paper’s reviews at random as the “first” reviews, determining whether those two reviews would trigger a third review, and then bucket the result according to whether the paper was ultimately accepted or rejected.  For example, while analyzing a rule such as “don’t bother with a third review if the initial two scores are both less than three”, a paper that received scores of 1,2, and 4 would be seen as skipping a third review if the pair (1,2) were chosen (probability 1/3) but receiving a third review if pair (1,4) or (2,4) were chosen.  Skipping the third review was beneficial if the paper was ultimately rejected as it saved a “pointless” review, but damaging if it was ultimately accepted as it would lead to an “accidental” rejection of what should have been accepted.

Slightly complicating things was the fact that 86 papers received 4 or even 5 reviews (presumably a race condition where multiple reviewers were recruited simultaneously), 300 received only 2 (presumably the third arrived after the deadline), and 33 had only 1 review.  I properly adjusted probabilities for the five, four, and two-review cases, and dropped the 1-review cases as they could not be sensibly analyzed.  33 is too small a number to materially affect the analysis.

Below, the table of outcomes.  For each pair of scores I report the number of papers (in expectation) that would initially get that pair of scores (Score1 and Score2) and were ultimately accepted (Accepts) and Rejected (Rejects).   Now suppose all papers with initial score-pair at or below a given row of the table were rejected without a third review.  Then all the ultimately-rejected papers below this row would be saved from getting a third review, while all ultimately accepted papers below this row would be inappropriately rejected for lack of their third review.  For a given row, the “Saved Reviews” column counts the number of reviews that would be skipped using that row as a threshold, while the “False Rejects” column counts the number of papers that get rejected by that threshold but got accepted under the current 3-reviews scheme.

The table is sorted by maximum score (of the pair) which makes it easy to see what would happen if you rejected papers that didn’t achieve a given maximum score.  So for example, rejecting any paper that failed to get any score of 4 or higher would save 1152 reviews (almost 1/3 of the total) but lead to 113 mistaken rejections (almost 1/3 of accepted papers).

Score1 Score2 Accepts Rejects Ratio Cum Acc Cum Rej False Rej Saved Reviews
5 5 1.67 0.00 NaN 0.00 0.00 454.00 1542.00
5 4.5 9.33 1.67 5.60 1.67 1.67 452.33 1540.33
5 4 13.50 3.50 3.86 11.00 5.17 443.00 1536.83
5 3.5 12.17 5.67 2.15 24.50 10.83 429.50 1531.17
5 3 8.00 3.67 2.18 36.67 14.50 417.33 1527.50
5 2.5 4.00 10.17 0.39 44.67 24.67 409.33 1517.33
5 2 3.50 8.50 0.41 48.67 33.17 405.33 1508.83
5 1.5 4.00 2.00 2.00 52.17 35.17 401.83 1506.83
5 1 0.33 1.33 0.25 56.17 36.50 397.83 1505.50
4.5 4.5 10.50 1.00 10.50 56.50 37.50 397.50 1504.50
4.5 4 33.33 10.33 3.23 67.00 47.83 387.00 1494.17
4.5 3.5 20.17 10.83 1.86 100.33 58.67 353.67 1483.33
4.5 3 12.00 6.17 1.95 120.50 64.83 333.50 1477.17
4.5 2.5 14.00 15.83 0.88 132.50 80.67 321.50 1461.33
4.5 2 9.83 23.17 0.42 146.50 103.83 307.50 1438.17
4.5 1.5 3.33 13.17 0.25 156.33 117.00 297.67 1425.00
4.5 1 2.00 8.17 0.24 159.67 125.17 294.33 1416.83
4 4 36.83 11.67 3.16 161.67 136.83 292.33 1405.17
4 3.5 52.17 29.00 1.80 198.50 165.83 255.50 1376.17
4 3 23.67 31.83 0.74 250.67 197.67 203.33 1344.33
4 2.5 37.00 51.83 0.71 274.33 249.50 179.67 1292.50
4 2 20.50 67.00 0.31 311.33 316.50 142.67 1225.50
4 1.5 6.33 29.67 0.21 331.83 346.17 122.17 1195.83
4 1 2.00 23.50 0.09 338.17 369.67 115.83 1172.33
3.5 3.5 22.67 20.27 1.12 340.17 389.93 113.83 1152.07
3.5 3 20.33 44.33 0.46 362.83 434.27 91.17 1107.73
3.5 2.5 18.83 65.57 0.29 383.17 499.83 70.83 1042.17
3.5 2 12.00 93.83 0.13 402.00 593.67 52.00 948.33
3.5 1.5 5.67 38.93 0.15 414.00 632.60 40.00 909.40
3.5 1 2.67 23.83 0.11 419.67 656.43 34.33 885.57
3 3 4.50 15.33 0.29 422.33 671.77 31.67 870.23
3 2.5 10.00 55.67 0.18 426.83 727.43 27.17 814.57
3 2 6.00 71.00 0.08 436.83 798.43 17.17 743.57
3 1.5 1.00 33.83 0.03 442.83 832.27 11.17 709.73
3 1 0.00 20.00 0.00 443.83 852.27 10.17 689.73
2.5 2.5 2.00 47.93 0.04 443.83 900.20 10.17 641.80
2.5 2 3.50 115.40 0.03 445.83 1015.60 8.17 526.40
2.5 1.5 1.33 59.37 0.02 449.33 1074.97 4.67 467.03
2.5 1 0.67 41.00 0.02 450.67 1115.97 3.33 426.03
2 2 1.67 89.27 0.02 451.33 1205.23 2.67 336.77
2 1.5 0.33 110.93 0.00 453.00 1316.17 1.00 225.83
2 1 0.33 90.50 0.00 453.33 1406.67 0.67 135.33
2 0.5 0.00 0.33 0.00 453.67 1407.00 0.33 135.00
1.5 1.5 0.00 43.77 0.00 453.67 1450.77 0.33 91.23
1.5 1 0.33 57.93 0.01 453.67 1508.70 0.33 33.30
1 1 0.00 32.97 0.00 454.00 1541.67 0.00 0.33
1 0.5 0.00 0.33 0.00 454.00 1542.00 0.00 0.00

 

7 Responses to “Allocating CHI Reviewers, 2014 Edition”

  • Tim Chen says:

    Good analysis, again.

    But the argument here is to propose an early rejection mechanism to “save a lot of reviewers’ time on low quality papers”, right? Then I am not sure can we called rejected papers as low quality ones?

    I tend to believe that people always learn from the reviews and they are not wasted in the end. As a reviewer, I also do not feel my time is wasted no matter the reviewed paper is accepted or not ( perhaps I will change my mind when I have more than 10 papers to review )

  • David Karger says:

    By definition, the papers being rejected are the ones the reviewers consider low quality (or out of scope, or a few other things). As for learning from reviews, I agree. But in my experience, with weak papers the reviews tend to be almost all the same. So we’re asking for extra work without any benefit.

  • Thanks for doing and sharing this amazing analysis!

    Your proposal sounds great, and I would be very happy to reduce my review workload.

    I totally support thinking this idea through!

    Two questions:

    * Just to be clear, we are talking about 2 instead of 3 external reviewers without counting the AC. So, the number of reviews would be 3 instead of 4, right?

    * What if reviewers discuss the score? When the initial scores had a high standard deviation, reviewers often tend to change their score towards the average. Could it be that by removing R3 we change these dynamics significantly?

  • David Karger says:

    To your first point, right; the AC process isn’t changed. In fact, given the AC, we might want to consider raising the threshold for automatically getting the 3rd review, because there is an opportunity for the AC to make an intelligent decision about when it is needed. So, e.g., we might settle for 2 reviews for any paper with scores below 3.5, with the understanding that the AC might ask for a 3rd review if they have doubts about the quick rejection.

    To your second point, note that the data I worked with is scores *prior to the discussion phase*, before they get changed. My analysis shows that it is safe to reject these papers *without reviewer discussion*. Again, if we allow those two reviewers to discuss and change scores before deciding whether a third review is needed, it might improve the process still further.

  • Great! So your analysis is not affected by the changes during the discussion! Thanks for the clarification!

    One additional thought: you said that removing R3 affects 2% of the decisions. This may even be an opportunity, because there won’t be 2% fewer papers, but 2% different papers accepted. Since there is less need for a common decision, more controversial one’s might have a better chance.

  • Tovi Grossman says:

    A comment: You say “I believe this 2% is dwarfed by other errors (review inaccuracies)”. Just how many other “errors” are you assuming there to be in the final decisions? How many would there need to be for 2% to be “dwarfed”? I’d be hesitant to adopt a new system that decreases the reliability of the process. I’m sure this year’s 2% of authors would agree :)

  • David Karger says:

    Tovi, sorry for missing this earlier. My own experience on PC meetings is that after dealing with the obvious accepts and rejects, we spend the bulk of our time agonizing over papers right at the boundary, all of which have roughly the same amount of support and opposition. I believe the difference in quality among these papers is generally slight, and that all sorts of random factors (e.g., when we eat lunch) influence which ultimately get in and which don’t. This randomness doesn’t concern me precisely because none of these papers is a clear accept. I don’t accept my proposed change to produce a 2% worse result, rather I expect it to produce a 2%-perturbed result which is just as good as the one we currently choose.