Allocating CHI reviewers, a sequel

Last year I used an analysis of CHI review data to argue that we could save a lot of reviewers’ time on low quality papers by modiyfing our review process.  With all the current talk of the value of replication, I figured it was worth testing the same procedure with this year’s review data, which Dan Olsen was kind enough to provide.

CHI currently collects three external reviews on every paper before engaging the program committee to make an accept/reject decision.   Last year I did an analysis that suggested rejecting papers with two bad reviews (scores below 3) without requiring a third, and showed that this would have saved 469 reviews (of papers that should have been rejected)  while accidentally rejecting just 6 of about 300 papers.  I argued that 6 papers was probably dwarfed by other mistakes the PC makes, so this would be a worthwhile tradeoff.

Because I’m replicating, I don’t need to detail the analysis again; you can find that in last year’s post.  I explored the following procedure: send each paper to two reviewers and, if the scores are too low, reject it without further consideration.  Otherwise, send it to a third reviewer and then on the the program committee for decision.  Varying the definition of “too low” provides a tradeoff between false positives (extra reviews for papers that ultimately get rejected) and false negatives (accidental early rejection of papers that would have been accepted).  Looking at last year’s data, a pretty appealing threshold was to reject any paper whose two scores were both lower than 3.   This is a natural rule because 3 is the “neutral” score in CHI reviews; both reviews below signifies that both reviewers recommend rejection.  It was also a good threshold because it saved 469 reviews while creating only 6 accidental rejections.

So, did the results replicate this year?  Applying exactly the same procedure, we get a remarkable level of agreement.  Of the roughly 1567 papers submitted, there were 369 acceptances and 1198 rejections.  Using the “two below 3″ rule, we skip 491 reviews of low-rated papers, while accidentally rejecting 4.7 papers (see last year’s post, or the discussion below, to understand why we get a fraction).  An almost identical outcome, just slightly better.

Some caveats: the data I got was slightly messy, with a few papers showing no reviews or less than three.  I presume these were withdrawn submissions, but haven’t had time to find out.  There were less than ten of these, not enough to influence the results significantly.

For those who want to explore the data themselves, I’ve prepared a table of the relevant numbers.  Score 1 and Score 2 are the reviewer scores; Accepts counts the number of pairs (remember, three pairs to a paper) from accepted papers that got those two scores, Rejects the number of pairs from rejected papers that got those two scores.  All other columns are functions of these two.  Ratio measures the ratio of accepts to rejects for a given pair of scores.  I’ve order the rows by this ratio, which is useful for visualizing the optimum false positive/false negative tradeoff.  Cum acc and Cum rej total the accept and reject pairs in the lines above, then divides by three so we can count papers (averaged over outcomes) instead of counting pairs.  These cumulative totals count the number of accepts and rejects above the current line, thus telling you the number of papers from each category that would be given a third review if you used the given line as the threshold for deciding on that third review.  False rejects then shows how many papers that were ultimately accepted would have been rejected in the first round using the given threshold, while saved reviews counts the number of ultimately rejected papers for which the threshold would have led to skipping the third review.

Note that the accepts and rejects columns are counting “review pairs”.  Each paper in the data set got three reviews.  If we imagine the chair selecting three reviewers but then requesting a review from only 2 of them (keeping the third in reserve for when the first reviews are good), then there are three possible outcomes.  Counting over all three outcomes of all three papers yields the average outcome—the expectation should the chair select their two reviewers at random from the pool of three.  It is this averaging that produces fractions in the other columns.

Score 1 Score 2 Accepts Rejects Ratio Cum acc Cum rej false rejects saved reviews
5 5 14 0 NaN 14 0 364.3 1197.3
4.5 4.5 37 3 12.33 51 3 352.0 1196.3
5 4 63 8 7.87 114 11 331.0 1193.7
5 4.5 41 6 6.83 155 17 317.3 1191.7
4.5 4 113 21 5.38 268 38 279.7 1184.7
4 4 125 31 4.03 393 69 238.0 1174.3
4.5 3.5 65 20 3.25 458 89 216.3 1167.7
5 3 22 8 2.75 480 97 209.0 1165.0
5 3.5 19 8 2.37 499 105 202.7 1162.3
4 3.5 129 68 1.90 628 173 159.7 1139.7
4.5 3 32 23 1.39 660 196 149.0 1132.0
4 3 70 81 0.86 730 277 125.7 1105.0
3.5 3.5 52 62 0.84 782 339 108.3 1084.3
4.5 2.5 20 29 0.69 802 368 101.7 1074.7
5 1 2 3 0.67 804 371 101.0 1073.7
5 2.5 9 17 0.53 813 388 98.0 1068.0
4 2.5 43 95 0.45 856 483 83.7 1036.3
4.5 1.5 8 18 0.44 864 501 81.0 1030.3
4.5 2 11 25 0.44 875 526 77.3 1022.0
4 2 39 95 0.41 914 621 64.3 990.3
3.5 3 48 117 0.41 962 738 48.3 951.3
5 2 6 20 0.30 968 758 46.3 944.7
5 1.5 2 8 0.25 970 766 45.7 942.0
3 3 14 66 0.21 984 832 41.0 920.0
4 1 5 28 0.18 989 860 39.3 910.7
3.5 2.5 30 172 0.17 1019 1032 29.3 853.3
4 1.5 6 36 0.17 1025 1068 27.3 841.3
4.5 1 2 13 0.15 1027 1081 26.7 837.0
3.5 2 24 199 0.12 1051 1280 18.7 770.7
3.5 1.5 6 68 0.09 1057 1348 16.7 748.0
3 2.5 15 188 0.08 1072 1536 11.7 685.3
3 1 3 52 0.06 1075 1588 10.7 668.0
3.5 1 3 55 0.05 1078 1643 9.7 649.7
3 2 10 214 0.05 1088 1857 6.3 578.3
3 1.5 2 107 0.02 1090 1964 5.7 542.7
2.5 2.5 3 155 0.02 1093 2119 4.7 491.0
2 2 4 223 0.02 1097 2342 3.3 416.7
2 1.5 3 227 0.01 1100 2569 2.3 341.0
2.5 2 4 331 0.01 1104 2900 1.0 230.7
2.5 1 1 83 0.01 1105 2983 0.7 203.0
1.5 1 1 144 0.01 1106 3127 0.3 155.0
2 1 1 157 0.01 1107 3284 0.0 102.7
1 1 0 94 0.00 1107 3378 0.0 71.3
1.5 1.5 0 76 0.00 1107 3454 0.0 46.0
2.5 1.5 0 138 0.00 1107 3592 0.0 0.0

 

4 Responses to “Allocating CHI reviewers, a sequel”

  • Interesting.
    I’m not sure what the CHI procedure was this year, apparently it was streamlined over the past. For CSCW, more reviews could be saved while still giving all papers three reviews by not assigning a fourth reviewer/metareviewer to papers with three reviews under a 4. For example, if every paper had 2 external reviewers and one AC assigned to do a review, and rejected with no further comment if none gave it a 4 or 5, and the same data pattern was found, almost 950 papers would have ended up with only three reviews. If in fact all papers ended up with three externals and an AC it would save 950 or more, and our data indicated that for CSCW no accidental kills would occur. How many reviews/metareviews were there in total this year?

    As to the papers with fewer than 3 reviews, they could have been “desk rejects” ruled out of scope by the chairs prior to reviewing or after one reviewer noted they were entirely out of scope or had some other fatal flaw. 1% of the CSCW 2012 submissions were in this category.

  • Paul Resnick says:

    Are you sure that your dataset reflects the original review scores, and not scores that were revised after discussion or rebuttals? Revised scores are more likely to be compatible with final decisions.

  • David Karger says:

    The data from year 1 of the analysis is perfect: an archival copy of the scores that were entered from just before reviewers began discussion. In the second year, I missed the beginning of discussion by 3 days—some eager reviewers may have begun discussion, but I doubt that many had. It’s impossible to know, and I’ll be more careful to request the data in time next year.

  • [...] can make use of data to improve the conference.  I’ve already analyzed historical data that demonstrates that we can substantially reduce reviewer workload.  We’ve also created a way you can use [...]