CHI: Do we really need three reviewers for every paper?

While I’m here at CHI, I thought I’d post a bit of analysis I did of the CHI reviewing statistics. I was exploring the question of whether we really need three reviewers on every CHI submission.

At the PC meeting, there was a lot of discussion of the huge amount of work that was going into reviewing for CHI.   CHI accepted about 300 of 1350 submissions.  As a first time PC member, I was particularly bothered by the work on rejected papers.  I felt pretty guilty wasting three separate reviewers’ time on a paper I was sure would be rejected.  If we could cut down to two we’d be saving a lot of work.

In particular, I wanted to explore the following model.  The PC member responsible for  a paper begins by soliciting three reviewers as usual.  However, they initially send the paper to only two of them, chosen at random.  Based on the scores by those two reviewers, some “clear reject” papers are dropped.  The remainder are sent out to the third reviewer, and a decision is made using the three resulting scores.

I analyzed the procedure using this year’s CHI reviews.   I took the  preliminary scores reviewers submitted for papers, prior to any discussion.   Each paper had three reviewers; if we chose two at random, we’d get one of three pairs.  Thus, for analysis, I split each paper into 3 “review pairs” each counting for 1/3 of a paper, which is the right way to analyze outcomes in expectation.  I grouped these pairs according to the two scores they represent (I ignored expertise ratings).  For each pair of values, I counted how many (thirds of) papers with those two scores ended up accepted, and how many ended up rejected.  I looked for (presumably low) pairs of scores that were associated with few or no papers that were ultimately accepted.    If we reject papers that get those pairs, we won’t mistakenly reject many papers.

Here’s a specific result for those who don’t want to read more: if, after getting the two scores, we had rejected any paper with both scores below three, we would have skipped sending out 469 papers (roughly half the total) in expectation.  Meanwhile, we would have rejected 6 papers that actually got accepted under the current scheme.  Alternatively, if we rejected all papers with both scores below 2.5, or scoring a 3 and less than 1.5, we would save 250 reviews, still a noticeable improvement, without sacrificing any acceptances at all!

So, the clear question is, are we willing to make 6 mistakes (out of the roughly 300 acceptances) in order to save ourselves 469 reviews of generally bad papers?  I would say yes.  I’m pretty sure that 6 is far less than the number of undetected “mistakes” we make through imperfect reviewing (in fact, I’m sure of it, since CHI has rejected more than 6 of my submissions, and all those rejections were obviously mistakes).  So I don’t think we’re introducing significantly more error using the scheme I suggested, and we’re certainly saving a lot of effort!

For those who want more numbers, the table below contains all the data I worked with.  First, 2 columns representing a score pair.  Then, for each pair, the number of paper “thirds” with that pair of scores that were accepted and rejected.  Then I compute a “ratio” of accepts to rejects.  If you want to get the most accepts with the fewest rejects (best cost-benefit), you sort by those ratios, and accept (for third review) the score pairs with the biggest ratios.  The table is sorted that way.  I then take a cumulative sum to compute, for each row, the cumulative number of (thirds of)  accepts and rejects that would result if you accepted all score pairs above that row.   Dividing by three and subtracting from the  number of actual accepts/rejects gives me the number of false rejects (papers that should be accepted but get rejected under the 2-review scheme) and false accepts (papers that will ultimately be rejected, that get a third review under this scheme).  Finally, subtracting false accepts from total number of rejects gives me the number of third reviews that would not have to be done under the proposed scheme.

There are many possible approaches to reducing reviewing load.  The scheme I described is clearly implementable, and has the benefit that we know exactly how it would have performed if we had used it this year.  Of course, there is a chance that I am over-fitting and the results would be different next year.  But these are pretty large samples we’re working with, so I’m moderately confident it generalizes well.

The only other concern regards the pipeline.  Clearly, sending out for a third review could add to the time lag.  But if doesn’t have to—in the first phase, we are only asking for 2/3 as many reviews, so we could shorten the timeline.  In the second “third review” phase, we are asking for half as many third reviews as before, so again we could shorten the timeline.  Hopefully this would work out.

I welcome your thoughts.

Score 1 Score 2 Accepts
(pairs)
Rejects
(pairs)
Ratio Cum
Accept
Pairs
Cum
Reject
Pairs
False
Reject
False
Accept
Reviews
Saved
5.0 5 10 0 inf 10 0 298 0 1041
5.0 4.5 29 1 29.00 39 1 288 0 1041
4.5 4.5 36 5 7.20 75 6 276 2 1039
5.0 4 52 10 5.20 127 16 259 5 1036
4.5 4 86 21 4.10 213 37 230 12 1029
4.0 4 96 41 2.34 309 78 198 26 1015
5.0 3 22 11 2.00 331 89 191 30 1011
4.5 3.5 57 29 1.97 388 118 172 39 1002
5.0 3.5 23 14 1.64 411 132 164 44 997
4.0 3.5 100 73 1.37 511 205 131 68 973
4.0 3 67 71 0.94 578 276 108 92 949
4.5 3 27 31 0.87 605 307 99 102 939
5.0 2.5 10 12 0.83 615 319 96 106 935
4.5 2.5 22 27 0.81 637 346 89 115 926
5.0 2 8 10 0.80 645 356 86 119 922
4.5 2 19 33 0.58 664 389 80 130 911
3.5 3.5 32 57 0.56 696 446 69 149 892
4.0 2.5 35 64 0.55 731 510 57 170 871
5.0 1.5 2 4 0.50 733 514 57 171 870
5.0 1 2 6 0.33 735 520 56 173 868
4.0 2 35 123 0.28 770 643 44 214 827
4.0 1.5 8 32 0.25 778 675 42 225 816
3.5 3 27 117 0.23 805 792 33 264 777
4.5 1.5 3 15 0.20 808 807 32 269 772
3.5 2 20 146 0.14 828 953 25 318 723
3.5 2.5 18 132 0.14 846 1085 19 362 679
4.0 1 3 32 0.09 849 1117 18 372 669
3.0 2.5 13 152 0.09 862 1269 14 423 618
4.5 1 1 12 0.08 863 1281 13 427 614
3.5 1.5 3 54 0.06 866 1335 12 445 596
3.0 2 10 188 0.05 876 1523 9 508 533
3.0 1.5 3 72 0.04 879 1595 8 532 509
2.5 2.5 5 121 0.04 884 1716 6 572 469
3.0 3 2 49 0.04 886 1765 6 588 453
2.0 2 6 168 0.04 892 1933 4 644 397
2.5 1 2 82 0.02 894 2015 3 672 369
2.5 2 6 257 0.02 900 2272 1 757 284
3.5 1 1 47 0.02 901 2319 1 773 268
3.0 1 1 51 0.02 902 2370 0 790 251
2.0 1.5 1 193 0.01 903 2563 0 854 187
2.5 1.5 0 116 0.00 903 2679 0 893 148
2.0 1 0 162 0.00 903 2841 0 947 94
1.5 1.5 0 77 0.00 903 2918 0 973 68
1.5 1 0 119 0.00 903 3037 0 1012 29
1.0 1 0 86 0.00 903 3123 0 1041 0

8 Responses to “CHI: Do we really need three reviewers for every paper?”