A short analysis of ICLR 2020 reviews

Tags: research

Published on
« Previous post: On Writing Reviews — Next post: Ideas and creativity »

With ICLR 2020, the International Conference on Learning Representations, being already well into its rebuttal period, I wanted to take a look at all the reviews and try to spot interesting patterns. This is a write-up of some of the (preliminary) results.

Getting and preparing the data

Thanks to the great folks of OpenReview, who provide an easy-to-use Python API, getting all reviews is a walk in the park (skip to the end of the article to see the code). Storing all reviews in JSON format took Michael and me only a few minutes. We did not change anything in the raw data except for a simple mapping of experience levels:

Experience Meaning
0 I do not know much about this area.
1 I have read many papers in this area.
2 I have published one or two papers in this area.
3 I have published in this field for several years.

For all other fields, we used the respective numerical scale. This year, there are only the following options:

Rating Meaning
1 Reject
3 Weak Reject
6 Weak Accept
8 Accept

In the interest of readability, the following plots only contain these numerical labels.

How long are the reviews?

Let us look at some distributions now. First, the obligatory histogram of word counts of a review. The median number of words per review is 338, while the mean is 395.5. As expected, there is a tail of very long reviews, but in general, ICLR 2020 reviews appear to be rather terse.

Histogram of word counts for all reviews

A histogram of word counts for all reviews

A closer inspection of the very short reviews shows that some of them deal with desk rejections, so their length is not too surprising. Are there maybe differences in word count that depend on the rating of a paper? To visualise this, here is a set of boxplots, partitioned by the final rating of a paper.

Boxplots of word counts per rating

A set of boxplots of word counts, partitioned by rating

This is mildly surprising to me—I would have expected that strongly-opinionated reviews (1 and 8) would use longer reviews to justify their decision. Yet, the plot shows that this is only true for reviews that reject a paper: more outlier reviews (in terms of their length) are shown here. Curiously, reviews that fully endorse a paper have the lowest mean:

Rating Mean number of words
1 424.70
3 420.65
6 352.41
8 340.28

What about the experience assessment, though? Do experienced reviewers write longer reviews? As it turns out, this is not exactly true, as the following plot shows:

Boxplots of word counts per experience level

A set of boxplots of word counts, partitioned by reviewer experience

There is a clear difference in the mean review length in that inexperienced reviewers tend to write shorter reviews, most likely because they (feel) unable to judge the content of the paper accurately. The jump between the means is quite sizeable:

Experience Mean number of words
0 313.94
1 378.28
2 429.63
3 425.14

Reviewers with experience level 21 write reviews that are on average 37% longer than those of reviewers with experience level 02. To dig more into this, we need to partition by experience level and by rating. Here are the resulting boxplots:

Boxplots of word counts per experience level and per rating

A set of boxplots of word counts, partitioned by reviewer experience and paper rating

This yields some interesting patterns:

  • Inexperienced reviewers write longer reviews when they rejecting a paper. This strikes me as a good; if someone with no knowledge about my research area recommends to reject my paper, a good and long justification is appreciated.

  • Weak rejects, with a rating of 3, are on average the longest reviews. I cannot check this assumption easily3, but I assume that these reviews also discuss steps that authors could do in order to ‘sway’ the rating into a more favourable one.

  • Again, accepts, with a rating of 8, are among the shortest reviews. Maybe reviewers feel that they do not have to justify the acceptance of a paper as much as its rejection? Alternatively, maybe the papers that receive such ratings are of such stellar quality that there is nothing left to improve? I doubt that this is the case, but analysing this further would be interesting—I will keep that idea for a future blog post, though.

How tough are the reviewers?

As a last analysis, let us compare the ’toughness’ of reviewers. Does the experience level shift the final rating of a reviewer? This is just a matter of counting across categories:

Bar plots per reviewer experience

A set of bar plots, partitioned by reviewer experience

As we can see, inexperienced reviewers tend not to be strongly-opinionated about a paper; most of their reviews (about 80%) are either weak rejects or weak accepts. For the remaining experience levels, the story is different: here, most reviews are weak rejects (for experience levels 1–3), followed by weak accepts (for experience levels 1 and 2). Interestingly, the second-most common rating for highly experienced reviewers is reject. It appears that veteran reviewers are relatively tough for their final verdict. Moreover, with only 6.8% of their reviews being rated as accept, in contrast to 7.5% (level 2), 9.9% (level 1), and 10.0% (level 0), getting a veteran reviewer appears to slightly decrease your chances of getting your paper accepted.

I have to admit that some of these results are surprising to me—I always though that after a certain level of experience, the rating algorithm applied by reviewers is essentially the same. This is definitely not the case. Of course, such an analysis suffers from inevitable caveats, the most glaring one being the fact that reviewers themselves decide on their level of competence…

Code, data, and coda

Everything (code and data for 2020 as well as 2019) is available on GitHub. It is clear that this analysis only scratched the surface here. An interesting direction would be the integration of NLP techniques to go down on the level of individual reviews. In addition, the extraction code and the analysis scripts are still somewhat unpolished. I would love to make this into a repository that contains all reviews from the past ICLR conferences because I am convinced that we, i.e. the machine learning community, should closely watch and analyse our review process. Everyone complains about peer review for their own reasons, but collections such as this one make it possible to investigate certain issues.

Ideally, I would like to close this article with some recommendations for improving the review process. However, I feel I am not in the position to ’enact’ these changes in our community. If you, my dear reader, happen to be among the ‘movers and shakers’ of the machine learning world, I hope this article gave you some food for thought.

Until next time; may your reviews always be fair!

Acknowledgements: This article was inspired by the scripts and plots of Shao-Hua Sun, whose analysis covers some aspects of this article but focuses more on review scores per paper. Check out the repository for a great overview and ranking of all papers!

  1. ‘I have published one or two papers in this area.’ ↩︎

  2. ‘I do not know much about this area.’ ↩︎

  3. It would require going through all reviews with that rating and check for the existence of certain key phrases. If you are up for the task, this sounds like an interesting NLP paper. ↩︎