A Brief Analysis of My Review Scores

Tags: research, musings

Published on Sunday, August 8, 2021

« Previous post: A Short Round-Up of Topology-Based … — Next post: Absurdest Academia (A ‘Darkest Dungeon’ … »

With the Twitterverse aflame and slinging mud concerning the latest round of NeurIPS reviews, I wanted to follow the maxim ‘know thyself’ instead and analyse my own behaviour as a reviewer. To this end, I looked at all the review scores I had given in my initial review of a paper (the final scores might be different, but more about that later). I only included reviews from ICML, ICLR, and NeurIPS, putting them all on the same rating system, which employs scores from 1–10. While not all grades can be provided in all conferences—ICLR likes to remove some ‘in-between’ grades in some instances—this should provide a rough overview of my tendencies as a reviewer. Plus, this is not a scientific publication, so a slipshod statistical analysis will not hurt anyone here. Here is the resulting plot in all its gory details:

The horizontal line indicates the (weak) acceptance threshold. There are some interesting observations arising from this sort of visualisation:

My opinion became more pronounced as time progressed. In 2018, I was mostly initially opting for weak rejects and weak accepts. Now, being more familiar with ML, I also feel more confident to express a stronger opinion in both directions!
The overall tendency is not to accept too many papers in the initial review. This is consistent with acceptance rates of the conferences. I make no statements about whether this is also good or justified. I should stress that I am always happy to raise my score if authors are demonstrating that they thought about my suggestions, in particular for borderline papers. In several cases, this proved to be crucial for getting a paper accepted.
The year 2020 appears to have slightly lower scores than this year so far. I hope that this is not indicative of my mood in this particular year, but rather an expression of the scoring system that was used during 2020. ICML, for instance, experimented with a new set of scores and tried to condense the usual scale. Moreover, I feel that the overall quality of papers was also slightly lower in 2020. With conferences receiving a record number of submissions as everyone was quarantined, I hypothesise that it was easier to submit work that was not quite finished yet. I wonder whether the official statistics of review score distributions would corroborate this.
The review cycle for 2021 is not over, so the numbers have to be taken with a certain grain of salt. I would hope that the mean score goes up, though.

Notice that the distributions are not all of the same size—the review load tends to increase over time. Since there are no glaring outliers, I am glad that my performance appears to be relatively consistent.

To continue this brief foray into my reviewing endeavours, I also performed a quick sentiment analysis of review contents, using the marvellous TextBlob package for Python. This resulted in distributions of the polarity of my reviews. Polarity is scaled between $[-1, 1]$, with $1$ and $-1$ representing highly positive and highly negative sentiments, respectively, whereas $0$ indicates mostly neutral language.

I think these plots uncover my growth as a reviewer—I have also run the same analysis for older reviews and find that my average polarity slightly increased over time. Earlier reviews were slightly more guarded and more negative in terms of their language (I sincerely hope that they were not overly critical, but I will never get an answer to this question except by asking authors directly, which is of course not possible).

As you can see from the score distributions above, more positive language does not necessarily lead to more papers being accepted. I think this is an important distinction to make: as I progress in academia, I like to believe that my understanding and my compassion for authors increases as well, with the ultimate outcome of me being able to offer more precise critiques that nevertheless have an underlying positive tone. At least that is the goal…

Closing this rather self-indulgent post, I think everyone would benefit from running such an analysis on their reviews every once in a while. We should ensure that we do not become vain, bitter, or belittling when it comes to the work of others. Instead, as machine learning venues—and academia in general—become more and more competitive, we should do our best to offer useful guidance to authors. It is all too easy to focus only on the flaws and find a paper to be ‘rejected until proven worthy.’ My nightmare would be to become someone who writes reviews like the Demoman or the Pyro would write them—a perfect analogy for some of the reviewers encountered in modern peer review, aptly described by Matt Might in his tongue-in-cheek article Peer Fortress: The Scientific Battlefield. I therefore strive to approach every review with a fascination for the topic and with the hope that the paper will (eventually) be published. Let’s hope that we can all keep this spirit.

Until next time! May you write the reviews that you like to receive, and may all your papers be accepted with minor revisions.

PS: if you are a reviewer yourself, you might consider reading On Writing Reviews, which captures my thoughts about how to write useful reviews.