The Power of Defaults

Tags: musings, research

Published on
« Previous post: The Misunderstood Stoic — Next post: Machine Learning Needs a Langlands … »

I use every ounce of willpower to keep still. One wrong move might wreak havoc. After carefully assessing the situation, I move my mouse over the right option, select it, and press ‘Next’. Crisis narrowly averted, I am now confronting the next set of inane settings—until finally, this installation of Windows 10 is finished. Midway along the process, I ask myself whether the designers of the setup process are aware that the majority of users will probably select the defaults, thus making it easy for Microsoft to use extra diagnostic data, install additional software, and decrease the user’s privacy even further. Welcome to 2020!

While setting up my dad’s new computer, I had a lot of time to speculate on the choices that designers make when creating an installation script. It dawned on me at some point that we, i.e. researchers that develop a software package in addition to their research, are doing precisely the same thing. We create a software tool for solving a certain problem. It might be an itch that we want to scratch, or it might be software that is related to our research—in the end, we all write some code in some language to produce some kind of value. How often do we think about the dangers of the API that we are exposing, though?

Misleading parameters

I am not talking about the dangers of our programs turning rogue and trying to take over the world1. I am talking about us providing potentially misleading parameters to our users. Take scikit-learn as an example2. This package, arguably the default for many data science applications, offers a wealth of classifiers, metrics, and utility classes to train models. I want to focus on the LogisticRegression class in particular. Logistic regression is a classifier that extends linear regression to the prediction setting: given data and categorical labels, logistic regression trains a model based on logistic functions. With a trained classifier, we can predict the label of ‘unseen’ data points later on. So far, so good—this is the daily bread of many researchers, data scientists, and data analysts. The implementation of logistic regression in scikit-learn is comprehensive and offers numerous additional configuration options to tweak the model.

Dangers lurk around every corner

Its default parameters, however, are dangerous. Specifically, by default, the model is subject to $L_2$ regularisation with a default regularisation strength of C = 1.0. A little background might be required to see just why this might be dangerous. In machine learning3, regularisation refers to adding additional constraints on a classifier to ensure that the classifier does not suffer from overfitting. Put somewhat differently, regularisation slightly changes the model that is fit in order to improve generalisation performance. This is because, when creating a classifier, we do not want a model to learn the data ‘by heart’, but rather, we want it to perform well on new data points. A classical regularisation technique is $L_2$ regularisation, which adds an additional penalty—the squared Euclidean norm of the weight vector—in order to make fitting the model harder, thus increasing its bias and preventing overfitting4.

If regularisation appears to be desired, why is this problematic? I would argue that the choice of regularisation is inherently a choice that the user should make. By default, logistic regression is not regularised at all—it is not part of the model specification. Thus, unless your classifier or model was proven to perform best with a specific regularisation technique, it is outright dangerous to enable regularisation by default. In the worst case, it might trick users into believing that they did not employ regularisation when in fact they did: when comparing to other methods in a publication, it is common practice to report the parameters that one selected for a classifier. A somewhat hidden assumption on the model can be very problematic for the reproducibility of a paper. This issue was discussed in a Tweet by Zachary Lipton, and I would definitely encourage reading this discussion thread.

Blaming and overreacting

A natural reaction to such a discussion is to play the ‘blame game’. I fully agree that uses should read the documentation of the software that they are about to use5. Nevertheless, I would say that it is our responsibility as developers to think about these cases and try to develop a safe API. In the case of logistic regression, there is a simple fix:

  1. Remove the default behaviour for regularisation.
  2. Create a new class RegularisedLogisticRegression that implements some form regularisation by default.

This is less problematic because the name of such a class will make it clear to users what is going on, whereas the current class name is still (slightly) misleading. Of course, such choices should not lead to the overreaction of abandoning all defaults in software. On the contrary! Defaults are useful because they keep our interface clean and make it possible to ‘just try out’ our software. From the perspective of a user, setting up a tool such that it does ’the right thing’, is a blessing—and judging from my personal usage patterns, having ‘sensible defaults’ also increases the likelihood of people using your software.

No surprises!

The practical implementation of this policy is not always easy, though. The principle of least astonishment, also known as the ‘principle of least surprise’, refers to a design philosophy that tries to reduce the astonishment of users when it comes to the behaviour of the system. For example, when using numpy.sum without any parameters, it takes the sum over the whole array. This makes sense because there is no ‘default axis’ on which all users can agree—some would probably prefer sums over the rows, while others would prefer sums over the columns. The implementation sidesteps this issue by forcing the user to make a choice in case the default does not do the expected thing.

Given that more complex algorithms exist for which these defaults are not clear, what can we do? I would suggest the following strategy:

  1. Provide example code that sets all parameters required for using a tool, even those that are set by default to some value. Not only does this ensure reproducibility to some extent, it also gives the user a notion of how complex our tool is, and which things to keep in mind when employing it.
  2. Keep the number of default parameters small and restrict them to a set of safe values. For example, when your algorithm can use an approximation scheme that speeds up the computations at the cost of accuracy, it should not be enabled by default. However, if your algorithm can either perform calculations ‘bottom-up’ or ’top-down’, with certain computational advantages in some situations, any one of these could be set by default, as long as it does not affect the results.
  3. Realise that different users have different expectations. Be prepared to compromise and adjust your tools accordingly. Try to warn uses whenever they are about to wager into dangerous territories.

May your defaults be safe and unsurprising, until next time!

  1. Even though that particular scenario should not be underestimated, but that’s the subject of another article. ↩︎

  2. I should mention that I have the utmost respect for the authors of this package. I am merely using it as an example that many readers can probably relate to, because it is the standard package for Python-based data science. ↩︎

  3. …and statistics, and deep learning, and artificial intelligence, etc. ↩︎

  4. This is only a cursory explanation that is meant to build intuition for grasping the problem. ↩︎

  5. The documentation of the logistic regression classifier now features a prominent warning about this issue. ↩︎