The Power of Defaults
Tags: musings, research
I use every ounce of willpower to keep still. One wrong move might wreak havoc. After carefully assessing the situation, I move my mouse over the right option, select it, and press ‘Next’. Crisis narrowly averted, I am now confronting the next set of inane settings—until finally, this installation of Windows 10 is finished. Midway along the process, I ask myself whether the designers of the setup process are aware that the majority of users will probably select the defaults, thus making it easy for Microsoft to use extra diagnostic data, install additional software, and decrease the user’s privacy even further. Welcome to 2020!
While setting up my dad’s new computer, I had a lot of time to speculate on the choices that designers make when creating an installation script. It dawned on me at some point that we, i.e. researchers that develop a software package in addition to their research, are doing precisely the same thing. We create a software tool for solving a certain problem. It might be an itch that we want to scratch, or it might be software that is related to our research—in the end, we all write some code in some language to produce some kind of value. How often do we think about the dangers of the API that we are exposing, though?
Misleading parameters
I am not talking about the dangers of our programs turning rogue and
trying to take over the world1. I am talking about us providing
potentially misleading parameters to our users. Take
scikit-learn
as an example2. This
package, arguably the default for many data science applications, offers
a wealth of classifiers, metrics, and utility classes to train models.
I want to focus on the
LogisticRegression
class in particular. Logistic regression is
a classifier that extends linear regression to the prediction setting:
given data and categorical labels, logistic
regression trains a model based on logistic functions. With
a trained classifier, we can predict the label of ‘unseen’ data points
later on. So far, so good—this is the daily bread of many researchers,
data scientists, and data analysts. The implementation of logistic
regression in scikit-learn
is comprehensive and offers numerous
additional configuration options to tweak the model.
Dangers lurk around every corner
Its default parameters, however, are dangerous. Specifically, by
default, the model is subject to $L_2$ regularisation with a default
regularisation strength of C = 1.0
. A little background might be
required to see just why this might be dangerous. In machine
learning3, regularisation
refers to adding additional constraints on a classifier to ensure that
the classifier does not suffer from overfitting.
Put somewhat differently, regularisation slightly changes the model that
is fit in order to improve generalisation performance. This is because, when creating
a classifier, we do not want a model to learn the data ‘by heart’, but
rather, we want it to perform well on new data points. A classical
regularisation technique is $L_2$ regularisation, which adds an
additional penalty—the squared Euclidean norm
of the weight vector—in order to make fitting the model harder, thus
increasing its bias and preventing overfitting4.
If regularisation appears to be desired, why is this problematic? I would argue that the choice of regularisation is inherently a choice that the user should make. By default, logistic regression is not regularised at all—it is not part of the model specification. Thus, unless your classifier or model was proven to perform best with a specific regularisation technique, it is outright dangerous to enable regularisation by default. In the worst case, it might trick users into believing that they did not employ regularisation when in fact they did: when comparing to other methods in a publication, it is common practice to report the parameters that one selected for a classifier. A somewhat hidden assumption on the model can be very problematic for the reproducibility of a paper. This issue was discussed in a Tweet by Zachary Lipton, and I would definitely encourage reading this discussion thread.
Blaming and overreacting
A natural reaction to such a discussion is to play the ‘blame game’. I fully agree that uses should read the documentation of the software that they are about to use5. Nevertheless, I would say that it is our responsibility as developers to think about these cases and try to develop a safe API. In the case of logistic regression, there is a simple fix:
- Remove the default behaviour for regularisation.
- Create a new class
RegularisedLogisticRegression
that implements some form regularisation by default.
This is less problematic because the name of such a class will make it clear to users what is going on, whereas the current class name is still (slightly) misleading. Of course, such choices should not lead to the overreaction of abandoning all defaults in software. On the contrary! Defaults are useful because they keep our interface clean and make it possible to ‘just try out’ our software. From the perspective of a user, setting up a tool such that it does ’the right thing’, is a blessing—and judging from my personal usage patterns, having ‘sensible defaults’ also increases the likelihood of people using your software.
No surprises!
The practical implementation of this policy is not always easy, though.
The principle of least astonishment,
also known as the ‘principle of least surprise’, refers to a design
philosophy that tries to reduce the astonishment of users when it
comes to the behaviour of the system. For example, when using
numpy.sum
without any parameters, it takes the sum over the whole
array. This makes sense because there is no ‘default axis’ on which
all users can agree—some would probably prefer sums over the rows,
while others would prefer sums over the columns. The implementation
sidesteps this issue by forcing the user to make a choice in case
the default does not do the expected thing.
Given that more complex algorithms exist for which these defaults are not clear, what can we do? I would suggest the following strategy:
- Provide example code that sets all parameters required for using a tool, even those that are set by default to some value. Not only does this ensure reproducibility to some extent, it also gives the user a notion of how complex our tool is, and which things to keep in mind when employing it.
- Keep the number of default parameters small and restrict them to a set of safe values. For example, when your algorithm can use an approximation scheme that speeds up the computations at the cost of accuracy, it should not be enabled by default. However, if your algorithm can either perform calculations ‘bottom-up’ or ’top-down’, with certain computational advantages in some situations, any one of these could be set by default, as long as it does not affect the results.
- Realise that different users have different expectations. Be prepared to compromise and adjust your tools accordingly. Try to warn uses whenever they are about to wager into dangerous territories.
May your defaults be safe and unsurprising, until next time!
-
Even though that particular scenario should not be underestimated, but that’s the subject of another article. ↩︎
-
I should mention that I have the utmost respect for the authors of this package. I am merely using it as an example that many readers can probably relate to, because it is the standard package for Python-based data science. ↩︎
-
…and statistics, and deep learning, and artificial intelligence, etc. ↩︎
-
This is only a cursory explanation that is meant to build intuition for grasping the problem. ↩︎
-
The documentation of the logistic regression classifier now features a prominent warning about this issue. ↩︎