Open Source and Academia

Tags: academia, musings

Published on Saturday, July 2, 2022

« Previous post: Improving My Lab Book Using Snippets for … — Next post: Questions to Sharpen a Research Proposal »

With data science, machine learning, and artificial intelligence rapidly becoming the ‘technology du jour,’ academia has changed. In earlier decades, the output or ‘artefacts’ of academics were publications or, to a minor extent, patents and other translations into practice. Now, however, code plays a major role, and even in the life sciences, researchers are constantly commuting between the bench and their data analyses. While this constitutes an unprecedented opportunity for team-driven science that harnesses the power of open source, many academics are loath to embrace open source principles. This blog post is aimed at both producers and consumers of academic code, and sets out to provide some simple guidelines.

For Producers

Before talking about the various benefits and me trying to make a passionate plea for open science, let me first mention a few practical steps. Consider this: are you a trailblazing principal investigator or Ph.D. student bent on changing the world with your science but you do not know a lot about releasing your code? Here are some simple steps to follow:

Check whether you are allowed to share your code at all. At Helmholtz Munich, we have a great commitment to releasing code, but check with your research institution and your peers whether this is possible. Often, it can be better to ‘ask for forgiveness than to ask for permission.’
Put your code on GitHub (or a similar website) and—this is important—pick a license under which to release it. I have discussed the benefits of software licenses in a previous post; let me summarise that discussion by stating that any license is better than no license since a license provides a set of terms under which others are allowed to use your code. If you are confused by the many licenses out there, you can also just pick CRAPL, a license invented by the inimitable Matt Might and specifically targeting academic settings.
Set your repository to be visible to the public and add a README.md in which you explain the most important things about your code repository, to wit:
1. What does the repository contain?
2. Is there a paper associated with the code?
3. How to run the code.
Even if you believe that no one will use your code, you at least will provide some documentation for yourself, which may come in handy when you revisit the project at some point in the future.
You are done, congratulations—you have open-sourced your research code!

Captain Picard from Star Trek: The Generation enjoying a win — The reaction of any researcher upon stumbling over your code on GitHub or somewhere else. You cannot imagine how nice it is to find and use the implementation of someone else in another project.

Is this all? At this point, you might hear an uproar by more knowledgeable readers: ‘What about unit tests? What about documentation? What about maintenance?’ I agree—all of these things are important. At the same time, le mieux est l’ennemi du bien.¹ By this I mean that many academics are hard-pressed for time and are reluctant to commit to what they expect to be tantamount to building a cathedral for their code. No one expects you to deliver a perfectly-polished product. It is true that unit tests, code coverage analysis, and superb documentation are important aspects of well-built software packages but if these requirements preclude researchers from open-sourcing their code in the first place, no one benefits. It is better to strive for excellence over time; having some code for the community is better than none.

Benefits! This brief segue into the struggle for excellence brings me to the subject of benefits. You, as the producer of fine code and papers, benefit quite a lot from open source code:

You provide an open discussion for your results, making it possible for people to reproduce and replicate them. You are thus doing your part to combat the replication crisis.
You enable applications that you did not have on your radar before; perhaps an aspect of your method turns out to be critical for someone else’s project.
You make it possible to ‘pass the torch’ to someone else in the future. As your research interests shift, your previous work will still endure and can be built upon.
Finally, you substantially increase the reach of your methods, making it possible to potentially acquire new collaboration partners, thus advancing science together.

A personal anecdote. In the spirit of ‘Do as I say, not as I do,’ I regret not open-sourcing several of my early research papers. I have had marvellous conversations with researchers about my work on topology-based evaluations of dimensionality reduction schemes and clustering methods; I think we could have saved everyone so much time if there was a simple package to use for these purposes. Do not be like my previous self—make your code public.

But… There are some common reservations against making your code publicly available. Let me try to answer them (and feel free to reach out to me and tell me whether I missed some).

…my code is bad! So is mine. So is everyone’s. Beauty is in the eye of the beholder anyway, and no one is going to mind or publicly call you out for releasing ‘unaesthetic code’ into the wild.² We are researchers, not professional software developers.
…I might get scooped! This one is tougher to answer. First, I would say that fear is never a good motivator. If you truly fear being scooped, any papers you are writing about the subject should also add to that fear, though. My take is that after a publication has been accepted, there is no need for secrecy any more. Contrariwise, making code available can decrease the probability of being scooped because you already ‘planted your flag,’ as it were. Maybe, in the spirit outlined above, this might even make it possible to join forces with your potential scooper?
…it’s so much work! It truly is not. The steps outlined above are perfectly sufficient. You are providing something for the community, so of course you get to call the shots here.
…I cannot maintain the code! This one is tough to address. I realise that science does not have the incentive structure in most places. To put it plainly: we are supposed to deliver papers, not handle GitHub issues. Of course, in the right place—did I mention how awesome Helmholtz Munich is in this respect?—great collaborative software projects can bloom. Take scanpy, for instance: this is rapidly becoming the toolkit for Python-based single-cell data analysis; it has recently surpassed 100 contributors from all over the world. Many hands make light work, and no single person or lab has to shoulder the burden of maintenance.

Of course not every project is of that scope, so let me repeat what I wrote above: You are providing something for the community, so of course you get to call the shots here. It is perfectly acceptable and reasonable to release the code ‘as-is’ and tell people that you do not have the resources to provide guaranteed support. No one should expect this—open source is a little bit like a gift; it is bad form to feel entitled to anything. We will return to this point in the subsequent paragraphs.

A futuristic utopia indicating that the world could be so much better if people released their code all the time.

I hope I convinced you of putting your code out there. Try it out—you in particular and science in general may only win! And if you decide to put something out there, know that I and many others are very grateful. Our community and our daily research would not be the same without the brave folks that set their code free to let it blossom.

For Consumers

At the end of this loquacious post, let me briefly address the ‘other side,’ i.e. the users or consumers of open source code. Here, I would say that my main point is ‘Be excellent to each other.’ To elucidate: whenever you are dealing with open source code (academic or not) there is a high chance that some amateurs were involved. Amateur is not meant as an insult but just to indicate that most open source developers are not being paid for their work. Instead, they provide it because they love what they are doing.³ Or, to use less exalted terms, they provide something voluntarily.

Homer Simpson looking good from the front but dishevelled from the back, representing a paper and its corresponding code — A situation you will often encounter when dealing with academic code. Most of us put more time and thought into their papers because papers are a recognised currency much more than code. However, the times are changing!

From this, we can derive a few rules of engagement for consumers or users:⁴

Be respectful and kind when asking for help with a project. On the other end of your GitHub issue or e-mail, you will probably have an exasperated, sleep-deprived Ph.D. student who is writing their papers and their thesis using a combination of hope, despair, and caffeine. They probably want to help you; a few nice words will not hurt your case. It has been written elsewhere that ‘Pleasant words are as a honeycomb, sweet to the soul, and health to the bones.’

(Rich Hikey wrote the essay Open Source is Not About You, a powerful plea for seeing open source projects as gifts rather than something one is entitled to. His position is expressing some frustration about the status quo in a specific community, but the points can also apply to smaller projects.)
Do some research first. Research code is notoriously brittle. It often was packaged pretty quickly by the aforementioned sleep-deprived researchers. It is thus quite likely that certain things might not work; you can help the troubleshooting process by doing some research and trying out a few things. Here are a few examples of that:
- The code does not work with a specific Python version? Use pyenv to try a different one.
- The code does not work with your operating system? Try a virtual machine or container first.
- There are errors arising from other packages? Check their respective issues first. Maybe someone else has a solution.
Jeff Atwood (of StackExchange fame) describes numerous relevant problem-solving and description strategies. This is definitely worth a read!
Whenever possible, try to give back. Many projects are in dire need of additional contributors. If you are a power user of a specific package and you encountered a problem that you solved, consider providing it as a patch to the ‘upstream’ project. See an error in the documentation? Fork the repository and fix it. Try to improve the project a little bit and the maintainers will usually be ever so grateful.

Of course, the extent of your contributions is a function of your experience level and interests. If you do not know a lot about programming, there is no need for you to write patches. However, even as a non-technical user of a project, you can still spot errors or inaccuracies in the documentation, for instance. Try to be a ‘good citizen of the open source world.’

Closing Thoughts

I hope this article made a convincing case for open source. Before you go, do not forget to enjoy your ride. Being able to contribute both to science, which is already a pretty large human endeavour on its own, and open source is a privilege. I sincerely hope that it brings you as much joy as it did for me.

Until next time!

Perfect is the enemy of good. ↩︎
Please ‘summon’ me on Twitter or via e-mail if someone tries to bully you because of code you released. I promise you I am going to get medieval with them. Words will be had, denouncements or burn notices will be written, and—in case that is not enough—I shall summon the Spanish inquisition, which will come as an unexpected surprise to everyone. I mean it. ↩︎
That is indeed the origin of the word amateur, amator referring to a ’lover’ in Latin. ↩︎
Some of these rules double as general life advice; that only goes to show that open source is fundamentally a human endeavour, necessitating a social strategy. ↩︎