Clopen Data

Tags: research, musings

Published on Sunday, May 25, 2025

« Previous post: If At First You Don’t … — Next post: There is Fun in the Fundamentals »

As much as we sometimes tend to forget it, machine learning runs on data.¹ Despite this, we—that is, the researchers—often treat data like manna: Something is bestowed upon us from on high, typically in an ineffable fashion. Certainly, not something that we should question or that we should concern ourselves with! Unfortunately, this mentality has some unexpected side-effects, most notably the fact that in many cases, we do not even know how some of our benchmark data sets came to be or for what purpose. One might thus expect that machine-learning research, echoing the real world here, would have spawned some version of theology or philosophy on its own, focusing on the big questions:

Where did our data come from and who collected it?
Why was it collected?
What should we do with it?

Unfortunately, this did not happen. Data is still treated as something one does not speak about. Meanwhile, the real world turns and toils, and, almost unbeknownst to machine learning, saw the rise of initiatives promoting Open Data. While we are still using the MNIST data set, data scientists are running circles around us and are mining the whole German public-transport ecosystem, for instance.

But not all data sets are equally open. Under the moniker of Open Data, some actors are doing the bare minimum, viz. actually publishing something that can be downloaded in some file format. No explanations, no keywords, no tags—nothing. I believe that we should call such data sets Clopen Data. Similar to topology, where a set can be open and closed at the same time, there are data sets that are both open (i.e. available) and closed (i.e. unusable).

Clopen Data arises in various forms:

The aforementioned doing-the-bare-minimum actors that just upload some undocumented data set somewhere because they are forced to but in reality cannot be really bothered. Bonus points if the licensing terms are unclear, making it almost impossible to work with the data. In the worst case, such data sets are being used without a great understanding of its provenance, leading to problems with reproducibility later on.

In a talk on What’s in a Graph?, I expressed this situation for graph-learning research somewhat humorously, borrowing liberally from Tolkien:

And some things that should not have been forgotten were lost. History became legend. Legend became myth. And even myth was long forgotten when some poor wretch trained a large graph neural network on poorly-understood data.

(The community has since acknowledged that problem in an article that discusses the relevance of strong benchmark data sets. However, there are still formidable chasms to bridge between clopen data, open data, and data that is useful as a benchmark data set…)
Data sets that are technically open in the sense that they can be theoretically obtained but are practically hidden behind an application process, such as the dreaded ‘We will make this data available to all bona fide researchers.’ A study from 2021 shows that about 27–59% of all such requests are successful; and even if they are, you might get clopen data of the first form. We have few checks and balances in place that ensure that such data requests are not only honoured in the letter of the law but also in its spirit. Data sets that will be published ‘soon,’ but the last update to the repository has been made a decade ago, also fall under this category.

What happens in reality with such clopen data is that you have to be ‘in the know’ to actually make use of the data. If you are not part of the data cognoscenti, you probably have no way to make use of the data—and do not even think about getting help when something in the data is unclear.
Finally, we have clopen data that is clopen due to an inability to properly find it. Contrary to clopen data of the first form, there are numerous actors—cities, countries, organisations—that actually care about publishing their data. They lovingly tag data sets, put them in a database, and upload them. However, their frontend is not conducive to discoverability, meaning that potential users typically do not find the data. A classical frontend failure leading to clopen data are search engines that are overly literal. For instance, you search for police reports but actually you should have searched for police report.²

To me, this is the saddest form of clopen data—and easily the most prevalent. Especially governmental repositories seem to suffer from these issues. I presume that there is a disconnect between the people who collect the data and the people that are tasked with implementing a search engine. And probably, none of them are actually data consumers; if they were, these types of problems would probably not exist. The worst thing about this is that, in the long run, this might make an organisation stop publishing their data, because it is not being used. Little do they know that there would be a lot of data consumers out there if only they could find the data.

…and by opposing end them: Having outlined some of the problems, what can we do about them? What can you do about them? I dislike whining in my posts and I certainly do not want to pass the buck for producing good open data to someone else. On the contrary: I strongly believe that we, as scientists, should be open³ about our needs for open data and assist data producers. This can take many forms:

We can be upfront about data formats that are useful (computer vision, by its very nature, seems to be at an advantage there).
We can provide feedback about existing data sets, letting the producers know that we appreciate their work (maybe this could also lead to new collaborations). The power of a simple ‘Thank you’ e-mail remains very much underestimated.
Finally, and maybe most importantly, we could start a cultural shift towards recognising data-production and data-curation efforts. The fact that NeurIPS and other conferences accept data-set submissions now is already a good start.

Even if data production is not our strong suit, we should endeavour to think about it more carefully if we want to keep proposing algorithms of practical relevance. Let’s aim for more and overcome clopen data together.

And compute. Massive amounts of compute. But first and foremost, data, because how can you use your high-performance compute cluster if you do not have any data to process? ↩︎
I am making this up, but the broader pattern exists. ↩︎
Yes, I went there. ↩︎