Web of Science: A Web of Nonsense
One of the worst outgrowths of academic research are companies trying to track the impact of your research.1 On the surface level—before engaging for more than a split second with the topic—I can fully understand the perceived need for that. Let me tell you why: suppose you are working at a university, hiring some hotshot scientists. You have no clue what they are actually doing. But you do have a couple of questions:
- How to make sure that they are producing good science?
- Wait, what is good science, actually?
- Could it be that some scientific advances are immediately usable, while others may take longer to be of use?
- Is it a bad idea to view science through the lens of utility?
Any of these questions would involve delving deeper into the complex interplay of science and society, so of course, being a responsible and rational person, you engage with the various scientific departments. First, you talk to the pure mathematicians, trying to ‘grok’ how they are trying to discover the fabric of mathematics. Then, you talk to the physicists and marvel with them at our universe. Finally, you talk to the historians and gain a deeper understanding of the frustration they must feel about seeing cycles of behaviour repeat themselves constantly. After some time, you get back to your office, enlightened.
This is how you would act in the ‘best of all worlds.’ In this fallen and tarnished universe, you probably do not have time for this because your bosses (and their bosses) they demand answers, stat! Wheels are turning, and so on… This is where the companies behind Web of Science step in. Quoting from Wikipedia, Web of Science claims to be…
[…] a paid-access platform that provides (typically via the internet) access to multiple databases that provide reference and citation data from academic journals, conference proceedings, and other documents in various academic disciplines.
The emphasis is of course on ‘paid.’ You pay for it, the quality must be good, right? No. No. No. Unfortunately, that is not the case. Behind the self-aggrandising name of purporting to be a ‘Web of Science,’ there is actually little substance—at least not in some of the research fields I know best. If it was called ‘A Subset of Science,’ I could maybe let this slide, but as it stands, the omissions are glaring.
Let me start with a simple example. For those that are unfamiliar with him, Yann LeCun is one of the ‘Godfathers of Deep Learning,’ certainly an eminent figure in modern machine learning research. The author search of ‘Web of Science’ turns up a meagre 3 publications for him, and claims that his real name is actually ‘rose, learning’ (sic). You cannot make this stuff up. Who thought it was a good idea to not vet any of this stuff for correctness? The operators behind this useless database even have the gall of showing Prof. LeCun’s profile with a green check mark, giving it the veneer of a record that was carefully checked by someone. Let me repeat: you cannot make this stuff up.
This is, unfortunately, just the tip of the iceberg. The data quality is freakishly low, since Web of Science apparently is unable or unwilling to index machine learning conferences such as ICLR, ICML, and NeurIPS. I do not know the reasons behind this, and I do not want to speculate. I can only say that many of my own publications at these venues are only indexed as a preprint in this great database, and citations are also not tracked reliably (referring to Prof. LeCun again, you can imagine that he racked up quite a few citations himself).
Now, where is the problem with this? If you are working in the real world, you must have seen a fair share of crappy databases. Everyone can ignore yet another one of those. However, the problem, to come back to our hapless administrator from the beginning, is that some universities make hiring—and firing—decisions based on this information. We can and should debate whether it is a good idea to measure the ‘impact’ of scientists by citations, but I think we can all agree on one thing: having a database that excludes a large amount of venues is not a good start for this endeavour. The situation is even juicier because Google Scholar, the de-facto database for publications,2 publishes a regular list of top-cited venues, and two machine learning conferences are in the top 10 venues, which usually comprise journals.3 This is food for thought and might lead us to question the traditional journal-based publishing, but Web of Science will not have any of that, of course!
The list of problems continues: the database has troubles with non-ASCII names,4 and will assign ‘alternative names’ in what I assume must be an essentially random fashion (things like ‘rose, learning’). That’s a slap in the face for many scientists, in particular those that change their names for some reason. It is also a great strategy to alienate a large chunk of the scientific community who, I am being told, exhibits all kinds of names and characters. Erdős, anyone?
At this point, I hope to have convinced you that we are dealing with a ‘Web of Nonsense’ here. How do we move forward? Well, for starters, we could stop throwing away university money to access crappy databases. I have no magic bullet5 for measuring the impact of science—in fact, I have the hunch that any such solution or metric could just be gamed by people—but anything, even throwing dice at random, strikes me as a better and cheaper solution than to rely on bad databases, managed by bad companies that have absolutely no incentive to serve the dissemination of knowledge.
If you are using such databases, consider doing something else. If you are thinking about investing in such databases, please don’t. The money could be spent on something else…maybe even for science?
The word impact always makes me think of an impending crash. Maybe this post will impact my career, huh? ↩︎
Google Scholar is not perfect, either. However, Google certainly does one thing well, namely searching, and in contrast to the questionable business practices of Web of Science and the like, Google does not charge universities—or scholars—for using Google Scholar. ↩︎
Of course, the irony of me celebrating this fact is palpable, since Google Scholar’s list involves yet another way of counting citations. I am aware of that. My point is that if we want to go ahead with these bogus metrics, we should at least strive for including all the venues. ↩︎
Because of course it does! ↩︎
I also think that such an assessment must be multidimensional and multifaceted by nature, since different sciences have different ways of making progress. And let’s not even get started about the different publishing cultures in different fields… ↩︎