The Better Lesson? Geometry and Topology in the Era of Deep Learning

Tags: research, musings

Published on
« Previous post: 2025 in Numbers

Sooner or later, every deep learning researcher encounters the essay “The Bitter Lesson.” Written by Rich Sutton, an early pioneer of what we today would call “Artificial Intelligence” (AI), the essay presents a crucial insight obtained from decades of research: In the long run, approaches that scale better with available computational power tend to outperform domain-specific solutions. In other words, scaling beats “handcrafted features” most of the time.

This mantra drives deep learning research and indeed, on the surface, the lesson seems to apply, with general-purpose deep learning architectures like convolutional neural networks or transformers nearly obviating (or, depending on your perspective, finishing) entire research fields like computer vision or natural language processing. These fields used to rely on elegant manual feature descriptors that were ultimately outperformed and made obsolete by the capability of deep learning models to “learn” task-specific features.

A Cautionary Tale

Hence, the “Bitter Lesson” is often cited as a cautionary tale when it comes to developing new deep learning models, prevailing wisdom dictating that one should always eschew complex, domain-specific solutions in favor of simpler, scalable ones. In practice, this often manifests in a general wariness towards seemingly “clever” and complex mathematical ideas being incorporated in models. As someone with a background in mathematics and computer science, I find this wariness to be understandable but also deplorable since it often results in referees “shooting down” promising research directions based on, for instance, a perceived lack of scalability.1

(Unintentional) Double Standards

Looking closely, however, such arguments unintentionally2 involve double standards in that they compare recent innovations with well-established computational paradigms. For instance, convolutional neural networks (CNNs) showed that they can outperform established computer vision models in 2012, but they had a long history before that:3 Yann LeCun, one of the “godparents” of modern AI, described CNNs already in a 1989 article, and even that description built on earlier work. It just so happens that the early 2010s saw a rise in graphics processing units (GPUs), which turned out to be perfectly suited for performing the calculations required to make CNNs scalable. More precisely, GPUs suddenly made training a neural network feasible in practice, thus opening the door for more applications. By the standards of reviewers, LeCun’s work in the 1980s therefore lacked scalability—no wonder this period is generally referred to as an “AI Winter,” i.e., a period characterized by decreased spending and interest in AI.

Hindsight is 20/20

The CNN story suggests that, when reviewing a new mechanism, one should at the very least make a careful distinction between practical scalability on contemporary hardware and theoretical scalability. The latter is much harder to predict or extrapolate. A reviewer in the 1980s could certainly extrapolate compute availability based on heuristics like Moore’s law, but such a reviewer would have been unable to predict the existence of GPUs, i.e., hardware that greatly speeds up the required linear algebra calculations. In this case, the “Bitter Lesson” should rather be treated as an insight that one obtains in hindsight.

The Paradigms of Tomorrow?

Given the striking example of a paradigm becoming relevant only decades later, where can we expect to discover the paradigms that will drive deep learning research in a couple of decades? I am convinced that the answer lies in a return to the mathematical foundations underpinning the field. Specifically, I believe that two mathematical fields, namely geometry and topology, carry a wealth of concepts that may serve as building blocks for next-generation techniques in deep learning. Vast subjects on their own, geometry and topology are often considered two sides of the same coin in that they describe the same objects but from different perspectives. Roughly speaking, one could say that geometry focuses more on quantitative aspects, whereas topology focuses more on qualitative ones.

For instance, a typical geometrical question is “What is the distance between those two objects?,” whereas a typical topological question is “Are these two parts of an object connected to each other?” This is not the whole story, of course. Modern mathematics is fractured into small fiefdoms,4 depending on the general flavor of tools being used. Depending on whether one is more interested in studying discrete properties or smooth properties, one will often talk about either “algebraic geometry” or “differential geometry” and, mutatis mutandis, topology. Without attempting to further classify specific methods—an endeavor fraught with difficulties since some concepts serve as bridges between the discrete and smooth worlds—I want to rather expand a little on how geometry and topology can actually be used.

Uses of Geometry and Topology

Whenever a new tool arises for deep learning, it can be either classified as being of an “observational” nature or an “interventional” one.5 Tools belonging to the first category help us better understand a given model by highlighting the relevance of its inputs. They may also provide insights into the training regimen, alerting us about issues with the training process. Importantly, they do not influence the model in any way. Tools of the second category, by contrast, change the way a model processes data. For example, if we are training a model to reconstruct data, we could measure to what extent this goal has been achieved (via a so-called loss term, which captures desirable characteristics of the data). This would ensure that a model is incentivized to preserve such characteristics during training.

Case Study: The Intrinsic Dimension of Latent Spaces in Large Language Models

As a case study of how geometry can be used in an observational manner, I want to briefly summarize our recent NeurIPS paper on analyzing large language models. Given the sheer size and complexity of these models, new approaches are needed to help understand their training process better.

As a brief refresher on how such models work: A large language model like ChatGPT essentially takes the textual input of a user, decomposes it into tokens, i.e., smaller units like parts of a word, which are subsequently embedded into a high-dimensional space. All subsequent operations, such as the generation of new answers, then make use of this high-dimensional token embed- ding space. As such, we hypothesized that all “interesting” training dynamics should be measurable by assessing this space, and we started measuring its intrinsic dimension (ID). This can be understood as the “degrees of freedom” a model has, with larger numbers typically corresponding to a more complex space. We measured ID locally, arriving at a per-token estimate, which turned out to be quite effective. Not only were we able to predict training convergence, i.e., the point at which further training is wasteful, we were also able to detect overfitting better, thus preventing a model from essentially “memorizing” its input data instead of “learning” from it. Our measure could even be used to detect data a model had not been trained on, making it possible to train the model more selectively as opposed to showing it the same dataset over and over again—given the sheer size of datasets used for training and tuning large language models, this is a crucial problem to address in practice.

Case Study: Neural Differential Forms

After seeing how geometrical properties can be used in an observational manner, here is an example of how to use geometry and topology in a more interventional fashion, obtaining a new framework for handling geometric data. This is based on our ICLR paper on simplicial representation learning.

Let us first introduce the necessary ingredients: When dealing with graphs or networks, we often have measurements along the edges or nodes—think of molecules, for instance, where nodes denote individual atoms and edges denote their bonds. Typically, such graphs are processed using a paradigm called message passing, which essentially passes information over the edges to neighboring nodes. While this leads to models that are relatively simple to implement and run in practice, message passing is known to suffer from some shortcomings since it discards information about graph geometry. We wanted to address this and developed a new way to process graphs based on differential forms, i.e., functions that measure the volume of objects in high-dimensional spaces. Instead of propagating messages along the edges of the graph, we use learnable differential forms to let a model better understand the graph geometry! Thus, we use the graph’s internal topology to learn a shared, consistent geometry. As a result, we can treat graph datasets better since we respect their geometry; on top of that, the resulting model is tiny by modern standards, making it easier to train on commodity hardware. Our formulation even works for higher-order data, i.e., data that is not restricted to the dyadic relations captured by graphs.

Emerging Research Fields and a Better Lesson

The utility of geometrical and topological paradigms is already being recognized by deep learning researchers. An influential paper coined the term geometric deep learning (GDL) to denote deep learning methods that go beyond “ordinary” Euclidean space, thus targeting targeting data like graphs or networks.

Geometric deep learning is predominantly known for giving rise to message-passing algorithms, focusing on capturing symmetry properties of graphs and preserving them. This field is now complemented by the field of topological deep learning (TDL), a nascent research direction, which focuses more on tackling higher-order data as well as their relational structures. As it stands right now, both fields do not necessarily pass the spot check of the “Bitter Lesson” in that not all GDL/TDL methods are computationally efficient yet, despite showing great promise to tackle relevant problems.6 The “Bitter Lesson” may thus unintentionally yet again serve as a gatekeeping mechanism for novel ideas in deep learning.

A “Better Lesson” could be that the conflict between scalability on the one side and domain knowledge on the other side is often nothing but a false dichotomy. If we want to create intelligent algorithms,7 they not have to inherit our own cognitive constraints. The search for improved, efficient general-purpose computational architectures does not preclude us from imbuing our models with new inductive biases that make them better aware of the “shape” of data, as endeavored by GDL/TDL, for instance. As the proverb goes:

Gold is where you find it.

Maybe the next general-purpose deep learning architecture will be found in a geometry or topology textbook? I remain bullish about the role such core mathematical domains can play in the AI revolution. We just need to be fair and give them time to mature. Who knows what the field will look like if we were to check back in a couple of decades?

(This is an extended and revised version of an essay I wrote for the “Bulletin VSH-AEU” of the Swiss Association of University Teachers.)


  1. Machine learning is certainly not the only field of research that has to contend with this type of issue. ↩︎

  2. Always assuming the best! ↩︎

  3. I believe it would substantially advance our field if more of this history was known. ↩︎

  4. Much to the chagrin of everyone and, I suspect, very much to the detriment of the field. ↩︎

  5. Michael coined this distinction, which we subsequently used in our 2021 survey↩︎

  6. If you absolutely disagree with this take, please let me know. ↩︎

  7. Whatever intelligence might mean in this context. ↩︎