Shakespeare, topology, and machine learning

Tags: projects, research

Published on
« Previous post: Scatterplot matrices with gnuplot — Next post: Homology statistics for 2-manifolds »

In this post, I want to briefly explain how to combine methods from “topological data analysis” and “machine learning”. Given a Shakespearean social network, I want to predict whether it is likely to be a comedy or not.

Preliminaries & terms

Let’s first discuss some terms: a Shakespearean social network is a co-occurrence network generated from an annotated corpus of Shakespeare’s plays. The basic idea is to calculate a graph from the play in which characters that co-occur in a scene are connected by an edge. The edge weight roughly indicates how many lines of text a character has—the idea being that less important characters should be connected by edges with smaller weights.

I already wrote two posts about these networks, with the first one talking about the general ideas behind these networks and providing some basic statistics, and the second one talking about a more formal evaluation in the form of a short paper titled ‘Shall I compare thee to a network?’—Visualizing the Topological Structure of Shakespeare’s Plays.

The short paper focuses more on visualizing structural similarities between plays. Much has happened in the meantime, however. I started working on Aleph, a library for exploring persistent homology. Moreover, encouraged by previous publications (namely, my two papers about analysing clustering algorithms and dimensionality reduction methods), I started to delve a little bit deeper into machine learning.

In some sense, this post is a continuation of my work in that area.

Topological data analysis & persistent homology

Since I want to keep this post brief (I plan on writing up a more detailed explanation as a follow-up post), I merely focus on the basic idea behind my analysis. Using persistent homology, a method from topological data analysis, it is possible to measure the amount of ‘holes’ in a network. Yes, you read that correctly. While many algorithms describe what is there, persistent homology focuses on what is missing. This paradigm is highly successful and has been applied in many different application domains so far—let me just mention Gunnar Carlsson’s paper Topological pattern recognition for point cloud data as one particularly striking example of this.

At the heart of persistent homology is the concept of a persistence diagram, a two-dimensional scatter plot that depicts scale information about each hole. For every dimension of an input data set, we have a separate persistence diagram. These diagrams are commonly taken as a feature descriptor that summarizes a data set. Their appeal lies in certain structural properties such as robustness. Furthermore, it is possible to calculate the “Wasserstein distance” between them. This makes it possible to compare two extremely complex objects, shapes, or spaces by a rather simple algorithm!

Distances & kernels

In the aforementioned short paper, I used the Wasserstein distance between persistence diagrams in order to obtain a low-dimensional embedding of Shakespeare’s plays. Proximity in this embedding means that the co-occurrence networks of two plays have a similar structure. Among other things, one of the results of the short paper was that comedies such as A Midsummer Night’s Dream appear to form clusters, while tragedies such as Hamlet appear to be spread out more.

It is thus a natural question to ask whether it is possible to decide if a random Shakespeare play, described as a co-occurrence network, is a comedy or not. This seems like a perfect job for machine learning! Based on the successful usages of persistent homology in other contexts, it seemed natural to me to use it in order to assess the dissimilarity between the networks. There are other options out there for doing this—and it would be an interesting side-project to compare all of them—but seeing that most of my research centres on persistent homology one way or the other, I will stick with it.

In order to unleash the power of machine learning on these unsuspecting networks, we require the definition of a suitable “kernel”. A kernel function permits us to use supervised learning algorithms without ever having to specify coordinates in the feature space. Given the fact that the feature space is extremely hard to describe—notwithstanding current efforts to introduce vector-based representations of persistence diagrams, such as persistence images—kernel methods are a blessing. Thanks to scikit-learn, they can be used quite easily.

There are multiple options for calculating kernels. Here, I experimented a little bit with a simple Wasserstein kernel (obtained from negating the Wasserstein distance values) as well as persistence indicator functions( of which I hope to write more in a subsequent post).

I am leaving out the generation process of the kernel matrices for now, because I plan on describing them in a subsequent blog post: my library for calculating persistent homology is still subject to change and I do not want to prematurely write about APIs that are not yet ready.

Experiments & results

After generating the kernel matrices with Aleph, we can use “support vector machines” to test how easily we may detect comedies. Some caution is necessary, though: there are only 37 plays, of which 17 are comedies. A standard splitting of the data set into training and test data is thus not useful here because there are not enough samples. I thus opted for “cross-validation techniques”, more precisely leave-p-out cross-validation, with p up to 3.

The results are as follows:

Kernel Accuracy Average ROC AUC
Persistence indicator functions, d=1 0.86 0.91
Persistence indicator functions, d=2 0.73 0.80
Persistence indicator functions, d=3 0.66 0.74
Wasserstein, d=1 0.86 0.95
Wasserstein, d=2 0.81 0.94
Wasserstein, d=3 0.81 0.93

The entries in the third column refer to the area under the curve of the “receiver operating characteristic” and give an idea about how the classifier will perform in general. These values are to be taken with a grain of salt, however, because there are few samples of both classes. I selected the accuracy performance measure because there are only two classes and they are not highly skewed. Given that the probability of a random play being a comedy is about 0.46, it is nice to see that we are able to detect comedies rather well.

Nonetheless, the values in the table are somewhat surprising. First of all, from personal experience I would have expected a slightly better performance from the second Wasserstein distance—it is the one that is commonly used because its local costs resemble the basic Euclidean distance. This has the nice property of suppressing topological features with low persistence values. It thus appears that these features are useful for classification purposes. Paul Bendich suspected something similar in his paper Persistent Homology Analysis of Brain Artery Trees:

That said, there is no guarantee of persistence being correlated with importance, just with reliability. Indeed, one of the findings in Section 5 is that dots of not-particularly-high persistence have the most distinguishing power in our specific application.

What now?

This was/is a fun project for me as it combines several interesting aspects. I look forward to trying out more kernels, such as the one proposed by Reininghaus et al. in A Stable Multi-Scale Kernel for Topological Machine Learning.

One of the results of this paper was that the Wasserstein distance is theoretically unsuitable as a kernel function. This is somewhat unfortunate because it still appears to work. I sincerely hope that the persistence indicator function proves to be a viable candidate for kernel methods because it can be calculated very quickly and, in contrast to persistence diagrams, permits the definition of a unique mean function.

As soon as time permits, I shall write a more detailed article about the technical details of this experiment. Please take a look at the GitHub repository that I created for this article. Here, you will find the raw data along with some scripts to reproduce the values reported in this blog post.

Have fun and feel free to drop me a line if you are interested in these (or related) methods. I will be more than happy to have an interesting discussion.