# Shakespeare, topology, and machine learning

## Tags: projects, research

In this post, I want to briefly explain how to combine methods from
“topological data analysis” and “machine learning”.
Given a *Shakespearean social network*, I want to predict whether it is
likely to be a comedy or not.

# Preliminaries & terms

Let’s first discuss some terms: a *Shakespearean social network*
is a co-occurrence network generated from an annotated corpus of
Shakespeare’s plays. The basic idea is to calculate a graph from
the play in which characters that co-occur in a scene are connected by
an edge. The edge weight roughly indicates how many lines of text
a character has—the idea being that less important characters
should be connected by edges with smaller weights.

I already wrote two posts about these networks, with the first
one
talking about the general ideas behind these networks and providing some
basic statistics, and the second
one talking
about a more formal evaluation in the form of a short paper titled *‘Shall I compare thee to a
network?’—Visualizing the Topological Structure of Shakespeare’s
Plays*.

The short paper focuses more on *visualizing* structural similarities
between plays. Much has happened in the meantime, however. I started
working on Aleph, a library for
exploring persistent homology. Moreover, encouraged by previous
publications (namely, my two papers about analysing clustering
algorithms and dimensionality reduction
methods), I started to delve a little bit
deeper into machine learning.

In some sense, this post is a continuation of my work in that area.

# Topological data analysis & persistent homology

Since I want to keep this post brief (I plan on writing up a more
detailed explanation as a follow-up post), I merely focus on the basic
idea behind my analysis. Using *persistent homology*, a method from
topological data analysis, it is possible to measure the amount of
‘holes’ in a network. Yes, you read that correctly. While
many algorithms describe what is *there*, persistent homology focuses on
what is *missing*. This paradigm is highly successful and has been
applied in many different application domains so far—let me just
mention Gunnar Carlsson’s paper Topological pattern recognition
for point cloud data
as one particularly striking example of this.

At the heart of persistent homology is the concept of a *persistence
diagram*, a two-dimensional scatter plot that depicts scale information
about each hole. For every dimension of an input data set, we have
a separate persistence diagram. These diagrams are commonly taken as
a *feature descriptor* that summarizes a data set. Their appeal lies in
certain structural properties such as robustness. Furthermore, it is
possible to calculate the “Wasserstein distance” between them. This makes it possible to
compare two extremely complex objects, shapes, or spaces by a rather
simple algorithm!

# Distances & kernels

In the aforementioned short
paper, I used the Wasserstein distance between
persistence diagrams in order to obtain a low-dimensional embedding of
Shakespeare’s plays. Proximity in this embedding means that the
co-occurrence networks of two plays have a similar structure. Among
other things, one of the results of the short paper was that
comedies such as *A Midsummer Night’s Dream* appear to form
clusters, while tragedies such as *Hamlet* appear to be spread out more.

It is thus a natural question to ask whether it is possible to decide if a random Shakespeare play, described as a co-occurrence network, is a comedy or not. This seems like a perfect job for machine learning! Based on the successful usages of persistent homology in other contexts, it seemed natural to me to use it in order to assess the dissimilarity between the networks. There are other options out there for doing this—and it would be an interesting side-project to compare all of them—but seeing that most of my research centres on persistent homology one way or the other, I will stick with it.

In order to unleash the power of machine learning on these unsuspecting
networks, we require the definition of a suitable “kernel”. A kernel function permits us to use
supervised learning algorithms without ever having to specify
coordinates in the feature space. Given the fact that the feature space
is extremely hard to describe—notwithstanding current efforts to
introduce vector-based representations of persistence diagrams, such as
persistence
images—kernel
methods are a blessing. Thanks to
`scikit-learn`

, they can be used quite
easily.

There are multiple options for calculating kernels. Here, I experimented
a little bit with a simple Wasserstein kernel (obtained from
negating the Wasserstein distance values) as well as *persistence
indicator functions*( of which I hope to write more in a subsequent
post).

I am leaving out the generation process of the kernel matrices for now, because I plan on describing them in a subsequent blog post: my library for calculating persistent homology is still subject to change and I do not want to prematurely write about APIs that are not yet ready.

# Experiments & results

After generating the kernel matrices with
Aleph, we can use “support vector machines” to test how
easily we may detect comedies. Some caution is necessary, though: there
are only 37 plays, of which 17 are comedies. A standard splitting of the
data set into *training* and *test* data is thus not useful here because
there are not enough samples. I thus opted for “cross-validation techniques”, more precisely
*leave-p-out cross-validation*, with `p`

up to 3.

The results are as follows:

Kernel | Accuracy | Average ROC AUC |
---|---|---|

Persistence indicator functions, `d=1` |
0.86 |
0.91 |

Persistence indicator functions, `d=2` |
0.73 | 0.80 |

Persistence indicator functions, `d=3` |
0.66 | 0.74 |

Wasserstein, `d=1` |
0.86 |
0.95 |

Wasserstein, `d=2` |
0.81 | 0.94 |

Wasserstein, `d=3` |
0.81 | 0.93 |

The entries in the third column refer to the *area under the curve* of
the “receiver operating characteristic” and give an idea about how the classifier
will perform in general. These values are to be taken with a grain of
salt, however, because there are few samples of *both* classes.
I selected the *accuracy* performance measure because there are only two
classes and they are not highly skewed. Given that the probability of
a random play being a comedy is about `0.46`

, it is nice to see that we
are able to detect comedies rather well.

Nonetheless, the values in the table are somewhat surprising. First of
all, from personal experience I would have expected a slightly better
performance from the second Wasserstein distance—it is the one
that is commonly used because its local costs resemble the basic
Euclidean distance. This has the nice property of suppressing
topological features with low persistence values. It thus appears that
these features are useful for classification purposes. Paul Bendich suspected something similar in his paper *Persistent Homology Analysis of Brain Artery Trees*:

That said, there is no guarantee of persistence being correlated with importance, just with reliability. Indeed, one of the findings in Section 5 is that dots of not-particularly-high persistence have the most distinguishing power in our specific application.

# What now?

This was/is a fun project for me as it combines several interesting
aspects. I look forward to trying out more kernels, such as the one
proposed by Reininghaus et al. in *A Stable Multi-Scale Kernel for
Topological Machine Learning*.

One of the results of this paper was that the Wasserstein distance is
theoretically unsuitable as a kernel function. This is somewhat
unfortunate because it still appears to work. I sincerely hope that the
persistence indicator function proves to be a viable candidate for
kernel methods because it can be calculated very quickly and, in
contrast to persistence diagrams, permits the definition of a unique
*mean* function.

As soon as time permits, I shall write a more detailed article about the technical details of this experiment. Please take a look at the GitHub repository that I created for this article. Here, you will find the raw data along with some scripts to reproduce the values reported in this blog post.

Have fun and feel free to drop me a line if you are interested in these (or related) methods. I will be more than happy to have an interesting discussion.