Watching the Watchmen: Pitfalls in Graph Generative Model Evaluation

Tags: research

Published on Tuesday, March 15, 2022

« Previous post: Who’s Attacking My Server? — Next post: Improving My Lab Book Using Snippets for … »

In our recent work Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions, soon to appear at the International Conference on Learning Representations (ICLR 2022), we¹ are looking at a fundamental question: how do you evaluate a graph generative model, i.e. a model that, given a set of input graphs, for instance from a database of molecules,² is able to generate ‘useful’ novel graphs following the same distribution. It turns out that arriving at an evaluation strategy is not as straightforward as you might think!

Why Is This Challenging?

Approaching this with some knowledge about other generative models—for instance for images—one might first wonder why this is challenging. For images, we know quite well how to evaluate and rank different models:

Train your model.
Use the Fréchet inception distance to evaluate the quality of the generated images.
Rinse and repeat.

For graphs, however, this is a little bit difficult: graphs are less structured than images, since the number of vertices and edges are most likely different for graphs in real-world distributions. This is in stark contrast to images, which, while different in their pixel content, are usually at least of the same dimensions. Plus, the structure of images makes per-pixel comparisons easy, whereas with graphs, we typically don’t know which vertices—if any—need to be matched.

(As a quick aside: while there is a metric on graphs, the Graph Edit Distance, it is neither computationally efficient nor does it specifically target analysing the comparison of two distributions. We will thus not be considering this distance here.)

A Simple Example

Take these three graphs as an example; are they from the same distribution or not?

After some deliberation, one could say that the leftmost graph is a little bit different from the remaining two since it ‘appears’ to be more complex. How can we make this notion of similarity more precise?

MMD to the Rescue!

Luckily, the problem of comparing two distributions of complex objects has already been tackled for us. One solution is to calculate their maximum mean discrepancy (MMD) based on a kernel function. Let’s back up at this point and briefly disentangle the meaning of some terms here: MMD³ was originally developed to compare two distributions in the statistical sense. While there are several metrics out there that solve a similar problem, MMD is capable of comparing any type of distribution (graphs, images, etc.). This is achieved by ‘outsourcing’ the actual comparison task to a separate function, a kernel.⁴ A kernel function $k$ takes two graphs and returns a scalar value that is supposed to describe their similarity.⁵ With $k$ being chosen, we can calculate the overall similarity between two distributions $X$ and $Y$, consisting of $n$ and $m$ objects, respectively, as follows:⁶

$$ \text{MMD}^2(X, Y) := \frac{1}{n(n-1)} \sum_{i}^{n} \sum_{j \neq i}^{n} k(x_i, x_j) + \frac{1}{m(m-1)} \sum_{i}^{m} \sum_{j \neq i}^{m} k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^{n} \sum_{j = 1}^{m} k(x_i, y_j). $$

Setting aside questions of computational efficiency for a moment, let’s first think how to obtain a useful kernel $k$. The graph learning community has already come up with a nice recipe here:

Given a graph, use a descriptor function to turn it into a vector in some $\mathbb{R}^d$. The simplest descriptor function is the degree distribution. If the maximum degree in our initial distribution can be bounded, we can turn all graphs into a high-dimensional count vector, for instance.
Use a Gaussian kernel, i.e. $k(x, y, \sigma) := \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)$, with $\sigma \in \mathbb{R}$, to assess the difference between two ‘graph vectors’ $x$ and $y$.
Calculate MMD using the equation above and use it to rank competing models, for instance, with lower values indicating that a model is closer to the original distribution.

Desiderata

While the above sounds like a good plan, we found that there is no consensus in the community about the various choices made in the recipe above! Notice that there are (at least) three different things to be considered, each coming with their own challenges:

The choice of graph descriptor function.
The choice of kernel function.
The choice of hyperparameters, both for the kernel as well as for the descriptor function.

We argue that such choices should satisfy a set of desiderata, viz.:

Expressivity. If we pick a function that assigns a graph to a random point in $\mathbb{R}^d$, it will be rather useless. A good descriptor function should be able to detect whether two graphs are arising from the same distribution or not.
Robustness. The ranking of models should be stable with respect to perturbations. Essentially, if the perturbation itself is small, the change in ranking should also be small.
Efficiency. The comparison of different models should be fast and not take longer than their training.

With these desiderata serving as our pole star, we now set out to analyse the state of the art, and we made some surprising discoveries along the way.

Evaluations: Not for the Unwary

While we uncovered quite a few hidden pitfalls, such as technical requirements for the kernel function, one of our main finds is that there is a dire need for a consensus in choosing kernel hyperparameters. Previous publications use a fixed set of parameters, keeping the smoothing parameter $\sigma$ of the Gaussian kernel fixed, for instance. This, however, can have unintended consequences when it comes to ranking models.

To demonstrate this, we calculated MMD values for three different distributions, corresponding to three different graph generative models. We only varied the number of bins for the histogram representation of a graph (think of this as the dimension of the representation) and $\sigma$, the smoothing parameter of the Gaussian kernel. This resulted in a matrix of different experimental configurations, and for each configuration, we colour-coded the best model (according to MMD). The resulting plot looks like this:

As you can see, there are regions in parameter space in which the first model clearly dominates. However, this can change for larger values of $\sigma$. This alone would not be surprising, since we would expect that different parameters have different effects; a large $\sigma$ might decrease the expressivity since a large degree of smoothing is employed. However, the truly disconcerting part is that there are ‘flips’ in some regions: first, model A dominates, then model B, then model A again.

Overall, this is highly unstable and makes it hard to pick a suitable model.

This is not the only issue we found. We also observed that parameter choices are often not optimal, in the sense that they are not consistent with large changes in the values of MMD. Moreover, not only is the ranking with respect to parameters arbitrary, a similar experiment shows that rankings with respect to kernels are also arbitrary.

Clearly, a more consistent strategy is required!

What Now?

Next to outlining that, indeed, there is a problem in graph generative model evaluation, we also provide some strategies to alleviate it. We achieve this by proposing how to pick hyperparameters based on how stable the resulting MMD values are with respect to controlled perturbations.

Read the paper for more details or check out our GitHub repository for the code.

There’s tons of interesting directions to go from here, the most relevant one being: what could be a good descriptor–kernel combination that is universally applicable and comes with provable guarantees on stability and expressivity?

I am excited to see where this journey takes us!

Joint work with the incredibly talented Leslie O’Bray, Max Horn, and Karsten Borgwardt. ↩︎
I picked this example because its rather intuitive. However, it needs to be said that for the specific use case of molecular graph generation, specific evaluation metrics exist. We tackle a somewhat more generic problem here in that we do not make any assumptions about the ‘provenance’ of our graphs. ↩︎
See A Kernel Two-Sample Test for an excellent description. ↩︎
Kernel functions are fascinating, and I would be remiss if I did not briefly indulge in it here: essentially, a kernel—in the sense I am using it in this post—is intricately linked to an inner product of a Hilbert space. Kernels have all sorts of interesting applications in machine learning: you can use them for classification, visualisation, or, as we are doing it here, comparing two distributions. To paraphrase Fermat: I could explain so much more, but this post is too small to contain all the intriguing details. I might revisit this topic in a forthcoming post, though! ↩︎
I am skipping over some details here for the sake of brevity. ↩︎
Several variations of MMD exist. If $n = m$, i.e. for distributions of the same cardinality, an unbiased estimate is available. We depict the equation that is most often used in practice, but our code also implements some variations. ↩︎