As someone who enjoys watching the occasional film, I am often very much disappointed by bad or cheesy dialogues—Prometheus, I am looking at you. Likewise, I do not enjoy too many over-the-top action scenes, especially when they are too fast for the (read: my) eye to follow.

Motivation & setup

Thus, I started wondering about the relationship between dialogue and non-dialogue parts of a film. Where there any interesting patterns just waiting to be discovered? I used a very simple approach to shed some light on this question: I downloaded subtitles for a lot of films and use their total duration as the total amount of speech in a film. Although I took care not to use subtitles that are meant for the hearing-impaired, I cannot completely rule out that the subtitles contain additional information, such as background noises or songs in a film. A more detailed analysis should thus try to exclude these factors.

I will refer to the amount of speech in a film as its speechtime, while the total length of the film is its runtime.

Films of 2014

With the process outlined above, I analysed the Top 100 films of 100, according to Rotten Tomatoes. First, let us take a lot at the relation between speechtime and runtime. In the following scatterplot, you can hover over each dot in order to show the raw x and y data of a film:

We can see that there are some outlying points, with respect to the runtime. One, at the upper right part of the data, turns out to correspond to The Last of the Unjust, a film that is not only very long (over four hours), but also contains lots of dialogue (being an interview film). In general, the distance from the diagonal describes the compressibility of a film—the larger the distance, the less speech it contains. Put differently, films that are very distant from the diagonal contain more visual imagery than speech. In the films I selected, Manakamana is the one with the largest distance from the diagonal. I have been told that it is very slow-paced and visually appealing, giving viewers time to reflect. On the other extreme, there are films such as Life Itself and The Trip to Italy (and, surprisingly, Birdman), which consist almost exclusively of dialogues. According to my simple calculations, these films only have about 15 minutes of non-speech, which is pretty surprising.

Next, I was wondering how the speechtime is distributed. The scatterplot seems to indicate that there is a cluster of films of similar speechtime, but it is admittedly not the optimal visualization for this task. I thus created a histogram of the percentage of speech in a film, using bins of 5% each since I considered 10% steps to be too coarse. I additionally plotted the raw speechtime values using transparent circles:

There seems to be a denser region around the 30%&endash;40% mark. This region includes, among others, X-Men: Days of Future Past and Only Lovers Left Alive. I have not yet seen both of them, but at least the X-Men franchise does not strike me as being very heavy on dialogues. The outliers I talked above are now easily visible. Manakamana is on the very left of the axis, containing only around 12% speech, while The Trip to Italy is almost entirely a dialogue, with around 89% speech. The region above 80% contains Life Itself and The Internet's Own Boy, two documentaries. This is not as surprising as finding Birdman (again) and Coherence—two fiction films that apparently favour speech over action.

What's next?

The simple analysis barely scratched the surface. To have more fun with these sorts of data, more films are required. It would then be interesting to compare films of different decades. Are the films of the 1950s more dialogue-heavy than the films of the 1990s? Including the ratings might also be interesting. Is there a correlation between the amount of speech and the rating at IMDB, for example? For the films I selected here, this is not the case. Pearson's correlation is around 0.1 for these films. This might be a typical case of selection bias, though, because I use a best-of list to include them.

Data and code

The plots have been created with D3.js. I release the source code into the public domain. Ditto for the raw data. Note that it does not contain 100 films, but rather only 82, because I could not find any subtitles for the remaining 18.