Using tf–idf to analyse U.S. Presidential Inauguration Speeches

Tags: projects, research

Published on
« Previous post: More experiments with libclang — Next post: Visualizing the sentiments in U.S. … »

Previously, I wrote about a brief analysis of U.S. Presidential Inauguration Speeches. I have since extended the analysis slightly using “tf–idf”.

For those of you who are unaware of this technique, it refers to a statistical method for assessing the importance of certain words in a corpus of documents. Without going into too much detail, the method determines which words are most relevant for a given document of the corpus, yielding a high-dimensional vector whose entries refer to the common vocabulary of the corpus.

I picked the five most relevant words for every speech to essentially obtain an extremely pithy summary of words that are relevant for the given speech only. Since there are 59 speeches in total, I first decided to do a small visualization of the last eight presidents only, starting from George H.W. Bush in 1989. Here are the results; the x-axis represents the time, while the y-axis shows the five most important words, scaled by their relative frequency in the speech:

A visualization of the relative importance of words in the last eight inauguration speeches of U.S. presidents

So, what can we see in this visualization? Here are some of my thoughts.

  • The don in the speech of George H.W. Bush is a shortened form of don’t. The algorithm picked up on his usage in the speech, which contains beautiful imagery such as this:

For the first time in this century, for the first time in perhaps all history, man does not have to invent a system by which to live. We don’t have to talk late into the night about which form of government is better. We don’t have to wrest justice from the kings. We only have to summon it from within ourselves.

  • It is also interesting to note that the new breeze George H.W. Bush is talking about is detected as a unique feature of his speech.

  • The speeches of Bill Clinton allude to the change that started after the end of the Cold War, as well as the promises that arise in the new century to come.

  • The second speech of George W. Bush tries to rally Americans in the War on Terror. Liberty and freedom are part of the underlying theme, these of course being American ideals that are worth fighting for.

  • With Barack Obama, a sort of rebirth takes place. He speaks to the new generation, expressing his hope that American becomes a new nation, and aligns everyone that today—not tomorrow—is the day to address these challenges. In his second speech, the great journey towards equality is presented to the Americans, making it clear that change does not stop.

  • With Donald Trump, the narrative changes. The important words are now the dreams of people, such as the hope that they will find new jobs. It is interesting to note that only the speeches at a time of crisis or abrupt change (Cold War or the War on Terror) exhibit the same occurrence of the words America and American. Maybe Donald Trump is trying to invoke a connection to these events in his speech?

These are only my random thoughts—I think it is fascinating that tf–idf is capable of picking up on these themes in such a reliable manner. Maybe a more competent person wants to comment on these findings? I look forward to any feedback you might have. In the meantime, please enjoy a variant of the visualization above that contains all speeches of all presidents so far. You will have to scroll around a lot for this. By the way, I have added the code to the GitHub repository. You may be particularly interested in tf_idf_analysis.py, which demonstrates the tf–idf analysis of all speeches. Moreover, I added a gnuplot script that demonstrates how to create the visualizations attached to this blog post.