Counting words in TeX documents under version control

Tags: programming, projects, visualization

Published on
« Previous post: Converting VTK structured grids to … — Next post: Scatterplot matrices with gnuplot »

Most of the time, PhD comics hits too close to home. Recently, they ran a strip showing the temporal evolution of a typical thesis document. As expected, the comic proves that efforts are “doubly redoubled” when the deadline approaches. Naturally, I wondered how my own thesis—still under review—fares in that regard.

I keep my thesis under version control because I am a paranoid and cranky person when it comes to data security and backups. Consequently, I required a script for reconstructing the amount of changes I did to the thesis sources over time. Here’s what I came up with:

If I let this script run, with the output being stored in a file somewhere, I can use the venerable gnuplot to create a nice graph. This is what it looks like for my dissertation:

Word count in my thesis over time

Note that the script is unable to count “proper” words for now. Hence, changes in the LaTeX markup will also result in increasing the total number of words. I guess the slope of my own efforts is not too similar to the comic. At the same time, you can see that most of the content of the thesis was added in a comparatively short amount of time.

Feel free to apply this to your own documents. The code is available as a gist on GitHub. This is my first real gist—thanks to Conrad for the suggestion!