A header analysis of C++ projects

Tags: programming, projects

Published on
« Previous post: Thoughts on Travis CI — Next post: Using CMake, C++, and pybind11 on hard … »

Even though C++ is still my favourite programming language, I am always amazed by the sheer number of headers I have to include to get something done. Most of my source files could contain #include <vector> at the beginning, because this is most likely what I am going to include anyway. Other headers are not treated this way. In fact, I am displaying a wanton disregard for many of the other headers. For example, I have never consciously use the execution header in real-world code.

I thus started wondering how other projects fared in that regard. So I cloned several of the larger C++ repositories on GitHub. More precisely, I started with the following repositories:

Counting individual headers

In total, these projects comprise more than 2 million lines of code—a reasonably-sized sample, I would say. To figure out how these projects use headers, I first extracted all STL headers from all files and counted their occurrences. This resulted in the following histogram (the counts are relative):

Histogram of STL header occurrences

Pretty interesting, I would say. This is a nice long-tail distribution for which a few headers are used much more often than the rest. In fact, for these repositories, only four headers make up more than 50% of the usage:

  • vector
  • string
  • memory
  • utility

For vector and string, this is not surprising. Virtually every C++ programmer uses vector for almost anything. The same goes for string. Similarly, memory is not so surprising as it contains the different smart pointer classes—most prominently, shared_ptr. The last one of the list, utility, was slightly unexpected for me. It contains things such as std::make_pair and std::move. At least the latter one is required for any class that does its own memory management.

At the tail of the distribution, the more exotic headers await. The stack header, for example, appears not to be used in these projects too often, while the future header comes in dead last. I must confess that I have not used in real-world projects so far because I did not yet have to deal with asynchronous operations. The lack of enthusiasm for the regex header is somewhat sad but maybe this is to be expected in a language that does not really encourage the use of regular expressions? Also, C++ regular expressions are said to perform worse than their counterparts in other languages. To what extent the unfamiliarity of C++ programmers with regular expressions might contribute to this, I cannot say.

Counting pairs of headers

Let’s delve into another aspect of the headers. In my code, I noticed that some headers are almost always used together. For example, if there is an algorithm header, there is often also a functional header. Extending this to projects, I thought that it might be interesting to analyse the co-occurrence patterns of STL headers. To this end, I counted how often pairs of headers are being included the same file. This naturally gives rise to a co-occurrence matrix, in which rows and columns indicate headers, and the value indicates how often those header occur in the same file. If headers are sorted by their counts, this results in a beautiful picture:

Co-occurrence matrix of STL headers

This matrix tells us something about the universality of certain headers. The vector header, for example, co-occurs with almost every other header to some extent because vectors are such fundamental data types. The typeinfo header, on the other hand, is so very specific that is only co-occurs with typeindex. In fact, the structure of the matrix, i.e. the many dark cells, indicates that many combinations are highly unlikely to occur “in the wild”.

Some of the combinations really tell a story, though. For example, queue is used in conjunction with thread (possibly to implement patterns for multi-thread environments), but also with stack (possibly to implement different traversal strategies of implicit or explicit graph structures in these projects). I also see another pattern of my own code, namely the pair unordered_map and unordered_set. I tend to require either both of them (the set for iteration and storage, the map for, well associating more information with individual objects) or none of them.

Conclusion

As a next step, it would be interesting to see whether the co-occurrence of certain headers makes it possible to guess the domain of a C++ program, just like certain pairs of words (I guess I should rather speak of bigrams here, to use the NLP term) are more indicative of certain genres of literature. Treating code like literature would certainly make for an interesting art project.

The code for this project is available on GitHub. You only have to supply the repositories for scanning.

Happy coding, until next time!

Update (2018-04-06): Changed the title because I was using the term meta-analysis incorrectly. Thanks, HN!