A header analysis of C++ projects
Tags: programming, projects
Even though C++ is still my favourite programming language, I am always
amazed by the sheer number of headers I have to include to get something
done. Most of my source files could contain #include <vector>
at the
beginning, because this is most likely what I am going to include
anyway. Other headers are not treated this way. In fact, I am displaying
a wanton disregard for many of the other headers. For example, I have
never consciously use the execution
header in real-world code.
I thus started wondering how other projects fared in that regard. So I cloned several of the larger C++ repositories on GitHub. More precisely, I started with the following repositories:
Counting individual headers
In total, these projects comprise more than 2 million lines of code—a reasonably-sized sample, I would say. To figure out how these projects use headers, I first extracted all STL headers from all files and counted their occurrences. This resulted in the following histogram (the counts are relative):
Pretty interesting, I would say. This is a nice long-tail distribution for which a few headers are used much more often than the rest. In fact, for these repositories, only four headers make up more than 50% of the usage:
vector
string
memory
utility
For vector
and string
, this is not surprising. Virtually every C++
programmer uses vector
for almost anything. The same goes for
string
. Similarly, memory
is not so surprising as it contains the
different smart pointer classes—most prominently, shared_ptr
.
The last one of the list, utility
, was slightly unexpected for me. It
contains things such as std::make_pair
and std::move
. At least the
latter one is required for any class that does its own memory
management.
At the tail of the distribution, the more exotic headers await. The
stack
header, for example, appears not to be used in these projects
too often, while the future
header comes in dead last. I must confess
that I have not used in real-world projects so far because I did not yet
have to deal with asynchronous operations. The lack of enthusiasm for
the regex
header is somewhat sad but maybe this is to be expected in
a language that does not really encourage the use of regular
expressions? Also, C++ regular expressions are said to perform worse
than their counterparts in other languages. To what extent the
unfamiliarity of C++ programmers with regular expressions might
contribute to this, I cannot say.
Counting pairs of headers
Let’s delve into another aspect of the headers. In my code,
I noticed that some headers are almost always used together. For
example, if there is an algorithm
header, there is often also
a functional
header. Extending this to projects, I thought that it
might be interesting to analyse the co-occurrence patterns of STL
headers. To this end, I counted how often pairs of headers are being
included the same file. This naturally gives rise to a co-occurrence
matrix, in which rows and columns indicate headers, and the value
indicates how often those header occur in the same file. If headers are
sorted by their counts, this results in a beautiful picture:
This matrix tells us something about the universality of certain
headers. The vector
header, for example, co-occurs with almost
every other header to some extent because vectors are such fundamental
data types. The typeinfo
header, on the other hand, is so very
specific that is only co-occurs with typeindex
. In fact, the structure
of the matrix, i.e. the many dark cells, indicates that many
combinations are highly unlikely to occur “in the wild”.
Some of the combinations really tell a story, though. For example,
queue
is used in conjunction with thread
(possibly to implement
patterns for multi-thread environments), but also with
stack
(possibly to implement different traversal strategies of
implicit or explicit graph structures in these projects). I also see
another pattern of my own code, namely the pair unordered_map
and
unordered_set
. I tend to require either both of them (the set
for iteration and storage, the map for, well associating more
information with individual objects) or none of them.
Conclusion
As a next step, it would be interesting to see whether the co-occurrence of certain headers makes it possible to guess the domain of a C++ program, just like certain pairs of words (I guess I should rather speak of bigrams here, to use the NLP term) are more indicative of certain genres of literature. Treating code like literature would certainly make for an interesting art project.
The code for this project is available on GitHub. You only have to supply the repositories for scanning.
Happy coding, until next time!
Update (2018-04-06): Changed the title because I was using the term meta-analysis incorrectly. Thanks, HN!