Baby steps with libclang
: Walking an abstract syntax tree
Tags: programming, howtos
Recently, I wanted to parse some C++ code to collect some information about my coding style. As always with C++, this turned out to require more effort than I had originally anticipated. Apparently, C++ is so complex that you require more or less a compiler to parse it. The great folks at the LLVM project had already done the work—a stroke of luck for me.
So I set out to use clang
for parsing C++ code. The official
documentation lists multiple
options here. While LibTooling
, a C++ interface to clang
, would
arguably have been the more appropriate choice, I settled for
libclang
, a C interface. Its main advantage is its stability. The
LLVM developers are very reluctant to introduce breakages
here (although they have been known to happen). The disadvantage of
libclang
is that we do not get full control over the abstract syntax
tree (AST) of the program. I can live with that, I guess.
In this post, I only want to briefly introduce some basic concepts of
libclang
. I hope to show off more features in subsequent posts.
Dumping an AST
Suppose you have the following simple C++ snippet:
template <class T> bool f( T x )
{
return x % 2;
}
Just like any compiler, clang
creates an abstract syntax tree that
represents the code we want to compile. The syntax of this AST resembles
the written code very closely, which makes this a very interesting view
on your code.
To show the AST of the previous snippet, we can clang
with:
clang -Xclang -ast-dump -fsyntax-only foo.cc
This results in the following output:
TranslationUnitDecl 0x2fef1b0 <<invalid sloc>> <invalid sloc>
|-TypedefDecl 0x2fef6f0 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
|-TypedefDecl 0x2fef750 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
|-TypedefDecl 0x2fefb10 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list '__va_list_tag [1]'
`-FunctionTemplateDecl 0x2fefdc0 <foo.cc:1:1, line:4:1> line:1:25 f
|-TemplateTypeParmDecl 0x2fefb60 <col:11, col:17> col:17 referenced class T
`-FunctionDecl 0x2fefd20 <col:20, line:4:1> line:1:25 f '_Bool (T)'
|-ParmVarDecl 0x2fefc20 <col:28, col:30> col:30 referenced x 'T'
`-CompoundStmt 0x2fefea0 <line:2:1, line:4:1>
`-ReturnStmt 0x2fefe80 <line:3:3, col:14>
`-BinaryOperator 0x2fefe58 <col:10, col:14> '<dependent type>' '%'
|-DeclRefExpr 0x2fefe10 <col:10> 'T' lvalue ParmVar 0x2fefc20 'x' 'T'
`-IntegerLiteral 0x2fefe38 <col:14> 'int' 2
The important thing to note is that we have a FunctionTemplateDecl
node that contains, among others, the body of the
function (starting from the CompoundStmt
node).
To familiarize ourselves with libclang
, let’s try to reproduce
parts of this AST ourselves. I propose writing a program that walks the
AST and prints out node names.
First, we need to get clang
to emit the AST in a binary format so that
we traverse it later:
clang++ -emit-ast foo.cc
This should result in a file foo.ast
.
Reading an AST with libclang
Understanding libclang
requires reading copious amounts of
documentation.
I cannot claim in good conscience that I have fully understood and
memorized every detail, so the code that follows may actually be very
stupid.
One of the core functions of libclang
is the creation of an index.
This data structure contains multiple translation units, e.g. source
files, that make up an executable. Given an index, a new translation
unit can be created. There are various ways of doing this. It it
possible to parse source code directly. For now, I will only present
a way of parsing a dumped AST. Last, we need a cursor, which is the
libclang
term for a thing that keeps track of where we are in the
AST.
The following snippet will create the required things for us and permit further parsing (for the hasty: scroll to the bottom for the complete example code):
#include <clang-c/Index.h>
int main( int argc, char** argv )
{
if( argc < 2 )
return -1;
CXIndex index = clang_createIndex( 0, 1 );
CXTranslationUnit tu = clang_createTranslationUnit( index, argv[1] );
if( !tu )
return -1;
CXCursor rootCursor = clang_getTranslationUnitCursor( tu );
clang_disposeTranslationUnit( tu );
clang_disposeIndex( index );
return 0;
}
This program will attempt to load a translation unit from an .ast
file, get a cursor to the root node, and exit. The only noteworthy thing
are the two _dispose
calls at the end of the function. This is due to
libclang
being a C library—we actually need to think about
memory management here. Darn.
Visiting nodes
To make the example application marginally more useful, we need to make
use of the “visitor pattern”.
libclang
requires us to specify a sort of callback function that is
called whenever we reach a node in the AST.
The function needs to have the following signature:
CXChildVisitResult visitor( CXCursor cursor, CXCursor parent, CXClientData clientData )
The result of the function decides how the traversal is being continued.
Possibilities include CXChildVisit_Break
to stop traversal,
CXChildVisit_Continue
to continue traversing the siblings of the
current cursor without visiting its children, or CXChildVisit_Recurse
to visit the children of the current cursor first. cursor
refers to
the current position in the tree. parent
is the parent node in the
tree. Last, CXClientData
is a typedef for ‘void*
(a,
don’t you love C as much as I do?) and permits passing additional
information for each traversal. We will soon figure out a way of using
this to our advantage.
The main entry point for visiting nodes is the clang_visitChildren
function. Assuming that we have set up a function visitor
with the proper prototype, we may add the following lines after rootCursor
has been assigned:
unsigned int treeLevel = 0;
clang_visitChildren( rootCursor,
visitor,
&treeLevel );
This will call our visitor function with client data that initially
refers to the upper-most level in the tree. A potential implementation for visitor
could look like this:
CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData clientData )
{
CXSourceLocation location = clang_getCursorLocation( cursor );
if( clang_Location_isFromMainFile( location ) == 0 )
return CXChildVisit_Continue;
CXCursorKind cursorKind = clang_getCursorKind( cursor );
unsigned int curLevel = *( reinterpret_cast<unsigned int*>( clientData ) );
unsigned int nextLevel = curLevel + 1;
std::cout << std::string( curLevel, '-' ) << " " << getCursorKindName(
cursorKind ) << " (" << getCursorSpelling( cursor ) << ")\n";
clang_visitChildren( cursor,
visitor,
&nextLevel );
return CXChildVisit_Continue;
}
The visitor uses some interesting functions. First, we use
clang_getCursorLocation()
to get the location of the cursor with
respect to the source file. This permits mapping a cursor, i.e. a region
in a tree, to a region in the source file. We use it for a far simpler
purpose, though. By calling clang_Location_isFromMainFile()
, we can
check whether a node is part of the source file we are analysing or
not. This is extremely helpful for skipping over nodes that are not
interesting for us because they appear in the AST after an #include
,
for example. As you can see, we skip over the subtree rooted at the
current cursor when such a node has been reached.
The next interesting thing is the clang_getCursorKind
. This yields the
type of the cursor, giving us information about whether it refers to
a statement, a function declaration, or what have you. To see all
types (or rather kinds) that are available, please take a look at
the Index.h
file.
Having obtained information about the type of the cursor, we get
information about our current level, using a reinterpret_cast
. We then
call two auxiliary functions that give us the name and spelling of
the cursor (see below for whole code). By spelling, libclang
means the name of the entity the cursor refers to. When positioned at
the root node of a function declaration, the cursor spelling will be the
name of the function, for example. Not all entities have
a spelling—for some, we need to tokenize the node first. For
reasons of simplicity, I am going to cover this in a later post, though.
The last interesting thing is the continued traversal within the visitor
function. We call clang_visitChildren
on the current cursor, with the
updated level information. Thus, all children of the current cursor will
eventually be visited. Last, we return CXChildVisit_Continue
. This
ensures that we do not visit the children of a particular node twice.
The output of this simple program when being applied to foo.ast
is:
FunctionTemplate (f)
- TemplateTypeParameter (T)
- ParmDecl (x)
-- TypeRef (T)
- CompoundStmt ()
-- ReturnStmt ()
--- BinaryOperator ()
---- DeclRefExpr (x)
---- IntegerLiteral ()
Which is nice for just about 50 lines of code in total. As promised, not all entities have names here, but the ones that do seem to make sense.
The complete code
Here’s the complete code of this example:
#include <clang-c/Index.h>
#include <iostream>
#include <string>
std::string getCursorKindName( CXCursorKind cursorKind )
{
CXString kindName = clang_getCursorKindSpelling( cursorKind );
std::string result = clang_getCString( kindName );
clang_disposeString( kindName );
return result;
}
std::string getCursorSpelling( CXCursor cursor )
{
CXString cursorSpelling = clang_getCursorSpelling( cursor );
std::string result = clang_getCString( cursorSpelling );
clang_disposeString( cursorSpelling );
return result;
}
CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData clientData )
{
CXSourceLocation location = clang_getCursorLocation( cursor );
if( clang_Location_isFromMainFile( location ) == 0 )
return CXChildVisit_Continue;
CXCursorKind cursorKind = clang_getCursorKind( cursor );
unsigned int curLevel = *( reinterpret_cast<unsigned int*>( clientData ) );
unsigned int nextLevel = curLevel + 1;
std::cout << std::string( curLevel, '-' ) << " " << getCursorKindName(
cursorKind ) << " (" << getCursorSpelling( cursor ) << ")\n";
clang_visitChildren( cursor,
visitor,
&nextLevel );
return CXChildVisit_Continue;
}
int main( int argc, char** argv )
{
if( argc < 2 )
return -1;
CXIndex index = clang_createIndex( 0, 1 );
CXTranslationUnit tu = clang_createTranslationUnit( index, argv[1] );
if( !tu )
return -1;
CXCursor rootCursor = clang_getTranslationUnitCursor( tu );
unsigned int treeLevel = 0;
clang_visitChildren( rootCursor, visitor, &treeLevel );
clang_disposeTranslationUnit( tu );
clang_disposeIndex( index );
return 0;
}
I am releasing the code into the public domain. Don’t forget to
link against libclang
when compiling it (subsequent posts will
provide a find module for CMake
). Should you consider this code
useful, it would give me enormous pleasure if you were to drop me an
e-mail.
To show that this would be worth your while, I am pointing out an
interesting part in the code above: In the getCursorKindName()
function, we use clang_getCursorKindSpelling()
to obtain the name of
the type the current cursor refers to. This is a CXString
, which needs
to be disposed later on with clang_disposeString()
so as not to cause
any memory leaks. To use the string as a regular C++ string, we need to
call the helper function clang_getCString
, which returns
a null-terminated C string (i.e. a const char*
, which we may
finally assign to an std::string
).
Wishing you a happy tree traversal! Maybe clang
hides some gifts in the AST…