Baby steps with libclang: Walking an abstract syntax tree

Tags: programming, howtos

Published on
« Previous post: YouCompleteMe and CMake — Next post: Ten noteworthy books I read in 2015 »

Recently, I wanted to parse some C++ code to collect some information about my coding style. As always with C++, this turned out to require more effort than I had originally anticipated. Apparently, C++ is so complex that you require more or less a compiler to parse it. The great folks at the LLVM project had already done the work—a stroke of luck for me.

So I set out to use clang for parsing C++ code. The official documentation lists multiple options here. While LibTooling, a C++ interface to clang, would arguably have been the more appropriate choice, I settled for libclang, a C interface. Its main advantage is its stability. The LLVM developers are very reluctant to introduce breakages here (although they have been known to happen). The disadvantage of libclang is that we do not get full control over the abstract syntax tree (AST) of the program. I can live with that, I guess.

In this post, I only want to briefly introduce some basic concepts of libclang. I hope to show off more features in subsequent posts.

Dumping an AST

Suppose you have the following simple C++ snippet:

template <class T> bool f( T x )
{
  return x % 2;
}

Just like any compiler, clang creates an abstract syntax tree that represents the code we want to compile. The syntax of this AST resembles the written code very closely, which makes this a very interesting view on your code.

To show the AST of the previous snippet, we can clang with:

clang -Xclang -ast-dump -fsyntax-only foo.cc

This results in the following output:

TranslationUnitDecl 0x2fef1b0 <<invalid sloc>> <invalid sloc>
|-TypedefDecl 0x2fef6f0 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
|-TypedefDecl 0x2fef750 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
|-TypedefDecl 0x2fefb10 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list '__va_list_tag [1]'
`-FunctionTemplateDecl 0x2fefdc0 <foo.cc:1:1, line:4:1> line:1:25 f
  |-TemplateTypeParmDecl 0x2fefb60 <col:11, col:17> col:17 referenced class T
  `-FunctionDecl 0x2fefd20 <col:20, line:4:1> line:1:25 f '_Bool (T)'
    |-ParmVarDecl 0x2fefc20 <col:28, col:30> col:30 referenced x 'T'
    `-CompoundStmt 0x2fefea0 <line:2:1, line:4:1>
      `-ReturnStmt 0x2fefe80 <line:3:3, col:14>
        `-BinaryOperator 0x2fefe58 <col:10, col:14> '<dependent type>' '%'
          |-DeclRefExpr 0x2fefe10 <col:10> 'T' lvalue ParmVar 0x2fefc20 'x' 'T'
          `-IntegerLiteral 0x2fefe38 <col:14> 'int' 2

The important thing to note is that we have a FunctionTemplateDecl node that contains, among others, the body of the function (starting from the CompoundStmt node).

To familiarize ourselves with libclang, let’s try to reproduce parts of this AST ourselves. I propose writing a program that walks the AST and prints out node names.

First, we need to get clang to emit the AST in a binary format so that we traverse it later:

clang++ -emit-ast foo.cc

This should result in a file foo.ast.

Reading an AST with libclang

Understanding libclang requires reading copious amounts of documentation. I cannot claim in good conscience that I have fully understood and memorized every detail, so the code that follows may actually be very stupid.

One of the core functions of libclang is the creation of an index. This data structure contains multiple translation units, e.g. source files, that make up an executable. Given an index, a new translation unit can be created. There are various ways of doing this. It it possible to parse source code directly. For now, I will only present a way of parsing a dumped AST. Last, we need a cursor, which is the libclang term for a thing that keeps track of where we are in the AST.

The following snippet will create the required things for us and permit further parsing (for the hasty: scroll to the bottom for the complete example code):

#include <clang-c/Index.h>

int main( int argc, char** argv )
{
  if( argc < 2 )
    return -1;

  CXIndex index        = clang_createIndex( 0, 1 );
  CXTranslationUnit tu = clang_createTranslationUnit( index, argv[1] );

  if( !tu )
    return -1;
  
  CXCursor rootCursor  = clang_getTranslationUnitCursor( tu );

  clang_disposeTranslationUnit( tu );
  clang_disposeIndex( index );
  return 0;
}

This program will attempt to load a translation unit from an .ast file, get a cursor to the root node, and exit. The only noteworthy thing are the two _dispose calls at the end of the function. This is due to libclang being a C library—we actually need to think about memory management here. Darn.

Visiting nodes

To make the example application marginally more useful, we need to make use of the “visitor pattern”. libclang requires us to specify a sort of callback function that is called whenever we reach a node in the AST.

The function needs to have the following signature:

CXChildVisitResult visitor( CXCursor cursor, CXCursor parent, CXClientData clientData )

The result of the function decides how the traversal is being continued. Possibilities include CXChildVisit_Break to stop traversal, CXChildVisit_Continue to continue traversing the siblings of the current cursor without visiting its children, or CXChildVisit_Recurse to visit the children of the current cursor first. cursor refers to the current position in the tree. parent is the parent node in the tree. Last, CXClientData is a typedef for ‘void* (a, don’t you love C as much as I do?) and permits passing additional information for each traversal. We will soon figure out a way of using this to our advantage.

The main entry point for visiting nodes is the clang_visitChildren function. Assuming that we have set up a function visitor with the proper prototype, we may add the following lines after rootCursor has been assigned:

unsigned int treeLevel = 0;

clang_visitChildren( rootCursor,
                     visitor,
                     &treeLevel );

This will call our visitor function with client data that initially refers to the upper-most level in the tree. A potential implementation for visitor could look like this:

CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData clientData )
{
  CXSourceLocation location = clang_getCursorLocation( cursor );
  if( clang_Location_isFromMainFile( location ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind cursorKind = clang_getCursorKind( cursor );

  unsigned int curLevel  = *( reinterpret_cast<unsigned int*>( clientData ) );
  unsigned int nextLevel = curLevel + 1;

  std::cout << std::string( curLevel, '-' ) << " " << getCursorKindName(
  cursorKind ) << " (" << getCursorSpelling( cursor ) << ")\n";

  clang_visitChildren( cursor,
                       visitor,
                       &nextLevel ); 

  return CXChildVisit_Continue;
}

The visitor uses some interesting functions. First, we use clang_getCursorLocation() to get the location of the cursor with respect to the source file. This permits mapping a cursor, i.e. a region in a tree, to a region in the source file. We use it for a far simpler purpose, though. By calling clang_Location_isFromMainFile(), we can check whether a node is part of the source file we are analysing or not. This is extremely helpful for skipping over nodes that are not interesting for us because they appear in the AST after an #include, for example. As you can see, we skip over the subtree rooted at the current cursor when such a node has been reached.

The next interesting thing is the clang_getCursorKind. This yields the type of the cursor, giving us information about whether it refers to a statement, a function declaration, or what have you. To see all types (or rather kinds) that are available, please take a look at the Index.h file.

Having obtained information about the type of the cursor, we get information about our current level, using a reinterpret_cast. We then call two auxiliary functions that give us the name and spelling of the cursor (see below for whole code). By spelling, libclang means the name of the entity the cursor refers to. When positioned at the root node of a function declaration, the cursor spelling will be the name of the function, for example. Not all entities have a spelling—for some, we need to tokenize the node first. For reasons of simplicity, I am going to cover this in a later post, though.

The last interesting thing is the continued traversal within the visitor function. We call clang_visitChildren on the current cursor, with the updated level information. Thus, all children of the current cursor will eventually be visited. Last, we return CXChildVisit_Continue. This ensures that we do not visit the children of a particular node twice.

The output of this simple program when being applied to foo.ast is:

 FunctionTemplate (f)
- TemplateTypeParameter (T)
- ParmDecl (x)
-- TypeRef (T)
- CompoundStmt ()
-- ReturnStmt ()
--- BinaryOperator ()
---- DeclRefExpr (x)
---- IntegerLiteral ()

Which is nice for just about 50 lines of code in total. As promised, not all entities have names here, but the ones that do seem to make sense.

The complete code

Here’s the complete code of this example:

#include <clang-c/Index.h>

#include <iostream>
#include <string>

std::string getCursorKindName( CXCursorKind cursorKind )
{
  CXString kindName  = clang_getCursorKindSpelling( cursorKind );
  std::string result = clang_getCString( kindName );

  clang_disposeString( kindName );
  return result;
}

std::string getCursorSpelling( CXCursor cursor )
{
  CXString cursorSpelling = clang_getCursorSpelling( cursor );
  std::string result      = clang_getCString( cursorSpelling );

  clang_disposeString( cursorSpelling );
  return result;
}

CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData clientData )
{
  CXSourceLocation location = clang_getCursorLocation( cursor );
  if( clang_Location_isFromMainFile( location ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind cursorKind = clang_getCursorKind( cursor );

  unsigned int curLevel  = *( reinterpret_cast<unsigned int*>( clientData ) );
  unsigned int nextLevel = curLevel + 1;

  std::cout << std::string( curLevel, '-' ) << " " << getCursorKindName(
  cursorKind ) << " (" << getCursorSpelling( cursor ) << ")\n";

  clang_visitChildren( cursor,
                       visitor,
                       &nextLevel ); 

  return CXChildVisit_Continue;
}

int main( int argc, char** argv )
{
  if( argc < 2 )
    return -1;

  CXIndex index        = clang_createIndex( 0, 1 );
  CXTranslationUnit tu = clang_createTranslationUnit( index, argv[1] );

  if( !tu )
    return -1;

  CXCursor rootCursor  = clang_getTranslationUnitCursor( tu );

  unsigned int treeLevel = 0;

  clang_visitChildren( rootCursor, visitor, &treeLevel );

  clang_disposeTranslationUnit( tu );
  clang_disposeIndex( index );

  return 0;
}

I am releasing the code into the public domain. Don’t forget to link against libclang when compiling it (subsequent posts will provide a find module for CMake). Should you consider this code useful, it would give me enormous pleasure if you were to drop me an e-mail.

To show that this would be worth your while, I am pointing out an interesting part in the code above: In the getCursorKindName() function, we use clang_getCursorKindSpelling() to obtain the name of the type the current cursor refers to. This is a CXString, which needs to be disposed later on with clang_disposeString() so as not to cause any memory leaks. To use the string as a regular C++ string, we need to call the helper function clang_getCString, which returns a null-terminated C string (i.e. a const char*, which we may finally assign to an std::string).

Wishing you a happy tree traversal! Maybe clang hides some gifts in the AST…