Baby steps with libclang: Counting function extents

Tags: programming, howtos

Published on
« Previous post: Ten noteworthy books I read in 2015 — Next post: Improved superscript citations for … »

In the previous instalment of this little series, I already explained how to walk an abstract syntax tree. Since this requires a specific call to clang beforehand, I want to extend the example to be able to parse code directly.

We will not encounter any new concepts for code parsing here but rather some additional methods of libclang. The main entry point for parsing code directly is the clang_parseTranslationUnit() method. It requires a working compilation index (which we already encountered last time) as well as an optional number of additional compiler arguments. These arguments turn out to be extremely critical when trying to do sensible things with C++ code. Without, say, the proper include directories, clang will be incapable of deciding whether a series of tokens in a source code constitutes a type, for example.

Where to get compile arguments

The easiest way to obtain compilation parameters is to use a compilation database. Typically, this is a file called compile_commands.json that resides in the build directory of a software project. For each source file, it contains the complete call to the compiler, including all flags and other parameters. We can easily obtain such a compilation database if we specify

SET( CMAKE_EXPORT_COMPILE_COMMANDS ON )

in the main CMakeLists.txt file of our project (see my article about the YouCompleteMe engine and cmake in this very blog). Armed with this file, libclang offers numerous methods to help deal with the database. The following snippet (again, see the bottom of this post for the complete code) will attempt to load a database from a file and count the number of parameters:

#include <clang-c/CXCompilationDatabase.h>
#include <clang-c/Index.h>

// Somewhat later, in the main function:

CXCompilationDatabase_Error compilationDatabaseError;
CXCompilationDatabase compilationDatabase = clang_CompilationDatabase_fromDirectory( ".", &compilationDatabaseError );
CXCompileCommands compileCommands         = clang_CompilationDatabase_getCompileCommands( compilationDatabase, resolvedPath.c_str() );
unsigned int numCompileCommands           = clang_CompileCommands_getSize( compileCommands );
 

Let’s ignore the resolvedPath variable for the time being&emdash;it will be explained in the complete code. Our next task is to get all these parameters into the clang_parseTranslationUnit() function. Unfortunately, the interface for this method is somewhat clunky (at least I am not aware of a better solution). We have to convert each command individually and pass it in the form of two-dimensional char array:

CXCompileCommand compileCommand = clang_CompileCommands_getCommand( compileCommands, 0 );
unsigned int numArguments       = clang_CompileCommand_getNumArgs( compileCommand );
char** arguments                = new char*[ numArguments ];

for( unsigned int i = 0; i < numArguments; i++ )
{
  CXString argument       = clang_CompileCommand_getArg( compileCommand, i );
  std::string strArgument = clang_getCString( argument );
  arguments[i]            = new char[ strArgument.size() + 1 ];

  std::fill( arguments[i],
             arguments[i] + strArgument.size() + 1,
             0 );

  std::copy( strArgument.begin(), strArgument.end(),
             arguments[i] );

  clang_disposeString( argument );
}

translationUnit = clang_parseTranslationUnit( index, 0, arguments, numArguments, 0, 0, CXTranslationUnit_None );

for( unsigned int i = 0; i < numArguments; i++ )
  delete[] arguments[i];

delete[] arguments;

The salient point is the call to clang_parseTranslationUnit() in which all arguments obtained from the compilation database are used.

Counting function extents

Having a valid translation unit at hand, we can proceed as in the previous article by getting a cursor into the translation unit and visiting the syntax tree.

CXCursor rootCursor = clang_getTranslationUnitCursor( translationUnit );
clang_visitChildren( rootCursor, functionVisitor, nullptr );

With the functionVisitor being a simple visitor that only reacts to function definitions, class methods, and function template specifications:

CXChildVisitResult functionVisitor( CXCursor cursor, CXCursor /* parent */, CXClientData /* clientData */ )
{
  if( clang_Location_isFromMainFile( clang_getCursorLocation( cursor ) ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind kind = clang_getCursorKind( cursor );
  auto name         = getCursorSpelling( cursor );

  if( kind == CXCursorKind::CXCursor_FunctionDecl || kind == CXCursorKind::CXCursor_CXXMethod || kind == CXCursorKind::CXCursor_FunctionTemplate )
  {
    CXSourceRange extent           = clang_getCursorExtent( cursor );
    CXSourceLocation startLocation = clang_getRangeStart( extent );
    CXSourceLocation endLocation   = clang_getRangeEnd( extent );

    unsigned int startLine = 0, startColumn = 0;
    unsigned int endLine   = 0, endColumn   = 0;

    clang_getSpellingLocation( startLocation, nullptr, &startLine, &startColumn, nullptr );
    clang_getSpellingLocation( endLocation,   nullptr, &endLine, &endColumn, nullptr );

    std::cout << "  " << name << ": " << endLine - startLine << "\n";
  }

  return CXChildVisit_Recurse;
}
 

This time, we always recursively visit all children of the current node because we might encounter functions nested in namespace and suchlike. Apart from this, the visitor offers few surprises. We again use getCursorSpelling to obtain the name of the function.

If we encounter a function (which we can decide by checking the type of the cursor using the clang_getCursorKind() function), we get its extents within the source file. To this end, we call clang_getCursorExtent(), which results in a CXSourceRange. This is a type that specifies, well, a range of lines in the source code. The start and end location, respectively, are obtained using clang_getRangeStart() and clang_getRangeEnd(). Finally, we use clang_getSpellingLocation() to map the internal locations to external ones, in the form of a line and a column. We then print the name of the function and the amount of source code lines it takes. This includes comment and everything so it is not a good measure of the code complexity—as an introductory example into the power of libclang it should suffice, though.

By the by: This example also demonstrates the care the libclang developers have taken when specifying their API. Being capable of mapping entities encountered during the parse process back to actual lines of code offers a great amount of flexibility for tool developers. This is really nice!

What about default arguments?

As a fall-back, if no compile commands are available, we can also specify our own includes. This is surprisingly painless, thanks to std::extents:

constexpr const char* defaultArguments[] = {
  "-std=c++11",
  "-I/usr/include",
  "-I/usr/local/include"
};

translationUnit = clang_parseTranslationUnit( index,
                                              resolvedPath.c_str(),
                                              defaultArguments,
                                              std::extent<decltype(defaultArguments)>::value,
                                              0,
                                              0,
                                              CXTranslationUnit_None );

What about the mysterious resolved path?

At this point, the resolvedPath variable occurred multiple times and surely the suspense kept you on the edge of your seat. Let me resolve the mystery for you:

#ifdef __unix__
  #include <limits.h>
  #include <stdlib.h>
#endif

std::string resolvePath( const char* path )
{
  std::string resolvedPath;

#ifdef __unix__
  char* resolvedPathRaw = new char[ PATH_MAX ];
  char* result          = realpath( path, resolvedPathRaw );

  if( result )
    resolvedPath = resolvedPathRaw;

  delete[] resolvedPathRaw;
#else
  resolvedPath = path;
#endif

  return resolvedPath;
}
 

We only need this function to permit the user to specify relative paths on the command-line. For the compilation database and the translation unit parsing, however, we require absolute paths. The function above is nothing but a fancy wrapper for the realpath() function that returns the canonicalized absolute path name.

The complete code

This is what you have been waiting for:

#include <clang-c/CXCompilationDatabase.h>
#include <clang-c/Index.h>

#ifdef __unix__
  #include <limits.h>
  #include <stdlib.h>
#endif

#include <iostream>
#include <string>
#include <type_traits>

std::string getCursorSpelling( CXCursor cursor )
{
  CXString cursorSpelling = clang_getCursorSpelling( cursor );
  std::string result      = clang_getCString( cursorSpelling );

  clang_disposeString( cursorSpelling );
  return result;
}

/* Auxiliary function for resolving a (relative) path into an absolute path */
std::string resolvePath( const char* path )
{
  std::string resolvedPath;

#ifdef __unix__
  char* resolvedPathRaw = new char[ PATH_MAX ];
  char* result          = realpath( path, resolvedPathRaw );

  if( result )
    resolvedPath = resolvedPathRaw;

  delete[] resolvedPathRaw;
#else
  resolvedPath = path;
#endif

  return resolvedPath;
}

CXChildVisitResult functionVisitor( CXCursor cursor, CXCursor /* parent */, CXClientData /* clientData */ )
{
  if( clang_Location_isFromMainFile( clang_getCursorLocation( cursor ) ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind kind = clang_getCursorKind( cursor );
  auto name         = getCursorSpelling( cursor );

  if( kind == CXCursorKind::CXCursor_FunctionDecl || kind == CXCursorKind::CXCursor_CXXMethod || kind == CXCursorKind::CXCursor_FunctionTemplate )
  {
    CXSourceRange extent           = clang_getCursorExtent( cursor );
    CXSourceLocation startLocation = clang_getRangeStart( extent );
    CXSourceLocation endLocation   = clang_getRangeEnd( extent );

    unsigned int startLine = 0, startColumn = 0;
    unsigned int endLine   = 0, endColumn   = 0;

    clang_getSpellingLocation( startLocation, nullptr, &startLine, &startColumn, nullptr );
    clang_getSpellingLocation( endLocation,   nullptr, &endLine, &endColumn, nullptr );

    std::cout << "  " << name << ": " << endLine - startLine << "\n";
  }

  return CXChildVisit_Recurse;
}

int main( int argc, char** argv )
{
  if( argc < 2 )
    return -1;

  auto resolvedPath = resolvePath( argv[1] );
  std::cerr << "Parsing " << resolvedPath << "...\n";

  CXCompilationDatabase_Error compilationDatabaseError;
  CXCompilationDatabase compilationDatabase = clang_CompilationDatabase_fromDirectory( ".", &compilationDatabaseError );
  CXCompileCommands compileCommands         = clang_CompilationDatabase_getCompileCommands( compilationDatabase, resolvedPath.c_str() );
  unsigned int numCompileCommands           = clang_CompileCommands_getSize( compileCommands );

  std::cerr << "Obtained " << numCompileCommands << " compile commands\n";

  CXIndex index = clang_createIndex( 0, 1 );
  CXTranslationUnit translationUnit;

  if( numCompileCommands == 0 )
  {
    constexpr const char* defaultArguments[] = {
      "-std=c++11",
      "-I/usr/include",
      "-I/usr/local/include"
    };

    translationUnit = clang_parseTranslationUnit( index,
                                                  resolvedPath.c_str(),
                                                  defaultArguments,
                                                  std::extent<decltype(defaultArguments)>::value,
                                                  0,
                                                  0,
                                                  CXTranslationUnit_None );

  }
  else
  {
    CXCompileCommand compileCommand = clang_CompileCommands_getCommand( compileCommands, 0 );
    unsigned int numArguments       = clang_CompileCommand_getNumArgs( compileCommand );
    char** arguments                = new char*[ numArguments ];

    for( unsigned int i = 0; i < numArguments; i++ )
    {
      CXString argument       = clang_CompileCommand_getArg( compileCommand, i );
      std::string strArgument = clang_getCString( argument );
      arguments[i]            = new char[ strArgument.size() + 1 ];

      std::fill( arguments[i],
                 arguments[i] + strArgument.size() + 1,
                 0 );

      std::copy( strArgument.begin(), strArgument.end(),
                 arguments[i] );

      clang_disposeString( argument );
    }

    translationUnit = clang_parseTranslationUnit( index, 0, arguments, numArguments, 0, 0, CXTranslationUnit_None );

    for( unsigned int i = 0; i < numArguments; i++ )
      delete[] arguments[i];

    delete[] arguments;
  }

  CXCursor rootCursor = clang_getTranslationUnitCursor( translationUnit );
  clang_visitChildren( rootCursor, functionVisitor, nullptr );

  clang_disposeTranslationUnit( translationUnit );
  clang_disposeIndex( index );

  clang_CompileCommands_dispose( compileCommands );
  clang_CompilationDatabase_dispose( compilationDatabase );
  return 0;
}

Let me repeat myself here: I am releasing the code into the public domain. Don’t forget to link against libclang when compiling it (one of the subsequent posts is likely to provide a find module for CMake). Should you consider this code useful, it would give me enormous pleasure if you were to drop me an e-mail.

If I apply the sample program to its own source code, I get the following results:

Parsing [FILENAME REDACTED FOR SECURITY PURPOSES -GLADOS]
Obtained 1 compile commands
[FILENAME REDACTED FOR SECURITY PURPOSES -GLADOS]
  getCursorSpelling: 7
  resolvePath: 17
  functionVisitor: 24
  main: 74

May your code in 2016 be as easy to parse for you as this example!