Baby steps with libclang
: Counting function extents
Tags: programming, howtos
In the previous instalment of this little series,
I already explained how to walk an abstract syntax tree. Since this requires a specific call to
clang
beforehand, I want to extend the example to be able to parse code directly.
We will not encounter any new concepts for code parsing here but rather some additional methods of
libclang
. The main entry point for parsing code directly is the clang_parseTranslationUnit()
method. It requires a working compilation index (which we already encountered last time) as
well as an optional number of additional compiler arguments. These arguments turn out to be
extremely critical when trying to do sensible things with C++ code. Without, say, the proper include
directories, clang
will be incapable of deciding whether a series of tokens in a source code
constitutes a type, for example.
Where to get compile arguments
The easiest way to obtain compilation parameters is to use a compilation database. Typically, this
is a file called compile_commands.json
that resides in the build directory of a software project.
For each source file, it contains the complete call to the compiler, including all flags and other
parameters. We can easily obtain such a compilation database if we specify
SET( CMAKE_EXPORT_COMPILE_COMMANDS ON )
in the main CMakeLists.txt
file of our project (see my article about the YouCompleteMe
engine and cmake
in this very blog). Armed with this file, libclang
offers numerous methods to help deal with the database. The following snippet (again, see the
bottom of this post for the complete code) will attempt to load a database from a file and count the
number of parameters:
#include <clang-c/CXCompilationDatabase.h>
#include <clang-c/Index.h>
// Somewhat later, in the main function:
CXCompilationDatabase_Error compilationDatabaseError;
CXCompilationDatabase compilationDatabase = clang_CompilationDatabase_fromDirectory( ".", &compilationDatabaseError );
CXCompileCommands compileCommands = clang_CompilationDatabase_getCompileCommands( compilationDatabase, resolvedPath.c_str() );
unsigned int numCompileCommands = clang_CompileCommands_getSize( compileCommands );
Let’s ignore the resolvedPath
variable for the time being&emdash;it will be explained in the
complete code. Our next task is to get all these parameters into the clang_parseTranslationUnit()
function.
Unfortunately, the interface for this method is somewhat clunky (at least I am not aware of a
better solution). We have to convert each command individually and pass it in the form of
two-dimensional char
array:
CXCompileCommand compileCommand = clang_CompileCommands_getCommand( compileCommands, 0 );
unsigned int numArguments = clang_CompileCommand_getNumArgs( compileCommand );
char** arguments = new char*[ numArguments ];
for( unsigned int i = 0; i < numArguments; i++ )
{
CXString argument = clang_CompileCommand_getArg( compileCommand, i );
std::string strArgument = clang_getCString( argument );
arguments[i] = new char[ strArgument.size() + 1 ];
std::fill( arguments[i],
arguments[i] + strArgument.size() + 1,
0 );
std::copy( strArgument.begin(), strArgument.end(),
arguments[i] );
clang_disposeString( argument );
}
translationUnit = clang_parseTranslationUnit( index, 0, arguments, numArguments, 0, 0, CXTranslationUnit_None );
for( unsigned int i = 0; i < numArguments; i++ )
delete[] arguments[i];
delete[] arguments;
The salient point is the call to clang_parseTranslationUnit()
in which all arguments obtained from
the compilation database are used.
Counting function extents
Having a valid translation unit at hand, we can proceed as in the previous article by getting a cursor into the translation unit and visiting the syntax tree.
CXCursor rootCursor = clang_getTranslationUnitCursor( translationUnit );
clang_visitChildren( rootCursor, functionVisitor, nullptr );
With the functionVisitor
being a simple visitor that only reacts to function definitions, class
methods, and function template specifications:
CXChildVisitResult functionVisitor( CXCursor cursor, CXCursor /* parent */, CXClientData /* clientData */ )
{
if( clang_Location_isFromMainFile( clang_getCursorLocation( cursor ) ) == 0 )
return CXChildVisit_Continue;
CXCursorKind kind = clang_getCursorKind( cursor );
auto name = getCursorSpelling( cursor );
if( kind == CXCursorKind::CXCursor_FunctionDecl || kind == CXCursorKind::CXCursor_CXXMethod || kind == CXCursorKind::CXCursor_FunctionTemplate )
{
CXSourceRange extent = clang_getCursorExtent( cursor );
CXSourceLocation startLocation = clang_getRangeStart( extent );
CXSourceLocation endLocation = clang_getRangeEnd( extent );
unsigned int startLine = 0, startColumn = 0;
unsigned int endLine = 0, endColumn = 0;
clang_getSpellingLocation( startLocation, nullptr, &startLine, &startColumn, nullptr );
clang_getSpellingLocation( endLocation, nullptr, &endLine, &endColumn, nullptr );
std::cout << " " << name << ": " << endLine - startLine << "\n";
}
return CXChildVisit_Recurse;
}
This time, we always recursively visit all children of the current node because we might encounter
functions nested in namespace and suchlike. Apart from this, the visitor offers few surprises. We
again use getCursorSpelling
to obtain the name of the function.
If we encounter a function (which we can decide by checking the type of the cursor using the
clang_getCursorKind()
function), we get its extents within the source file. To this end, we call
clang_getCursorExtent()
, which results in a CXSourceRange
. This is a type that specifies, well,
a range of lines in the source code. The start and end location, respectively, are obtained using
clang_getRangeStart()
and clang_getRangeEnd()
. Finally, we use clang_getSpellingLocation()
to
map the internal locations to external ones, in the form of a line and a column. We then
print the name of the function and the amount of source code lines it takes. This includes comment
and everything so it is not a good measure of the code complexity—as an introductory example
into the power of libclang
it should suffice, though.
By the by: This example also demonstrates the care the libclang
developers have taken when
specifying their API. Being capable of mapping entities encountered during the parse process back to
actual lines of code offers a great amount of flexibility for tool developers. This is really nice!
What about default arguments?
As a fall-back, if no compile commands are available, we can also specify our own includes. This is
surprisingly painless, thanks to std::extents
:
constexpr const char* defaultArguments[] = {
"-std=c++11",
"-I/usr/include",
"-I/usr/local/include"
};
translationUnit = clang_parseTranslationUnit( index,
resolvedPath.c_str(),
defaultArguments,
std::extent<decltype(defaultArguments)>::value,
0,
0,
CXTranslationUnit_None );
What about the mysterious resolved path?
At this point, the resolvedPath
variable occurred multiple times and surely the suspense kept you
on the edge of your seat. Let me resolve the mystery for you:
#ifdef __unix__
#include <limits.h>
#include <stdlib.h>
#endif
std::string resolvePath( const char* path )
{
std::string resolvedPath;
#ifdef __unix__
char* resolvedPathRaw = new char[ PATH_MAX ];
char* result = realpath( path, resolvedPathRaw );
if( result )
resolvedPath = resolvedPathRaw;
delete[] resolvedPathRaw;
#else
resolvedPath = path;
#endif
return resolvedPath;
}
We only need this function to permit the user to specify relative paths on the command-line. For
the compilation database and the translation unit parsing, however, we require absolute paths. The
function above is nothing but a fancy wrapper for the realpath()
function that returns the
canonicalized absolute path name.
The complete code
This is what you have been waiting for:
#include <clang-c/CXCompilationDatabase.h>
#include <clang-c/Index.h>
#ifdef __unix__
#include <limits.h>
#include <stdlib.h>
#endif
#include <iostream>
#include <string>
#include <type_traits>
std::string getCursorSpelling( CXCursor cursor )
{
CXString cursorSpelling = clang_getCursorSpelling( cursor );
std::string result = clang_getCString( cursorSpelling );
clang_disposeString( cursorSpelling );
return result;
}
/* Auxiliary function for resolving a (relative) path into an absolute path */
std::string resolvePath( const char* path )
{
std::string resolvedPath;
#ifdef __unix__
char* resolvedPathRaw = new char[ PATH_MAX ];
char* result = realpath( path, resolvedPathRaw );
if( result )
resolvedPath = resolvedPathRaw;
delete[] resolvedPathRaw;
#else
resolvedPath = path;
#endif
return resolvedPath;
}
CXChildVisitResult functionVisitor( CXCursor cursor, CXCursor /* parent */, CXClientData /* clientData */ )
{
if( clang_Location_isFromMainFile( clang_getCursorLocation( cursor ) ) == 0 )
return CXChildVisit_Continue;
CXCursorKind kind = clang_getCursorKind( cursor );
auto name = getCursorSpelling( cursor );
if( kind == CXCursorKind::CXCursor_FunctionDecl || kind == CXCursorKind::CXCursor_CXXMethod || kind == CXCursorKind::CXCursor_FunctionTemplate )
{
CXSourceRange extent = clang_getCursorExtent( cursor );
CXSourceLocation startLocation = clang_getRangeStart( extent );
CXSourceLocation endLocation = clang_getRangeEnd( extent );
unsigned int startLine = 0, startColumn = 0;
unsigned int endLine = 0, endColumn = 0;
clang_getSpellingLocation( startLocation, nullptr, &startLine, &startColumn, nullptr );
clang_getSpellingLocation( endLocation, nullptr, &endLine, &endColumn, nullptr );
std::cout << " " << name << ": " << endLine - startLine << "\n";
}
return CXChildVisit_Recurse;
}
int main( int argc, char** argv )
{
if( argc < 2 )
return -1;
auto resolvedPath = resolvePath( argv[1] );
std::cerr << "Parsing " << resolvedPath << "...\n";
CXCompilationDatabase_Error compilationDatabaseError;
CXCompilationDatabase compilationDatabase = clang_CompilationDatabase_fromDirectory( ".", &compilationDatabaseError );
CXCompileCommands compileCommands = clang_CompilationDatabase_getCompileCommands( compilationDatabase, resolvedPath.c_str() );
unsigned int numCompileCommands = clang_CompileCommands_getSize( compileCommands );
std::cerr << "Obtained " << numCompileCommands << " compile commands\n";
CXIndex index = clang_createIndex( 0, 1 );
CXTranslationUnit translationUnit;
if( numCompileCommands == 0 )
{
constexpr const char* defaultArguments[] = {
"-std=c++11",
"-I/usr/include",
"-I/usr/local/include"
};
translationUnit = clang_parseTranslationUnit( index,
resolvedPath.c_str(),
defaultArguments,
std::extent<decltype(defaultArguments)>::value,
0,
0,
CXTranslationUnit_None );
}
else
{
CXCompileCommand compileCommand = clang_CompileCommands_getCommand( compileCommands, 0 );
unsigned int numArguments = clang_CompileCommand_getNumArgs( compileCommand );
char** arguments = new char*[ numArguments ];
for( unsigned int i = 0; i < numArguments; i++ )
{
CXString argument = clang_CompileCommand_getArg( compileCommand, i );
std::string strArgument = clang_getCString( argument );
arguments[i] = new char[ strArgument.size() + 1 ];
std::fill( arguments[i],
arguments[i] + strArgument.size() + 1,
0 );
std::copy( strArgument.begin(), strArgument.end(),
arguments[i] );
clang_disposeString( argument );
}
translationUnit = clang_parseTranslationUnit( index, 0, arguments, numArguments, 0, 0, CXTranslationUnit_None );
for( unsigned int i = 0; i < numArguments; i++ )
delete[] arguments[i];
delete[] arguments;
}
CXCursor rootCursor = clang_getTranslationUnitCursor( translationUnit );
clang_visitChildren( rootCursor, functionVisitor, nullptr );
clang_disposeTranslationUnit( translationUnit );
clang_disposeIndex( index );
clang_CompileCommands_dispose( compileCommands );
clang_CompilationDatabase_dispose( compilationDatabase );
return 0;
}
Let me repeat myself here: I am releasing the code into the public domain. Don’t forget to
link against libclang
when compiling it (one of the subsequent posts is likely to provide a
find module for CMake
). Should you consider this code useful, it would give me enormous pleasure
if you were to drop me an e-mail.
If I apply the sample program to its own source code, I get the following results:
Parsing [FILENAME REDACTED FOR SECURITY PURPOSES -GLADOS]
Obtained 1 compile commands
[FILENAME REDACTED FOR SECURITY PURPOSES -GLADOS]
getCursorSpelling: 7
resolvePath: 17
functionVisitor: 24
main: 74
May your code in 2016 be as easy to parse for you as this example!