使用libclang解析C++文件

快要瘦了的小林

已于 2022-09-07 09:32:46 修改

阅读量2.6k

点赞数

分类专栏： c++ 文章标签： c++ 开发语言

于 2022-09-05 11:37:32 首次发布

本文链接：https://blog.csdn.net/weixin_47358139/article/details/126701416

版权

c++ 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了如何利用libclang库进行C++源码的解析，展示了利用libclang创建抽象语法树(AST)并进行遍历的实例。作者通过一个开源反射框架Reflang的开发背景，阐述了选择libclang而非其他解析方法的原因。文中提供了代码示例，演示了如何通过Python API获取AST节点信息，包括节点类型、模板参数等，并给出了一个经典例子来展示如何解析特定C++结构体。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在此前我想先向你们推荐简短看一下我之前的一篇博客：http://t.csdn.cn/9ADYr

前言

在本文中，我将提供一个使用libclang的快速教程。我开始试用libclang，同时实现了一个面向c++的开源反射框架——Reflang。

在此前我想先向你们推荐简短看一下我之前的一篇博客：http://t.csdn.cn/9ADYr （LLVM介绍）

一、libclang ?

有听说过Clang吗？它是一个很棒的C++(和其他C语言家族)编译器。不完全是编译器，还是LLVM编译器的前端。
编译器有一个很难解决的问题，分为两个：

将一种编程语言（在我们的例子中是C++，也就是头文件）转换为一些中间代码 —— 这称为前端，这正是Clang所做的。
将上面的中间代码转换为机器码 —— 这称为后端代码。Clang为此使用了LLVM。

工具的主要功能。它的输入是 C++ 头文件，里面包含了需要导出的 C++ 类定义。根据输入，会自动处理处理类/结构体的实现和一堆由模板文件生成的业务逻辑代码。在工具内部，首先会对输入的 C++ 文件进行语法分析，生成抽象语法树(AST)。接着通过遍历语法树来生成所需要的包装导出代码。这里是使用访问者 (Visitor) 设计模式来解耦，把语法树的遍历逻辑和不同文件的代码生成逻辑区分开来，彼此独立实现。

C++ 语法以繁复难以解析著称。此时有 3 个选择:

1、手写解析器，对 C++ 语法有选择地分析。但这样做耗时耗力，而且很难避免出错;
2、使用 pyparsing 等解析库，帮助我们实现简单的 C++ 语法解析器。这只比纯手写好上一点，难点还是在于 C++ 语法实在不是一般的复杂;
3、利用真正的 C++ 编译器来解析。例如 g++ 就可以把解析后的语法树输出为 XML 结构，方便其它程序进一步利用。不少代码生成器就是这么做的; 比较权衡之后最终选择了第三种方式，使用一个真正的 C++ 编译器来帮助解析。但这里没有使用老牌的 g++，而是选择了另一位新晋明星: Clang。

二、libclang !

本工具也正是利用 Clang 官方的 Python 扩展来实现 C++ 解析功能的，面的代码示范了如何利用 Clang 来解析输入的 C++ 头文件，我们创建了一个最为简单的例子。

#include <stdint.h>
#include <array>
#include <vector>
#include <string>
#include <map>

enum ErrorCode
{
     ErrorCode_OK = 0,
     ErrorCode_CONTROL_ERROR = 1000
};
namespace user1
{
namespace autogen_test
{
typedef std::array<int, 10>  Array10;
struct PointCloud
{
    std::string label;
    std::vector<double> point2;  
    std::vector<uint64_t> point4;
    std::vector<std::string> point3;
    std::array<double, 80000> point;
    std::array<uint64_t, 70000> point1;
    std::array<std::array<uint64_t, 3>, 100> drops;
    std::map<std::string, double> properties2;
};
}
}

我们将会使用一个得到经过解析的抽象语法树(AST)，可以遍历和检查它。

三、libclang解析基本的例子（可用于debug）

import sys
import clang.cindex
import clang.enumerations
from clang.cindex import Index  #主要API
from clang.cindex import Config  #配置
from clang.cindex import CursorKind  #索引结点的类别
from clang.cindex import TypeKind    #节点的语义类别
from clang.cindex import CursorKind, TypeKind



# clang.cindex需要用到libclang.so共享库，所以先配置共享库
clang.cindex.Config.set_library_file('/usr/lib/llvm-10/lib/libclang.so')

file_path = r"/home/root01/Xway_os/xway_os_ide_toolchain-1/utilities/lzl/hello.h"
# file_path = r"/home/root01/Xway_os/xway_os_ide_toolchain-1/utilities/lzl/hello.cpp"
# 查看头文件中的抽象语法树（AST），利用所获得的信息来生成绑定代码。脚本中是通过游标（cursor）来遍历 AST。游标指向 AST 中的某个节点，用来说明他是哪一类节点（比如说是类/结构体定义）其子节点是什么（比如说类/结构体的成员方法）等各类信息。第一个游标所指向的是 translation unit 的根(root)，即所要解析的文件。
index = clang.cindex.Index.create(excludeDecls=True)
params = ['-x', 'c++', '-std=c++14', '-D__CODE_GENERATOR__']
tu = index.parse(file_path, params, options=clang.cindex.TranslationUnit.PARSE_SKIP_FUNCTION_BODIES)
# print(tu.TokenGroup)
AST_root_node= tu.cursor #cursor根节点
def cpp_to_xway_type(type_node):
    pass
    # print(type_node)

def filtrate(cursor):
    for node in cursor.get_children():
        location_file = str(node.location.file)
        include_dirs = ['../external_type/']
        if location_file == file_path or any(location_file.startswith(dir) for dir in include_dirs):
            return node
# print(asd(AST_root_node).spelling)
print(filtrate(AST_root_node).spelling)

def preorder_travers_AST(cursor,level):
    children = [c for c in cursor.get_children()]
    print(len(children),cursor.spelling)
    for node in cursor.get_children():
           
        # print(level+node.spelling,'(',node.type.get_template_argument_type(0).spelling,")"
        #     ,"array_size:",int(node.type.get_size()/node.type.get_template_argument_type(0).get_size()),
        #     '(',node.type.get_template_argument_type(1).spelling,")")    
        cpp_to_xway_type(node.type)
        preorder_travers_AST(node,level+"├─")
preorder_travers_AST(filtrate(AST_root_node),level= "└─")

指向AST的指针在libclang术语中称为Cursors。Cursors可以有父Cursors和子Cursors。它也可以有相关的Cursors(比如参数的默认值、枚举项的显式值等)。我们将使用的“entry point” cursor是表示翻译单元(TU)的cursor，TU是一个C++术语，表示包含所有#include代码的单个文件。要获得TU的cursor，我们将使用描述性非常强的clang_getTranslationUnitCursor()。现在我们有了一个cursor，我们可以使用它进行研究或迭代。任何cursor都有一种Kind，它表示cursor的本质。Kind可以是许多选项中的一个，如这里所示。以下是一些例子：

/** \brief A C or C++ struct. */
CXCursor_StructDecl = 2,
/** \brief A C or C++ union. */
CXCursor_UnionDecl = 3,
/** \brief A C++ class. */
CXCursor_ClassDecl = 4,
/** \brief An enumeration. */
CXCursor_EnumDecl = 5,

四、经典例子（采用于stackoverflow）

如何解析以下这段c++代码

#include <vector>

struct outer_t
{
    std::vector<int> vec_of_ints;
};

我们需要获取这段代码的中node的内容和node.type内容，如何获取？

import sys
import clang.cindex
clang.cindex.Config.set_library_file("/usr/lib/llvm-6.0/lib/libclang.so.1")

class Walker:
    def __init__(self, filename):
        self.filename = filename

    def walk(self, node):
        node_in_file =  bool(node.location.file and node.location.file.name == self.filename)
        if node_in_file:
            print(f"node.spelling = {node.spelling:14}, node.kind = {node.kind}")
            # -------- BEGIN modified section --------
            type = node.type
            if type is not None:
                ntargs = type.get_num_template_arguments()
                if ntargs > 0:
                    print(f"  type.spelling = {type.spelling}")
                    print(f"  type.get_num_template_arguments = {ntargs}")
            # -------- END modified section --------
        for child in node.get_children():
            self.walk(child)

filename = sys.argv[1]
index = clang.cindex.Index.create()
translation_unit = index.parse(filename)

root = translation_unit.cursor
walker = Walker(filename)
walker.walk(root)

这里并不是处理代码中出现的所有模板情况。我通过反复试验和阅读clang/cindex.py库模块源文件发现了上述内容。在任何情况下，关于 Clang AST（以及几乎所有 C/C++ AST）要理解的一件重要事情是类型不是主语法树中的节点。相反，类型是该树中某些节点的语义解释，因此有点靠边站。这就是为什么它们不会作为walk。

当我们在某一处debug后，可以获取以下各个节点的内容：、

node.kind

node.type.kind

node.spelling

node.type.spelling

[i.spelling for i in cursor.get_children()]

cursor.type.get_num_template_arguments()

node.type.get_num_template_arguments()

node.type.get_template_argument_type(0)

node.type.get_template_argument_type(1).get_size()

node.type.get_template_argument_type(0).get_template_argument_type(0)

结果所示：

node.spelling = outer_t       , node.kind = CursorKind.STRUCT_DECL
node.spelling = vec_of_ints   , node.kind = CursorKind.FIELD_DECL
  type.spelling = std::vector<int>
  type.get_num_template_arguments = 1
node.spelling = std           , node.kind = CursorKind.NAMESPACE_REF
node.spelling = vector        , node.kind = CursorKind.TEMPLATE_REF