clang static analyzer总结

最新推荐文章于 2025-04-24 05:59:35 发布

电影旅行敲代码

最新推荐文章于 2025-04-24 05:59:35 发布

阅读量6.3k

点赞数 5

分类专栏： LLVM源码系列

本文链接：https://blog.csdn.net/dashuniuniu/article/details/103669441

版权

LLVM源码系列专栏收录该内容

18 篇文章

订阅专栏

本文深入探讨Clang Static Analyzer的工作原理，包括其在不同代码阶段的应用，如AST、IR和Binary，以及在程序运行时的作用。重点讲解了CSA在AST阶段的独特优势和限制，对比了与Clang-Tidy的区别，详细分析了符号执行、抽象解释、数据流分析等技术在CSA中的运用，以及如何处理复杂的C++语言特性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

总结

注：有一段时间没有接触csa，很多东西发生了变化，但本质的东西没有大的变动，这里再回顾一下。这里不会介绍太细节的东西，搞技术搞的太细节了，会导致自己“过拟合”，从而缺少接受其他知识的泛化的能力。

我在这里罗列这些东西，对于他人没有什么特别有价值的东西，我能读到的源码，别人也能读到，我能看到的论文，别人也能看到。这些内容唯一的价值是帮助我理思路

从源代码到程序真正运行下面几个阶段都可以做代码分析。

AST
IR
Binary
程序运行时

而clang static analyzer来说，它处在AST阶段，高层的语义信息都可以保存，例如对于C++来说，在llvm IR阶段lambda表达式本质上就是一个匿名类，在IR层面就脱去了lambda的语义信息，于此对应的一些可以检查的规则就比较难做了。但是后面做有后面做的好处，不需要紧跟语言的特性，处理的场景比较少。到了程序运行时，使用sanitizer等工具能够得到更精确的信息，但是会推迟发现bug的时机，同时对程序本身有影响。就像ollvm混淆了代码，但是不可避免带了性能上的倒退。

虽然csa架在AST上，当然还是比clang-tidy层次更低些，clang-tidy就是使用AST matcher匹配一些特定的bug场景，这个类似于semmle QL，只是semmle QL针对多种编程语言，对用户更友好，不像clang-tidy需要用户对clang有一定的了解。

站在具体实现的角度，基于clang AST的工具，肯定离不开ASTConsumer和一系列的Handle*Decl()方法。

clang/lib/StaticAnalyzer/README.txt对csa比较高层次的介绍，例如下面这句。

In a nutshell, the analyzer is basically a source code simulator that traces out possible paths of execution.

传统的虚拟机是使用一个大的switch来evaluate指令，同时更新维护的一些状态信息。而clang static analyzer按照拓扑序遍历Call Graph + CFG（架在AST上的），使用一个visitor来处理CFGElement，同时更新维护的一些状态。唯一的区别是虚拟机给定了输入，path唯一。而csa没有给定输入，所以通过一个worklist来能够回溯的处理多种不同的path。同大部分虚拟机一样，csa也是典型的基于operational semantics。一个store，一个enviornment，一个evaluator。所以如果想要了解csa，RegionStore，Environment，ExprEngine，Assume*，Checker五个点是核心。

当然这其中还有很多细节，例如

ExplodedGraph？
Checker如何挂载？
如何处理IPA
等等，剩下的都是如何处理纷繁复杂的C++特性，特别是一些新的语言特性，需要非常谨慎地进行处理。

前置无关内容

AnalysisConsumer

AnalysisConsumer是clang AST中做实事儿的接口，根据具体情况ASTFrontendAction可能对应一个或多个AnalysisConsumer。

RecursiveASTVisitor & StmtVisitor

RecursiveASTVisitor是顶层的遍历clang AST的工具，虽然也能处理stmt级别的处理，但是终归没有StmtVisitor用的顺手。

符号执行

wiki - symbolic execution

这里粘贴wiki上关于Symbolic Execution的描述：

Symbolic execution(also symbolic evaluation) is a means of analyzing a program to determine what inputs cause each part of program to execute. An interpreter follows the program, assuming symbolic values for inputs rather than obtaining actual inputs as normal execution of the program would. It thus arrives at expressions in terms of those symbols for expressions and variables in the program, and constraints in terms of those symbols for the possible outcomes of each conditional branch.

Limitations:

Path explosion

符号化地执行所有路径的方式是很难扩展到大型程序上的
可能遇到死循环
缓解路径爆炸的方法：启发式地优先执行特定路径，并行执行，合并相同路径（csa目前只采用了最后一种方式）

Program-dependent efficiency，这种path-by-path的分析方式相较于input-by-input的动态分析方式的优势不稳定，具有一定的程序依赖性（我的理解可能不准确，可能这个程序的inputs很集中，99%的时间跑在特定的路径上，而path-by-path的方式会在99%可能性的路径和1%可能性的路径上花费同样的时间）

Memory aliasing，对于符号执行来说很难静态地识别出别名，对于一些严重的情况，例如某个memory被外部修改了，而我们对memory一无所知，那么我们就只能invalidate所有memory location。

Arrays，如何建模

Environment Interactions，这个是所有静态分析工具的限制所在，对于system call或者其它library call。如下代码所示，第5行fork出两个路径，两条路径在执行到11行时，读出来的文件内容还会和第5行相同，但其实fputs()调用已经放入新的值。

int main()
{
  FILE *fp = fopen("doc.txt");
  ...
  if (condition) {
    fputs("some data", fp);
  } else {
    fputs("some other data", fp);
  }
  ...
  data = fgets(..., fp);
}

以下三种解决办法：

Executing calls to the enviornment directly
Modeling the environment
Forking the entire system state

抽象解释

野路子，瞎写
关于抽象解释，我在《#64 Abstract Interpretation: Introduction & #66 Galois Connections - 课程笔记》中做过一些简单的学习。这里再贴一下abstract interpretation的wiki介绍（赞wikipedia的介绍）。

wiki - Abstract Interpretation

Abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattices. It can be viewed as a partial execution of a computer program which gains information about its semantices without performing all the calculations.

Program semantics are generally described using fixed points in the presence of loops or recursive procedures.

Widening and Narrowing Operators for Abstract Interpretation

wiki对抽象解释的描述层次比较高，这篇论文中的一个描述比较具体。

This theory is based on two main key-concepts: the correspondence between concrete and abstract semantics through Galois connections/insertions, and the feasibility of a fixed point computation of the abstract semantics, through, the combination of widening operators (to get fast convergence) and narrowing operators (to improve the accuracy of the resulting analysis).

通过Galois connection/insertion将concrete semantics 和 abstract semantics的联结
fixed point computation
widening来保证尽快的收敛
narrowing来提高精度

我个人认为抽象解释中 concrete set -> abstract set，这一点还有点儿像线性代数中的线性变换或者投影的概念，将高维的数据降低到低纬度，剔除到我们不关心的干扰项。

Symbolic Execution vs Abstract Interpretation

我们知道数据流分析是抽象解释的一种应用，一种实例。那么符号执行和抽象解释有没有什么关系，其实两者本质上没有多大的关系。虽然两者关系不大，但还是值得思考一下两者的区别。

Symbolic Execution is a case of Abstract Interpretation?

The main idea of symbolic execution is that, at an arbitrary point in execution, you can express the values of all variables as functions of the initial values.

The main idea of abstract interpretation is that you can systematically explore all executions of a program by a series of over-approximations.

整体上虽然两者不相同，但是有一些子部分，两者的思想相同。

abstract states
joining 和 widening

抽象解释使用 abstract states 是自不必说的，而对于符号执行来说，也不是在evaluate程序语句的时候，对于程序语句的语义有取舍。这也算是abstract states。

而对于abstract states在一些程序点会对abstract state进行merge，所以抽象解释执行是一个DAG，而symbolic execution是一个tree。但是对于某些符号执行工具来说，例如csa，其中会合并相同的状态，最后其实也是个DAG，所以有ExplodedGraph。

widening是抽象解释常用的处理 infinite lattice 的方法，也能使迭代快速收敛，核心思想就是减少abstract state的“精度”，在保证sound的同时，尽可能地接近least fixed point解。而对于符号执行来说，也有一些简单的widening的思想，例如csa中的loop widen，Improved Loop Execution Modeling in the Clang Static Analyzer。

数据流分析

其实数据流分析和clang static analyzer关系不大，一个抽象解释的路子，一个是模拟执行的路子，一个是不动点，一个是执行完成。但是最好还是了解一下。

ExprEngine

如下图所示ExprEngine是path sensitive analysis的核心，前面经过HandleTranslationUnit()->runAnalysisOnTranslationUnit()->HandleDeclsCallGraph()->ExecuteWorkList()进入ExprEngine的范围。绿色框中的数据成员是ExprEngien的核心。

CoreEngine
ProgramStateManager
ExplodedGraph
SValBuilder
BugReporter
MemRegionManager
ConstraintManager

ExprEngine的成员方法也分为三类，

第一类是process*()，这些都是处理主要处理CFGElement层次的数据，例如如何处理branch，loop，Dtor&Ctor，因为有一些数据需要在这个层次处理，而不是放到下一层AST expression的transfer function的时候处理。有点儿类似于什么，类似于数据流分析时需要处理的控制流
第二类是Visit*()，这些方法代表了具体的transfer function，例如对于BinaryOperator，我们需要eval operands，然后进行相加。
第三类是ExprEngine()一些helper方法，例如removeDead()等等，与具体的evaluation无关。

ExprEngine算是具体的analyzer，而中间的一层CoreEngine可以理解为一个框架，可以在其中塞入不同的analyzer ngine。

CoreEngine

CoreEngine是一个中间层。

// CoreEngine.h
//
// This file defines a generic engine for intraprocedural, path-sensitive,
// dataflow analysis via graph reachability.

//===----------------------------------------------------------------------===//
// CoreEngine - Implements the core logic of the graph-reachability
// analysis. It traverses the CFG and generates the ExplodedGraph.
// Program "states" are treated as opaque void pointers.
//
// Note this engine only dispatches to transfer functions
// at the statement and block-level. The analyses themselves must implement
// any transfer function logic and sub-expression level (if any).
class CoreEngine{/**/};

ProgramStateManager

ProgramStateManager用来管理程序状态。核心是Stmt->SVal，Loc->SVal和SVal->Constraints。关于SVal的细节见Clang Static Analyzer内存模型（二）.i：MemRegion与SVal。

// ProgramState - This class encapsulates:
// 
//  1. A mapping from expression to values (Environment)
//  2. A mapping from locations to values (Store)
//  3. Constraints on symbolic values (GenericDataMap)
//
// Together these represent the "abstract state" of a program.
// 
// ProgramState is intended to be used as a functional object; that is,
// once it is created and made "persistent" in a FoldingSet, its
// values will never change.
class ProgramState : public llvm::FoldingSetNode {
	  //==---------------------------------------------------------------------==//
  // Constraints on values.
  //==---------------------------------------------------------------------==//
  //
  // Each ProgramState records constraints on symbolic values.  These constraints
  // are managed using the ConstraintManager associated with a ProgramStateManager.
  // As constraints gradually accrue on symbolic values, added constraints
  // may conflict and indicate that a state is infeasible (as no real values
  // could satisfy all the constraints).  This is the principal mechanism
  // for modeling path-sensitivity in ExprEngine/ProgramState.
  //
  // Various "assume" methods form the interface for adding constraints to
  // symbolic values.  A call to 'assume' indicates an assumption being placed
  // on one or symbolic values.  'assume' methods take the following inputs:
  //
  //  (1) A ProgramState object representing the current state.
  //
  //  (2) The assumed constraint (which is specific to a given "assume" method).
  //
  //  (3) A binary value "Assumption" that indicates whether the constraint is
  //      assumed to be true or false.

  // The output of "assume*" is a new ProgramState object with the added constraints.
  // If no new state is feasible, NULL is returned.
	ProgramStateRef assume*();

  //==---------------------------------------------------------------------==//
  // Binding and retrieving values to/from the environment and symbolic store.
  //==---------------------------------------------------------------------==//

  /// Create a new state by binding the value 'V' to the statement 'S' in the
  /// state's environment.
  ProgramStateRef BindExpr(const Stm *S,);
  ProgramStateRef bindLoc(Loc location, SVal V,);
  ProgramStateRef bindLoc(SVal location, SVal V,)
}

ExplodedGraph

ExplodedGraph的概念源于IFDS那篇论文中的概念，保存整个符号执行的状态，在BugReporter中有比较重要的应用，用于“重现”或者“定制”报错路径。

//===---------------------------------------------------------------------===//
// 
// This file defines the template classes ExplodedNode and ExplodedGraph, 
// which represent a path-sensitive, intra-procedural "exploded graph."
// See "Precise interprocedural dataflow analysis via graph reachability" by Reps, Horwitz, and Sagiv for the definition of an exploded graph.
class ExplodedGraph {/**/};

SValBuilder

SValBuilder是最底层的细粒度的对程序语句进行evaluate的对象。从SValBuilder的成员方法，我们就可以知道它到底是用来做什么了。

//===---------------------------------------------------------------------===//
//
// This file defines SValBuilder, a class that defines the interface for
// "symbolic evaluators" which construct an SVal from an expression.
//===---------------------------------------------------------------------===//
class SValBuilder {
	SVal evalCast(SVal val, QualType castTy, QualType originalType);

	// Handles casts of type CK_IntegeralCast.
	SVal evalIntegralCast(ProgramStateRef state, SVal val, QualType castTy,
						QualType originalType);
	
	virtual SVal evalMinus(NonLoc val) = 0;
	virtual SVal evalComplement(NonLoc val) = 0;

	// Create a new value which represents a binay expression with two non-
	// location operands.
	virtual SVal evalBinOpNN(ProgramStateRef state, BinaryOperator::Opcode op,
							NonLoc lhs, NonLoc rhs, QualType resultTy) = 0;
	// ...

	SVal evalBinOp(ProgramStateRef state, BinaryOperator::Opcode op,
					SVal lhs, SVal rhs, QualType type);

	SVal evalEQ(ProgramStateRef state, SVal rhs, SVal rhs);

	// ...
	
	// make SVals
	make*();
};

class SimpleSValBuilder : public SValBuilder {/**/};

相较于SValBuilder，还有一个更底层的BasicValueFactory，从而支持了一些简单的数学运算，例如加减乘除、位运算等等。

ConstraintManager

其实ConstraintMananger不是ExprEngine的数据成员，ProgramState中记录了symbol value对应的constraints。例如对于下面的代码：

int divzero(int a, int b) {
	if (a > 2 * b) {
		return a / 0;
	}
	return a * b;
}

CFG如下图所示，而约束求解的部分由assume*实现。约束求解ExprEngine::evalEagerlyAssumeBinOpBifurcation()是在ExprEngine.Visit()函数中在evaluate a > 2 * b的值之后执行的。
CFG
ExprEngine::evalEagerlyAssumeBinOpBifurcation()的代码逻辑如下，在checker中显示调用assume*是很常用方式，通过判断StateTure和StateFalse也验证是否触发bug。

void ExprEngine::evalEagerlyAssumeBinOpBifurcation(ExplodedNodeSet &Dst,
                                                   ExplodedNodeSet &Src,
                                                   const Expr *Ex) {
	// ...
	SVal V = state->getSVal(Ex, Pred->getLocationContext());
	      ProgramStateRef StateTrue, StateFalse;
	std::tie(StateTrue, StateFalse) = state->assume(*SEV);

	// First assume that the condition is true.
    if (StateTrue) {
		SVal Val = svalBuilder.makeIntVal(1U, Ex->getType());
        StateTrue = StateTrue->BindExpr(Ex, Pred->getLocationContext(), Val);
        Bldr.generateNode(Ex, Pred, StateTrue, tags.first);
	}

	// Next, assume that the condition is false.
	if (StateFalse) {
		SVal Val = svalBuilder.makeIntVal(0U, Ex->getType());
        StateFalse = StateFalse->BindExpr(Ex, Pred->getLocationContext(), Val);
        Bldr.generateNode(Ex, Pred, StateFalse, tags.second);
    }
}

最终约束求解的部分在RangedConstraintManager中实现的，最终还是通过判断symbolic value是否有交叉，由于不同类型的[MIN, MAX]不同，所以range based constraint solver 和数据类型很相关。

Range Based Constaint Solver

csa默认的是这个solver，所有都转换为简单类型的运算和比较。

class Base {
public:
	int a;
	int b;
};

class Derived : public Base {
public:
	int c;
};

int main() {
	Derived *d = new Derived;
	Base *b = d;
	if (b != d){
		return 10 / 0; // unreachable code
	}
	return 0;
}

Z3

csa把Z3 merge进去了，用于减少误报率，见[analyzer] Improved cmake configuration for Z3，SMT-Based Refutation of Spurious Bug Reports in the Clang Static Analyzer 。

但并不是用Z3替代range based constraint solver，官方的说法是效率很低。由于csa保留了ExplodedGraph，可以还原现场，所以目前是使用Z3重新扫一遍BugReports来筛选掉误报。

Environment

还是以下面的代码为例，我们截取assume a > 2 * b之前的一个ProgramState。

int divzero(int a, int b) {
	if (a > 2 * b) {
		return a / 0;
	}
	return a * b;
}

state
我们在assume a > 2 * b之前的，需要用到variable，expression的值就存在environment中，如下所示。

a : &a
b: &b
b : reg_$1<int b>
2 * b : reg_$2<int b> * 2
a : reg_$0<int a>

csa会对environment进行垃圾回收，这也是比较复杂的一部分。我在《clang static analyzer源码分析（番外篇）：removeDead() - SVal、Symbol及Environment》中详细介绍过，虽然内容旧了点儿，但大框架没错。

The basic algorithm is pretty simple, though:
(1) Find out which expressions and variables are still live(LiveVariables). This is cached, per-function, context-insensitive information.
(2) Ask checkers which symbols are known to be in use, though potentially not live (checkLiveSymbols).
(3) Mark live any values associated with live expressions in the Environment. Remove all other bindings.
(4) Mark live any values accessible via the live regions in the Store. Remove all other bindings.
(5) Remove any constraints on dead symbols.
(6) Report dead symbols to the checkers, so that they can stop tracking information dependent on those symbols (checkDeadSymbols).
————————————————
版权声明：本文为CSDN博主「电影旅行敲代码」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/dashuniuniu/article/details/53173045

RegionStore

个人认为RegionStore是整个csa最为复杂的部分了，一些内容我记的不是很清楚（没有记清楚说明当时并没有完全理解），这里勉强描述一下RegionStore的原理。RegionStoreManager继承自StoreManager，StoreManager作为interface的角色，定义了一个StoreManager应该具备哪些功能。这个和SValBuilder/SimpleSValBuilder以及ConstraintManager/SimpleConstraintManager相关。

类似于ProgramState，Store也是functional object，这个对于理解csa很关键。细节见clang static analyzer中的数据结构及内存分配策略 - ImmutableMap & ImmutableSet篇。

class StoreManager {
	virtual SVal getBinding(Store store, Loc loc, QualType T = QualType()) = 0;
	virtual Optional<SVal> getDefaultBinidng(Store store, const MemRegion *R) = 0;
	virtual StoreRef Bind(Store store, Loc loc, SVal val) = 0;
	virtual StoreRef BindDefaultZero(Store store, const MemRegion *R) = 0;
	virtual StoreRef killBinding(Store ST, Loc L) = 0;
	// ...
	SVal evalDerivedToBase(SVal Derived, const CastExpr *Cast);
	SVal attemptDownCast(SVal Base, QualType DerivedPtrType, bool &Failed);
	const ElementRegion *GetElementZeroRegion(const SubRegion *R, QualType T);
	// ...
	virtual StoreRef removeDeadBindings(Store store, const StackFrameContext *LCtx, 
										SymbolReaper &SymReaper) = 0;
	virtual StoreRef invalidateRegions(Store store, ArrayRef<SVal> Values,...);
};

RegionStoreManager以下面几个比较复杂的地方，

invalidate regions
remove dead bindings
如何处理aggregate types与filed
如何处理derived to base cast等等
LazyCompoundVal

关于Store的设计，csa有一个文档clang/docs/developer-docs/RegionStore.rst，介绍了Store中一些关键点的设计。

CrossTU

summary based

A. Sidorin曾经实现过Summary-based inter-unit analysis for Clang Static Analyzer曾经实现过一个summary based的过程间分析，我没理解错的话，它的summary就是包含symbols的一个program state。例如对于$sym op $sym，如果没有约束信息，我们对此毫无所知，所以只能创建新的$sym来代替$sym op $sym。而summary based的方式是不过早的下结论，将$sym op $sym记录下来等到有了context信息，在对其进行actualization。

Gábor Horváth

另外一个就是Gábor Horváth提出的[analyzer] Support for naive cross translational unit analysis和.1. Cross Translation Unit (CTU) Analysis它的方式是将TU的AST序列化出来，遇到Cross TU的情形，将需要的AST反序列化出来，然后在evaluate CallExpr。