CodeQL学习笔记
本篇笔记参考CodeQL官方文档:https://codeql.github.com/docs
QL教程
QL是一种逻辑编程语言,与SQL相似,但用法略有不同。本节主要关注语法,环境搭建、选择数据库等过程略。
QL查询
单个返回结果的查询(Query)通常如下模版所示:
from /* ... variable declarations ... */
where /* ... logical formulas ... */
select /* ... expressions ... */
例如,以下查询语句的返回结果是42:
from int x, int y
where x = 6 and y = 7
select x * y
查询语句也可以返回多个结果,例如,以下语句计算1到10之间的所有勾股数(Pythagorean triples):
from int x, int y, int z
where x in [1..10] and y in [1..10] and z in [1..10] and
x*x + y*y = z*z
select x, y, z
可以使用类来简化查询语句,以下示例使用SmallInt
类来表示1到10之间的整数:
class SmallInt extends int {
SmallInt() { this in [1..10] }
int square() { result = this*this }
}
from SmallInt x, SmallInt y, SmallInt z
where x.square() + y.square() = z.square()
select x, y, z
CodeQL查询示例
CodeQL库可以帮助我们发现代码库中的一些安全漏洞。要导入特定编程语言的CodeQL库,要在query开头加上import <language>
。
import python
from Function f
where count(f.getAnArg()) > 7
select f
from
子句定义了变量f
,表示Python函数;where
部分表示筛选具有7个以上参数的函数;select
子句列出符合条件的函数。
import javascript
from Comment c
where c.getText().regexpMatch("(?si).*\\bTODO\\b.*")
select c
from
子句定义了变量c
,表示JavaScript注释;where
部分表示筛选包含"TODO"
的注释;select
子句列出符合条件的注释。
import java
from Parameter p
where not exists(p.getAnAccess())
select p
from
子句定义了变量p
,表示Java参数;where
部分表示筛选没被使用过的参数(not accessed);select
子句列出符合条件的参数。
CodeQL查询
CodeQL查询用于分析代码中与安全性、正确性、可维护性和可读性相关的问题。查询主要分为两类:
- Alert queries: queries that highlight issues in specific locations in your code
- Path queries: queries that describe the flow of information between a source and a sink in your code.
Query基本结构
/**
*
* Query metadata
*
*/
import /* ... CodeQL libraries or modules ... */
/* ... Optional, define CodeQL classes and predicates ... */
from /* ... variable declarations ... */
where /* ... logical formula ... */
select /* ... expressions ... */
使用CodeQL编写的query文件扩展名为.ql
,并包含select
子句。本节主要介绍Alert queries,关于Path queries参考Creating path queries
查询元数据(Query metadata)
元数据提供有关查询目的的信息,还指定了如何解释(interpret)和显示(display)查询结果。提供给开源存储库或使用CodeQL CLI分析数据库的查询必须指定查询类型(@kind)。@kind属性指示如何解释和显示查询分析的结果:
Alert query
的元数据必须包含@kind problem
,以指定输出结果为简单的警报Path query
的元数据必须包含@kind path-problem
,以指定输出结果为记录了一系列代码位置的警报Diagnostic query
的元数据必须包含@kind diagnostic
,以指定输出结果为关于提取过程的故障诊断数据Summary query
的元数据必须包含@kind metric
和@tags summary
,以指定输出结果为CodeQL database的总结信息(summary metrics)
关于metadata的更多信息参考Metadata for CodeQL queries
Import statements
编写alert query时,通常需要import项目对应的编程语言的标准库,参考:CodeQL language guides
还有一些库包含了常用的predicates,types和其他用于分析的模块(例如data flow、control flow、taint-tracking),path queries通常要求import data flow库,参考Creating path queries
此外,也可以自定义类和谓词,参考Defining a predicate和Defining a class
From clause
from子句声明查询中使用的变量,声明的格式为<type> <variable name>
。变量类型参考types。
Where clause
where子句定义了应用于在from子句中声明的变量的逻辑条件,使用aggregations、predicates和formulas来限定变量的范围。
Select clause
select子句的结构需要与元数据中的@kind
相对应。例如,alert queries (@kind problem
)的select子句结构为:
select element, string
element
: query识别的code element,定义了alert的显示位置string
: message(也可以包含links和placeholders),解释了产生alert的原因
可以在message中使用placeholders,使用$@
定义一个placeholder,之后的两个参数分别为link target和link text,下例用于查找扩展了其他类的Java类:
/**
* @kind problem
*/
import java
from Class c, Class superclass
where superclass = c.getASupertype()
select c, "This class extends the class $@.", superclass, superclass.getName()
查询结果为:
更多信息参考Defining the results of a query。
其他类型query的select子句结构见select clause
Query help files
查询帮助文件用于向其他用户解释查询的目的,参考Query help files
Providing locations in CodeQL queries
当向用户展示信息时,应用程序需要能够从查询结果中提取位置信息。QL类通过以下机制之一来提供位置信息:
- Providing URLs
- Providing location information
- Using extracted location information
数据流分析(Data flow analysis)
数据流图(Data flow graph)
CodeQL data flow libraries的两种数据流:
- Local data flow: 单个函数内的数据流
- Global data flow: 整个program的数据流(calculating data flow between functions and through object properties)
Creating path queries
Path query模版:
/**
* ...
* @kind path-problem
* ...
*/
import <language>
// For some languages (Java/C++/Python) you need to explicitly import the data flow library, such as
// import semmle.code.java.dataflow.DataFlow
import DataFlow::PathGraph
...
from MyConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "<message>"
CodeQL for C and C++
CodeQL library for C and C++
cpp.qll
引入了所有CodeQL的核心C/C++库,因此在query前加上import cpp
即可
常用的Declaration、Statement、Expression、Type、Preprocessor类参见CodeQL library for C and C++
Functions in C and C++
查询所有静态函数
import cpp
from Function f
where f.isStatic()
select f, "This is a static function."
查询未被调用过的函数
import cpp
from Function f
where not exists(FunctionCall fc | fc.getTarget() = f)
select f, "This function is never called."
查询未被调用过且未被函数指针引用
import cpp
from Function f
where not exists(FunctionCall fc | fc.getTarget() = f)
and not exists(FunctionAccess fa | fa.getTarget() = f)
select f, "This function is never called, or referenced with a function pointer."
查询使用了可变格式字符串的sprintf函数
import cpp
from FunctionCall fc
where fc.getTarget().getQualifiedName() = "sprintf"
and not fc.getArgument(1) instanceof StringLiteral
select fc, "sprintf called with variable format string."
Expressions, types, and statements in C and C++
CodeQL中的C/C++ statements
Stmt
Loop
WhileStmt
ForStmt
DoStmt
ConditionalStmt
IfStmt
SwitchStmt
TryStmt
ExprStmt
- expressions used as a statement; for example, an assignmentBlock
- { } blocks containing more statements
查找在初始化中对整数进行了0赋值的for循环
import cpp
from AssignExpr e, ForStmt f
// the assignment is in the 'for' loop initialization statement
where e.getEnclosingStmt() = f.getInitialization()
and e.getRValue().getValue().toInt() = 0
and e.getLValue().getType().getUnspecifiedType() instanceof IntegralType
select e, "Assigning the value 0 to an integer, inside a for loop initialization."
循环初始化是一个statement(Stmt
)而不是expression(Expr
),赋值表达式AssignExpr
类被包裹在ExprStmt
类中,因此要使用Expr.getEnclosingStmt()
来获取套在表达式外的最近的Stmt
;Type.getUnspecifiedType()
将typedef类型解析为其基础类型,例如typedef int myInt;
中myInt
被解析为int
。
查找for循环体中的0赋值
import cpp
from AssignExpr e, ForStmt f
// the assignment is in the for loop body
where e.getEnclosingStmt().getParentStmt*() = f.getStmt()
and e.getRValue().getValue().toInt() = 0
and e.getLValue().getType().getUnderlyingType() instanceof IntegralType
select e, "Assigning the value 0 to an integer, inside a for loop body."
C/C++中的数据流分析
Local data flow
Using local data flow
Local data flow library在DataFlow
模块中,其中Node
类表示数据可以流过的任何元素。Node
分为expression nodes(ExprNode
)和parameter nodes(ParameterNode
)两类。可以使用成员谓词asExpr
和asParameter
来实现data flow nodes和expressions/parameters之间的转换。
class Node {
/** Gets the expression corresponding to this node, if any. */
Expr asExpr() { ... }
/** Gets the parameter corresponding to this node, if any. */
Parameter asParameter() { ... }
...
}
或使用谓词exprNode
和parameterNode
:
/**
* Gets the node corresponding to expression `e`.
*/
ExprNode exprNode(Expr e) { ... }
/**
* Gets the node corresponding to the value of parameter `p` at function entry.
*/
ParameterNode parameterNode(Parameter p) { ... }
谓词localFlowStep(Node nodeFrom, Node nodeTo)
在从节点nodeFrom
到nodeTo
存在直接数据流边(immediate data flow edge)的时候成立。该谓词可以递归调用(使用+和*运算符),预定义的递归谓词localFlow
与localFlowStep*
效果相同。
DataFlow::localFlow(DataFlow::parameterNode(source), DataFlow::exprNode(sink))
Using local taint tracking
Local taint tracking扩展了local data flow,它额外考虑了non-value-preserving flow steps,由模块TaintTracking
实现。相似的,谓词localTaintStep(DataFlow::Node nodeFrom, DataFlow::Node nodeTo)
在从节点nodeFrom
到nodeTo
存在直接的taint propagation edge时成立,该谓词也可以使用+和*递归调用,或者使用递归版本localTaint
。
Examples
使用Local数据流分析查找所有可能流入fopen
的filename参数的expressions:
import cpp
import semmle.code.cpp.dataflow.DataFlow
from Function fopen, FunctionCall fc, Expr src
where fopen.hasQualifiedName("fopen")
and fc.getTarget() = fopen
and DataFlow::localFlow(DataFlow::exprNode(src), DataFlow::exprNode(fc.getArgument(0)))
select src
查找用于打开文件的public parameter:
import cpp
import semmle.code.cpp.dataflow.DataFlow
from Function fopen, FunctionCall fc, Parameter p
where fopen.hasQualifiedName("fopen")
and fc.getTarget() = fopen
and DataFlow::localFlow(DataFlow::parameterNode(p), DataFlow::exprNode(fc.getArgument(0)))
select p
查找格式字符串不是硬编码的格式化函数:
import semmle.code.cpp.dataflow.DataFlow
import semmle.code.cpp.commons.Printf
from FormattingFunction format, FunctionCall call, Expr formatString
where call.getTarget() = format
and call.getArgument(format.getFormatParameterIndex()) = formatString
and not exists(DataFlow::Node source, DataFlow::Node sink |
DataFlow::localFlow(source, sink) and
source.asExpr() instanceof StringLiteral and
sink.asExpr() = formatString
)
select call, "Argument to " + format.getQualifiedName() + " isn't hard-coded."
Global data flow
Global data flow分析比local data flow更加不准确,且需要更多的时间和内存。
Using global data flowC
通过扩展DataFlow::Configuration
类来使用Global data flow库:
import semmle.code.cpp.dataflow.DataFlow
class MyDataFlowConfiguration extends DataFlow::Configuration {
MyDataFlowConfiguration() { this = "MyDataFlowConfiguration" }
override predicate isSource(DataFlow::Node source) {
...
}
override predicate isSink(DataFlow::Node sink) {
...
}
}
configuration中定义了以下谓词:
isSource
—defines where data may flow fromisSink
—defines where data may flow toisBarrier
—optional, restricts the data flowisBarrierGuard
—optional, restricts the data flowisAdditionalFlowStep
—optional, adds additional flow steps
通过谓词hasFlow(DataFlow::Node source, DataFlow::Node sink)
来实现数据流分析:
from MyDataFlowConfiguration dataflow, DataFlow::Node source, DataFlow::Node sink
where dataflow.hasFlow(source, sink)
select source, "Data flow to $@.", sink, sink.toString()
Using global taint tracking
import semmle.code.cpp.dataflow.TaintTracking
class MyTaintTrackingConfiguration extends TaintTracking::Configuration {
MyTaintTrackingConfiguration() { this = "MyTaintTrackingConfiguration" }
override predicate isSource(DataFlow::Node source) {
...
}
override predicate isSink(DataFlow::Node sink) {
...
}
}
The following predicates are defined in the configuration:
- isSource—defines where taint may flow from
- isSink—defines where taint may flow to
- isSanitizer—optional, restricts the taint flow
- isSanitizerGuard—optional, restricts the taint flow
- isAdditionalTaintStep—optional, adds additional taint steps
Examples
import semmle.code.cpp.dataflow.DataFlow
class EnvironmentToFileConfiguration extends DataFlow::Configuration {
EnvironmentToFileConfiguration() { this = "EnvironmentToFileConfiguration" }
override predicate isSource(DataFlow::Node source) {
exists (Function getenv |
source.asExpr().(FunctionCall).getTarget() = getenv and
getenv.hasQualifiedName("getenv")
)
}
override predicate isSink(DataFlow::Node sink) {
exists (FunctionCall fc |
sink.asExpr() = fc.getArgument(0) and
fc.getTarget().hasQualifiedName("fopen")
)
}
}
from Expr getenv, Expr fopen, EnvironmentToFileConfiguration config
where config.hasFlow(DataFlow::exprNode(getenv), DataFlow::exprNode(fopen))
select fopen, "This 'fopen' uses data from $@.",
getenv, "call to 'getenv'"
Detecting a potential buffer overflow
一个buffer overflow的例子,malloc时没有预留null termination character的位置。
void processString(const char *input)
{
char *buffer = malloc(strlen(input));
strcpy(buffer, input);
...
}
使用CodeQL查询在malloc
中只使用strlen(string)
作为参数的情况(详细便携过程请参考Detecting a potential buffer overflow):
import cpp
class MallocCall extends FunctionCall
{
MallocCall() { this.getTarget().hasGlobalName("malloc") }
Expr getAllocatedSize() {
if this.getArgument(0) instanceof VariableAccess then
exists(LocalScopeVariable v, SsaDefinition ssaDef |
result = ssaDef.getAnUltimateDefiningValue(v)
and this.getArgument(0) = ssaDef.getAUse(v))
else
result = this.getArgument(0)
}
}
from MallocCall malloc
where malloc.getAllocatedSize() instanceof StrlenCall
select malloc, "This allocation does not include space to null-terminate the string."
SSA库以静态单赋值(SSA)形式表示变量。在这种形式中,每个变量只赋值一次,每个变量都在使用前定义。使用SSA变量可以简化查询语句,因为它已经进行了很多local data flow分析。
Using the guards library in C and C++
Guards库(semmle.code.cpp.controlflow.Guards
)可以用来识别控制程序执行的条件表达式,参考Using the guards library in C and C++
Using range analysis for C and C++
Range analysis(semmle.code.cpp.rangeanalysis.SimpleRangeAnalysis
)可以用来确定表达式的上限和下限,或者确定表达式是否可能发生溢出。参考Using range analysis for C and C++
Hash consing and value numbering
使用semmle.code.cpp.valuenumbering.HashCons
识别语法相同的表达式,使用semmle.code.cpp.valuenumbering.GlobalValueNumbering
识别在运行时拥有相同值的表达式。参考Hash consing and value numbering