静态基线机器学习_静态分析：基线VS差异

最新推荐文章于 2023-12-20 14:19:04 发布

cullen2012

最新推荐文章于 2023-12-20 14:19:04 发布

阅读量1.1k

点赞数

文章标签： python java 大数据 linux 编程语言

原文链接：https://habr.com/en/company/pvs-studio/blog/513952/

版权

本文探讨了静态分析中的基线方法和基于VCS差异功能的方法，旨在详细阐述警告抑制机制及其优缺点。基线方法通过生成抑制文件来忽略已有警告，而差异方法利用版本控制系统来定位新代码的警告。文章分析了这两种方法的实现细节，包括碰撞处理和提高准确性的策略。

摘要由CSDN通过智能技术生成

静态基线机器学习

The purpose of this article is not to help with integration but rather to elaborate on the technicalities of the process: the exact implementations of warning suppression mechanisms and pros and cons of each approach.

本文的目的不是帮助集成，而是详细介绍该过程的技术性：警告抑制机制的确切实现以及每种方法的利弊。

基线，或所谓的抑制轮廓 (baseline, or the so-called suppress profile)

This approach is known by various names: baseline file in Psalm and Android Lint, suppress base (or profile) in PVS-Studio, code smell baseline in detekt.

这种方法有多种名称： Psalm和Android Lint中的基准文件， PVS-Studio中的抑制基准(或配置文件)， detekt中的代码气味基准。

This file is generated by the linter when run on the project:

该文件在项目上运行时由linter生成：

superlinter --create-baseline baseline.xml ./project

Inside, it stores all the warnings produced at the creation step.

在内部，它存储在创建步骤中产生的所有警告。

When running a static analyzer with the baseline.xml file, all the warnings contained in it will be ignored:

当使用baseline.xml文件运行静态分析器时，其中包含的所有警告将被忽略：

superlinter --baseline baseline.xml ./project

A straightforward approach, where warnings are kept in full along with line numbers, won't be working well enough: adding new code to the beginning of the source file will result in shifting the lines and, therefore, bringing back all the warnings meant to stay hidden.

一种简单的方法(警告与行号保持完整)将无法很好地发挥作用：将新代码添加到源文件的开头将导致行的移位，因此，将所有警告均传递给保持隐藏。

We typically seek to accomplish the following goals:

我们通常寻求实现以下目标：

All warnings on new code must be issued
必须发布有关新代码的所有警告
Warnings on existing code must be issued only if it was modified
仅当修改现有代码时，才发出警告
(optional) Allow moving files or code fragments
(可选)允许移动文件或代码片段

What we can configure in this approach is which fields of the warning to include to form its hash value (or "signature"). To avoid problems related to the shifting of line numbers, don't include the line number in the list of these fields.

我们可以用这种方法配置的是警告要包括的哪些字段以形成其哈希值(或“签名”)。为避免与行号转移相关的问题，请勿在这些字段的列表中包括行号。

Here is an example list of fields that can form a warning signature:

以下是可以形成警告签名的字段示例：

Diagnostic name or code
诊断名称或代码
Warning message
警告信息
File name
文档名称
Source line that triggers the warning
触发警告的源代码行

The more properties used, the lower the risk of collision, but also the higher the risk of getting an unexpected warning due to signature invalidation. If any of the specified properties changes, the warning will no longer be ignored.

使用的属性越多，发生碰撞的风险越低，但是由于签名无效而获得意外警告的风险也就越高。如果任何指定的属性发生更改，该警告将不再被忽略。

Along with the warning-triggering line, PVS-Studio stores the line before and the line after it. This helps better identify the triggering line, but with this approach, you may start getting the warning after modifying a neighboring line.

除警告行外，PVS-Studio还存储前行和后行。这有助于更好地识别触发线，但是通过这种方法，您可以在修改相邻线后开始获得警告。

Another – less obvious – property is the name of the function or method the warning was issued for. This helps reduce the number of collisions but renaming the function will cause a storm of warnings on it.

另一个(不太明显)的属性是发出警告的函数或方法的名称。这有助于减少冲突的次数，但是重命名该功能将在其上引起大量警告。

It was empirically found that using this property allows you to weaken the filename field and store only the base name rather than the full path. This enables moving files between directories without the signature getting invalidated. In languages like C#, PHP, or Java, where the file name usually reflects the class name, such moves may have sense.

根据经验发现，使用此属性可让您削弱文件名字段并仅存储基本名称而不存储完整路径。这样可以在目录之间移动文件而不会使签名无效。在文件名通常反映类名的C＃，PHP或Java之类的语言中，此类移动可能是有意义的。

A well composed set of properties makes the baseline approach more effective.

一组良好组成的属性使基准方法更加有效。

基线方法中的碰撞 (Collisions in a baseline method)

Suppose we have a diagnostic,

假设我们有一个诊断，

W104 (W104)

, that detects calls to

，可以检测到

死 (die)

in the source code.

在源代码中。

The project under analysis has the file foo.php:

正在分析的项目具有文件foo.php ：

function legacy() {
  die('test');
}

The properties we use are {file name, diagnostic code, source line}.

我们使用的属性是{文件名，诊断代码，源代码行}。

When creating a baseline, the analyzer adds the call die('test') to its ignore base:

创建基准时，分析器将调用die('test')添加到其忽略基准：

{
  "filename": "foo.php",
  "diag": "W104",
  "line": "die('test');"
}

Now, let's add some more code:

现在，让我们添加更多代码：

+ function newfunc() {
+   die('test');
+ }

  function legacy() {
    die('test');
  }

All properties of the new call die('test') are exactly the same as those forming the signature of the fragment to be ignored. That's what we call a collision: a coincidence of warning signatures for potentially different code fragments.

新调用die('test')的所有属性与构成要忽略的片段签名的属性完全相同。这就是我们所说的冲突：潜在不同代码片段的警告签名重合。

One way of solving this issue is to add an additional field to distinguish between the two calls – say, "name of containing function".

解决此问题的一种方法是添加一个附加字段以区分这两个调用，例如“包含函数的名称”。

But what if the new die('test') call is added to the same function? Neighboring lines may be the same in both cases, so including the previous and next lines in the signature won't help.

但是，如果将新的die('test')调用添加到同一函数中怎么办？在两种情况下，相邻行可能都相同，因此在签名中包括前一行和下一行将无济于事。

This is where the counter of signatures with collisions comes in handy. It will let us know that we get two or more warnings when only one was expected inside a function – then all warnings but the first must be shown.

这是带有冲突的签名计数器派上用场的地方。它会让我们知道，当在一个函数中只需要一个警告时，我们会收到两个或更多警告-然后必须显示除第一个警告以外的所有警告。

With this solution, however, we somewhat lose in precision: you can't determine which line is newly added and which already existed. The analyzer will report the line following the ignored ones.

但是，使用此解决方案时，我们在某种程度上会失去精度：您无法确定哪些行是新添加的，哪些行已经存在。分析器将报告被忽略的行之后的行。

基于VCS差异功能的方法 (Approach based on diff capabilities of VCSs)

The original goal was to have warnings issued only on "newly written" code. This is where version control systems can be useful.

最初的目标是仅在“新编写的”代码上发出警告。这是版本控制系统有用的地方。

The revgrep utility receives a flow of warnings at stdin, analyzes the git diff, and outputs only warnings produced for new lines.

revgrep实用程序在stdin处接收警告流，分析git diff ，仅输出为新行生成的警告。

golangcigolangci -lint-lint将 employs revgrep's fork as a library, so it uses just the same algorithms for diff calculations. revgrep的fork作为库使用，因此它仅使用相同的算法进行差异计算。

If you take this path, you'll have to answer the following questions:

如果您采用这种方式，则必须回答以下问题：

What commit range to choose for calculating the diff?
选择哪个提交范围来计算差异？
How are you going to process commits coming from the master branch (merge/rebase)?
您将如何处理来自master分支的提交(合并/变基)？

Also keep in mind that sometimes you still want to get warnings outside the scope of the diff. For example, suppose you deleted a meme director class, MemeDirector. If that class was mentioned in any doc comments, you'd like the linter to tell you about that.

还请记住，有时您仍然希望获得差异范围之外的警告。例如，假设您删除了Meme Director类MemeDirector 。如果在任何文档注释中都提到了该类，那么您希望lint告诉您有关该类的信息。

We need not only to get a correct set of affected lines but also to expand it so as to trace the side effects of the changes throughout the whole project.

我们不仅需要获得一组正确的受影响线，还需要对其进行扩展，以跟踪整个项目中所做更改的副作用。

The commit range can also be different. You wouldn't probably want to check the last commit only because in that case you'd be able to push two commits at once: one with the warnings, and the other for CI traversal. Even if done unintentionally, this poses a risk of overlooking a critical defect. Also keep in mind that the previous commit can be taken from the master branch, in which case it shouldn't be checked either.

提交范围也可以不同。您可能不想只检查最后一个提交，因为在这种情况下，您可以一次推送两个提交：一个带有警告，另一个用于CI遍历。即使无意中这样做，也存在忽略严重缺陷的风险。还要记住，先前的提交可以从master分支获取，在这种情况下也不应进行检查。

NoVerify中的差异模式 (diff mode in NoVerify)

NoVerify has two working modes: diff and full diff. NoVerify有两种工作模式：差异和完全差异。

The regular diff can find warnings on files affected by modifications within the specified commit range. It's fast, but it doesn't provide thorough analysis of dependencies, and so new warnings on unaffected files can't be found.

常规diff可以在受指定提交范围内的修改影响的文件上找到警告。它的速度很快，但是没有提供对依赖关系的全面分析，因此无法找到未受影响文件的新警告。

The full diff runs the analyzer twice: first on existing code and then on new code, with subsequent filtering of results. This is similar to generating a baseline file on the fly based on the ability to get the previous version of the code using git. As you would expect, execution time increases almost twofold in this mode.

完全差异对分析器运行两次：首先对现有代码运行，然后对新代码运行，然后对结果进行过滤。这类似于基于使用git获取代码的先前版本的能力动态生成基准文件。如您所料，在这种模式下执行时间几乎增加了两倍。

The initially suggested scenario was to run the faster analysis on pre-push hooks – in diff mode – so that feedback comes as soon as possible; then run the full diff mode on CI agents. As a result, people would ask why issues were found on agents but none were found locally. It's more convenient to have identical analysis processes so that passing a pre-push hook guarantees passing the linter's CI phase.

最初建议的方案是在diff模式下对预按钩进行更快的分析，以便尽快反馈。然后在CI代理上运行完全差异模式。结果，人们会问为什么在代理上没有发现问题，而在本地没有找到问题。具有相同的分析过程更加方便，因此，通过预推钩可以确保通过短绒棉的CI阶段。

一遍全差 (full diff in one pass)

We can make an analog of full diff but without having to run double analysis.

我们可以模拟完全差异，而不必运行双重分析。

Suppose we've got the following line in the diff:

假设我们在差异中包含以下行：

- class Foo {

If we try to classify this line, we'll tag it as "Foo class deletion".

如果我们尝试对该行进行分类，则将其标记为“ Foo类删除”。

Each diagnostic that in any way depends on the class being present must issue a warning if this class got deleted.

如果删除了该类，则无论如何都取决于存在的类的每个诊断都必须发出警告。

Similarly, when deleting variables (whether global or local) and functions, we have a collection of facts generated about all changes that we can classify.

同样，在删除变量(全局变量或局部变量)和函数时，我们收集了有关可以分类的所有更改的事实集合。

Renames don't require additional processing. We view it as deleting the character with the old name and adding a character with the new one:

重命名不需要其他处理。我们将其视为删除具有旧名称的字符并添加具有新名称的字符：

- class MemeManager {
+ class SeniorMemeManager {

The biggest difficulty here is to correctly classify lines with changes and reveal all their dependencies without slowing down the algorithm to the speed of full diff with double traversal of the code base.

这里最大的困难是正确地对带有更改的行进行分类，并揭示它们的所有依存关系，而又不减慢算法到代码库两次遍历的全差异速度。

结论 (Conclusions)

基准线 (baseline)

: simple approach used in many analyzers. The obvious downside to it is that you will have to place this baseline file somewhere and update it every now and then. The more appropriate the collection of properties forming the warning signature, the more accuracy.

：许多分析仪使用的简单方法。它的明显缺点是，您将不得不将此基准文件放置在某个位置并不时进行更新。形成警告签名的属性集合越合适，准确性越高。

差异 (diff)

: simple in basic implementation but complicated if you want to achieve the best result possible. Theoretically, this approach can provide the highest accuracy. Your customers won't be able to integrate the analyzer into their process unless they use a version control system.

：基本实现比较简单，但是如果您想获得最佳效果则比较复杂。从理论上讲，这种方法可以提供最高的准确性。您的客户除非使用版本控制系统，否则将无法将分析器集成到他们的过程中。

baseline	diff
+ can be easily made powerful	+ doesn't require storing an ignore file
+ easy to implement and configure	+ easier to distinguish between new and existing code
− collisions must be resolved	− takes much effort to prepare properly

基准线	差异
+可以轻松地变得强大	+不需要存储忽略文件
+易于实施和配置	+更容易区分新代码和现有代码
−必须解决冲突	−花费大量精力进行适当的准备

Hybrid approaches are also used: for example, you first take the baseline file and then resolve collisions and calculate line shifts using git.

还使用混合方法：例如，您首先获取基线文件，然后解决冲突并使用git计算换行。

Personally, I find the diff approach more elegant, but the one-pass implementation of full analysis may be too problematic and fragile.

就个人而言，我发现diff方法更为优雅，但是完整分析的一遍实施可能会太成问题且脆弱。