机器学习：软件漏洞分析

最新推荐文章于 2024-05-28 16:16:05 发布

明远AI

最新推荐文章于 2024-05-28 16:16:05 发布

阅读量4.2k

点赞数 3

分类专栏：人工智能机器学习深度学习

本文链接：https://blog.csdn.net/u011748542/article/details/84301934

版权

人工智能同时被 3 个专栏收录

21 篇文章 2 订阅

订阅专栏

深度学习

11 篇文章 1 订阅

订阅专栏

机器学习

8 篇文章 1 订阅

订阅专栏

Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey

摘要
keyword
引言
背景：软件漏洞分析与挖掘
机器学习和数据挖掘技术
- 希望与担忧
- 方法分类
基于软件度量的脆弱性预测
异常检测方法
易受攻击代码模式识别
其他方法

基于机器学习和数据挖掘的软件漏洞分析方法
SEYED MOHAMMAD GHAFFARIAN and HAMID REZA SHAHRIARI,
Amirkabir University of Technology

摘要

软件安全漏洞是计算机安全领域的关键问题之一，在过去几十年中陆续提出了许多方法来减轻软件漏洞的损害。机器学习和数据挖掘技术也是解决该问题的众多方法之一。在本文中，回顾了基于机器学习和数据挖掘技术的软件漏洞分析和挖掘领域的工作，讨论了各种方法的优、缺点，并指出了该领域的挑战和一些未知领域。

keyword

Software vulnerability analysis, software vulnerability discovery, software security, machine-learning, data-mining, review, survey

引言

计算机软件无处不在，并且人类生活在很大程度上依赖于各种各样的软件。不同形式的软件在不同平台上运行，既有手持移动设备上简单的应用程序，也有复杂的分布式企业软件系统。这些软件基于各种各样的技术，以不同的方法产生，每种技术都有自己的优点和局限。在这个庞大的产业以及计算机安全领域中，一个重要的问题是软件安全漏洞。引用行业专家的话：

“In the context of software security, vulnerabilities are specific flaws or oversights in a piece of software that allow attackers to do something malicious: expose or alter sensitive information, disrupt or destroy a system, or take control of a computer system or program.” Dowd et al. (2007)

软件漏洞根据开发的复杂性和攻击面等因素，带来不同严重程度的危害(Nayak等，2014)。在过去的二十年中，存在大量因为软件漏洞而对公司和个人造成了重大损害的事件。一个突出的例子是流行浏览器插件中的漏洞情况，这些漏洞威胁到数百万互联网用户的安全和隐私。例如，Adobe Flash Player (US-CERT 2015; Adobe Security Bulletin 2015) 和Oracle Java (US-CERT 2013)。此外，基础开源软件中的漏洞也威胁到全球数千家公司及其客户的安全（例如Heartbleed（Codenomicon 2014），ShellShock（赛门铁克安全响应2014）和Apache Commons（Breen 2015））。上述例子只是每年报告大量漏洞中的一小部分。
针对这个重要的问题，学术界、软件行业的研究人员已经提出了许多减弱危害的方法。 Shahriar和Zulkernine (2012) 针对安全漏洞的不同方法进行了广泛调研，包括测试，静态分析和混合分析，以及1994年至2010年期间发布的安全编程，程序转换和修补方法。
但在接下来的几年中（从2011年开始），研究界越来越关注另一类方法。这些方法利用来自数据科学和人工智能（AI）领域的技术来解决软件漏洞分析和发现问题。 Shahriar和Zulkernine (2012) 忽略了这类有趣的方法。
在本文中，调研了利用数据挖掘和机器学习技术的软件漏洞分析和发现方法。首先，定义了软件漏洞分析和挖掘的问题，并简要介绍了该领域的传统方法。简要介绍了机器学习和数据挖掘技术及其使用背后的动机。分别阐述了利用机器学习和数据挖掘技术解决软件漏洞分析和挖掘问题的工作。为这类方法进行分类，并讨论了它们的优点和局限性。最后，讨论了该领域的挑战，并指出了一些未知的领域，以激发这一新兴研究领域的工作。

背景：软件漏洞分析与挖掘

软件漏洞的定义

首先给出前人对软件安全漏洞的定义

“an instance of an error in the specification, development, or configuration of software such that its execution can violate the security policy.” (Krsul 1998)

“A software vulnerability is an instance of a mistake in the specification, development, or configuration of software such that its execution can violate the explicit or implicit security policy.” Ozment (2007)

“In the context of software security, vulnerabilities are specific flaws or oversights in a piece of software that allow attackers to do something malicious: expose or alter sensitive information, disrupt or destroy a system, or take control of a computer system or program.” Dowd et al. (2007)

IEEE标准软件工程术语表(IEEE标准1990):

error: “the difference between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition” .
fault: “an incorrect step, process, or data definition in a computer program”. Faults are also known as flaws or bugs.
failure: “the inability of a system or component to perform its required functions within specified performance requirements”.
mistake: “a human action that produces an incorrect result”.

A summary and clarification of the relation of these terms is to “distinguish between the human action (a mistake), its manifestation (a hardware or software fault), the result of the fault (a failure), and the amount by which the result is incorrect (the error)” (IEEE Standards 1990).

本文定义：

A software vulnerability is an instance of a flaw, caused by a mistake in the design, development, or configuration of software such that it can be exploited to violate some explicit or implicit security policy.

软件漏洞的原因是人为错误，其表现形式是缺陷。执行对易受攻击软件的执行不一定违反安全策略; 直到某些特定数据（漏洞利用代码）或某些具有某些条件的随机数据到达有缺陷的语句，此时，其执行可能违反某些安全策略。

“In general, software vulnerabilities can be thought of as a subset of the larger phenomenon of software bugs. Security vulnerabilities are bugs that pack an extra hidden surprise: A malicious user can leverage them to launch attacks against the software and supporting systems.” Dowd et al. (2007)

可靠性、完备性和不可判定性

程序漏洞分析是决定给定程序是否包含已知安全漏洞（根据安全策略）的问题。基于图灵停止问题和赖斯定理的不可判定性，可以证明许多程序分析问题在一般情况下也是不可判定的 (Landi 1992; Reps 2000)。对于从业者而言，不可判定性意味着不存在对问题完备的解决方案。
在数学逻辑中，如果系统没有无效参数被批准，则证明系统是可靠的。如果所有有效参数都可以被系统批准，则证明系统是完整的。通过推论，一个可靠的完备的证明系统是一个能够批准所有有效论证并反驳所有无效论证的系统（Xie et al.2005）。在软件安全的背景下，如果漏洞分析系统永远不会批准易受攻击的程序（没有漏掉漏洞），那么它就是可靠的。如果可以批准所有安全程序（没有虚假漏洞），则漏洞分析系统是完备的。

除了漏洞分析之外，更实用的系统是程序漏洞挖掘（或漏洞报告）系统。与批准或拒绝给定程序的安全性（即二进制输出）的漏洞分析系统相比，程序漏洞挖掘系统报告在程序中发现的每个漏洞更详细的信息（例如类型，位置等）程序。同样，一个完备且可靠的软件漏洞挖掘系统是不存在的。

传统方法

静态分析：对一给定的程序，不需要执行，基于其源代码，利用泛化的抽象对其性质进行分析。
动态分析：对一给定的程序，通过输入特定的数据进行执行，并监视运行时的行为。
混合分析

在大量漏洞挖掘技术中，下面三种方法在软件产业种应用更为广泛。

软件渗透测试
模糊测试
静态数据流分析

机器学习和数据挖掘技术

在上文提到的传统方法之外，另一类不同的方法应用数据科学和人工智能领域技术的方法来处理软件漏洞分析和挖掘问题，在2011年以后收到了很大的关注。
AI领域的机器学习方法已经被证明在不同的应用场景都有着显著的效果。

希望与担忧

“In their early days, computer security and artificial intelligence didn’t seem to have much to say to each other . . . Security researchers aimed to fix the leaks in the plumbing of the computing infrastructure or design infrastructures they deemed leakproof . . . But the two fields have grown closer over the years, particularly where attacks have aimed to simulate legitimate behaviors . . . We might imagine systems that would have a degree of self-awareness about the data that they process. The notion of reflective systems (systems that can reference and modify their own behavior) has its origins in the AI community . . . Imagine a plumbing system that contained a system of smart pipes that could detect incipient leaks. A cyber-infrastructure that incorporated the analog of smart pipes would be of great interest.” Landwehr (2008)

方法分类

基于软件度量的脆弱性预测。大量的工作是基于熟知的软件度量作为特征集合，利用机器学习的方法建立预测模型，然后使用模型去评估漏洞的状态。
异常检测方法。利用无监督的方法自动提取正常模型，并将漏洞检测为异常行为。
易受攻击代码模式识别。利用机器学习方法从众多漏洞代码样本中提取易受攻击代码段的模式，然后使用模式匹配技术来检测和定位软件中的漏洞源代码。
其他方法。不属于上述类别的近期工作，利用人工智能和数据科学领域的技术进行软件脆弱性分析和发现。

基于软件度量的脆弱性预测

脆弱性预测模型基于通用的软件工程度量，利用数据挖掘，机器学习和统计分析技术预测软件开发中的漏洞(源代码文件，面向对象类等)。这些方法的主要思想来自于软件工程中的软件质量和可靠性保证领域。缺陷预测模型已应用于工业界((Khoshgoftaar et al. 1997)。

Table 1. Summary of Recent Works on Vulnerability Prediction Models Based on Software Metrics

Paper	Metrics	Granularity	Within/Cross-project	Vulnerability info
(Zimmermann et al. 2010)	Code-churn, complexity, coverage,dependency, organizational	Binary modules	Within-project	Public advisories
(Meneely and Williams 2010)	Developer-activity	Source file	Within-project	Public advisories
(Doyle and Walden 2011)	Code complexity, Security Resources Indicator	Source file	Within-project	Tool-based detection
(Shin and Williams 2013)	Complexity, code-churn, fault-history	Source file	Within-project	Public advisories
(Shin and Williams 2011)	Code complexity, dependency network complexity, execution complexity	Source file	Within-project	Public advisories
(Shin et al. 2011)	Complexity, code-churn,developer-activity	Source file	Within-project	Public advisories
(Moshtari et al. 2013)	Unit complexity, coupling	Source file	both	Self-developed detection framework
(Meneely et al. 2013)	Code-churn, developer-activity	Code commits	Within-project	Public advisories
(Bosu et al. 2014)	Developer-activity	Code commits	Within-project	Public advisories
(Perl et al. 2015)	Code-churn, developer-activity, GitHub meta-data	Code commits	Cross-project	Public advisories
(Walden et al. 2014)	Code complexity	Source file	both	Public advisories
(Morrison et al. 2015)	Code-churn, complexity, coverage, dependency, organizational	Binary modules,source file	Within-project	Public advisories
(Younis et al. 2016)	Code complexity, Information Flow,Functions, Invocations	Functions	Within-project	Public advisories

异常检测方法

Table 2. Summary of Reviewed Works on Anomaly Detection for Vulnerability Discovery

Paper	Type	Approach	Within/Cross-project	Security focused
(Engler et al. 2001)	API usage pattern	Template-based rule extraction	Within	Yes
(Livshits and Zimmermann 2005)	API usage pattern	Association rule mining	Within	No
(Li and Zhou 2005)	API usage pattern	Frequent closed itemset mining	Within	No
(Wasylkowski et al. 2007)	API usage pattern	Frequent closed itemset mining	Within	No
(Acharya et al. 2007)	API usage pattern	Frequent partial-order itemset mining	Cross	No
(Chang et al. 2008)	Missing checks	Maximal frequent sub-graph mining	Within	No
(Thummalapenta and Xie 2009)	API usage pattern + Missing checks	Imbalanced frequent itemset mining	Cross	No
(Gruska et al. 2010)	API usage pattern	Frequent closed itemset mining	Cross	No
(Yamaguchi et al. 2013)	Missing checks	k-Nearest neighbors + bag-of-words	Within	Yes

易受攻击代码模式识别

Table 3. Summary of Reviewed Works on Vulnerable Code Pattern Recognition

Paper	Code Processing Approach	Learning Approach	Static/Hybrid	Source/Binary
(Yamaguchi et al. 2011, 2012)	Extracting AST with parser	Supervised (classification)	Static	Source
(Shar and Tan 2012, 2013)	Static data flow analysis	Supervised (classification)	Static	Source
(Shar et al. 2013, 2015)	Static program slicing and control flow analysis	Semi-supervised and supervised (classification)	Hybrid	Source
(Scandariato et al. 2014)	Bag-of-words extraction from program source text	Supervised (classification)	Static	Source
(Yamaguchi et al. 2014, 2015)	Extracting Code Property Graph	Unsupervised (clustering)	Static	Source
(Pang et al. 2015)	N-gram analysis on program source text	Supervised (classification)	Static	Source
(Grieco et al. 2015)	N-gram analysis on function call sequences	Supervised (classification)	Hybrid	Binary

其他方法

Table 4. Summary of Reviewed Miscellaneous Approaches

Paper	Approach Summary
(Sparks et al. 2007)	Used Genetic Algorithm (GA) for intelligently guiding the input selection process of black-box fuzz testing
(Wijayasekara et al. 2012, 2014)	Used text mining (bag-of-words) on bug reports in open bug databases for identifying hidden impact bugs (HIBs)
(Alvares et al. 2013)	Used a hybrid of static data-flow analysis and computational intelligence (GA and FSS) techniques for discovering exploitable memory corruption vulnerabilities
(Medeiros et al. 2014)	Used classification techniques on the output of static tainted data-flow analysis for web application vulnerability discovery to identify false-positive reports
(Sadeghi et al. 2014)	Used a probabilistic rule ranking approach based on the information contained in categorized software repositories to improve the efficiency and scalability of static vulnerability analysis tools

明远AI

关注

3
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
机器学习：软件漏洞分析

Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey摘要SEYED MOHAMMAD GHAFFARIAN and HAMID REZA SHAHRIARI,Amirkabir University of Technology摘要软件...
复制链接

扫一扫

专栏目录