《Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website》

Multi-Modal系列论文研读目录



文章链接

1.论文题目含义

基于多模态特征表示的卷积神经网络恶意网站检测模型

2.ABSTRACT

Web applications have proliferated across various business sectors, serving as essential tools for billions of users in their daily lives activities. However, many of these applications are malicious which is a major threat to Internet users as they can steal sensitive information, install malware, and propagate spam. Detecting malicious websites by analyzing web content is ineffective due to the complexity of extraction of the representative features, the huge data volume, the evolving nature of the malicious patterns, the stealthy nature of the attacks, and the limitations of traditional classifiers. Uniform Resource Locators (URL) features are static and can often provide immediate insights about the website without the need to load its content. However, existing solutions for detecting malicious web applications through web content analysis often struggle due to complex feature extraction, massive data volumes, evolving attack patterns, and limitations of traditional classifiers. Leveraging solely lexical URL features proves insufficient, potentially leading to inaccurate classifications. This study proposes a multimodal representation approach that fuses textual and image-based features to enhance the performance of the malicious website detection. Textual features facilitate the deep learning model’s ability to understand and represent detailed semantic information related to attack patterns, while image features are effective in recognizing more general malicious patterns. In doing so, patterns that are hidden in textual format may be recognizable in image format. Two Convolutional Neural Network (CNN) models were constructed to extract the hidden features from both textual and imagerepresented features. The output layers of both models were combined and used as input for an artificial neural network classifier for decision-making. Results show the effectiveness of the proposed model when compared to other models. The overall performance in terms of Matthews Correlation Coefficient (MCC) was improved by 4.3% while the false positive rate was reduced by 1.5%.Web应用程序在各个业务部门中激增,成为数十亿用户日常生活活动中的重要工具。然而,这些应用程序中有许多是恶意的,这是对互联网用户的主要威胁,因为它们可以窃取敏感信息,安装恶意软件和传播垃圾邮件。通过分析网页内容来检测恶意网站是无效的,由于提取的代表性特征的复杂性,巨大的数据量,恶意模式的演变性质,攻击的隐蔽性,以及传统分类器的局限性。统一资源定位器(URL)功能是静态的,通常可以提供有关网站的即时见解,而无需加载其内容。然而,现有的通过Web内容分析来检测恶意Web应用程序的解决方案往往由于复杂的特征提取、海量数据、不断演变的攻击模式以及传统分类器的局限性而难以实现。事实证明,仅仅利用词汇URL特征是不够的,可能会导致不准确的分类。本研究提出了一种融合文本和基于图像的特征的多模态表示方法,以增强恶意网站检测的性能。文本特征有助于深度学习模型理解和表示与攻击模式相关的详细语义信息,而图像特征可有效识别更一般的恶意模式。在这样做时,以文本格式隐藏的模式可以以图像格式识别。构造了两个卷积神经网络(CNN)模型,分别从文本和图像表示的特征中提取隐藏特征。两个模型的输出层相结合,并作为输入的人工神经网络分类器的决策。结果表明,所提出的模型相比,其他模型的有效性。在马修斯相关系数(MCC)方面的整体性能提高了4.3%,而假阳性率降低了1.5%。

3.INDEX TERMS

Convolutional neural network, malicious URL detection, malicious website detection, multi-modal features representation, URL image representation.
卷积神经网络,恶意网址检测,恶意网站检测,多模态特征表示,网址图像表示。

4.INTRODUCTION

  1. According to the Siteefy website [1], there are over 1.11 billion websites in theWorld, and this number has been growing exponentially in recent years. Every day, T 252 thousand new websites are created (REF Please). As of May 9, 2023, it is estimated that the number of web pages is more than 50 billion pages. Although most of the websites are created for good purposes, many of these websites are malicious websites [2]. Malicious websites are designed to harm users in some way, such as by stealing their personal information or installing malware on their computers. They can be used to spread malware, phishing, spread spam, or conduct denial of service attacks [3]. According to Google’s in-depth research, there are an estimated 12.8 million malicious websites on the internet [4]. Furthermore, as stated by authors in [5], there are 18.5 million websites hosting malicious code. This number is constantly changing, as new malicious websites are created and old ones are taken down.根据Siteefy网站[1]的数据,世界上有超过11.1亿个网站,而且这个数字近年来呈指数级增长。每一天,T 252千新网站创建(REF请).截至2023年5月9日,预计网页数量超过500亿页。虽然大多数网站都是出于良好的目的而创建的,但其中许多网站都是恶意网站[2]。恶意网站旨在以某种方式伤害用户,例如窃取他们的个人信息或在他们的计算机上安装恶意软件。它们可用于传播恶意软件、网络钓鱼、传播垃圾邮件或进行拒绝服务攻击[3]。根据Google的深入研究,互联网上估计有1280万个恶意网站[4]。此外,正如作者在[5]中所述,有1850万个网站托管恶意代码。这个数字在不断变化,因为新的恶意网站被创建,旧的网站被删除。
  2. Malicious website detection has been the subject of much research and many solutions were suggested [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23]. The blacklist is the most common solution used by many organizations [24]. However, it is slow to update, as malicious actors can easily bypass blacklists by creating new websites or simply changing the URLs of their websites. This makes it difficult for blacklist-based systems to keep up with the ever-changing landscape of malicious websites [25], [26].恶意网站检测已经成为许多研究的主题,并提出了许多解决方案[6]、[7]、[8]、[9]、[10]、[11]、[12]、[13]、[14]、[15]、[16]、[17]、[18]、[19]、[20]、[21]、[22]、[23]。黑名单是许多组织最常用的解决方案[24]。但是,它的更新速度很慢,因为恶意行为者可以通过创建新网站或简单地更改其网站的URL来轻松绕过黑名单。这使得基于黑名单的系统很难跟上不断变化的恶意网站[25],[26]。
  3. To address the limitations of blacklisting, many researchers have employed machine learning techniques to detect malicious websites. These techniques extract features from web content [27], [28], [29], scripts [15], [16], HTTP/s response [29], [30], URLs [6], [7], [8], [9], [10], [11], [12], [13], [14], [31], [32], [33], domain names [25], [34], [35], network traffic data [34], [36], and digital certificates [26]. Many machine learning algorithms were used such as support vector machines, decision trees, logistic regression, and random forests to classify websites as malicious or benign [28], [32]. The effectiveness of machine learning methods depends on the choice of features [13], [14], [17], [18], [19], [20], [21], [22], [23]. However, extracting effective features is challenging due to the constant changing of malicious code, the use of obfuscation techniques by attackers, the huge volume of data that needs to be analyzed, and the complexity of the attack today. Unfortunately, traditional machine learning is ineffective in extracting useful patterns for classification from huge and complex datasets. However, effective feature engineering is required to improve detection performance.为了解决黑名单的局限性,许多研究人员采用机器学习技术来检测恶意网站。[27][28][29][2 [35]、网络流量数据[34]、[36]和数字证书[26]。使用了许多机器学习算法,如支持向量机,决策树,逻辑回归和随机森林来将网站分类为恶意或良性[28],[32]。[13][14][17][18][19][20][21][22][23]]机器学习方法的有效性取决于特征的选择。然而,由于恶意代码的不断变化,攻击者使用混淆技术,需要分析的大量数据以及当今攻击的复杂性,提取有效特征具有挑战性。不幸的是,传统的机器学习在从庞大而复杂的数据集中提取有用的分类模式方面是无效的。然而,需要有效的特征工程来提高检测性能。
  4. Deep learning models are effective in extracting representative features from huge and complex datasets. They can automatically extract effective features without the need for incentive manual feature engineering, as it can automatically learn features from webpage text data. Convolutional Neural Networks (CNN) [22], Recurrent Neural Networks (RNN) [23], and attention mechanisms were commonly reported methods for malicious malware detection. Many deep learning models are constructed based on features extracted from the website’s content. However, acquiring large and diverse datasets from website content for training deep learning models is challenging due to the dynamicity of the web content, the use of anti-scraping mechanisms to detect and block automated scrapers, and the evolving nature of online threats. Some websites require user sessions and authentication to access content. Scraping such websites may involve simulating user interactions, including logging in. Websites frequently change their structure and layout, necessitating ongoing maintenance and updates to scraping scripts to ensure they continue to work correctly. Moreover, extracting webpage representative features from the web content may be inefficient for limited resources devices such as IoT devices. Although content-based features can be used for detecting many types of threats, relying on web content features is neither effective nor efficient for detecting advanced malicious websites.深度学习模型可以有效地从庞大而复杂的数据集中提取代表性特征。它们可以自动提取有效的特征,而不需要激励手动特征工程,因为它可以自动从网页文本数据中学习特征。卷积神经网络(CNN)[22],递归神经网络(RNN)[23]和注意力机制是恶意软件检测的常用方法。许多深度学习模型都是基于从网站内容中提取的特征构建的。然而,由于Web内容的动态性,使用反抓取机制来检测和阻止自动抓取器,以及在线威胁的不断演变,从网站内容中获取大量不同的数据集来训练深度学习模型具有挑战性。某些网站需要用户会话和身份验证才能访问内容。抓取此类网站可能涉及模拟用户交互,包括登录。网站经常更改其结构和布局,需要持续维护和更新以抓取脚本,以确保它们继续正确工作。此外,从web内容提取网页代表性特征对于诸如IoT设备之类的有限资源设备而言可能是低效的。虽然基于内容的特征可以用于检测许多类型的威胁,但依赖于Web内容特征对于检测高级恶意网站既不有效也不高效。
  5. The URL-based features seem to be a good alternative to the web content features. Many researchers compare the performance of the models constructed using both features and, on all occasions, URL-based features always win. However, most of the existing studies rely solely on the lexical features extracted from URLs. Lexical features have limited semantics information which causes the construction of sparse feature vectors. Some studies combine URL features with digital certificates to improve the detection performance. Malicious websites often lack valid certificates or use self-signed certificates, making certificate analysis a useful indicator of trustworthiness. Analyzing digital certificates can reveal whether a website is employing encryption, which is a common practice among reputable sites. However, not all websites use digital certificates, and some may employ self-signed certificates or certificates issued by less reputable Certificate Authorities (CAs). Extracting relevant and meaningful features from certificates for machine learning models can be complex, and the selection of the right features is crucial for effective detection. In addition, digital certificates can be misconfigured, expired, and frequently change leading to high false alarms. To sum up, existing solutions for detecting malicious web applications through web content analysis often struggle due to complex feature extraction, massive data volumes, evolving attack patterns, and limitations of traditional classifiers. Relying solely on lexical URL features proves insufficient, potentially leading to inaccurate classifications.基于URL的功能似乎是Web内容功能的一个很好的替代品。许多研究人员比较了使用这两种特征构建的模型的性能,在所有情况下,基于URL的特征总是获胜。然而,现有的研究大多依赖于从URL中提取的词汇特征。词汇特征的语义信息有限,这就需要构造稀疏的特征向量。一些研究将联合收割机URL特征与数字证书相结合以提高检测性能。恶意网站通常缺少有效的证书或使用自签名证书,这使得证书分析成为可信度的有用指标。分析数字证书可以揭示网站是否使用加密,这是信誉良好的网站的常见做法。但是,并非所有网站都使用数字证书,有些网站可能使用自签名证书或由信誉较差的证书颁发机构(CA)颁发的证书。从机器学习模型的证书中提取相关和有意义的特征可能很复杂,选择正确的特征对于有效检测至关重要。此外,数字证书可能被错误配置、过期和频繁更改,从而导致高误报率。总而言之,通过web内容分析来检测恶意web应用的现有解决方案通常由于复杂的特征提取、海量数据量、不断演变的攻击模式以及传统分类器的局限性而难以实现。仅仅依靠词汇URL特征是不够的,可能会导致不准确的分类。
  6. To address these challenges, this study proposes a novel multimodal representation approach that integrates textual and image-based features to enhance malicious website detection. This approach leverages the strengths of both modalities: textual features capture detailed semantic information related to attack patterns, and image features recognize broader malicious visual cues. Hidden patterns within textual content may become discernible through image analysis.为了解决这些挑战,本研究提出了一种新的多模态表示方法,集成了基于文本和图像的功能,以提高恶意网站的检测。这种方法利用了两种方式的优势:文本特征捕获与攻击模式相关的详细语义信息,图像特征识别更广泛的恶意视觉线索。文本内容中的隐藏模式可以通过图像分析变得可辨别。
  7. The proposed approach employs two Convolutional Neural Networks (CNNs): one for textual features and another for image features. Their outputs are then combined and fed into an artificial neural network classifier for improved decision-making. Our results demonstrate the superiority of the proposed model compared to existing approaches. We achieve a 4.3% increase in Matthews Correlation Coefficient (MCC) and a 1.5% reduction in the false-positive rate, showcasing the effectiveness of our multimodal approach in accurately identifying malicious web applications.该方法采用了两个卷积神经网络(CNNs):一个用于文本特征,另一个用于图像特征。然后将它们的输出组合并馈送到人工神经网络分类器中以改进决策。实验结果证明,相较于已有方法,该算法是有效和优越的。我们实现了4.3%的马修斯相关系数(MCC)的提高和1.5%的误报率的降低,展示了我们的多模态方法在准确识别恶意Web应用程序方面的有效性。
  8. This study made the following contributions:这项研究作出了以下贡献:
    (1)Integrating DNS-derived features with URL-based features enhances the comprehensiveness of malicious website detection. This synergy offers valuable contextual information regarding domain behavior and infrastructure, thereby fortifying the evaluation ofwebsite authenticity and security contributing to a more robust and nuanced approach to identifying malicious websites.
    (2)The study introduces a multimodal representation approach that utilizes both textual and image-based features to represent a comprehensive feature set. Textual features facilitate the deep learning model’s ability to understand and represent detailed semantic information related to attack patterns, while image features are effective in recognizing more general malicious patterns.
    (3) Design and develop two Convolutional Neural Network (CNN) models to extract hidden features from the textual and image representations.
    (4) An additional, deep learning classifier was constructed to learn the relationships among the hidden features extracted by the CNN models. This approach advances the field by applying deep learning techniques to combine and leverage both textual and visual information for more effective malicious website detection.
  9. The paper is organized as follows. Section II reviews the relevant literature and Section III describes the proposed solution in detail. Section IV discusses the experimental design and Section V presents the results and discussion. Section VI concludes the paper and discusses the limitations and future work.本文的结构如下。第二节回顾了相关文献,第三节详细描述了拟议的解决方案。第四节讨论了实验设计,第五节给出了结果和讨论。第六节总结了本文,并讨论了局限性和未来的工作。

5.RELATED WORK

  1. There are three main approaches that have been suggested by researchers for malicious URL classification: blacklist, content-based, and URL-based [11], [32]. Many techniques were proposed to construct the detection classifiers such as the use of heuristic rules based on professional experience or the use of machine learning techniques. However, effective malicious URL detection is still an open issue problem. The performance of the recent malicious website detection solutions is influenced by the extracted features and the machine learning algorithms used for constructing the detection classifier. Authors in [32] presented an in-depth literature review that covers various machine learning-based techniques for detecting malicious URLs, considering aspects such as limitations, detection technologies, feature types, and datasets. The type of extracted features combined with deep learning techniques are research trends of malicious website detection solutions. The professional experience heuristic rule was widely used for constructing a blacklist of malicious URLs such as the Google safe web browsing tool [37]. However, the blacklist solutions are ineffective for malicious URL detection due to the constantly evolving threats causing the need for frequent identification ofthe evolved threat and frequently updating the database.研究人员提出了三种主要的恶意URL分类方法:黑名单、基于内容和基于URL [11],[32]。提出了许多技术来构造检测分类器,例如基于专业经验的启发式规则的使用或机器学习技术的使用。然而,如何有效地检测恶意URL仍然是一个悬而未决的问题。最近的恶意网站检测解决方案的性能受到所提取的特征和用于构造检测分类器的机器学习算法的影响。作者在[32]中提供了一份深入的文献综述,涵盖了用于检测恶意URL的各种基于机器学习的技术,并考虑了限制、检测技术、特征类型和数据集等方面。将提取的特征类型与深度学习技术相结合是恶意网站检测解决方案的研究趋势。专业经验启发式规则被广泛用于构建恶意URL的黑名单,如Google安全网页浏览工具[37]。然而,由于不断发展的威胁导致需要频繁地识别发展的威胁并频繁地更新数据库,因此黑名单解决方案对于恶意URL检测是无效的。
  2. Many researchers have used feature extraction techniques to extract the features from website content to detect malicious content Natural language processing has been commonly employed for representation. However, due to the evolving nature of attacker’s techniques, malicious website content is complex and such patterns become dynamic and stealthy leading to poor detection accuracy. For example, in [38], the authors investigated how malicious websites employ various web spam techniques to evade detection. The aim is to provide an effective solution for detecting and combating malicious websites that utilize techniques like redirection spam, hidden Iframes spam, and contenthiding spam. Accordingly, the study focuses on capturing screenshots of webpages from a user’s perspective and using a Convolutional Neural Network for classification. However, the solution is limited for detecting spam techniques. Moreover, the feature depends on screenshots of the loaded page might be dangerous and uncompleted due to the dynamic nature of the websites.许多研究人员使用特征提取技术从网站内容中提取特征来检测恶意内容。然而,由于攻击者的技术不断发展的性质,恶意网站的内容是复杂的,这样的模式变得动态和隐身,导致检测精度差。例如,在[38]中,作者研究了恶意网站如何使用各种Web垃圾邮件技术来逃避检测。其目的是提供一种有效的解决方案,用于检测和打击利用重定向垃圾邮件,隐藏的Iframes垃圾邮件和contenthiding垃圾邮件等技术的恶意网站。因此,该研究的重点是从用户的角度捕获网页的屏幕截图,并使用卷积神经网络进行分类。然而,该解决方案局限于检测垃圾邮件技术。此外,由于网站的动态性质,该功能依赖于加载页面的屏幕截图可能是危险的和不完整的。
  3. In [27], the authors collected features from the HTTP/s responses and applied various feature transformation and selection techniques for classification. However, these features are dynamic, subject to obfuscation using encoding and encryption mechanisms, which can render the detection classifier ineffective. Although machine learning algorithms were widely used for constructing the detection classifier, many researchers focused on deep learning techniques. Deep learning can accurately determine the similar patterns learned during the training resulting in effective classification. However, the web content is very dynamic and may be encrypted or encoded to hide the malicious patterns, posing a challenge in extracting effective features for classification.在[27]中,作者从HTTP/s响应中收集特征,并应用各种特征转换和选择技术进行分类。然而,这些特征是动态的,容易受到使用编码和加密机制的混淆,这可能使检测分类器无效。虽然机器学习算法被广泛用于构建检测分类器,但许多研究人员将重点放在深度学习技术上。深度学习可以准确地确定训练过程中学习到的相似模式,从而实现有效的分类。然而,Web内容是非常动态的,并且可能被加密或编码以隐藏恶意模式,这在提取有效特征用于分类方面构成了挑战。
  4. The URL features which less dynamic are promising for the accurate detection of malicious domains. This is because malicious domains are generated algorithmically while benign domains are created by humans. Thus, malicious URLs may contain more prominent features compared to the features extracted from the content which can be obfuscated, or encrypted to mislead the learning process. Authors in [38] focused on detecting the malicious URLs that are generated algorithmically. They hypothesize that attackers or malicious bots are used to generate the malicious URLs automatically. Accordingly, those URLs may contain patterns that are different from those generated by humans. Similarly, authors in [39] and [40] proposed solutions for detecting URLs that are generated using Domain Generation Algorithms (DGAs).动态性较低的URL特征对于准确检测恶意域名具有重要意义。这是因为恶意域名是通过算法生成的,而良性域名是由人类创建的。因此,与从内容中提取的特征相比,恶意URL可能包含更突出的特征,这些特征可能被混淆或加密以误导学习过程。[38]中的作者专注于检测算法生成的恶意URL。他们假设攻击者或恶意机器人被用来自动生成恶意URL。因此,这些URL可能包含与人类生成的模式不同的模式。类似地,[39]和[40]中的作者提出了用于检测使用域生成算法(DGAs)生成的URL的解决方案。
  5. Authors in [41] proposed a malicious website detection technique based on lexical and host-based features extracted from URLs. Results showed that URL features are more accurate compared to the other types of features. Authors in [26] proposed an adaptive segmentation mechanism to solve the maximum sequence length (MSL) limitation in deep learning. Webpage text, digital certificate, and Uniform Resource Locator (URL) were used as the source of the extracted features and used to construct the detection model using the Multi-Head Self-Attention and multi-channel text convolution (MCTC) network. However, relying on dynamic content features is challenging and can lead to degrade the classification performance. The study in [42] presented an approach to learning the uncertainties by employing deep Bayesian neural networks (DBNNs) to model the stochastic system dynamics. Authors in [43] presented a feature extraction algorithm called URL embedding based unsupervised learning technique called Huffman coding to reduce the dimensionality of the features vector. Although the algorithm shows better detection performance compared to the existing feature extraction mechanisms, the algorithm has been evaluated using a dataset with a strong assumption about the length and distribution of the characters of the malicious URLs samples.作者在[41]中提出了一种基于从URL中提取的词法和基于主机的特征的恶意网站检测技术。结果表明,与其他类型的特征相比,URL特征更准确。作者在[26]中提出了一种自适应分割机制,以解决深度学习中的最大序列长度(MSL)限制。以网页文本、数字证书和统一资源定位符(URL)作为提取特征的来源,使用多头自注意和多通道文本卷积(MCTC)网络构建检测模型。然而,依赖于动态内容特征是具有挑战性的,并可能导致降低分类性能。[42]中的研究提出了一种通过采用深度贝叶斯神经网络(DBNN)来对随机系统动态建模来学习不确定性的方法。作者在[43]中提出了一种称为URL嵌入的特征提取算法,基于无监督学习技术,称为霍夫曼编码,以降低特征向量的维数。虽然该算法表现出更好的检测性能相比,现有的特征提取机制,该算法已被评估使用的数据集具有很强的假设的长度和分布的字符的恶意网址样本。
  6. In [34], the authors proposed an anomaly detection model for detecting malicious domains. They utilized Hidden Markov Model (HMM) with a probabilistic model was used to construct the normal profile of the normal domain. In the online operation, if the domain is suspicious Jensen–Shannon divergence is calculated between the suspicious domain and a subset of the benign domains, and if the JS divergence exceeds a specific threshold the malicious domain is detected. Authors in [31] proposed a detection model called ‘‘deepBF’’ which combines Bloom Filters and Deep Learning techniques, aiming to improve accuracy and efficiency in identifying potentially harmful web addresses. The evolutionary convolutional neural network was used to construct the detection classifier. Authors in [33] compare the performance of several deep learning and traditional machine learning techniques to detect malicious URLs. The BiLSTM classifier was reported as the most performed classifier among studied classifiers.在[34]中,作者提出了一种用于检测恶意域的异常检测模型。他们利用隐马尔可夫模型(HMM)和概率模型来构建正常域的正常轮廓。在在线操作中,如果域是可疑的,则在可疑域和良性域的子集之间计算Jensen-Shannon分歧,并且如果JS分歧超过特定阈值,则检测到恶意域。作者在[31]中提出了一种名为“deepBF”的检测模型,该模型结合了Bloom Filters和深度学习技术,旨在提高识别潜在有害网址的准确性和效率。采用进化卷积神经网络构造检测分类器。[33]中的作者比较了几种深度学习和传统机器学习技术检测恶意URL的性能。BiLSTM分类器被报道为所研究的分类器中性能最好的分类器。
  7. Authors in [21] used a combination of different feature transformations to reduce the data volume to improve the learning process. Various linear and non-linear space transformation methods were used in the solution. Although feature transformation plays a significant role in improving the classifiers constructed using traditional machine learning techniques, the total number of features extracted is 62 features does not seem very challenging if deep learning techniques were used for the classification.[21]中的作者使用了不同特征变换的组合来减少数据量以改善学习过程。在求解中采用了各种线性和非线性空间变换方法。虽然特征变换在改进使用传统机器学习技术构建的分类器方面发挥了重要作用,但如果使用深度学习技术进行分类,则提取的特征总数为62个特征似乎并不具有挑战性。
  8. Authors in [44] presented a solution for malicious URL detection using two-stage ensemble learning to address the growing concern of web-based attacks. The study leverages cyber-threat intelligence features from sources like Google web search and Whois websites to enhance detection accuracy. The two-stage ensemble approach, combining Random Forest and Multi-Layer Perceptron algorithms, results in an improvement in accuracy and a reduction in false positives when compared to traditional URL-based models. However, the study does not thoroughly examine the potential limitations of relying on external cyber threat intelligence sources, which may pose challenges in terms of comprehensiveness and timeliness, warranting further investigation.作者在[44]中提出了一种使用两阶段集成学习的恶意URL检测解决方案,以解决日益增长的基于Web的攻击问题。该研究利用来自Google网络搜索和Whois网站等来源的网络威胁情报功能来提高检测准确性。两阶段集成方法结合了随机森林和多层感知器算法,与传统的基于URL的模型相比,提高了准确性并减少了误报。然而,该研究没有彻底审查依赖外部网络威胁情报来源的潜在局限性,这可能在全面性和及时性方面构成挑战,从而阻碍进一步调查。
  9. The authors in [45] proposed a curriculum-based multimodal masked transformer network (CMMTN) that combines BERT and ResNet to enhance text and image representations, addressing the assumption of having labeled posts for training the fake news detection model. The CMMTN aims to strengthen correlations between relevant information by masking irrelevant context between modalities. However, the proposed solution in the current study is for malicious website detection, which presents different challenges compared to fake news detection, as it involves linguistic issues.作者在[45]中提出了一种基于递归的多模态掩码Transformer网络(CMMTN),它结合了BERT和ResNet来增强文本和图像表示,解决了具有标签帖子用于训练假新闻检测模型的假设。CMMTN旨在通过屏蔽模态之间的无关上下文来加强相关信息之间的相关性。然而,目前研究中提出的解决方案是恶意网站检测,与假新闻检测相比,这带来了不同的挑战,因为它涉及语言问题。
  10. Authors in [46] introduced a multi-modal hierarchical attention model (MMHAM) for phishing website detection, extracting features from URLs, textual information, and visual design. However, the study solely focuses on phishing website detection, limiting its generalizability to other types of malicious websites. The current study takes a broader approach to detect various kinds of malicious websites. Additionally, it incorporates semantic textual patterns, utilizing Character embedding techniques to extract semantic features from textual data.作者在[46]中介绍了一种用于钓鱼网站检测的多模态层次注意力模型(MMHAM),从URL,文本信息和视觉设计中提取特征。然而,该研究仅关注钓鱼网站检测,限制了其推广到其他类型的恶意网站。目前的研究采取了更广泛的方法来检测各种恶意网站。此外,它还结合了语义文本模式,利用字符嵌入技术从文本数据中提取语义特征。
  11. The authors in [47] proposed a hybrid deep learning approach to combine visual and textual modalities for detecting incongruous hashtags in user-generated content. However, the study concentrates on extracting contradictions between textual and visual features, which differs from malicious website detection where both features represent the same aspects from different perspectives.作者在[47]中提出了一种混合深度学习方法,将联合收割机视觉和文本模态结合起来,用于检测用户生成内容中不一致的主题标签。然而,该研究集中在提取文本和视觉特征之间的矛盾,这与恶意网站检测不同,这两个特征从不同的角度代表相同的方面。
  12. To sum up, many approaches were investigated for detecting malicious websites and performance of detection relies heavily on the features extracted and the design of the model. Web content features are highly dynamic and complex, making it challenging to construct an efficient and effective classifier. For efficiency, the features should be rendered by a browsing machine before the extraction process which is risky and also needs valuable resources of memory and computational power for extracting the features. Meanwhile, for effectiveness, such features can be manipulated, encrypted, or encoded in such a way as to hide malicious patterns and make it very difficult to extract meaningful features for effective learning. URL features are more effective and efficient due to their size and generation conditions. The features extracted from URLs are less complex and more stable compared to the content-based features. Usually, malicious URLs are generated automatically using domain generation algorithms. Such URLs have different character distributions. That is the features can be more distinguishable compared to human-generated features. In addition, while features extracted from benign samples may be meaningful, malicious features usually contain meaningless terms, misspelled words, and randomly generated text. Benign URLs are more straightforward while malicious URLs may contain multiple domains, longer lengths, and contains more hercucal paths. Thus, features extracted from URLs contain more valuable patterns for the machine learning classifiers. Features such as those extracted from domain certificates or domain name servers are important. Lexical features extracted from domain information, URLs, and HTTP/s header response are also valuable. Features representation plays an essential role in improving learning performance. However, few studies focused on such issues. Many current detection models either rely on lexical features with statistical representations or depend on content-based features, which can result in low detection accuracy and high false alarms.综上所述,目前已有多种恶意网站的检测方法,而检测性能的好坏很大程度上取决于特征的提取和模型的设计。Web内容特征具有高度的动态性和复杂性,使得构建一个高效的分类器具有挑战性。为了效率,特征应当在提取过程之前由浏览机呈现,这是有风险的,并且还需要宝贵的存储器资源和计算能力来提取特征。同时,为了有效性,可以以隐藏恶意模式的方式来操纵、加密或编码这些特征,并且使得提取有意义的特征以进行有效学习非常困难。由于URL功能的大小和生成条件,因此它们更加有效和高效。与基于内容的特征相比,从URL中提取的特征复杂度更低,并且更稳定。通常,使用域生成算法自动生成恶意URL。这些URL具有不同的字符分布。也就是说,与人工生成的特征相比,这些特征可以更容易区分。此外,虽然从良性样本中提取的特征可能是有意义的,但恶意特征通常包含无意义的术语、拼写错误的单词和随机生成的文本。良性URL更为直接,而恶意URL可能包含多个域、更长的长度以及更多的路径。因此,从URL中提取的特征包含对于机器学习分类器更有价值的模式。诸如从域证书或域名服务器中提取的那些特征是重要的。从域信息、URL和HTTP/s报头响应中提取的词汇特征也很有价值。特征表示在提高学习成绩中起着至关重要的作用。然而,很少有研究关注这些问题。现有的检测模型要么依赖于具有统计表示的词汇特征,要么依赖于基于内容的特征,这会导致检测准确率低和误报率高。

6.THE PROPOSED HF-CNN MODEL 提出的HF-CNN模型

The proposed model consists of four main phases as follows: features extraction phase, features representation phase, classifiers construction phase, and decision-making phase (See Figure 1). The output of each phase is used as input to the next phase. A detailed description of each phase is presented in the subsequent sections.所提出的模型包括以下四个主要阶段:特征提取阶段,特征表示阶段,分类器构建阶段和决策阶段(见图1)。每个阶段的输出用作下一阶段的输入。每个阶段的详细描述将在后续章节中介绍。
在这里插入图片描述

A. PHASE 1: DATA COLLECTION PHASE 数据收集阶段

The dataset used in this study is available on the Kagel website and can be downloaded from the following link (https://www.kaggle.com/datasets/sid321axn/maliciousurls-dataset?datasetId=1486586). Various types of URLs including benign, phishing, malware, and defacement, were collected from different sources such as the ISCX-URL-2016 dataset, Faizan’s GitHub repository, and Malware Domain Blacklist dataset.本研究中使用的数据集可在Kagel网站上获得,并可从以下链接(https://www.example.com)下载www.kaggle.com/datasets/sid321axn/maliciousurls-dataset?登录ID =1486586)。各种类型的URL(包括良性、网络钓鱼、恶意软件和污损)是从不同的来源收集的,例如ISCX-URL-2016数据集、Faizan的GitHub存储库和恶意软件域名黑名单数据集。

B. PHASE 2: FEATURES EXTRACTION 特征提取

Two types of features were extracted URLs-based and DNSbased features. The textual content presented in the URL is extracted using character-level n-gram to capture patterns, structures, and information present in the text of URLs. Ngrams are contiguous sequences of n characters within the text. N-Gram is a text analysis technique that breaks down text into smaller units, where ‘N’ represents the number of units (typically words or characters). For example, in the URL ‘‘https://www.example.com,’’ if we consider 3-grams (trigrams), we would have the following n-gram vector: [‘‘htt’’, ‘‘ttp’’, ‘‘tps’’, ‘‘ps:’’, ‘‘s:/’’, ‘‘😕/’’, ‘‘//w’’, ‘‘/ww’’, ‘‘www’’, ‘‘ww.’’, ‘‘w.e’’, ‘‘.ex’’, ‘‘exa’’, ‘‘xam’’, ‘‘amp’’, ‘‘mpl’’, ‘‘ple’’, ‘‘le.c’’, ‘‘e.co’’, ‘‘.com’’]. Each element in the n-gram vector is a feature. In this study n-gram that is ranged from 3 to 5 is used that is the features vector can contain complete textual terms such as ‘‘http’’, ‘‘https’’, ‘‘.com’’, ‘‘.org’’ and so on. The DNS features are the information related to the DNS requests made when accessing these URLs. DNS requests may include domain names, IP addresses, and other metadata. Similar to the URL features DNS features were extracted and represented using n-gram.提取了两类特征:基于URL的特征和基于DNS的特征。URL中呈现的文本内容使用字符级n-gram来提取,以捕获URL文本中呈现的模式、结构和信息。Ngrams是文本中n个字符的连续序列。N-Gram是一种文本分析技术,它将文本分解为更小的单元,其中“N”表示单元(通常是单词或字符)的数量。例如,在URL“https://www.example.com”中,如果我们考虑3-gram(trigram),我们将具有以下n-gram向量:“https://www.com”,“ttp”,“tps”,“ps:”,“s:/”,“//w”,“/www”,“www.”,www.example.com’’ w.e’‘,’‘. ex’‘,’’ exa ‘’,‘’ xam ‘’,‘’ amp ‘’,‘’ mpl ‘’,‘’ ple ‘’,‘’ le.c’‘,’’ e.co’‘,’‘. com’‘]。n-gram向量中的每个元素都是一个特征。在本研究中,n-gram的范围是从3到5,即特征向量可以包含完整的文本术语,如’’ http ‘’,‘’ https ‘’,‘’. com ‘,’'. org’等。DNS特征是与访问这些URL时发出的DNS请求相关的信息。DNS请求可以包括域名、IP地址和其他元数据。与URL特征类似,DNS特征被提取并使用n-gram表示。

C. PHASE 3: FEATURES REPRESENTATION 特征表示

In this study, a multimodal representation approach employs textual and image-based features to represent the combined feature set. Textual features facilitate the deep learning model’s ability to understand and represent detailed syntax information related to attack patterns, while image features are effective in recognizing more general malicious patterns.在这项研究中,多模态表示方法采用基于文本和图像的功能来表示组合的功能集。文本特征有助于深度学习模型理解和表示与攻击模式相关的详细语法信息,而图像特征可有效识别更一般的恶意模式。

1) TEXT REPRESENTATION 文本表示

The URLs are converted to sequences of characters called tokens. N-gram of range of (1,4) was used to enrich the features. Then a dictionary was created based on the unique tokens in the sequences. Then a feature vector containing all the unique tokens is constructed. For each token, an integer index is assigned. That is the dictionary that maps each token to a unique integer index. For example, if the word ‘‘www’’ is assigned index 3, that means it is the third token in order in the dictionary. The dictionary will also contain the frequency of the tokens in the entire corpus. Thus, to convert a URL to sequence the n-gram with a range of 1 to 4 is used to tokenize the URL at the character level, and then each token is mapped to it equaling count value in the dictionary. This sequence is post-padded based on the longest sequence in the dataset. For simplicity, the length of the sequence is set to 659 in this study. This sequence is used as input for the designed CNN input layer.URL会转换为称为Token的字符序列。利用(1,4)的值域N元来丰富特征。然后根据序列中的唯一标记创建字典。然后构造包含所有唯一标记的特征向量。为每个令牌分配一个整数索引。它是将每个标记映射到唯一整数索引的字典。例如,如果为单词“www "分配了索引3,则意味着它是字典中按顺序排列的第三个标记。词典还将包含整个语料库中的标记的频率。因此,为了将URL转换为序列,使用具有1到4的范围的n元语法在字符级对URL进行标记化,然后将每个标记映射到等于字典中的计数值的URL。该序列是基于数据集中最长的序列进行后填充的。为简单起见,在本研究中将序列的长度设定为659。该序列用作所设计的CNN输入层的输入。

2) IMAGE REPRESENTATION 图像表示

URL information was treated as images. Each URL is converted into a visual representation, where characters in the URL are transformed into a 2D image-like structure. Character embedding was used. The resulting ‘‘images’’ represent the visual patterns within URLs. In this approach, each character in the URL is treated as a basic building block. The process of converting the URLs into visual images and converting them into a visual representation using character embedding consists of two steps. Firstly, the character-level Representation step in which the URL is broken down into its characters (letters, digits, symbols, etc.), and each character is considered as a discrete element. Secondly, in the features embedding step, Character embedding is a technique commonly used in Natural Language Processing (NLP) to represent discrete characters or words as continuous vectors. For each character in the URL, a corresponding embedding vector is generated. These vectors are learned during the training process and capture semantic information about the characters. Character embedding allows the model to convert characters into numerical representations that retain information about their relationships and patterns. The pseudo-code outlines the process of converting a URL into an image-like representation using character embedding and then using a CNN for feature extraction. Tokenize the URL into individual n-gram sequence.URL信息被视为图像。每个URL都被转换成一个视觉表示,URL中的字符被转换成一个类似2D图像的结构。使用了字符嵌入。生成的“图像"表示URL中的视觉模式。在这种方法中,URL中的每个字符都被视为一个基本的构建块。将URL转换为视觉图像并使用字符嵌入将其转换为视觉表示的过程包括两个步骤。首先,字符级表示步骤,其中URL被分解成其字符(字母、数字、符号等),并且每个字符被认为是离散元素。其次,在特征嵌入步骤中,字符嵌入是自然语言处理(NLP)中常用的一种技术,用于将离散字符或单词表示为连续向量。对于URL中的每个字符,生成相应的嵌入向量。这些向量在训练过程中学习,并捕获有关字符的语义信息。字符嵌入允许模型将字符转换为保留有关其关系和模式的信息的数字表示。伪代码概述了使用字符嵌入将URL转换为类似图像的表示,然后使用CNN进行特征提取的过程。将URL标记为单独的n-gram序列。
Let characters set is C = {abcdefghijklmnopqrstuvwxyz 0123456789−, ; .!? :′′′ /|_@# %ˆ& ∗ ∼‘ +− = ()[]{}}. The URL is converted to a series of characters. Each character is considered a feature. N-gram with a range between 2 to 4 was applied to extract more features from the URL to improve the representation. The n-gram features are merged into the URL character sets. Then, the term frequency tfi is calculated for each feature in the merged vector. The term frequency of each feature is stored in a corpus called C (See algorithm 1 Line 8). The term frequency tfi is a local measure of term importance within a single document. It gives you an idea of how often a word appears in a document. The unique terms in the corpus were extracted and stored in a dictionary. The inverse document frequency weight was calculated for each term in the dictionary as follows.设字符集为C = {abcdefghijklmnopqrstuvwxyz 0123456789−,; .!?:′/\|_@# %&’ +− =()[]{}}.URL被转换为一系列字符。每个字符都被视为一个特征。使用范围在2到4之间的N-gram来从URL中提取更多的特征,以改善表示。n-gram特征被合并到URL字符集中。然后,计算合并向量中每个特征的词频tfi。每个特征的词频存储在名为C的语料库中(见算法1第8行)。术语频率tfi是单个文档中术语重要性的局部度量。它可以让你了解一个单词在文档中出现的频率。提取语料库中的唯一术语并存储在字典中。逆文档频率权重针对字典中的每个术语计算如下。
在这里插入图片描述
在这里插入图片描述
其中IDFI是文档频率。IDF通过将每个文档中每个术语的tfi和idfi值相乘来衡量整个语料库中术语的全局重要性。这将为每个文档中的每个术语生成TF-IDF分数。它量化了一个术语在语料库中的独特性或常见性。接下来,对于语料库中的每个特征,如下计算词频-逆词频(tf_idf)。
在这里插入图片描述
如果一个词在文档中频繁出现,但在整个语料库中相对较少,则该词在文档中的t_idfi得分较高。t_idfi特征使用如下的最小-最大归一化来缩放。
在这里插入图片描述
最后,从语料库的唯一术语中创建特征向量。特征向量的最大长度为4096个特征。这些特征向量被转换为64×64图像大小,如下所示。

The pseudocode in Algorithm 1 illustrates the proposed URL to image representation approach and Figure 2 shows the output of the algorithm. Figure 3 shows the histogram of six samples selected randomly. As can be seen in Figures 2 and 3 benign websites have less intense features compared to defacement websites. Phishing websites look similar to benign websites it can be interpreted by the attackers’ purpose. In phishing websites, attackers try to look benign so they can harvest sensitive information or perform an attack.算法1中的伪代码说明了所提出的URL到图像表示方法,图2显示了算法的输出。图3显示了随机选择的六个样本的直方图。如图2和图3所示,良性网站与污损网站相比具有较低的强度特征。钓鱼网站看起来类似于良性网站,可以通过攻击者的目的来解释。在钓鱼网站中,攻击者试图看起来是良性的,这样他们就可以收集敏感信息或执行攻击。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

D. PHASE 4: CNN MODELS CONSTRUCTION CNN模型构建

Two CNN models were constructed the first model was trained based on the image representation features and the other based on the textual-based features. The detailed description of these two models is presented as follows.构建了两个CNN模型,第一个模型基于图像表示特征进行训练,另一个基于基于文本特征进行训练。这两个模型的详细描述如下。

1) CNN MODEL FOR IMAGE 图像的CNN模型

CNNs are typically used for image-related tasks, as they are effective at detecting patterns and features in 2D data. By applying convolutional layers to the grid of the images represented by the proposed Algorithm 1, CNN learns to detect important patterns and features within the URL’s character sequence. As shown in Figures 4(a) and (b), the proposed CNN model, which is called imgCNN consists of nine layers as follows.CNN通常用于图像相关的任务,因为它们在检测2D数据中的模式和特征方面很有效。通过将卷积层应用于由所提出的算法1表示的图像网格,CNN学习检测URL字符序列中的重要模式和特征。如图4(a)和(B)所示,提出的CNN模型(称为imgCNN)由以下九层组成。
在这里插入图片描述
The first layer is the convolutional layer with 32 filters/kernels, a kernel size of (3, 3), and ReLU activation. It processes the input data, resulting in feature maps of size (62, 62, 32). The second layer is the max-pooling layer with a pool size of (2, 2). It reduces the spatial dimensions of the feature maps by taking the maximum value in each 2×2 region, resulting in smaller feature maps. The Output Shape of this layer is 31 × 31 size images (None, 31, 31, 32). The third layer is the second convolutional layer with 64 filters, a kernel size of (3, 3), and ReLU activation. It further processes the feature maps from the previous layer. The output shape of this layer is (None, 29, 29, 64). The fourth layer is the second max-pooling layer with a pool size of (2, 2), further reducing the spatial dimensions. The fifth layer is the third convolutional layer with 64 filters, a kernel size of (3, 3), and ReLU activation. The sixth layer flattens the 3D feature maps into a 1D vector, preparing them for fully connected layers. The seventh layer is a fully connected layer which has 64 units with ReLU activation.第一层是卷积层,具有32个过滤器/内核,内核大小为(3,3),ReLU激活。它处理输入数据,产生大小为(62,62,32)的特征图。第二层是池大小为(2,2)的最大池化层。它通过在每个2×2区域中取最大值来减少特征图的空间维度,从而得到更小的特征图。此层的输出形状是31 × 31大小的图像(无,31,31,32)。第三层是第二个卷积层,有64个过滤器,内核大小为(3,3),ReLU激活。它进一步处理来自前一层的特征图。此层的输出形状为(None,29,29,64)。第四层是第二个最大池化层,池大小为(2,2),进一步减少了空间维度。第五层是第三个卷积层,有64个过滤器,内核大小为(3,3),ReLU激活。第六层将3D特征映射转换为1D向量,为完全连接的层做准备。第七层是一个完全连接的层,有64个ReLU激活单元。

2) CNN MODEL FOR TEXTUAL FEATURES 文本特征的CNN模型

As shown in Figure 5, the proposed deep learning model for malicious URL classification using text representation (txtCNN) relies on a 1D Convolutional Neural Network (CNN). It commences with an embedding layer that translates the character-level inputs with n-gram features into continuous 32-dimensional vectors. Following this, a 1D convolutional layer of 128 filters and ReLU activation is applied to capture salient features in the text data. Max-pooling is subsequently employed for spatial reduction. The flattened output is then processed through a dense layer consisting of 128 units with ReLU activation. To mitigate overfitting, dropout with a rate of 0.5 is introduced. Finally, the model employs a softmax-based output layer to provide classification probabilities for the defined number of classes. This architecture excels at learning meaningful patterns in textual representations of URLs, facilitating the distinction between benign and malicious URLs.如图5所示,提出的使用文本表示的恶意URL分类深度学习模型(txtCNN)依赖于1D卷积神经网络(CNN)。它从一个嵌入层开始,将具有n-gram特征的字符级输入转换为连续的32维向量。在此之后,应用128个过滤器和ReLU激活的1D卷积层来捕获文本数据中的显著特征。最大池化随后用于空间缩减。然后通过由128个ReLU激活单元组成的密集层处理扁平化的输出。为了减轻过拟合,引入了比率为0.5的dropout。最后,该模型采用基于softmax的输出层为定义的类别数量提供分类概率。该架构擅长学习URL文本表示中有意义的模式,便于区分良性和恶意URL。
在这里插入图片描述
As the URL representation passes through the CNN, the network performs feature extraction. Features might include detecting specific character combinations, sequences, or other visual patterns within the URL. The CNN learns to recognize which patterns are indicative of certain URL categories, such as malicious or benign. The output from the CNN is then used as a feature representation of the URL. This feature representation, which captures visual patterns within the URL, can be passed to further layers in the neural network for classification.当URL表示通过CNN时,网络执行特征提取。功能可能包括检测URL中的特定字符组合、序列或其他视觉模式。CNN学习识别哪些模式指示某些URL类别,例如恶意或良性。然后,CNN的输出被用作URL的特征表示。这个特征表示捕捉URL中的视觉模式,可以传递到神经网络中的其他层进行分类。

E. PHASE 5: DECISION MAKING 决策制定

The decision-making model is a sequential deep learning model designed to classify URLs as either benign or malicious based on integrated features from two separate models, one processing URL text representations and the other treating URLs as images. As shown in Figure 6, the model begins with an input layer, followed by densely connected layers with ReLU activation functions. These layers collectively enable the model to learn complex patterns and representations from both text and image data. The final output layer employs the softmax activation function to provide class probabilities for classification. The model is optimized using the Adam optimizer and trained to minimize categorical cross-entropy loss. Its architecture allows it to effectively fuse information from text and image representations, making informed decisions about the nature of URLs, and contributing to robust URL classification.决策模型是一个顺序深度学习模型,旨在根据两个独立模型的集成特征将URL分类为良性或恶意,一个处理URL文本表示,另一个将URL视为图像。如图6所示,该模型从输入层开始,然后是具有ReLU激活函数的密集连接层。这些层共同使模型能够从文本和图像数据中学习复杂的模式和表示。最后的输出层采用softmax激活函数来提供分类的类别概率。该模型使用Adam优化器进行优化,并进行训练以最大限度地减少分类交叉熵损失。其架构允许它有效地融合来自文本和图像表示的信息,对URL的性质做出明智的决定,并有助于强大的URL分类。

7.PERFORMANCE EVALUATION 绩效评价

The dataset, the experimental procedures, and the performance evaluation are described in the following sub-sections.数据集、实验程序和性能评价在以下小节中描述。

A. SOURCES AND PREPROCESSING OFDATASETS 数据集的来源和预处理

In this study, a popular and accessible dataset of malicious URLs was used. This dataset can be found on the Kaggle.com repository [48]. The dataset was sourced from well-established repositories frequently used by researchers specializing in the detection of malicious URLs, including Phishtank [39], [40] (accessible at https://phishtank.org/) and the URL dataset known as ISCX-URL-2016 [8] (available at https://www.unb.ca/cic/datasets/url-2016.html). The URLs within this dataset are either malicious or benign. The malicious URLs encompassed a range of types, such as links to malware, web defacement, spam, phishing, and drive-by downloads. In this study, a sample of 50,000 URLs was randomly selected. Because some URLs are outdated, the validity of the URLs was tested before it is included in the sample dataset. An http/s request was initiated for each URL in the dataset, only the valid HTTP response was included in the sample dataset. Figure 7 presents a summary of the quantity and various types of URL samples present in the original dataset (right figure) and the selected sample (left figure).在这项研究中,使用了一个流行的和可访问的恶意URL数据集。该数据集可以在Kaggle.com存储库中找到[48]。该数据集来自专门从事恶意URL检测的研究人员经常使用的成熟存储库,包括Phishtank [39]、[40](可在https://phishtank.org/访问)和名为ISCX-URL-2016 [8]的URL数据集(可在https://www.unb.ca/cic/datasets/url-2016.html)。此数据集中的URL可能是恶意的,也可能是良性的。恶意URL包括一系列类型,例如恶意软件链接、网页破坏、垃圾邮件、网络钓鱼和驾车下载。在这项研究中,随机选择了50,000个URL的样本。由于某些URL已经过时,因此在将其包含在示例数据集中之前,将测试URL的有效性。为数据集中的每个URL发起了一个http/s请求,示例数据集中只包含有效的HTTP响应。图7总结了原始数据集(右图)和选定样本(左图)中URL样本的数量和各种类型。
在这里插入图片描述

B. EXPERIMENTAL PROCEDURES 实验程序

In this study, the state-of-the-art deep learning-based solutions, which have previously been proposed for malicious URL detection, were used for the evaluation of the proposed model. Additionally, text-based CNN and Image-based CNN were developed to serve as baselines for evaluating the proposed model. The lexical URL-based features, drawing from existing literature [6], [9], [11], [12], [13], [18], [49] were also used in the comparison. In the subsequent section, we provide a detailed exposition of the results.在这项研究中,使用了最先进的基于深度学习的解决方案,这些解决方案以前曾被提出用于恶意URL检测,用于评估所提出的模型。此外,还开发了基于文本的CNN和基于图像的CNN作为评估所提出模型的基线。[11][12][13][14][15][16][17][18][19]在随后的章节中,我们将详细阐述这些结果。

1) PERFORMANCE MEASURE 业绩计量

To assess the detection performance of the proposed model, we employed five key performance metrics: overall accuracy, detection rate (recall), precision, F1 score, Matthews Correlation Coefficient (MCC), false-positive rate (FPR), and false-negative rate (FNR). These performance metrics are widely accepted and commonly utilized in the evaluation of malware detection solutions within the existing literature. The MCC measures the quality of binary classifications, particularly when dealing with imbalanced datasets. It takes into account true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) to provide a balanced evaluation of a binary classification model. The performance measures used in this study were calculated based on the following equations.为了评估所提出的模型的检测性能,我们采用了五个关键性能指标:总体准确率,检测率(召回率),精度,F1分数,马修斯相关系数(MCC),假阳性率(FPR),和假阴性率(FNR)。这些性能指标被广泛接受,通常用于评估现有文献中的恶意软件检测解决方案。MCC衡量二进制分类的质量,特别是在处理不平衡数据集时。它考虑了真阳性(TP),真阴性(TN),假阳性(FP)和假阴性(FN),以提供对二元分类模型的平衡评估。本研究中使用的性能指标根据以下公式计算。
在这里插入图片描述
Although the F-measure evaluates the overall performance of the model by measuring the balance between precision and recall, it doesn’t consider true negatives, making it less informative for imbalanced datasets. The MCC is a more accurate measure because it is sensitive to class distribution and dataset size. MCC takes into account both true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in a balanced way. Therefore, it gives more insights into the performance of the model.虽然F度量通过测量精确度和召回率之间的平衡来评估模型的整体性能,但它不考虑真阴性,这使得它对不平衡数据集的信息量较少。MCC是一种更准确的衡量标准,因为它对类别分布和数据集大小很敏感。MCC以平衡的方式考虑真阳性(TP)、真阴性(TN)、假阳性(FP)和假阴性(FN)。因此,它提供了对模型性能的更多见解。

8.RESULTS AND DISCUSSION

The classification results of the proposed HF-CNN and imgCNN as compared to the related work models are listed in Table 1. It can be seen that the proposed HF-CNN is superior to all other studied models. Compared with the baseline model txtCNN, the proposed HF-CNN is 0.7%, 0.7%, 0.4%, and 0.6% improvement in terms of Accuracy, Precession, Recall, F-Measure, and MCC, respectively. The False Positive Rate (FPR) and False Negative Rate (FNR) were reduced by 1.6% and 1.4%, respectively.与相关工作模型相比,所提出的HF-CNN和imgCNN的分类结果列于表1中。可以看出,所提出的HF-CNN是上级优于所有其他研究模型。与基线模型txtCNN相比,HF-CNN在准确性、进动、召回、F-Measure和MCC方面分别提高了0.7%、0.7%、0.4%和0.6%。假阳性率(FPR)和假阴性率(FNR)分别降低了1.6%和1.4%。
在这里插入图片描述
Figures 8-14 present results of the proposed HF-CNN, and imgCNN as compared to the related work models, in terms of Accuracy, Precession, Recall, F-Measure, MCC, FNR, and FPR respectively. As can be seen in these figures, CNN models outperform the other studied models. LSTM and DBN achieved lower performance compared to the other studied model this is because LSTM and DBN models are designed for sequence modeling where there are clear dependencies between elements in a sequence. Malicious URL patterns, however, may not exhibit strong sequential dependencies, making LSTM and DBN less effective for URL classification. BiLSTM, however, achieved better performance than the LSTM. The LSTM is likely unable to capture the spatial correlation among the URL features while BiLSTMs, with their bidirectional processing, can capture spatial context features. MCCNN and AMCCNN achieved comparable good performance compared with the proposed model (See Figures 11 and 12). Both MCCNN and AMCCNN models employ CNN to extract and classify the URLs. CNNbased models can capture the spatial dependencies in the URL features. This interprets also the improvement gained when the URLs are represented as images and the CNN model is used for classification. CNNs are designed for processing grid-like data, such as images, which have a clear spatial structure. CNNs are capable of capturing both local features (e.g., character-level patterns) and global features (e.g., overall URL structure) simultaneously. This flexibility allows them to identify malicious patterns at different scales within URLs.图8-14分别在准确性、进动、召回、F测量、MCC、FNR和FPR方面展示了所提出的HF-CNN和imgCNN与相关工作模型的结果。从这些图中可以看出,CNN模型优于其他研究模型。与其他研究模型相比,LSTM和DBN的性能较低,这是因为LSTM和DBN模型是为序列建模而设计的,其中序列中的元素之间存在明显的依赖关系。然而,恶意URL模式可能不会表现出很强的顺序依赖性,这使得LSTM和DBN对URL分类的有效性降低。然而,BiLSTM实现了比LSTM更好的性能。LSTM可能无法捕获URL特征之间的空间相关性,而BiLSTM通过其双向处理可以捕获空间上下文特征。与所提出的模型相比,MCCNN和AMCCNN实现了相当好的性能(见图11和12)。MCCNN和AMCCNN模型都使用CNN来提取和分类URL。基于CNN的模型可以捕获URL特征中的空间依赖性。这也解释了当URL被表示为图像并且CNN模型用于分类时所获得的改进。CNN旨在处理具有清晰空间结构的网格状数据,例如图像。CNN能够捕获两个局部特征(例如,字符级模式)和全局特征(例如,整体URL结构)。这种灵活性使他们能够识别URL中不同规模的恶意模式。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
Figure 13 and 14 shows the results in terms of false positive rate (FPR) and false negative rate (FNR). Both measures are important in the evaluation of the malicious website detection models. As can be noticed in Figure 13 the proposed models HF-CNN and imgCNN achieved the lowest false positive rate which is 3.49% for both models (Seet Table 1). The DBN and LSTM models achieved 23.14%, and 30.45% respectively. CNN models are more effective in eliminating the false positive rate, due to their ability to capture the malicious pattern in the URLs features. Traditional machine learning produced a high rate of false positives because such algorithms do not capture complex sequential or spatial dependencies present in the URL-based features. Although most of the models achieved a false negative rate lower than 3%, however, such a percentage could be dangerous for critical systems. Recent studies show that an average US internet user visits 130 web pages per day. That is, every day an average internet user may visit 39 malicious websites per thousand URLs. The proposed model achieved a 0.48% of the false negative rate. That is 6.24 malicious websites might be visited per each thousand visited URLs.图13和图14显示了假阳性率(FPR)和假阴性率(FNR)方面的结果。这两种度量在恶意网站检测模型的评估中都是重要的。如图13所示,拟定模型HF-CNN和imgCNN的假阳性率最低,均为3.49%(见表1)。DBN模型和LSTM模型的预测精度分别为23.14%和30.45%。由于CNN模型能够捕获URL特征中的恶意模式,因此在消除误报率方面更有效。传统的机器学习产生了很高的误报率,因为这种算法不能捕获基于URL的特征中存在的复杂的顺序或空间依赖性。尽管大多数模型的假阴性率低于3%,但是,这样的百分比对于关键系统来说可能是危险的。最近的研究表明,美国互联网用户平均每天访问130个网页。也就是说,平均每一千个URL中,每天因特网用户可能访问39个恶意网站。该模型的假阴性率为0.48%。也就是说,每一千个访问过的URL中可能会访问6.24个恶意网站。
The results exhibited that the URL-based features are promising alternatives to web content features. Researchers often assess model performance by comparing both sets of features, and consistently, URL-based features outperform their counterparts. Nevertheless, the majority ofexisting studies primarily rely on lexical features extracted from URLs, which offer limited semantic information and result in sparse feature vectors. Some studies seek to enhance detection performance by combining URL features with digital certificates. Malicious websites frequently lack valid certificates or resort to self-signed certificates, rendering certificate analysis a valuable trustworthiness indicator. Evaluating digital certificates can unveil whether a website employs encryption, a common practice among reputable sites. However, not all websites employ digital certificates, and some may utilize self-signed certificates or certificates issued by less reputable Certificate Authorities (CAs). The extraction of relevant and meaningful features from certificates for machine learning models can be intricate, and the judicious selection of appropriate features is pivotal for effective detection. Furthermore, digital certificates can be susceptible to misconfiguration, expiration, and frequent changes, leading to an elevated rate of false alarms.结果表明,基于URL的功能是有前途的替代Web内容的功能。研究人员经常通过比较两组特征来评估模型性能,并且始终,基于URL的特征优于其对应的特征。然而,现有的大多数研究主要依赖于从URL中提取的词汇特征,这提供了有限的语义信息,导致稀疏的特征向量。一些研究试图通过将URL特征与数字证书相结合来增强检测性能。恶意网站经常缺少有效证书或采用自签名证书,从而使证书分析成为一个有价值的可信度指标。评估数字证书可以揭示网站是否采用加密,这是信誉良好的网站的常见做法。但是,并非所有网站都使用数字证书,有些网站可能使用自签名证书或由信誉较差的证书颁发机构(CA)颁发的证书。从机器学习模型的证书中提取相关和有意义的特征可能是复杂的,明智地选择适当的特征是有效检测的关键。此外,数字证书容易受到错误配置、过期和频繁更改的影响,从而导致误报率升高。

9.CONCLUSION AND FUTURE WORKS

In this study, a malicious website detection model called HF-CNN was designed and developed. The model integrates URL features with DNS features to enhance the comprehensiveness of identifying malicious websites. A multimodal representation approach that encompasses both textual and image-based characteristics has been proposed to depict the combined feature set. Textual attributes enable the deep learning model to grasp and depict complex semantic details associated with attack patterns, while image attributes surpass at recognizing broader malicious patterns. Two Convolutional Neural Network (CNN) models were constructed to extract hidden features from the textual and image representations. CNNs are capable of simultaneously capturing both local and global features. The results indicate that the proposed model outperforms the other related models. The overall performance in terms of F-measure and MCC has been improved by 0.4%, and 0.6%, respectively, compared with the baseline model txtCNN. The False Positive Rate (FPR) and False Negative Rate (FNR) were reduced by 1.6% and 1.4%, respectively.在这项研究中,一个恶意网站检测模型HF-CNN的设计和开发。该模型将URL特征与DNS特征相结合,增强了对恶意网站识别的全面性。一个多模态表示方法,包括文本和图像为基础的特征已被提出来描绘的组合特征集。文本属性使深度学习模型能够掌握和描述与攻击模式相关的复杂语义细节,而图像属性在识别更广泛的恶意模式方面更胜一筹。构建了两个卷积神经网络(CNN)模型,用于从文本和图像表示中提取隐藏特征。CNN能够同时捕获局部和全局特征。结果表明,该模型优于其他相关模型。与基线模型txtCNN相比,F-measure和MCC的整体性能分别提高了0.4%和0.6%。假阳性率(FPR)和假阴性率(FNR)分别降低了1.6%和1.4%。
While the proposed models achieved a high detection performance of98.88% in terms ofF-measure, there are still considerable amounts of errors presented in the detection performance as measured by the MMC score of 96.66%. The errors mostly resulted from the unrepresented features in URLs and DNS information. Therefore, relying solely on URLs, DNS information or static features is not a wise approach to malicious website detection, as some benign domains that suffer from security vulnerabilities may become malicious due to injection attacks. Therefore, it is important to combine the URL-based features with other features such as content features. However, content features are complex due to their high dynamicity and usability by attackers to evade detection. As a result, further research is needed to propose effective and efficient mechanisms for acquiring web content.
虽然所提出的模型在F-度量方面实现了98.88%的高检测性能,但是在由96.66%的MMC评分所测量的检测性能中仍然存在相当大的误差量。这些错误主要是由于URL和DNS信息中未表示的功能造成的。因此,仅仅依靠URL、DNS信息或静态特征并不是检测恶意网站的明智方法,因为一些存在安全漏洞的良性域可能会由于注入攻击而变得恶意。因此,将基于URL的功能与其他功能(如内容功能)相结合非常重要。“然而,由于内容特征的高动态性和攻击者逃避检测的可用性,因此内容特征是复杂的。
Furthermore, employing an adaptive ensemble of classifiers designed to accommodate the dynamic nature of evolving threats could enhance detection performance. Each classifier within the ensemble is constructed based on a distinct set of features, providing versatility and robustness in addressing diverse threat scenarios.因此,需要进一步的研究来提出用于获取web内容的有效和高效的机制。此外,采用被设计为适应不断演变的威胁的动态性质的自适应分类器集合可以增强检测性能。集成中的每个分类器都基于一组不同的特征构建,在应对各种威胁场景时提供了多功能性和鲁棒性。

10.ACKNOWLEDGMENT

The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia, for funding this research work; project number (168/442). Further, the authors would like to extend their appreciation to Taibah University for its supervision support.作者感谢沙特阿拉伯教育部研究与创新副部长资助这项研究工作;项目编号(168/442)。此外,提交人还要感谢Taibah大学的监督支持。

AF(Association Fusion)是一种基于关联的多模态分类方法。多模态分类是指利用多种不同类型的数据(如图像、文本、音频等)进行分类任务。传统的多模态分类方法通常是将不同类型的数据分别提取特征,然后将这些特征进行融合得到最终结果。AF方法则是通过建立数据之间的关联来实现融合。 具体而言,AF方法首先将每个模态的数据进行特征提取,得到对应的特征向量。然后通过计算每个模态之间的相关度来建立模态之间的关联。这个相关度可以通过不同的方法来计算,例如互信息、皮尔逊相关系数等。 接下来,AF方法通过关联度来调整每个模态的权重。具体来说,权重与关联度成正比,关联度越高的模态将获得更大的权重。这样一来,每个模态的重要程度就会根据数据之间的关联度动态调整。 最后,AF方法通过将每个模态的特征与对应的权重进行加权融合,得到最终的特征向量。这个特征向量可以用于进行分类任务。 与传统的融合方法相比,AF方法能够更准确地捕捉到不同模态数据之间的关联信息。这样一来,融合后的特征向量能够更好地反映整个多模态数据的特征,提高分类准确率。 综上所述,AF是一种基于关联的多模态分类融合方法,通过建立数据之间的关联来动态调整每个模态的权重,从而提高多模态分类的准确率。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值