布隆过滤器学习笔记——《Learned Bloom Filters in Adversarial Environments:A Malicious URL Detection Use-Case》

本文链接：https://blog.csdn.net/weixin_44134344/article/details/122625221

摘要：

Abstract—Learned Bloom Filters (LBFs) have been recently proposed as an alternative to traditional Bloom filters that can reduce the amount of memory needed to achieve a target false positive probability when representing a given set of elements.（摘要 - Learned Bloom Filters(LBFs)最近被提出作为传统Bloom过滤器的替代方案，它可以减少在表示给定元素集时实现目标误报概率所需的内存量。) LBFs rely on Machine Learning models combined with traditional Bloom filters.However, if LBFs are going to be used as an alternative to Bloom filters, their security must be also be considered. (LBFs 依赖于机器学习模型与传统的布隆过滤器相结合。但是，如果要使用 LBF 作为布隆过滤器的替代品，还必须考虑其安全性。 ) In this paper, the security of LBFs is studied for the first time and a vulnerability different from those of traditional Bloom filters is uncovered. In more detail, an attacker can easily create a set of elements that are not in the filter with a much larger false positive probability than the target for which the filter has been designed. (本文首次研究了LBFs的安全性，发现了一个不同于传统Bloom过滤器的漏洞。更详细地说，攻击者可以轻松创建一组不在过滤器中的元素，其误报概率比为其设计过滤器的目标要大得多。) The constructed attack set can then be used to for example launch a denial of service attack against the system that uses the LBF. A malicious URL case study is used to illustrate the proposed attacks and show their effectiveness in increasing the false positive probability of LBFs. (然后，可以使用构造的攻击集对使用LBF的系统发起拒绝服务攻击（DOS-denial of service)。一个恶意URL案例研究被用来说明所提出的攻击，并显示它们在增加LBFs的误报概率方面的有效性。) The dataset under consideration includes nearly 485K URLs where 16.47% of them are malicious URLs. Unfortunately, it seems that mitigating this vulnerability is not straightforward.Index Terms—Learned Bloom Filters; Machine Learning; Security.(正在考虑的数据集包括近485K个URL，其中16.47%是恶意URL。不幸的是，缓解这一漏洞似乎并不简单。索引项；LBFs；机器学习；安全性)

正文：

I. INTRODUCTION

Introduced half a century ago, the Bloom filter [1] is today widely used to perform approximate membership checking in computing and communications [2]. For example, Bloom filters are used to efficiently implement multicast [3], to coordinate access in wireless networks [4] or to speed up packet processing [5]. In fact, Bloom filters are still an active area for research and variants and optimizations are still being proposed [6], [7].（半个世纪前引入的Bloom过滤器[1]如今被广泛用于在计算和通信[2]中执行近似成员身份检查。例如，Bloom过滤器用于高效实现多播[3]，协调无线网络中的访问[4]或加速分组处理[5]。事实上，Bloom过滤器仍然是一个活跃的研究领域，变体和优化仍在提出[6]，[7]。)

In the original Bloom filter, the elements of the set are mapped to several positions on a bit array that are set to one when the element is inserted and checked when an element is tested for membership.(在原始Bloom过滤器中，集合的元素被映射到位数组上的多个位置，当插入元素时，这些位置被设置为 1，当测试元素的成员资格时，这些位置被检查。)Those positions are obtained by applying hash functions to the element and thus there is no correlation among the positions set by any two elements even if they are similar.This results in a memory footprint that is approximately proportional to the number of elements in the set. （这些位置是通过对元素应用哈希函数来获得的，因此，由任意两个元素设置的位置之间没有相关性，即使它们相似。这导致内存占用量与集合中的元素数量大致成正比。）However, the Bloom filter can return a positive by chance (with low probability) for an element not inserted in the filter, due to hash collisions. Instead, the Bloom filter has no false negatives.（但是，由于散列冲突，Bloom过滤器可能会为未插入过滤器的元素返回一个正的偶然值（概率较低）。相反，布鲁姆过滤器没有假反性。）

The Learned Bloom Filter has been recently proposed to reduce the memory needed to implement the filtering [8]. The main idea behind the LBF is to train a machine learning model using the set to be represented and a relevant number of negative elements and then use that model combined with a traditional Bloom filter to build a filter. The LBF can reduce the memory requirements by using the learned model to exploit the correlation among elements in the set. The LBF has then been generalized so that the learned model is combined with several Bloom filters that are used for different output ranges of the learned model [9].（最近，有人提出使用学习过的Bloom过滤器来减少实现过滤所需的内存[8]。LBF背后的主要思想是使用要表示的集合和相关数量的负元素来训练机器学习模型，然后将该模型与传统的Bloom过滤器结合使用来构建过滤器。LBF可以通过使用学习模型来利用集合中元素之间的相关性来减少内存需求。然后对LBF进行了推广，使学习模型与多个Bloom过滤器相结合，这些过滤器用于学习模型的不同输出范围[9]。）

As Bloom filters are commonly used to speed up systems they can be the target of Denial of Service (DoS) attacks that exploit algorithmic complexity [10]. Indeed, several attacks to Bloom filters have been described, for example to increase their
false positive probability which would limit its ability to filter elements and thus speed up the system [11], [12], [13]. Attacks have also been proposed for other filters, like the cuckoo filter, again with the aim of limiting its ability to filter elements [14].(由于布隆过滤器通常用于加速系统，它们可能成为利用算法复杂性的拒绝服务 (DoS) 攻击的目标 [10]。实际上，已经描述了对布隆过滤器的几种攻击，例如增加它们的误报概率，这将限制其过滤元素的能力，从而加速系统 [11]、[12]、[13]。其他过滤器也被提出了攻击，比如布谷鸟过滤器，同样是为了限制其过滤元素的能力[14]。)

Therefore,before considering the use of LBFs in high-availability or critical systems, it seems that their security should be analyzed. This is the purpose of this paper that to the best of our knowledge presents the first study on the security of LBFs.The rest of the paper is organized as follows. Section II presents a brief overview of the learned Bloom filter and the adversarial models considered. Then, in section III two attacks on LBF are presented and they are evaluated in section IV. Section V discusses mechanisms that could potentially be used to detect the attack. The paper ends in section VI with the conclusions and a discussion of ideas to further explore the security of LBFs.(因此，在考虑在高可用性或关键系统中使用LBF之前，似乎应该分析它们的安全性。这就是本文的目的，据我们所知，这是对 LBF 安全性的第一次研究。本文的其余部分组织如下。第二节简要概述了学习的布隆过滤器和所考虑的对抗模型。然后，在第三节中提出了两种对 LBF 的攻击，并在第四节中对它们进行了评估。第五节讨论了可能用于检测攻击的机制。本文在第六节结束，得出结论并讨论了进一步探索 LBF 安全性的想法。 )

II. PRELIMINARIES

This section briefly describes the architecture of LBFs and then presents the adversarial models considered in the rest of the paper.(本节简要介绍了LBFs的体系结构，然后介绍了本文其余部分考虑的对抗模型。)

A. Learned Bloom Filters

As discussed in the introduction, a traditional Bloom filter uses a bit array and maps elements to k positions using hash functions. Those positions are set to one when the element is inserted and checked when performing a lookup. As more elements are inserted, more positions will store a one and the probability of returning a false positive increases. (正如介绍中所讨论的，传统的布隆过滤器使用位数组并使用散列函数将元素映射到 k 个位置。当插入元素并在执行查找和检查时，这些位置设置为 1。随着更多元素被插入，更多位置存储 1 并且返回假正的概率增加。)

The optimal configuration for a filter that has m bits and stores n elements achieves a false positive probability of ( 1/2 )(ln2· m/n ) approximately [2]. For example, to achieve a 1% false positive probability m/n has to be approximately 9.6 and the filter thus requires 9.6 bits per element stored in the filter.(具有 m 位并存储 n 个元素的过滤器的最佳配置实现了大约 ( 1/2 )(ln2· m/n ) [2] 的假正概率。例如，要实现 1% 的误报概率，m/n 必须约为 9.6，因此过滤器中每个元素需要9.6个位存储。)

The main idea behind the LBF is to use Machine Learning to train a model f(x) that produces values in the interval [0,1] to classify elements as belonging to a set of interest or not. Elements in the set should be mapped to values close to one and elements not in the set to values close to zero. Then, the LBF is built by selecting a threshold T on the output of that model that achieves the desired false positive rate and finally adding a classical approximate membership structure (such as a classical Bloom filter) to handle false negatives (elements in the set that have f(x) < T).(LBF背后的主要思想是使用机器学习来训练模型f（x），该模型在区间[0,1]内产生值，以将元素分类为是否属于一组感兴趣的元素。集合中的元素应映射到接近1的值，而集合中不存在的元素应映射到接近零的值。然后，通过在该模型的输出上选择一个阈值T来构建LBF，该阈值T可达到所需的假正率，并最终添加一个经典近似成员结构（例如经典Bloom过滤器）来处理误报（集合中f（x）<T的元素）。)

The overall approach is illustrated in Figure 1. First, the incoming element x is processed by the learned model to obtain f(x). If f(x) is larger than a threshold T, then x is classified as a positive. Otherwise, the element x goes to a Bloom filter that contains the elements of the set for which f(x) < T. The idea is that the learned model may be able to provide the desired false positive rate using a smaller memory footprint and only elements that are false negatives on the learned model need to be inserted in the Bloom filter. That is, if most elements in the set can be identified in the first stage (i.e. the ML model) and produce a value of f(x) > T while elements not in the set produce in most cases values f(x) < T, then the LBF can possibly reduce the memory needed to implement the filter.(整体方法如图1所示。首先，学习模型对输入元素x进行处理，以获得f（x）。如果f（x）大于阈值T，则x被归类为正。否则，元素x将进入包含f（x）<T的集合元素的Bloom过滤器。其思想是，学习模型可能能够使用较小的内存占用提供所需的误报率，并且只有学习模型上的误报元素需要插入Bloom过滤器。也就是说，如果集合中的大多数元素可以在第一阶段（即ML模型）中被识别并产生f（x）>T的值，而在大多数情况下不在集合中的元素产生f（x）<T的值，那么LBF可能会减少实现滤波器所需的内存。)

The overall false positive probability in the LBF is given by:
P(f(x) ≥ T) + (1 − P(f(x) ≥ T)) · Fp (1)

where P(f(x) ≥ T) refers to the probability that an element that is not in the set having a value of f(x) larger than T and Fp is the false positive probability of the Bloom filter. In the design process, T is selected so that the target false positive can be achieved and then the Bloom filter is designed for the required Fp. A limitation of LBFs is that adding new elements may require re-training the learned model.(其中P（f（x）≥ T）指不在集合中且f（x）值大于T，Fp是布隆过滤器的假正性概率。在设计过程中，选择T以实现目标假正性，然后针对所需的Fp设计Bloom过滤器。LBFs的一个限制是，添加新元素可能需要重新训练学习的模型。)

The LBF can be generalized by having different Bloom filters at the output of the learned model, each one covering a range of the f(x) output values [9]. This enables further memory savings by better exploiting the information provided by the learned model. In the following, for the sake of simplicity, only the basic LBF is considered, but the vulnerability identified is also present in the generalized LBFs. Similarly, it seems that the proposed attacks can be also used against more sophisticated
LBF designs.(LBF 可以通过在学习模型的输出处使用不同的布隆过滤器来推广，每个布隆过滤器都覆盖 f(x) 输出值的范围 [9]。这可以通过更好地利用学习模型提供的信息来进一步节省内存。在下文中，为简单起见，仅考虑基本 LBF，但所识别的漏洞也存在于广义 LBF 中。同样，所提出的攻击似乎也可以用于更复杂的 LBF 设计。)

B. Adversarial Models

Two adversarial models are considered for the attacker: black box and white box. In both models, the attacker has access to the filter to perform membership checks on the LBF and get the result. In some practical scenarios, the attacker may not have access to the result itself but may infer it from the system behaviour. For example, the response may be different for a negative than for a positive outcome. In the URL filtering case studied considered, a positive may result in the request being blocked or delayed. In the black box model, the attacker has no information on how the LBF is constructed, for example he does not know which ML algorithm has been used to construct the LBF, or the threshold T used or the hash functions used in the companion Bloom filter. This model places weaker assumptions than those considered in previous attacks for Bloom filters which in most cases assume some knowledge regarding the implementation details of the BFs [13]. Instead, in the white box model, the attacker knows how the filter is implemented and in particular the learned model. For example the attacker knows which features are extracted from the elements and the algorithm used to build the classification function f(x).(攻击者可以考虑两种对抗模式：黑盒和白盒。在这两种模型中，攻击者都可以访问过滤器，对LBF执行成员身份检查并获得结果。在某些实际情况下，攻击者可能无法访问结果本身，但可能会从系统行为推断出结果。例如，消极结果的反应可能不同于积极结果。在所研究的URL过滤案例中，肯定的结果可能会导致请求被阻止或延迟。在黑盒模型中，攻击者不知道LBF是如何构造的，例如，他不知道使用了哪种ML算法来构造LBF，也不知道使用的阈值T或伴随Bloom过滤器中使用的哈希函数。与之前针对Bloom Filter的攻击相比，该模型的假设较弱，在大多数情况下，Bloom Filter假设了一些关于BFs实现细节的知识[13]。相反，在白盒模型中，攻击者知道过滤器是如何实现的，尤其是学习到的模型。例如，攻击者知道从元素中提取哪些特征，以及用于构建分类函数f（x）的算法。)

III. GENERATING FALSE POSITIVES ON LBFS

As already discussed, the objective of the attack is to increase the false positive rate of the filter thus reducing its ability to filter negatives by creating an attack set that has a large false positive probability. On a traditional Bloom filter, this can be done by testing a large number of elements that are negatives and selecting those ones for which the filter returns a positive answer. To do this, the attacker must know a large
set of negative elements. Additionally, to build such a set with A elements would require on average testing A/(F P R) elements on the filter, where F P R is the filter false positive rate. This may not be possible in practical settings for filters with low
false positive rates since such a large number of queries may be itself an indication of a malicious activity that can be detected. Additionally, a traditional Bloom filter can be easily modified if the false positive rate increases suddenly. For example, a random number can be added to the elements before computing the hash functions and if the false positive rate increases, the filter can be rehashed by changing the random number so that the hash functions produce different values. This is not possible for the learned model in the case of the LBFs.(如前所述，攻击的目的是增加过滤器的假正率，从而通过创建具有较大误报概率的攻击集来降低其过滤负面信息的能力。在传统的Bloom过滤器上，这可以通过测试大量消极元素并选择过滤器返回积极答案的元素来实现。为此，攻击者必须知道大量负面元素。此外，要用 A 元素构建这样的集合，需要在过滤器上平均测试A /（F P R）元素，其中F P R是过滤器的假正性率。在低误报率的过滤器的实际设置中，这可能是不可能的，因为如此大量的查询本身可能表明可以检测到恶意活动。此外，如果假正率突然增加，传统的Bloom过滤器很容易修改。例如，可以在计算哈希函数之前将随机数添加到元素中，如果误报率增加，则可以通过更改随机数来重新设置过滤器，以便哈希函数产生不同的值。在LBFs的情况下，这对于学习模型是不可能的。)

A key observation is that in an LBF, potentially, the learned model can be exploited to generate false positives more effi-ciently. In the following subsections, two procedures to do so are presented, one for a black box adversary and the other for a white box attacker.(一个关键的观察结果是，在LBF中，学习的模型可以被更有效地利用以产生误报。在下面的小节中，将介绍两个执行此操作的过程，一个用于黑盒对手，另一个用于白盒攻击者。)

A. Attack no. 1: Creating new Positives by mutating known Positives

For the LBF, the procedure to generate false positives for a black box adversary can also start by testing elements until a positive is found. Then once an element x is known to be positive, the attacker can generate mutations” or variations of that element x by changing a few bits only. For example, an element y that is different from x in only one bit. Then, if x is a positive because f(x) > T intuitively, y is also likely to be a positive as it would look very similar to the machine learning model, which are continuous functions. Thus, it will be likely that also f(y) > T. Therefore, by mutating a positive we can possibly generate a set of elements that are likely to be
positives.(对于LBF，为黑匣子对手生成误报的过程也可以从测试元素开始，直到发现阳性。然后，一旦知道某个元素x为正，攻击者只需改变几位就可以生成该元素x的“突变”或变体。例如，元素y与x仅在一个位上不同。然后，如果x是正的，因为f（x）>T直观上，y也可能是正的，因为它看起来非常类似于机器学习模型，这是连续函数。因此，f（y）>T也有可能。因此，通过突变一个正性，我们可能会产生一组可能是正性的元素。)

The proposed procedure is shown in Algorithm 1. The mutations that can be done depend on the nature of the elements. For example, if the elements are URLs, one character can be added, modified or removed to create a mutated element. The number of mutations that can be made by changing a single character will be large. Instead, in other cases, the elements have a fixed size, and thus adding or removing bits or characters is not possible and only changes to one or more bits of the element can be made. The nature of the elements on each dataset will determine what type of mutations can be made. This will be illustrated in the case study considered in the evaluation section.(所提出的程序如算法1所示。可以进行的突变取决于元素的性质。例如，如果元素是URL，则可以添加、修改或删除一个字符来创建一个变异元素。通过改变单个字符可以产生的突变数量将会很大。相反，在其他情况下，元素具有固定的大小，因此不可能添加或删除位或字符，只能对元素的一个或多个位进行更改。每个数据集上元素的性质将决定可以产生何种类型的突变。这将在评估部分考虑的案例研究中加以说明。)

The procedure can also be refined to detect when an element x is positive because f(x) > T. To do so, instead of generating and testing all the mutations, a subset of the mutations can be tested and only when they produce a significant number of positives (indicating that the positive is due to f(x) > T), the rest of the mutations are tested. Similarly, the attacker can monitor which mutations are more likely to generate positives and use only those on the next elements.（该程序还可以改进，以检测某个元素x何时因f（x）>T而呈正。为此，不需要生成和测试所有突变，而是可以测试突变的一个子集，并且只有当它们产生大量正（表明正性是由于f（x）>T）时，才能测试其余突变。类似地，攻击者可以监控哪些突变更有可能产生阳性，并只使用下一个元素上的突变。）

An interesting observation is that the attack starts with a positive element x that may be a true or a false positive. Regardless of that, it is likely that mutations of that element that are positive will be false positives as the number of possible mutations is very large and the elements stored in the filter are small. For some applications, identifying as positives the mutations of a true positive may be considered beneficial. That is, the learned model is generalizing the positive to cover other elements that although not stored in the filter, could be considered positives. If that is the case, then the mutating attack can still be used by ensuring that the initial element x is a false positive. In the following section, this mutation attack is evaluated on a case study to better illustrate those issues and show its effectiveness.(一个有趣的观察结果是，攻击从正元素x开始，该元素可能是真正或假正性。尽管如此，由于可能的突变数量非常大，而存储在过滤器中的元素很小，因此正元素的突变很可能是假正性。在某些应用中，将真正性突变识别为正性可能被认为是有益的。也就是说，学习的模型将正概括为其他因素，这些因素虽然没有存储在过滤器中，但可以被视为正。如果是这种情况，那么通过确保初始元素x为假阳性，仍然可以使用变异攻击。在下一节中，将通过一个案例研究对这种突变攻击进行评估，以更好地说明这些问题并显示其有·性。)

Unfortunately, protecting the LBF against the generation of false positives does not seem to be straightforward as the learned model cannot be easily changed and even if it is modified, very likely it would produce similar results. In more detail, if x is a positive on one learned model it seems it would have a large probability of being a positive on another learned model.The problem seems fundamental to machine learning and how generalization is defined being robust to noisy perturbations.(不幸的是，保护LBF不产生误报似乎并不简单，因为学习的模型不容易改变，即使对其进行修改，也很可能会产生类似的结果。更详细地说，如果x在一个学习模型上为正，那么它在另一个学习模型上为正的概率似乎很大。这个问题似乎是机器学习的基础，以及如何定义泛化对噪声扰动的鲁棒性4 )

B. Attack no. 2: Creating False Positives from existing Nega tives

When the attacker knows the implementation details of the learned model, more sophisticated attacks can be designed. For example, the attacker can run the learned model on different elements to identify the features that are most relevant for the classifier and the values that correspond to positives and negatives. Then, using that information and the learned model, elements that will create false positives can be generated.(当攻击者知道所学习模型的实现细节时，就可以设计更复杂的攻击。例如，攻击者可以在不同的元素上运行学习的模型，以识别与分类器最相关的特征以及对应于正和负的值。然后，利用这些信息和学习到的模型，可以生成会产生误报的元素。）

In more detail, assuming for example that the filter processes URLs to detect malicious URLs, the attacker can start from a good URL and modify it to make the learned model believe it is a malicious URL. This can be done by removing the ”s” in ”https” or removing the ”www” which are known to be very informative and discriminant for the classifier. This procedure is expected to transform a negative (good URL) into a positive (malicious URL) with some probability. This attack, differently from the first one, requires a detailed analysis of the elements in the set and the relevant features for classification. This will also be illustrated in the case study presented in the next section. (更详细地说，假设过滤器处理URL以检测恶意URL，攻击者可以从一个好的URL开始，并对其进行修改，使学习的模型相信它是恶意URL。这可以通过删除“https”中的“s”或删除“www”来实现，众所周知，www对分类器来说是非常有用和有区别的。该过程有望以一定的概率将负面（良好URL）转换为正面（恶意URL）。与第一次攻击不同，这种攻击需要对集合中的元素和相关特征进行详细分析，以便进行分类。这也将在下一节介绍的案例研究中加以说明。）

IV. EVALUATION

To better illustrate the proposed attacks and show their feasibility, they have been implemented and tested using the malicious URL dataset used in [9]. This dataset contains 485,730 URLs (encoded as plain character strings) of which 16.47% are malicious (malware or MW in what follows), and the rest are benign (goodware or GW in what follows). The filter is used to identify malicious URLs. A traditional Bloom filter with 1% false positive probability (FPR) is used as baseline for comparison. This filter requires a memory of 767,997 bits in the optimal configuration as discussed in section II. (为了更好地说明所提出的攻击并证明其可行性，已使用[9]中使用的恶意URL数据集实现并测试了这些攻击。该数据集包含485730个URL（编码为纯字符串），其中16.47%是恶意的（以下是恶意软件或MW），其余是良性的（以下是goodware或GW）。该过滤器用于识别恶意URL。使用具有1%假正性概率（FPR）的传统Bloom过滤器作为比较基准。如第二节所述，在最佳配置下，该滤波器需要767997位的内存。）

For the learned model, the same 17 features used in [9] are extracted from the URLs and used to train a random forest classifier which is used to build the ML stage of the LBF. These features are:
• count.https [binary, 0 or 1], set to 1 if the URL contains the https keyword.
• count.http [binary, 0 or 1], set to 1 if the URL contains the http keyword.
• count.www [binary, 0 or 1], set to 1 if the URL contains the www keyword.
• count.digits [numeric], number of digits in the URL string.
• hostname length [numeric], number of characters forming the hostname.
• path length [numeric], number of characters forming the path.
• fd length [numeric], number of characters forming the first directory.
• tld length [numeric], number of characters forming the Top Level Domain.
• count [numeric], number of times where character -appears in the URL string.
• count at [numeric], number of times where character @appears in the URL string.
• count question [numeric], number of times where character ? appears in the URL string.
• count percentage [numeric], number of times where character % appears in the URL string.
• count dot [numeric], number of times where character. appears in the URL string.
• count question [numeric], number of times where character ? appears in the URL string.
• count equals [numeric], number of times where character = appears in the URL string.
• count dir [numeric], number of times where character /appears in the URL string.
• use of IP [0 or 1, binary], whether or not the URL uses an IP address (X.Y.Z.T) or not.

(对于学习模型，从URL中提取[9]中使用的17个相同特征，并用于训练随机森林分类器，该分类器用于构建LBF的ML阶段。这些特点是：

•计数。https[binary，0或1]，如果URL包含https关键字，则设置为1。

•计数。http[binary，0或1]，如果URL包含http关键字，则设置为1。

•计数。www[binary，0或1]，如果URL包含www关键字，则设置为1。

•计数。digits[数字]，URL字符串中的位数。

•主机名长度[数字]，构成主机名的字符数。

•路径长度[数字]，形成路径的字符数。

•fd长度[数字]，构成第一个目录的字符数。

•tld长度[数字]，构成顶级域的字符数。

•计数[数字]，URL字符串中出现字符的次数。

•在[numeric]计数，URL字符串中出现字符@的次数。

•计算问题[数字]，查拉克特朗普？出现在URL字符串中。

•计数百分比[数值]，字符出现的次数acter%出现在URL字符串中。

•计数点[数字]，字符出现的次数。出现在URL字符串中。

•计算问题[数字]，查拉克特朗普？出现在URL字符串中。

•count等于[numeric]，字符=出现在URL字符串中的次数。

•count dir[numeric]，字符/出现在URL字符串中的次数。

•使用IP[0或1，二进制]，无论URL是否使用IP地址（X.Y.Z.T）。）

Several machine learning models have been tested to build the learned part of the LBF, in particular Decision Trees (DT) using both rpart and C5.0 algorithm implementations in R, Gradient Boosted Machines (GBM) and Random Forests (RFs) as noted in Table I. The dataset is split into 5 folds, using four of them for training and one for testing, repeating the process five times to produce an estimation of test error (i.e. 5-fold cross-validation), as specified in [16]. Open-source R programming language has been used to conduct all experiments and LBF design and evaluation. (已经测试了几种机器学习模型来构建 LBF 的学习部分，特别是使用 R 中的 rpart 和 C5.0 算法实现的决策树 (DT)、梯度提升机 (GBM) 和随机森林 (RF)，如表 I. 数据集分为 5 份，其中 4 份用于训练，1 份用于测试，重复该过程五次以产生测试误差的估计（即 5 折交叉验证），如 [16] 中所述 . 开源 R 编程语言已用于进行所有实验和 LBF 设计和评估。）

As shown, both Random Forests and C5.0 decision trees produce the best results in terms of test accuracy and kappa metric obtained as: (如图所示，随机森林和C5。0决策树在测试精度和kappa度量方面产生最佳结果，如下所示：）

where TP, TN, FP, FN stand for True Positives, True Negatives, False Positives and False Negatives respectively. O stands for the Observed accuracy and E stands for the Expected accuracy.The expected accuracy in this particular URL use case is 83.53% since this is the result that a dummy classifier (a classifier which always provides the same answer: the majority class) would score. For instance, the RF accuracy (0.975 in test) is very close to perfection (100%) but it has to be compared against a dummy classifier which is 83.53% accurate. Thus, the difference between the Observed accuracy and Expected accuracy is reflected by the kappa statistic (0.9 in this case), a very high value that represents the goodness of our RF classifier. Figure 2 shows the probability score (probability of MW) provided by the RF model to each data point in the dataset (Green for GW, Red for MW). As shown, most malicious URLs are scored close to one. This figure highlights the accuracy of the ML model (RF in this case) at separating the majority of goodware from malware data samples; however, we can observe that some GW datapoints obtain a high-risk RF score (close to unity), while some MW datapoints also score a low-risk RF score (close to zero). (其中TP、TN、FP、FN分别代表真正性、真反性、假正性和假反性。O代表观测精度，E代表预期精度。在这个特定的URL用例中，预期的准确率为83.53%，因为这是一个虚拟分类器（一个总是提供相同答案的分类器：多数类）的得分结果。例如，RF准确度（测试中为0.975）非常接近完美（100%），但必须与准确度为83.53%的虚拟分类器进行比较。因此，kappa统计量（本例中为0.9）反映了观测精度和预期精度之间的差异，这是一个非常高的值，代表了我们的RF分类器的优点。图2显示了RF模型提供给数据集中每个数据点的概率分数（MW概率）（绿色表示GW，红色表示MW）。如图所示，大多数恶意URL的得分接近1。该图突出显示了ML模型（在本例中为RF）在分离大多数恶意软件数据样本时的准确性；然而，我们可以观察到，一些GW数据点获得了高风险RF分数（接近统一），而一些MW数据点也获得了低风险RF分数（接近零）。）

The random forest model requires 145,978 memory bits. A value of T = 0.86 was used with an auxiliary Bloom filter that contains 14,065 elements and uses 622,019 bits of memory so that the entire LBF uses the same amount of memory as the baseline Bloom filter. The false positive rate obtained by the LBF is 0.15% compared to 1% of the traditional Bloom filter that stores all the malicious URLs, i.e. 79,982 elements. This illustrates the benefits of the LBF in reducing the false positive rate for the same memory footprint (one order of magnitude smaller). The filter is intended to check URL requests and block the ones that are malicious as shown in Figure 3. In addition to blocking the malicious requests, they would also be typically logged for further analysis [15]. The filter could be for example embedded in the browser so that requests are filtered locally at the source. In this scenario, a large number of false positives could overload the logging systems and also trigger false alarms that can be used as a distraction to cover other malicious activity. (随机森林模型需要145978个内存位。T=0.86的值与包含14065个元素并使用622019位内存的辅助Bloom过滤器一起使用，以便整个LBF使用与基线Bloom过滤器相同的内存量。LBF获得的误报率为0.15%，而存储所有恶意URL（即79982个元素）的传统Bloom筛选器的误报率为1%。这说明了LBF在降低相同内存占用（小一个数量级）的误报率方面的优势。该过滤器用于检查URL请求并阻止恶意请求，如图3所示。除了阻止恶意请求，它们通常还会被记录下来进行进一步分析[15]。例如，过滤器可以嵌入浏览器中，以便在源位置本地过滤请求。在这种情况下，大量误报可能会使日志记录系统过载，还可能触发假警报，从而分散注意力，掩盖其他恶意活动。）

A. Attack no. 1: Malware mutation attack

To test the mutation attack, a random URL that returns a positive in the filter is selected from the dataset. Then, mutations are generated by randomly permuting any two characters of the URL. This is intended to introduce a change as small as possible to the original element as an attacker would do. The process is run 1,000 times to measure its effectiveness for different candidate MW URLs.

（为了测试变异攻击，从数据集中选择一个在过滤器中返回阳性的随机URL。然后，通过随机排列URL的任意两个字符来生成突变。这是为了对原始元素进行尽可能小的更改，就像攻击者所做的那样。该过程运行1000次，以测量其对不同候选URL的有效性。)

The results show that a false positive is created in 92% of the cases when using a threshold of T = 0.86. To see if the mutation is effective for both false and true positives, the percentage has been measured when the mutated element is either a false or a true positive. When the initial element is a false positive (only 1.9% of total positives), mutations are also positive in 78% of the cases. The results when the initial element is a true positive (98.1% of the positives) are similar with 93% of the mutations resulting in a positive. The results are even better when a lower threshold is used, i.e. T = 0.58 as done in [9]. Then a false positive is created in 99.3% of the cases.
结果表明，当使用T=0.86的阈值时，92%的样例产生假阳性。为了观察突变是否对假阳性和真阳性都有效，我们测量了突变元素为假阳性或真阳性时的百分比。当初始元素为假阳性（仅占总阳性的1.9%）时，78%的样例的中突变也为阳性。当初始元素为真阳性（98.1%的阳性）时，结果与93%的导致阳性的突变相似。当使用较低的阈值时，结果甚至更好，如[9]中所述，T=0.58。然后99.3%的样例出现假阳性。

Figure 4 shows the RF score of the original URL and its score after two characters at random are swapped, for T = 0.58. As observed, both original and modified URL have similar MW score.
图4显示了原始URL的RF分数以及随机交换两个字符后的分数，T=0.58。正如所观察到的，原始URL和修改后的URL都有相似的MW分数。

Therefore, even a very simple random mutation seems to be effective for various thresholds. This means that the attacker can potentially generate a large set of false positives with low effort and no understanding of the underlying URL format or structure.

因此，即使是一个非常简单的随机突变似乎对各种阈值都是有效的。这意味着攻击者可能会在不费力且不了解底层URL格式或结构的情况下产生大量误报。

B. Attack no. 2: Transforming benign URLs into malicious URLs

An attacker that knows how the LBF has been implemented can use that information to create false positives. In more detail, the 17 features can be analyzed to identify the ones that have a larger impact on the learned model result. Then, the appropriate values can be selected to create false positives. In more detail, for the case study considered, the features: count.www, count.http and count.https, which are binary values (i.e. 1 if present, 0 otherwise), have been found to be the most relevant ones, as shown in Fig. 5. This figure shows the mean decrease in Gini coefficient as a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest.

知道如何实现LBF的攻击者可以使用该信息创建误报。更详细地说，可以对17个特征进行分析，以确定哪些特征对学习的模型结果有更大的影响。然后，可以选择适当的值来创建误报。更详细地说，对于所考虑的案例研究，特征值是二进制（即，如果存在，则为1，否则为0）的有：count.www, count.http 和 count.https，这些特征也被发现是最相关的值，如图5所示。这张图显示了基尼系数的平均下降，作为衡量每个变量如何影响结果随机森林中节点和叶子的同质性的指标。

Table II shows the number of cases where all three features are set to unity. As observed, when all three are set to 1, then it is very likely that the classifier decides that the URL is good; the opposite may not be true.

表二显示了所有三个特征都设置为统一的情况。正如所观察到的，当所有三个都设置为1时，分类器很可能决定URL是好的；相反的情况可能并非如此。

To test this attack, we selected at random 1,000 good URLs with the three features set to one, and then we set the values of such features to all zeroes. In the case of threshold T = 0.58, then a false positive is obtained in 99.2% of the cases. However, when T = 0.86, a false positive is created in 7.5% of the cases only. These results show also the feasibility of this second attack although it seems to be more dependent on the threshold T used in the LBF.

为了测试这种攻击，我们随机选择了1000个良好的URL，将三个特性设置为一，然后将这些特性的值设置为全零。在阈值T=0.58的情况下，99.2%的病例出现假阳性。然而，当T=0.86时，只有7.5%的病例出现假阳性。这些结果也表明了第二次攻击的可行性，尽管它似乎更依赖于LBF中使用的阈值T。

Figure 6 shows the RF score of the original URL and its score after https and www parameters are removed from the URL, for T = 0.58. As observed, most original GW data samples are transformed into MW data samples.

图6显示了当阈值T为0.58时，原始URL的RF分数以及从URL中删除https和www参数后的分数。正如所观察到的，大多数原始GW数据样本被转换为MW数据样本。

V.DISCUSSION

The LBF vulnerability uncovered in this paper shows the importance of considering LBF security. Further research should focus on proposing techniques to protect against the attack described in this paper and also on trying to identify new vulnerabilities of LBFs. In the first case, it seems that the attack can be detected by checking for example the percentage of element lookups that are positives. When that percentage is large, it is likely that the filter is under attack as in normal operation, the filter would have many lookups for negative elements. If the attack is detected, one option could be to swap to a traditional Bloom filter but that may not be possible or efficient in practical settings. Another possible solution is adding a backup filter when f (x) > T . We may also use a few hash functions to double check x even when its score is high. Though an extra backup filter costs more memory compared to the original LBF, it can significantly improve the robustness of LBF for both attacks. Therefore, further work is needed to understand how the LBF can be effectively protected against the false positive generation attack presented in this paper. More generally, since LBFs are conceptually different from traditional Bloom filters, there may be other vulnerabilities introduced by the use of a learned model. Therefore, the security of LBFs needs to be carefully studied.

本文中发现的LBF漏洞表明了考虑LBF安全性的重要性。进一步的研究应侧重于提出技术来抵御本文中描述的攻击，并尝试识别LBF的新漏洞。在第一种情况下，似乎可以通过检查例如阳性元素查找的百分比来检测攻击。当该百分比较大时，该过滤器很可能会受到攻击，因为在正常操作中，该过滤器会多次查找呈阴性元素。如果检测到攻击，一种选择可能是切换到传统的Bloom过滤器，但在实际环境中，这可能不可能或不有效。另一个可能的解决方案是在f（x）>T时添加一个备份过滤器。我们也可以使用一些散列函数来双重检查x，即使它的分数很高。虽然与原始LBF相比，额外的备份过滤器需要更多内存，但它可以显著提高LBF对这两种攻击的鲁棒性。因此，需要进一步的工作来理解如何有效地保护LBF免受本文提出的误报生成攻击。更一般地说，由于LBF在概念上不同于传统的Bloom过滤器，使用学习模型可能会引入其他漏洞。因此，LBFs的安全性需要仔细研究。

VI.CONCLUSIONS AND FUTURE WORK

This paper has for the first time considered the security of Learned Bloom filters (LBF). The analysis shows that a large number of false positive elements can be easily generated by an attacker that can perform queries on the filter. This set of false positives can then be used for example to launch a denial of service attack to the filter. Unfortunately, and differently from traditional Bloom filters, protecting against this attack does not seem to be straightforward.
本文首次考虑了学习型Bloom过滤器（LBF）的安全性。分析表明，攻击者可以在过滤器上执行查询，从而很容易生成大量假阳性元素。然后，这组误报可用于对过滤器发起拒绝服务攻击。不幸的是，与传统的Bloom过滤器不同，防范这种攻击似乎并不简单。
Most works related to LBFs focus on achieving lower FPR and smaller memory usage. Our work is a clear evidence suggesting the importance of carefully balancing the tradeoff between better performance and robustness. More generally, a deep understanding of the security of LBF is required before they can be used in critical systems. Therefore, further research is needed on LBF security.
大多数与LBFs相关的工作都专注于实现更低的FPR和更小的内存使用。我们的工作是一个明确的证据，表明了在更好的性能和健壮性之间谨慎权衡的重要性。更一般地说，在关键系统中使用LBF之前，需要对其安全性有深入的了解。因此，需要对LBF安全性进行进一步的研究。

ACKNOWLEDGMENT 致谢
The authors acknowledge the support of the ACHILLES project PID2019-104207RB-I00 and the Go2Edge network RED2018-102585-T funded by the Spanish Ministry of Science and Innovation and of the Madrid Community research project TAPIR-CM grant no. P2018/TCS-4496.