布隆过滤器优化算法double—hashing论文原文（二）

最新推荐文章于 2024-10-13 16:52:53 发布

卷王2048

最新推荐文章于 2024-10-13 16:52:53 发布

阅读量63

点赞数

分类专栏：布隆过滤器文章标签：算法

原文链接：https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

版权

布隆过滤器专栏收录该内容

6 篇文章 0 订阅

订阅专栏

引言

本论文来自leveldb源码中bloom.cc下布隆过滤器实现的代码注释中推荐的论文。论证了一种布隆过滤器的优化方式，带有详细的证明。是不可多得的好文章。其中的许多公式会对布隆过滤器相关的文章有所帮助。所以博主将原文和其中的海量公式转化为了可编辑的markdown和Latex，方便大家引用。如有转载，望注明论文出处和本文出处，谢谢！

论文原文出处：论文原文

本文作者：csdn账号，个人空间 - AcWing

受平台字数限制，只好将论文分成多段，望理解

上接布隆过滤器优化算法double—hashing论文原文（一）

6. Rate of Convergence

In the previous sections, we identified a broad class of non-standard Bloom filter schemes that have the same asymptotic false positive probability as a standard Bloom filter. Unfortunately, these results are not particularly compelling in settings with very limited space, since it is reasonable to think that the rate of convergence in the conclusion of Theorem 4.1 might be fairly slow. This problem is compounded by the fact that Bloom filters are particularly attractive in applications where space is extremely limited (for example, see [3]), since they give a fairly small error rate while using only a small constant number of bits per item. Thus, with these applications in mind we provide a detailed analysis of the rate of convergence in Theorem 4.1. Before proceeding with the results, we introduce some useful notation. For functions $f (n)$ and

$g (n)$ , we use $f(n)\sim g(n)$ to denote that $\operatorname*{lim}_{n\to\infty}f(n)/g(n)=1$ .Similarly, we use $f(n)\lesssim g(n)$ to denote that lim: $\operatorname*{sup}_{n\to\infty}f(n)/g(n)\leq1$ and $f(n)\gtrsim g(n)$ to denote that $\operatorname*{lim}\operatorname*{inf}_{n\to\infty}f(n)/g(n)\geq1$ We are now ready to prove the main technical result of this section

Theorem 6.1.Under the same conditions as in Theorem 4.1
$\mathbf{Pr}(\mathcal{F})-\left(1-\mathrm{e}^{-\lambda/k}\right)^{k}\sim n\epsilon(n),$
where
$\begin{aligned}\epsilon(n)&\stackrel{def}{=}\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\\&+\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\\&+\sum_{j=2}^{k}\mathbf{Pr}(\|C(x)\|=j)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j}.\end{aligned}$
Remark. This result is intuitively pleasing, since it says that the portion of the false positive probability represented by the asymptotic error term is essentially the probability that $\|C(x,z)\|>1$ for exactly one $x\in S$ and Z 's other $k-\|C(x,z)\|$ hash locations are hit by the other elements of $S$ in the “asymptotic" filter (that is, in the limit as $n-1\to\infty$ ),which happens with probability $(1-\mathrm{e}^{-\lambda/k})^{k-\|C(x,z)\|}$ .(This almost follows from Theorem 4.1. The difference is that now $z$ has only $k-\|C(x,z)\|$ hash locations, while the elements of $S-\{x\}$ each have $k$ hash locations; however, it should be clear from the proof of Theorem 4.1 that the limiting false positive probability in this case is $(1-\mathrm{e}^{-\lambda/k})^{k-\|C(x,z)\|}$ )

Proof. We begin along the same lines as in the proof of Theorem 4.1. First, we adopt the convention introduced there that allows us to associate the elements of $H (z)$ (with multiplicity) with the elements of $[k]$ . Next, for $i\in[k]$ and $x\in S$ , we define $X_{i}(x)=1$ if $i\in C(x)$ and $X_{i}(x)=0$ otherwise, $X_{i}\stackrel{\mathrm{def}}{=}\sum_{x\in S}X_{i}(x)$ ,and $X\stackrel{\mathrm{def}}{=}(X_{0},\ldots,X_{k-1})$ .Finally, we define $P\stackrel{\mathrm{def}}{=}(P_{0},\ldots,P_{k-1})$ to be a vector of $k$ independent $\operatorname{Po}(\lambda/k)$ random variables Define
$\begin{aligned}&f(n)\stackrel{\mathrm{def}}{=}\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\\&g_{i}(n)\stackrel{\mathrm{def}}{=}\mathbf{Pr}(i\in C(x),\|C(x)\|=1)-\frac{\lambda}{kn}\quad\mathrm{for}\:i\in[k]\\&h_{T}(n)\stackrel{\mathrm{def}}{=}\mathbf{Pr}(C(x)=f_{H(z)}^{-1}(T))\quad\mathrm{for}\:T\subseteq[k]:|T|>1,\end{aligned}$
and note that they are all $o\left(1/n\right)$ by the hypotheses of the lemma.For $T\subseteq[k]$ ，we may now

write

$\begin{aligned}\mathbf{Pr}\left(\bigcap_{i\in T}X_i=0\right)&=\prod_{x\in S}\mathbf{Pr}\left(\{i\in[k]:i\in C(x)\}\subseteq\overline{T}\right)\\&=\left(\mathbf{Pr}(\|C(x)\|=0)+\sum_{i\in T}\mathbf{Pr}(i\in C(x),\|C(x)\|=1)\right.\\&+\sum_{T\leq T:|T^{\prime}|>1}\mathbf{Pr}(C(x)=f_{H(z)}^{-1}(T^{\prime}))\\&=\left(1-\frac{\lambda|T|}{kn}+f(n)+\sum_{i\in T}g_i(n)+\sum_{T^{\prime}\subseteq T|T^{\prime}|>1}h_{T^{\prime}}(n)\right)^n\\&\sim\exp\left[-\frac{\lambda|T|}k+nf(n)+\sum_{i\in T}ng_i(n)+\sum_{T^{\prime}\subseteq T:|T^{\prime}|>1}nhg_i^{\prime}(n)\right]\\&=\mathrm{e}^{-\frac{\lambda|T|}k}\left(\exp\left[nf(n)+\sum_{i\in T}ng_i(n)+\sum_{T^{\prime}\subseteq T:|T^{\prime}|>1}nhg_i^{\prime}(n)\right]\right)\\&\sim\mathrm{e}^{-\frac{\lambda|T|}k}\left(1+nf(n)+\sum_{i\in T}ng_i(n)+\sum_{T^{\prime}\subseteq T:|T^{\prime}|>1}nhg_i^{\prime}(n)\right),\end{aligned}$
where the first two steps are obvious, the third step follows from the definition of $f$ ,the $y_{i}$ 's, and

the $hT^{\prime}$ 's, and the fourth and sixth steps follow from the assumption that all of those functions are $o\left(1/n\right)$ (since $\mathrm{e}^{t(n)}\sim1+t(n)$ if $t (n) = o (1)$

Thus, the inclusion/exclusion principle implies that

$\begin{aligned}(\mathcal{F})-\mathbf{Pr}(\forall i:P_{i}>0)&=-\left(\mathbf{Pr}(\exists i:X_{i}=0)-\mathbf{Pr}(\exists i:P_{i}=0)\right)\\&=-\sum_{\emptyset\subset T\subseteq[k]}(-1)^{|T|+1}\left(\mathbf{Pr}\left(\bigcap_{i\in T}X_{i}=0\right)-\mathbf{Pr}\left(\bigcap_{i\in T}P_{i}=0\right)\right)\\&=\sum_{\emptyset\subset T\subseteq[k]}(-1)^{|T|}\left(\mathbf{Pr}\left(\bigcap_{i\in T}X_{i}=0\right)-\mathrm{e}^{-\frac{\lambda|T|}{k}}\right)\\&\sim n\sum_{\emptyset\subset T\subseteq[k]}(-1)^{|T|}\mathrm{e}^{-\frac{\Delta|T|}{k}}\left(f(n)+\sum_{i\in T}g_{i}(n)+\sum_{T^{\prime}\subseteq\overline{T}:[T^{\prime}]>1}h_{T^{\prime}}(n)\right).\end{aligned}$

Toevaluatethe sum on thelast line,we write

$\begin{aligned}M&\stackrel{\mathrm{def}}{=}\sum_{\emptyset\subset T\subseteq[k]}(-1)^{|T|}\mathrm{e}^{-\frac{\lambda|T|}{k}}\left(f(n)+\sum_{i\in\overline{T}}g_{i}(n)+\sum_{T^{\prime}\subseteq\overline{T}:|T^{\prime}|>1}h_{T^{\prime}}(n)\right)\\&=\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{T\subseteq[k]:[T|=j}f(n)\\&+\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{T\subseteq[k]:|T|=j}\sum_{i\in\overline{T}}g_{i}(n)\\&+\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{T\subseteq[k]:|T|=j}\sum_{T\subseteq\overline{T}:|T^{\prime}|>1}h_{T^{\prime}}(n),\end{aligned}$

and evaluate each term separately. First, we compute

$\begin{aligned}\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{T\subseteq[k]:[T]=j}f(n)&=f(n)\sum_{j=1}^{k}\binom{k}{j}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\\&=\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\right)\left(\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}-1\right)\end{aligned}$

Next, we see that

$\begin{aligned}\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{T\subseteq[k]:[T]=j}\sum_{i\in\overline{T}}g_{i}(n)&=\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{i\in[k]}g_{i}(n)|\:\{T\subseteq[k]\::\:|T|=j,i\not\in T\}\\&=\left(\sum_{i\in[k]}g_{i}(n)\right)\sum_{j=1}^{k}\binom{k-1}{j}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\\&=\left(\sum_{i\in[k]}g_{i}(n)\right)\left(\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}-1\right)\\&=\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\left(\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}-1\right),\end{aligned}$

where we have used the convention that $\binom{k-1}k=0$ .Now, for the last term, we compute

$\begin{aligned}\sum_{T\subseteq[k]:|T|=j}\sum_{T^{\prime}\subseteq\overline{T}:|T^{\prime}|>1}h_{T^{\prime}}(n)&=\sum_{\ell=2}^{k-j}\sum_{T^{\prime}\subseteq[k]:|T^{\prime}|=\ell}h_{T^{\prime}}(n)|\left\{T\subseteq[k]\::\:|T|=j,T^{\prime}\subseteq\overline{T}\right\}|\\&=\sum_{\ell=2}^{k-j}\binom{k-\ell}{j}\Pr(\|C(x)\|=\ell),\end{aligned}$

$\begin{aligned} \sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{T\subseteq[k]:[T]=j}\sum_{T^{\prime}\subseteq\overline{T}:|T^{\prime}|>1}h_{T^{\prime}}(n)& =\sum_{j=1}^{k}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\sum_{\ell=2}^{k-j}\binom{k-\ell}{j}\Pr(\|C(x)\|=\ell) \\ &=\sum_{j=1}^{k}\sum_{\ell=2}^{k-j}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\binom{k-\ell}{j}\Pr(\|C(x)\|=\ell) \\ &=\sum_{j=1}^{k}\sum_{r=j}^{k-2}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\begin{pmatrix}r\\j\end{pmatrix}\mathbf{Pr}(\|C(x)\|=k-r) \\ &=\sum_{r=1}^{k-2}\sum_{j=1}^{r}\left(-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{j}\begin{pmatrix}r\\j\end{pmatrix}\mathbf{Pr}(\|C(x)\|=k-r) \\ &=\sum_{r=1}^{k-2}\mathbf{Pr}(\|C(x)\|=k-r)\sum_{j=1}^r\begin{pmatrix}r\\j\end{pmatrix}\begin{pmatrix}-\mathrm{e}^{-\frac{\lambda}{k}}\end{pmatrix}^j \\ &=\sum_{r=1}^{k-2}\mathbf{Pr}(\|C(x)\|=k-r)\left(\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^r-1\right) \\ &=\sum_{j=2}^{k-1}\mathbf{Pr}(\|C(x)\|=j)\left(\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j}-1\right). \end{aligned}$

Adding the terms together gives
$\begin{aligned}\text{M}&=\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\right)\left(1-\mathbf{e}^{-\frac{\lambda}{k}}\right)^{k}\\&+\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\left(1-\mathbf{e}^{-\frac{\lambda}{k}}\right)^{k-1}\\&+\sum_{j=2}^{k-1}\mathbf{Pr}(\|C(x)\|=j)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j}\\&-\left(\mathbf{Pr}(\|C(x)\|=0)+\mathbf{Pr}(\|C(x)\|=1)+\sum_{j=2}^{k-1}\mathbf{Pr}(\|C(x)\|=j)-1\right).\end{aligned}$
Of course,
$-\left(\mathbf{Pr}(\|C(x)\|=0)+\mathbf{Pr}(\|C(x)\|=1)+\sum_{j=2}^{k-1}\mathbf{Pr}(\|C(x)\|=j)-1\right)=\mathbf{Pr}(\|C(x)\|=k$
80
$\begin{aligned}\text{M}&=\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\\&+\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\\&+\sum_{j=2}^{k-1}\mathbf{Pr}(\|C(x)\|=j)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j}\\&+\mathbf{Pr}(\|C(x)\|=k)\\&=\epsilon(n).\end{aligned}$

Since
$\mathbf{Pr}(\mathcal{F})-\left(1-\mathrm{e}^{-\lambda/k}\right)^{k}=\mathbf{Pr}(\mathcal{F})-\mathbf{Pr}(\forall i:P_{i}>0)\sim nM=n\epsilon(n),$
the result follows.

Unfortunately,the schemes that we discuss in this paper are often too messy to apply Theo rem 6.1 generally; the values $\mathbf{Pr}(\|C(x)\|=j$ )depend on the specifics of the hash functions being use. For example, whether the size of the range is prime or not affects $\mathbf{Pr}(\|C(x)\|=j]$ 0.The result can be applied in cases to examine specific schemes; for example,in the partitioned scheme, when $m^{\prime}$ is prime, $\Pr(\|C(x)\|=j)=0$ for $j=2,\ldots,k-1$ ,and so the expression becomes easily computable.To achieve general results,we derive some simple bounds that are sufficient to draw some interesting conclusions.

Lemma 6.1. Assume the same conditions as in Theorem 4.1. Furthermore, suppose that fo $x\in S$ ，it is possible to define events $E_{0},\ldots,E_{\ell-1}$ such that
$\begin{aligned}&I.\:\mathbf{Pr}(\|C(x)\|\geq1)=\mathbf{Pr}\left(\bigcup_{i\in[\ell]}E_{i}\right)\\&2.\:\sum_{i\in[\ell]}\mathbf{Pr}(E_{i})=\lambda/n\\&3.\:\mathbf{Pr}(\|C(x)\|\geq2)\leq\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j}).\\&hen\\&n\left[\mathbf{Pr}(\|C(x)\|\equiv k)-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(1+\mathrm{e}^{-\frac{\lambda}{k}}\right)\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j})\right]&\leqslant\mathbf{Pr}(\mathcal{F})-\left(1-\mathrm{e}^{-\lambda/k}\right)\\&&\leqslant n\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j})\end{aligned}$

Proof.As in Theorem

6.1

,we define
$\begin{aligned}\epsilon(n)&\stackrel{\mathrm{def}}{=}\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\\&+\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\\&+\sum_{j=2}^{k}\mathbf{Pr}(\|C(x)\|=j)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j},\end{aligned}$
So that
$\mathbf{Pr}(\mathcal{F})-\left(1-\mathrm{e}^{-\lambda/k}\right)^k\sim n\epsilon(n).$
Now,
$\begin{aligned} &\text{M}&&\overset{\mathrm{def}}{\operatorname*{=}}\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac\lambda n\right)\left(1-\mathrm{e}^{-\frac\lambda k}\right)^k+\left(\mathbf{Pr}(\|C(x)\|=1)-\frac\lambda n\right)\left(1-\mathrm{e}^{-\frac\lambda k}\right)^{k-1} \\ &&&=\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(\left(\mathbf{Pr}(\|C(x)\|=0)-1+\frac{\lambda}{n}\right)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)+\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\right) \\ &&&=\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left((\mathbf{Pr}(\|C(x)\|=0)+\mathbf{Pr}(\|C(x)\|=1)-1)-\mathrm{e}^{-\frac{\lambda}{k}}\right.\left((\mathbf{Pr}(\|C(x)\|=0)-1)+(\mathbf{Pr}(\|C(x)\|=0)-1)\right) \\ &&&=\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(-\mathbf{Pr}(\|C(x)\|\geq2)-\mathrm{e}^{-\frac{\lambda}{k}}\left(-\mathbf{Pr}(\|C(x)\|\geq2)-\mathbf{Pr}(\|C(x)\|=1)+\frac{\lambda}{n}\right)\right) \\ &&&=-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\mathbf{Pr}(\|C(x)\|\geq2)+\mathrm{e}^{-\frac{\lambda}{k}}\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right). \end{aligned}$

In particular, we have that $M\leq0$ since

$\mathbf{Pr}(\|C(x)\|=1)\leq\mathbf{Pr}(\|C(x)\|\geq1)=\mathbf{Pr}\left(\bigcup_{i\in[\ell]}E_i\right)\leq\sum_{i\in[l]}\mathbf{Pr}(E_i)=\lambda/n.$

Therefore

$\epsilon(n)=M+\sum_{j=2}^{k}\mathbf{Pr}(\|C(x)\|=j)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j}\leq\mathbf{Pr}(\|C(x)\|\geq2)\leq\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j})$

establishing the upper bound in thelemma For the lower bound, we note that

$\begin{aligned}\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}&=\mathbf{Pr}(\|C(x)\|\geq1)-\mathbf{Pr}(\|C(x)\|\geq2)-\frac{\lambda}{n}\\&=\mathbf{Pr}\left(\bigcup_{i\in[t]}E_{i}\right)-\mathbf{Pr}(\|C(x)\|\geq2)-\frac{\lambda}{n}\\&\geq\sum_{i\in[t]}\mathbf{Pr}(E_{i})-\sum_{i<j\in[t]}\mathbf{Pr}(E_{i}\cap E_{j})-\mathbf{Pr}(\|C(x)\|\geq2)-\frac{\lambda}{n}\\&=-\sum_{i<j\in[t]}\mathbf{Pr}(E_{i}\cap E_{j})-\mathbf{Pr}(\|C(x)\|\geq2)\\&\geq-2\sum_{i<j\in[t]}\mathbf{Pr}(E_{i}\cap E_{j}),\end{aligned}$

$\begin{aligned}\text{M}&=-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\mathbf{Pr}(\|C(x)\|\geq2)+\mathrm{e}^{-\frac{\lambda}{k}}\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(\mathbf{Pr}(\|C(x)\|=1)-\frac{\lambda}{n}\right)\\&\geq-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\mathbf{Pr}(\|C(x)\|\geq2)-\mathrm{e}^{-\frac{\lambda}{k}}\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}2\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j})\\&\geq-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k}\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j})-\mathrm{e}^{-\frac{\lambda}{k}}\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}2\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j})\\&=-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(1+\mathrm{e}^{-\frac{\lambda}{k}}\right)\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j}).\end{aligned}$

Therefore,

$\begin{aligned}\epsilon(n)&=\sum_{j=2}^{k}\mathbf{Pr}(\|C(x)\|=j)\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-j}+M\\&\geq\mathbf{Pr}(\|C(x)\|=k)-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(1+\mathrm{e}^{-\frac{\lambda}{k}}\right)\sum_{i<j\in[\ell]}\mathbf{Pr}(E_{i}\cap E_{j}),\end{aligned}$

completing the proof.

Lemma 6.1 is easily applied to the schemes discussed in Sections 5.1 and 5.2

Theorem 6.2. For the partition scheme discussed in Section 5.1

$\frac{k^2}{c^2n}\left[1-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(1+\mathrm{e}^{-\frac{\lambda}{k}}\right)\frac{k^3}{2}\right]\lesssim\mathbf{Pr}(\mathcal{F})-\left(1-\mathrm{e}^{-\lambda/k}\right)^k\:\lesssim\frac{k^5}{2c^2n}$

Proof. We wish to apply Lemma 6.1.To this end,we fix $x\in S$ ，and for $i\in[k]$ ，we define $E_{i}$ to be the event that $i\in C(x)$ (once again, we use the convention introduced in the proof of Theorem 4.1 that allows us to associate the elements of $H (z)$ with the elements of $[k]$ ). Then

$\mathbf{Pr}(\|C(x)\|\ge1)=\mathbf{Pr}\left(\bigcup_{i\in[k]}E_i\right).$

Recall from the proof of Theorem 5.1 that the partition scheme satisfies the conditions of Theo rem 4.1 with $\lambda=k^{2}/c$ .Furthermore, (as we saw in the proof of Theorem 5.1)

$\begin{aligned}\sum_{i\in[k]}\mathbf{Pr}(E_i)=\sum_{i\in[k]}\frac{1}{m'}=\frac{\lambda}{n}.\end{aligned}$

The proof of Theorem 5.1 also tells us that for $i\neq j\in[k]$

$\mathbf{Pr}(E_i\cap E_j)\leq\frac{k}{(m')^2}=\frac{k^3}{c^2n^2},$

$\Pr(\|C(x)\|\ge2)\le\sum_{i<j\in[k]}\mathbf{Pr}(E_i\cap E_j)\le\frac{k^5}{2c^2n^2},$

where we have used the (obvious) fact that every $u\in U$ is assigned $k$ distinct hash locations in the partition scheme. Finally, we note that $\|C(x)\|=k$ if $h_{1}(x)=h_{1}(z)$ and $h_{2}(x)=h_{2}(z)$ ，SC

$\mathbf{Pr}(\|C(x)\|=k)\geq\frac{1}{(m')^2}=\frac{k^2}{c^2n^2}.$

Plugging these bounds into the result from Lemma 6.1 gives the result.

Theorem 6.3. For the double hashing schemes discussed in Section 5.2.

$\frac{1}{c^{2}n}\left[1-\left(1-\mathrm{e}^{-\frac{\lambda}{k}}\right)^{k-1}\left(1+\mathrm{e}^{-\frac{\lambda}{k}}\right)\frac{k^{5}}{2}\right]\lesssim\mathbf{Pr}(\mathcal{F})-\left(1-\mathrm{e}^{-\lambda/k}\right)^{k}\:\lesssim\frac{k^{5}}{2c^{2}n}$

Proof. We wish to apply Lemma 6.1. First, recall from the proof of Theorem 5.2 that every double hashing scheme satisfies the conditions of Theorem 4.1 with $\lambda=k^{2}/c$ Now fix $x\in S$ We reintroduce some notation from the proof of Theorem 5.2. For $u\in U$ and $i\in[k]$ ,we define

$g_i(u)=h_1(u)+ih_2(u)+f(i)$

(where we continue to use the convention that all arithmetic involving the hash functions $h_{1}$ anc $h_{2}$ is done modulo $7/ l$ ) Proceeding, for $i,j\in[k]$ , we define $E_{i,j}$ to be the event that $g_{j}(x)=g_{i}(z)$ .Then

$\mathbf{Pr}(\|C(x)\|\ge1)=\mathbf{Pr}\left(\bigcup_{i,j\in[k]}E_{i,j}\right),$

and,as we saw in the proof of Theorem 5.2

$\sum\limits_{i,j\in[k]}\mathbf{Pr}(E_{i,j})=\sum\limits_{i,j\in[k]}\mathbf{Pr}(g_j(x)=g_i(z))=\sum\limits_{i,j\in[k]}\frac{1}{m}=\frac{\lambda}{n}.$

Furthermore, fixing any ordering 《 on $k]^{2}$

$\begin{aligned}\mathbf{Pr}(\|C(x)\|\geq2)&=\mathbf{Pr}(\exists i_{1},i_{2},j_{1},j_{2}\in[k]:\forall\ell\in\{1,2\},g_{j_{\ell}}(x)=g_{i_{\ell}}(x))\\&=\mathbf{Pr}\left(\bigcup_{(i_{1},j_{1})<(i_{2},j_{2})\in[k]^{2}}E_{i_{1},j_{1}}\cap E_{i_{2},j_{2}}\right)\\&\leq\sum_{(i_{1},j_{1})<(i_{2},j_{2})\in[k]^{2}}\mathbf{Pr}(E_{i_{1},j_{1}}\cap E_{i_{2},j_{2}}),\end{aligned}$

so the conditions of Lemma 6.1 are satisfied. To complete the proof, we note that for any $(i_{1},j_{1}),(i_{2},j_{2})\in[k^{2}]$

$\begin{aligned}\mathbf{Pr}(E_{i_{1},j_{1}}\cap E_{i_{2},j_{2}})&=\mathbf{Pr}(g_{j_{1}}(x)=g_{i_{1}}(z),g_{j_{2}}(x)=g_{i_{2}}(z))\\&\leq\frac{1}{m}\cdot\frac{k}{m}\\&=\frac{k}{c^{2}n^{2}},\end{aligned}$

where the computation in the second step was done in the proof of Theorem 5.2. Therefore

$\sum_{(i_{1},j_{1})<(i_{2},j_{2})\in[k]^{2}}\Pr(E_{i_{1},j_{1}}\cap E_{i_{2},j_{2}})\leq\sum_{(i_{1},j_{1})<(i_{2},j_{2})\in[k]^{2}}\frac{k}{c^{2}n^{2}}\leq\frac{k^{5}}{2c^{2}n^{2}}.$

Finally,

$\mathbf{Pr}(\|C(x)\|=k)\geq\mathbf{Pr}(h_1(x)=h_1(z),h_2(x)=h_2(z))=\frac{1}{m^2}=\frac{1}{c^2n^2}.$

Plugging these bounds into the result of Lemma 6.1 yields the result

It remains to investigate whether the error term analyzed in Theorems 6.2 and 6.3 is negligible in practice.Recall that for all of the schemes considered so far, the asymptotic false positive probability is $1-\exp[-k/c])^{k}$ ，the same as for a standard Bloom filter. We would like to minimize this probability.The easiest way to do this is to maximize $t$ given the applicationspecific constraints on the size of the filter, and then optimize $k$ subject to that value of $t$ which results in setting $k=c\ln2$ (this is a standard result for Bloom filters which is easily obtained using calculus; see, for example,[3l), yielding an asymptotic false positive probability of $2^{-c\ln2}$ Applying Theorems 6.2 and 6.3,we have that for all of the examined schemes, this setting of $k$ results in
$\mathbf{Pr}(\mathcal{F})-2^{-c\ln2}\lesssim\frac{(\ln2)^{5}}{2}\frac{c^{3}}{n}\quad\mathrm{as}\:n\to\infty.$

We now give a heuristic argument that the above error term is negligible in practice. Suppose that the asymptotic inequality above held for every $7 l .$ , and not just in the limit as $Tl\rightarrow0$ .Then

for any $\epsilon>0$
$\begin{aligned}\Pr(\mathcal{F})-2^{-c\ln2}\geq\epsilon2^{-c\ln2}&\Rightarrow\frac{(\ln2)^5}{2}\frac{c^3}{n}\geq\epsilon2^{-c\ln2}\\&\Rightarrow\frac{(\ln2)^5}{2}\frac{c^3}{n}\geq\epsilon2^{-c}\\&\Rightarrow2^{c+3\ln c}\geq\frac{2n\epsilon}{(\ln2)^5}\\&\Rightarrow2^{2c+1}\geq\frac{2n\epsilon}{(\ln2)^5}\\&\Rightarrow c\geq\frac{1}{2}\log_{2}\left(\frac{n\epsilon}{(\ln2)^{5}}\right).\end{aligned}$
The first step is the only non-rigorous step, and it follows from the assumption that the asymptotic inequality aboveholds for every $T L$ .The second step holds since $\ln2<1$ ,the third step is simple algebra, the fourth step follows from the fact that $3\ln c<c+1$ for all $C > 0$ ，and the fifth step is also simple algebra. From this heuristic argument, we conclude that the asymptotic error term analyzed above is negligible unless $c\gtrsim\log_{2}n$ .In these cases, however, it might be more appropriate to use a hash table or fingerprints rather than a Bloom filter (see, for example, [12, Section 5.5]).