< Generalizing the Layering Method of Indyk and Woodruff: Recursive Sketches for Frequency-Based Vectors on Streams> 阅读笔记
频数距: frequency moment
信息流: stream
重型元素: heavy element
[ n ] = [ 1 , 2 , ⋯ , n ] [n] = [1, 2, \cdots, n] [n]=[1,2,⋯,n]
α \alpha α-核心: α \alpha α-core
定义1. 令 m , n m, n m,n均为正整数. 信息流 D = D ( n , m ) D = D(n, m) D=D(n,m)是一个长度为 m m m的序列 [ p 1 , p 2 , ⋯ , p m ] [p_1, p_2, \cdots, p_m] [p1,p2,⋯,pm]. 其中, p i ∈ { 1 , 2 , ⋯ , n } p_i\in \{1, 2, \cdots, n\} pi∈{ 1,2,⋯,n}. 信息流 D D D的频数向量为 [ f 1 , f 2 , ⋯ , f n ] [f_1, f_2, \cdots, f_n] [f1,f2,⋯,fn]. 其中, f i = ∣ { p j ∣ p j = i , 1 ≤ j ≤ m } ∣ f_i = \vert\{p_j | p_j = i, 1\le j \le m\}\vert fi=∣{ pj∣pj=i,1≤j≤m}∣.
定义2. 信息流 D D D的第 k k k频数距可定义为 F k ( D ) = ∑ i = 1 n f i k F_k(D) = \sum_{i = 1}^n f_i^k Fk(D)=∑i=1nfik. F k ( D ) F_k(D) Fk(D)也可称为信息流 D D D的L-k范式.
定义3. 令 V = [ v 1 , v 2 , ⋯ , v n ] V=[v_1, v_2, \cdots, v_n] V=[v1,v2,⋯,vn], 其中 ∀ i ∈ [ n ] , v i ≥ 0 \forall i\in [n], v_i \ge 0 ∀i∈[n],vi≥0. ∣ V ∣ = ∑ i = 1 n v i \vert V\vert = \sum_{i = 1}^n v_i ∣V∣=∑i=1nvi. 如果 v i ≥ α V v_i \ge \alpha V vi≥αV, 则 v i v_i vi是 V V V的一个 α \alpha α-重型元素, 其中 0 ≤ α ≤ 1 0\le \alpha \le 1 0≤α≤1. 对于任意 v i v_i vi, 如果 v i v_i vi是 V V V的 α \alpha α重型元素, 则 i ∈ S i\in S i∈S, 则我们称 S S S为 V V V的一个 α \alpha α-核心.
引理1. 令 V ∈ R [ n ] V\in R^[n] V∈R[n]为一个 n n n维向量, S S S是 V V V的一个 α \alpha α核心. 同时令 H = ( h 1 , h 2 , ⋯ , h n ) H=(h_1, h_2, \cdots, h_n) H=(h1,h2,⋯,hn)为一个随机0-1向量, 即 ∀ i ∈ [ n ] , P ( h i = 0 ) = P ( h i = 1 ) = 1 2 \forall i\in [n], P(h_i = 0) = P(h_i = 1) = \frac{1}{2} ∀i∈[n],P(hi=0)=P(hi=1)=21, 且 H H H的各个分量相互独立. 设
X = ∑ i ∈ S v i + 2 ∑ i ∉ S h i v i X = \sum_{i\in S} v_i + 2\sum_{i\notin S}h_iv_i X=i∈S∑vi+2i∈/S∑hivi
则有 P ( ∣ X − ∣ V ∣ ∣ ≥ ϵ ∣ V ∣ ) ≤ α ϵ 2 P(\vert X - \vert V\vert\vert \ge \epsilon \vert V\vert) \le \frac{\alpha}{\epsilon^2} P(∣X−∣V∣∣≥ϵ∣V∣)≤ϵ2α.
证明: 首先,
E ( X ) = E ( ∑ i ∈ S v i + 2 ∑ i ∉ S h i v i ) = ∑ i ∈ S v i + 2 ∑ i ∉ S v i E ( h i ) = ∑ i ∈ S v i + ∑ i ∉ S v i = ∣ V ∣ E(X) = E(\sum_{i\in S}v_i + 2\sum_{i\notin S}h_iv_i) = \sum_{i\in S}v_i + 2\sum_{i\notin S}v_iE(h_i) = \sum_{i\in S}v_i + \sum_{i\notin S}v_i = |V| E(X)=E(i∈S∑vi+2i∈/S∑hivi)=i∈S∑vi+2i∈/S∑viE(hi)=i∈S∑vi+i∈/S∑vi=∣V∣
同时,
V a r ( X ) = V a r ( ∑ i ∈ S v i + 2 ∑ i ∉ S h i v i ) = 4 ∑ i ∉ S v i 2 V a r ( h i ) Var(X) = Var(\sum_{i\in S}v_i + 2\sum_{i\notin S}h_iv_i) =4\sum_{i\notin S}v_i^2Var(h_i) Var(X)=Var(i∈S∑vi+2i∈/S∑hivi)=4i∈/S∑vi2Var(hi)
而 V a r ( h i ) = 1 4 Var(h_i) = \frac{1}{4} Var(hi)=41, 所以 V a r ( X ) = ∑ i ∉ S v i 2 Var(X) = \sum_{i\notin S}v_i^2 <