k-anonimity、l-diversity 和 t-closeness
1. 前言
1.1 等价类的定义
等价类(equivalence class):等价类代表 QI 属性(attribute)相同的记录(record)。We define an equivalence class of an anonymized table to be a set of records that have the same values for the quasi-identifiers。
1.2 record的定义
记录(record):表示关系型数据库(relation data 或者叫做 multidimensonal data)中的行,它对应于每个项目 individual。
1.3 attributes的定义
属性(attribute):每个 record 包含很多对应的 attributes,这些属性可以被分为三类:EI、QI 和 SD。
1.4 两种纰漏:identity disclosure 和 attribute disclosure
- identity disclosure(身份纰漏):Identity disclosure occurs when an individual is linked to a particular record in the released table,也就是说可以从特定的记录中关联到某个身份了。
- attribute disclosure(属性纰漏):Attribute disclosure occurs when new information about some individuals is revealed, i.e., the released data makes it possible to infer the characteristics of an individual more accurately than it would be possible before the data release,也就是说信息的纰漏使推断身份特征变得可能。
身份纰漏通常导致属性纰漏,一旦身份被确认了,其相关的属性也就可以确认;而属性的纰漏不一定等导致身份的纰漏。而且需要指出的是,错误属性的纰漏可能会对推断身份变得有利。
2. k 匿名(k-anonimity)
2.1 k-anonimity 的定义
k-anonimity 满足每一个等价类中,有至少 k 个 records, 对于在等价类中的属性 attributes 中,不可区分这 k 个 records。
k 匿名有效抵御了身份纰漏,却没有提供足够的技术来抵御属性纰漏。
2.2 k-anonimity 的步骤
- 去掉 Explicit Identifiers。
- 模糊 Quasi Identifiers,通常的方法是 generalization 和 suppression。
例子如下图:
同质攻击(homogeneity attack)的例子:
Suppose Alice knows that Bob is a 27-year old man living in ZIP 47678 and Bob’s record is in the table. From Table 2, Alice can conclude that Bob corresponds to one of the first three records, and thus must have heart disease.
背景攻击(background attack)的例子:
Suppose that, by knowing Carl’s age and zip code, Alice can conclude that Carl corresponds to a record in the last equivalence class in Table 2. Furthermore, suppose that Alice knows that Carl has very low risk for heart disease. This background knowledge enables Alice to conclude that Carl most likely has cancer.
2.3 Generalization
2.4 Suppression
引入 suppression 的目的是降低一般化的维度,让数据变得更精确。当有限的 records (tuples with less than k occurrences,called outliers)提高了一般化的维度的时候,Suppression 用来调整一般化的过程。如下图,Ilatic 样板的 records 可以直接由 * 代替,因为这几项都不满足 2-anonimity 导致了过于一般化。
2.5 k-Minimal Generalization (with Suppression)
The application of generalization and suppression to a private table PT produces more general (less precise) and less complete (if some tuples are suppressed) tables that provide better protection of the respondents’ identities. 找到最小 k 的值能够避免一般化过多或者suppression过多,k-minimal generalization with suppression 基于下面的 Distance Vector 的定义。
Definition (Distance vector). Let Ti(A1,...,An) and Tj(A1,...,An) be two tables such that Tj 是 Ti 的 generalization. The distance vector of Tj from Ti is the vector DVi,j=[d1,...,dn] , where each dz,z=1,...,n is the length of the unique path between dom(Az,Ti) and dom(Az,Tj) in the domain generalization hierarchy DGHDz 。
上图7可以看出匿名之后的空白处是 suppression 掉的,留下来的是满足 k 匿名的,那么问题来了,generalization 失去数据精确度好呢还是 suppression 失去数据完整度好呢,Samarati 认为定义 MaxSup ,specifying the maximum number of tuples that can be suppressed。接下来是k-minimal generalization with suppression 的定义:
Intuitively, this definition states that a generalization Tj is k-minimal iff it satisfies k-anonymity, it does not enforce more suppression than it is allowed ( |Ti|−|Tj|≤MaxSup ), and there does not exist another generalization satisfying these conditions with a distance vector smaller than that of Tj . 举个例子,对于下图1来说:
MaxSup = 2, QI= {Race, ZIP}, and k = 2,那么有两个 k-minimal generalization with suppression,如图7。
2.6 k匿名技术的分类
Generalization可以被应用于:(i)Attribute,一般化整列;(ii)Cell,复杂度太高;(iii)None。
Suppression可以被应用于:(i)Tuple,suppression is performed at the level of row;(ii)Attribute,suppression is performed at the level of column;(iii)Cell,suppression is performed at the level of single cells;(iv)None。
因为此节内容过多,具体细节见《Classification of k-Anonymity Techniques》一文。
3. L-Diversity
3.1 L-Diversity 的定义
An equivalence class is said to have l-diversity if there are at least“well-represented”values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.“well-represented”的意思是:
Distinct l-diversity:ensure there are at least l distinct values for the sensitive attribute in each
equivalence class. 它不能抵御概率推断攻击(probabilistic inference attacks),如果在一个等价类中某个 SD 属性出现的频率比其他记录要大的话,容易让攻击者获得信息。Entropy l-diversity:等价类 E 的 entropy l-diversity 被定义为如下
Entropy(E)=−∑s∈Sp(E,s)logp(E,s)
其中 S is the domain of the sensitive attribute, andp(E,s) is the fraction of records in E that have sensitive values . A table is said to have entropy l-diversity if for every equivalence class E ,Entropy(E)≥logl .Recursive (c,l) -diversity:Recursive (c,l) -diversity makes sure that the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely. Let m be the number of values in an equivalence class, and
ri,1≤i≤m be the number of times that the ith most frequent sensitive value appears in an equivalence class E. Then E is said to have recursive (c,l) -diversity if r1<c(rl+rl+1+...+rm) . A table is said to have recursive (