k-anonimity、l-diversity 和 t-closeness

最新推荐文章于 2024-10-10 16:20:15 发布

scuLVLV

最新推荐文章于 2024-10-10 16:20:15 发布

阅读量1w

点赞数 5

本文链接：https://blog.csdn.net/scuLVLV/article/details/71077689

版权

本文详细介绍了数据隐私保护的三种技术：k-anonymity、l-diversity和t-closeness。k-anonymity通过确保每个等价类中至少有k个记录来防止身份纰漏；l-diversity则关注敏感属性的多样性，以抵御概率推断攻击；t-closeness则限制了等价类内部敏感属性分布与总体分布的距离，提供更精细的隐私保护。文章探讨了各种方法的定义、实现步骤、优缺点以及相互对比，揭示了数据匿名化过程中的挑战和解决方案。

摘要由CSDN通过智能技术生成

k-anonimity、l-diversity 和 t-closeness

1. 前言

1.1 等价类的定义

等价类（equivalence class）：等价类代表 QI 属性（attribute）相同的记录（record）。We define an equivalence class of an anonymized table to be a set of records that have the same values for the quasi-identifiers。

1.2 record的定义

记录（record）：表示关系型数据库（relation data 或者叫做 multidimensonal data）中的行，它对应于每个项目 individual。

1.3 attributes的定义

属性（attribute）：每个 record 包含很多对应的 attributes，这些属性可以被分为三类：EI、QI 和 SD。

1.4 两种纰漏：identity disclosure 和 attribute disclosure

identity disclosure（身份纰漏）：Identity disclosure occurs when an individual is linked to a particular record in the released table，也就是说可以从特定的记录中关联到某个身份了。
attribute disclosure（属性纰漏）：Attribute disclosure occurs when new information about some individuals is revealed, i.e., the released data makes it possible to infer the characteristics of an individual more accurately than it would be possible before the data release，也就是说信息的纰漏使推断身份特征变得可能。

身份纰漏通常导致属性纰漏，一旦身份被确认了，其相关的属性也就可以确认；而属性的纰漏不一定等导致身份的纰漏。而且需要指出的是，错误属性的纰漏可能会对推断身份变得有利。

2. k 匿名（k-anonimity）

2.1 k-anonimity 的定义

k-anonimity 满足每一个等价类中，有至少 k 个 records，对于在等价类中的属性 attributes 中，不可区分这 k 个 records。
k 匿名有效抵御了身份纰漏，却没有提供足够的技术来抵御属性纰漏。

2.2 k-anonimity 的步骤

去掉 Explicit Identifiers。
模糊 Quasi Identifiers，通常的方法是 generalization 和 suppression。
例子如下图：

同质攻击（homogeneity attack）的例子：
Suppose Alice knows that Bob is a 27-year old man living in ZIP 47678 and Bob’s record is in the table. From Table 2, Alice can conclude that Bob corresponds to one of the first three records, and thus must have heart disease.
背景攻击（background attack）的例子：
Suppose that, by knowing Carl’s age and zip code, Alice can conclude that Carl corresponds to a record in the last equivalence class in Table 2. Furthermore, suppose that Alice knows that Carl has very low risk for heart disease. This background knowledge enables Alice to conclude that Carl most likely has cancer.

2.3 Generalization

这里写图片描述

2.4 Suppression

引入 suppression 的目的是降低一般化的维度，让数据变得更精确。当有限的 records （tuples with less than k occurrences，called outliers）提高了一般化的维度的时候，Suppression 用来调整一般化的过程。如下图，Ilatic 样板的 records 可以直接由 * 代替，因为这几项都不满足 2-anonimity 导致了过于一般化。
这里写图片描述

2.5 k-Minimal Generalization (with Suppression)

The application of generalization and suppression to a private table PT produces more general (less precise) and less complete (if some tuples are suppressed) tables that provide better protection of the respondents’ identities. 找到最小 k 的值能够避免一般化过多或者suppression过多，k-minimal generalization with suppression 基于下面的 Distance Vector 的定义。

Definition (Distance vector). Let $T_{i}(A_{1},...,A_{n})$ and $T_{j}(A_{1},...,A_{n})$ be two tables such that $T_{j}$ 是 $T_{i}$ 的 generalization. The distance vector of $T_{j}$ from $T_{i}$ is the vector $DV_{i,j} = [d1,...,dn]$ , where each $d_{z}, z = 1,...,n$ is the length of the unique path between $dom(A_{z}, T_{i})$ and $dom(A_{z}, T_{j})$ in the domain generalization hierarchy $DGH_{D_{z}}$ 。
这里写图片描述

上图7可以看出匿名之后的空白处是 suppression 掉的，留下来的是满足 k 匿名的，那么问题来了，generalization 失去数据精确度好呢还是 suppression 失去数据完整度好呢，Samarati 认为定义 $\mathsf{MaxSup}$ ，specifying the maximum number of tuples that can be suppressed。接下来是k-minimal generalization with suppression 的定义：
这里写图片描述
Intuitively, this definition states that a generalization $T_{j}$ is k-minimal iff it satisfies k-anonymity, it does not enforce more suppression than it is allowed ( $|T_{i}|-|T_{j}| \le \mathsf{MaxSup}$ ), and there does not exist another generalization satisfying these conditions with a distance vector smaller than that of $T_{j}$ . 举个例子，对于下图1来说：
这里写图片描述
MaxSup = 2, QI= {Race, ZIP}, and k = 2，那么有两个 k-minimal generalization with suppression，如图7。

2.6 k匿名技术的分类

这里写图片描述
Generalization可以被应用于：（i）Attribute，一般化整列；（ii）Cell，复杂度太高；（iii）None。
Suppression可以被应用于：（i）Tuple，suppression is performed at the level of row；（ii）Attribute，suppression is performed at the level of column；（iii）Cell，suppression is performed at the level of single cells；（iv）None。
因为此节内容过多，具体细节见《Classification of k-Anonymity Techniques》一文。

3. L-Diversity

3.1 L-Diversity 的定义

An equivalence class is said to have l-diversity if there are at least“well-represented”values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.“well-represented”的意思是：

Distinct l-diversity：ensure there are at least l distinct values for the sensitive attribute in each
equivalence class. 它不能抵御概率推断攻击（probabilistic inference attacks），如果在一个等价类中某个 SD 属性出现的频率比其他记录要大的话，容易让攻击者获得信息。
Entropy l-diversity：等价类 $E$ 的 entropy l-diversity 被定义为如下

$E n t r o p y (E) = - \sum s \in S p (E, s) log p (E, s)$ $Entropy(E)=-\sum_{s \in S}p(E,s)\log p(E,s)$
其中 $S$ is the domain of the sensitive attribute, and $p(E, s)$ is the fraction of records in $E$ that have sensitive value $s$ . A table is said to have entropy l-diversity if for every equivalence class $E$ , $Entropy(E) \ge \log l$ .
Recursive $(c,l)$ -diversity：Recursive $(c, l)$ -diversity makes sure that the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely. Let $m$ be the number of values in an equivalence class, and $ri, 1 \le i \le m$ be the number of times that the $i^{th}$ most frequent sensitive value appears in an equivalence class E. Then E is said to have recursive $(c, l)$ -diversity if $r_{1} < c(r_{l} +r_{l+1} +...+r_{m})$ . A table is said to have recursive (