Data privacy、Principle and Practices精简（一）_the principle and practices o-CSDN博客

本文链接：https://blog.csdn.net/scuLVLV/article/details/70624625

本文介绍了数据隐私的基本概念，包括匿名化的过程、数据被分享的原因以及保护数据的方法。重点讨论了不同类型的敏感数据，如关系型数据、事务数据、纵向数据、图数据和时间序列数据的隐私挑战，并提出相应的匿名化技术和策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 简单概念

1.1 relational data（关系型数据），又叫做multidimensional data，是企业中最广泛使用的数据结构。相对应的有一系列数据保护算法：如 randomization、generalization、k-anonimization、l-diversity 和 t-closeness。
企业中不仅使用 multidimensional data，还在使用很多其他数据结构，如graph、longitudinal data、sparse high-dimensional transaction data（高维度稀疏事务数据）、time series data（时间序列数据）、spatiotemporal data、semistructured XML data 和 big data，这些属于complex data（复杂数据）。应用于multidimensional data的匿名方法并不能直接应用于这些复杂数据。
PII Personally Identifiable Information（个人识别信息）。
EI explicit identifiers，显式识别符，如社保号、医保号、名字等。
QI Quasi-identifiers，模糊识别符，如地理位置、电话号码、邮箱号，包括一些公开的数据等，quasi-identifiers在保持数据机密性中扮演重要的角色。
SD Sensitive data，敏感信息，如薪水、财产状态、身体状况，这个是不能泄露的。
NSD Nonsensitive data，非敏感信息。
这里写图片描述

1.2 Anonimization （匿名）是从敏感信息（SD）中分离出个人识别信息（PII）的处理过程。
1.3 Privacy和匿名的区别：隐私是我们知道个人的身份，但是不知道个人相关的私密信息；而匿名是我们知道一些私密信息，却对应不到相应的个人身份上去。under the condition of privacy, we have knowledge of a person’s identity, but not of an associated personal fact, whereas under the condition of anonymity, we have knowledge of a personal fact, but not of the associated person’s identity.
这里写图片描述

2. Anonimization的情形

匿名有两步：（1）data masking（数据掩码）；（2）de-identification（去标识化）。
2.1 Data masking is a technique applied to systematically substitute, suppress, or scramble data that call out an individual, such as names, IDs, account numbers, SSNs, etc.总之就是扰乱原始数据。

2.2 De-identification 应用于QI，如生日、性别、邮编等这些对识别身份有帮助的信息。

假设原始数据库为