Data Security and Privacy数据安全与隐私重要知识点

大白要努力啊

已于 2023-08-27 23:20:04 修改

阅读量1k

点赞数

分类专栏：笔记文章标签：安全数据安全数据隐私

于 2023-08-16 00:20:49 首次发布

本文链接：https://blog.csdn.net/weixin_45012798/article/details/132308890

版权

笔记专栏收录该内容

40 篇文章 14 订阅

订阅专栏

Data Security and Privacy

Part1: Internet Communication
Part2: User Tracking
Part3: Privacy and Anonymization
Part4: Cryptographic Techniques

Part1: Internet Communication

1.1 Risk and Countermeasure

risk: eavesdropping data, manipulating data, impersonation
countermeasure: TLS protocol

1.2 TLS Protocol

1.2.1 Overview

1.2.1.1 HTTP (Hypertext Transfer Protocol)

„Normal“ protocol for enabling Internet connection
Provides no security of the data
Everybody with access to the data can listen and change them

1.2.1.2 http ⇒ https = http + TLS

TLS: Transport Layer Security
It runs on top of http and provides a secure Internet connection

1.2.1.3 TLS Overview

Establish Communication -> Check Identity -> Agree on common secret key -> Encrypted Communication
后三个过程与Security有关

1.2.2 Encryption

1.2.2.1 Symmetric Key Encryption

Symmetric Key Encryption
最常用：AES (Advanced Encryption Standard)

1.2.2.2 Formal Definition

A secret key encryption scheme is composed of three sets: Key space 𝐾, Message space 𝑀, Ciphertext space 𝐶, and three algorithms: Key generation Gen: Outputs a key 𝑘 ∈ 𝐾, Encryption algorithm Enc: 𝑲 × 𝑴 → 𝑪, Decryption algorithm Dec: K × C → M
Ensure the correctness: Dec(k, Enc(k,m))=m

1.3 Establish a common secret key between the two parties

1.3.1 Key Agreement

1.3.1.1 Key Agreement Procedure

Goal: Establish a common secret key between the two parties

Over a public channel
No previous secrets

Two approaches

Public Key Encryption(not the Symmetric Encryption), e.g., RSA
Key Agreement Protocol, e.g., Diffie-Hellman Protocol

1.3.1.2 Public Key Encryption

Public Key Encryption

1.3.1.3 Formal Definition

A public key encryption scheme is composed of three sets: Key space 𝐾, Message space 𝑀, and Ciphertext space 𝐶. Three algorithms: Key generation Gen:[Set of integers(这里的整数可改变security level)]→K that outputs a keypair (pk,sk), being a public key and a secret key, Encryption algorithm Enc: (pk, m) -> c, Decryption algorithm Dec: (sk, c) -> m
Correctness: For each message 𝑚 ∈ 𝑀 and each keypair (pk, s𝑘) ∈ 𝐾, it holds that
Dec (sk, Enc(pk, m)) = m

1.3.1.4 Key Agreement using Public Key Encryption

Key Agreement using Public Key Encryption
为什么不直接使用Public Key Encryption Scheme对Message进行加密，而要用两种模式混合在一起？
因为Symmetric Key Encryption Scheme比Public Key Encryption Scheme更加高效，从长远来看可以更快加密Message。

1.3.1.5 Key Agreement Protocol

Key Agreement Protocol

1.3.2 Digital Signatures

1.3.2.1 Authenticity

1.3.2.2 Digital Signature

Digital Signature

1.3.2.3 Formal Definition

A digital signature scheme is composed of three sets: Key space K, Message space M, and Signature space S(replace the Ciphertext space (the reason is that the digital signatures do not produce any encrypted messages but produce the signature) ). Three algorithms: Key generation Gen:[Set of integers] → K that outputs a keypair (pk, sk), being a public key and a secret key; Signature algorithm Sign: K × M → S, (sk, m) -> s; Verification algorithm Verify: K × M × S → {True, False}, (pk, m, s) -> {True, False}(decide whether, under this public key, the signature really belongs to the message or not)
Verify(pk, m, Sign(sk, m)) = True

1.3.2.4 Security Goals

Authenticity
Non-repudiation
Integrity
Note: Confidentiality is not a security goal

1.3.2.5 Certificate

Servers authenticate by providing a certificate.
A certificate is a document that contains information about a public key (value, duration, etc.) and is signed by a trusted third party(Certificate
Authorities (CA))

1.4 Surveillance

1.5 NSA - Snowden Revelations

1.6 TCP/IP MODEL

1.6.1 TCP/IP

TCP/IP: transmission control protocol/internet protocol
family of internet protocols
often referred to as Internet protocol
Main goal: ensure that data packets arrive at their destination within a decentralized network

1.6.2 TCP/IP model

Layers: Application Layer -> Transport Layer -> Internet Layer -> Network Access Layer -> Media for data transfer
Approach: separate functions in the network are abstracted from their underlying structure
Advantage: flexibility to address the functions(layers) independently
Important: define interfaces between the different layers
Data exchange is handled by encapsulation

1.6.3 Encapsulation

Application Layer: data
Transport Layer: TCP header + data
Internet Layer: IP header + IP data(=TCP header + data)
Network Access Layer: Frame Header + Frame Data(=IP header + IP data) + Frame Footer

1.6.4 IP Address

Identifier of a party in the Internet
Locating a service/website
Users commonly use URLs (Uniform Resource Locator)
It’s located in the IP header
TLS (transport layer security)只在transport layer保证加密，没有接触到Internet Layer，因此Internet Layer中新增的IP header（包括里面的IP address）会被泄漏

1.6.5 IP Packets Leak Information

TLS (transport layer security)只在transport layer保证加密，没有接触到Internet Layer，因此Internet Layer中新增的IP header（包括里面的IP address）会被泄漏

1.7 IPsec

1.7.1 Overview

IP Security (IPsec)

Collection of protocols to provide security for a packet at the IP level
It leaves the selection of the encryption, authentication, and hashing methods to the user

Security Goals of IPsec

Confidentiality
Integrity
Authenticity

IPsec Main Components

Secret Key: Manual keys ; Internet Key Exchange (IKE and IKEv2): Negotiates protocols and algorithms, and generates the encryption and authentication keys to be used by IPsec.(如果不使用Manual Keys则可以使用IKE)
Security Protocols: Authentication Header (AH): is used to authenticate – but not encrypt – IP traffic; Encapsulating Security Payload (ESP): authenticate and/or encrypt IP packets
Modes:
Transport mode: secures IP‘s payload (Inserts additional information + The header is left untouched + Payload may have changed)
Tunnel mode: secures entire IP packet (Encapsulates header and
payload + Prepends new header + Allows for more security)

1.7.2 Authentication Header AH Protocol

1.7.2.1 Overview

AH Protocol
Goals

Identifies the sender of the package
Ensures integrity of the payload
No encryption (confidentiality of data cannot be provided)

Over Approach (在两种不同的模式下大致方法相同)

Computes and inserts so-called AH header (AH header contains information that allows to check the authenticity of the payload)
(Possibly) Changes IP header (and payload) 根据你采用哪种mode去决定

AH Header: Detailed Structure
Next Header – links IPsec headers
Payload length – length of whole AH packet
Reserved – for future use
Security Parameters Index – identifies security setting
Sequence Number – Counter to prevent replay attacks
Authentication Data – Integrity Check Value (ICV) - only field is empty at the beginning and needs to be computed

1.7.2.2 In the Transport Mode

Step 1: Compute ICV
Integrity Check Value: HMAC value of [Original IP header and payload + Incomplete AH header + Using secret key]
Compute ICV in Transport Mode

Step 2: Creating Packet

Inserts new AH header between „old“ IP header and „old“ payload
All fields are authenticated by the Integrity Check Value contained in the Authentication Data field

1.7.2.3 In the Tunnel Mode

Step 1: Compute ICV
Compute ICV in Tunnel Mode

Encapsulate header and payload into new payload
Prepend new header
Integrity Check Value: HMAC value of [New IP header and payload + Incomplete AH header + Using secret key]

Step 2: Creating Packet

Inserts new AH header between new IP header and new payload
All fields are authenticated by the Integrity Check Value contained in the authentication data field
Original header is encapsulated within the payload

1.7.2.4 HMAC

H: hash function
MAC: message authentication code
Functionality

Compresses data to a certain length => fingerprint
Fingerprint authenticates data based on secret key

Secret key k=(k1, k2)

1.7.3 Encapsulating Security Payload Protocol(ESP)

ESP encrypts data (encryption can be turned off)
Does prepend the ESP header and appends the ESP trailer to the payload

ESP
Authentication in ESP
Authentication Data: HMAC value of [ESP header + (Encrypted) Payload + ESP trailer]
ESP – Modes

Transport mode
Payload is encrypted + The header is left untouched.
Tunnel mode
Both the original IP header and the payload are encrypted + A new IP header
is prepended.

1.7.4 Hash Function

A hash function H: {0, 1}* -> {0, 1}^n
is a deterministic algorithm which maps bitstrings of arbitrary length to bitstrings of fixed length (n bits).
Typical length for n: 256, 512
H(x) is the fingerprint of x
Security Requirements

For a cryptographic hash function, we require that the hash value H(x) is characteristic for x.
In particular even small changes in 𝒙 should result into significant changes in H(x).
The output of 𝑯 should be unpredictable.
In general, the following conditions should be met by a cryptographic hash functions: Collision Resistance, Preimage Resistance, Second Preimage Resistance

1.7.5 Message Authentication Code

MAC
A digital signature scheme is composed of three sets: Key space 𝐾 (often {0,1}^n), Message space M, Tag space T
Three algorithms
Key generation Gen: [Set of integers] → K that outputs a key k ∈ K
Signature algorithm Sign: K × M → T, (k, m) -> t
Verification algorithm: Verify: K × M × T → {True, False} , (k, m, 𝑡) -> True/False
such that
All algorithms are efficient
For each message m ∈ 𝑀 and each k ∈ K, it holds that Verify(k,m,Sign(k,m))=True

Security: It should be difficult to generate a valid tuple (m,t) of message and MAC tag without knowing the secret key k
MACs can be seen as the symmetric key variant of digital signatures
Comparison to digital signatures:

Uses the same key for signing and verification
Less flexible（如果是public key，所有人都可以验证）
More efficient

1.8 Virtual Private Network (VPN)

1.8.1 Private Network

Separate, concluded network + Private lines/connections
Examples: Ethernet LAN, Optical fibers directly connecting devices, Wifi with access control
Problem: Private network over large distance?

Example: company network
Example: university network

1.8.2 Virtual Private Network

A VPN is a virtual network, built on top of existing physical networks, that can provide a secure communications mechanism for data and other information transmitted between networks. Because a VPN can be used over existing networks, such as the Internet, it can facilitate the secure transfer of sensitive data across public networks.

Functionality

Authentication
Access Control
Confidentiality
Integrity

Structure
Connected via secure and private channels, called tunnels
VPN
Procedure
Example: User A in Network 1 wants to communicate securely and privately with user B in network 3
Steps:

User A establishes secure communication channel with nearest gateway (G1)
A sends data for B to G1
G1 forwards data to G3, using the tunnel
G3 establishes secure channel with B and securely forwards data from A to B

Security
All gateways are communicating with each other;
Users/servers communicate with individual gateways;
As the content of the communication is encrypted, one cannot deduce whether certain users/servers are all communicating
**Possible Realization
5. IPsec & Tunnel Mode (Internet Layer)
- Gateway encrypts original header and payload and prepends new header for destination gateway
- Destination gateway decrypts payload and forwards original packet
6. SSL-VPN (Transport Layer)
- Kind of Remote-Acess-VPN
- Approach: Applications use HTTP-browser/server as „gateways“
- Advantages: Cannot be detected easily (as opposed to IPsec packets in tunnel mode) => blocking more difficult; Allows for more granular access control, e.g., for individual applications; No dedicated clients necessary
- Disadvantages: Managing more involved; Only supports browser-based applications

Problem

Privacy?
- Service Provider knows everything
- Some information need to be logged, e.g., for billing customers
- What else is stored?
VPN Blocking
- VPN traffic can be detected and be blocked
  Block certain ports
  Block IP addresses from known VPN servers
  Etc.
- Some countries have either placed technological barriers that block VPNs or passed laws prohibiting the use of VPNs

1.9 Tor

Initially abreviation for The Onion Router
Goal: Ensure private Internet communication, i.e., hide communication partners

1.9.1 Tor Nodes

Servers that voluntarily participate in the Tor network
Communication via three Tor nodes, called a circuit

1.9.2 Steps

Choose Circuit
Download current list of Tor nodes
Select randomly three Tor nodes
Establish Secure Channel
Get public keys of the three Tor nodes from Tor database
Use these public keys to establish secure channel
Establish Circuit
User establishes pairwise secret keys with each node
The keys are used to encapsulate the data from user to server
Each Tor node learns only the direct predecessor and direct successor
Sending Message
Each message is encrypted with three layers (=onion)
One layer = one key
Each node can decrypt/remove the outmost layer

1.10 HTTP

HyperText Transfer Protocol (HTTP)

Protocol which allows for fetching of resources, such as HTML documents
Foundation of any data exchange in the Internet
Application layer (TCP/IP model) protocol

HTTP Protocol
Two parties: Client (user), Server
Two types of messages: Request, Response
Client sends requests
Server sends responses
Structure
HTTP Request
Request method + Header fields + Separator (empty line) + Message body (optional)
HTTP Response
Status code line + Header fields + Separator (empty line) + Message body (optional)
User Agent
Denotes client software
Responsible for fetching/displaying Internet content
Examples: Web browser, Email readers
Sometimes, “user agent” also refers to the HTTP header field of the same name:

Part2: User Tracking

2.1 Cookies

2.1.1 Overview

2.1.1.1 Purpose

HTTP is stateless. It’s inconvenient, eg. Are you currently logged in? Internet shopping: did you put something into the shopping cart? Non-personalized results. -> Need to store state/information with cookies

2.1.1.2 Work Flow

When client connects to server for the first time, the server creates a cookie (SetCookie header field) Cookie = text string
Server sends cookie to client who stores cookie locally
Later on, clients sends local cookie to server (Cookie header field) More precisely, cookie is inserted
in each future HTTP requests

2.1.1.3 Structure of a Cookie

Text string starting with SetCookie or Cookie

Set-Cookie: Issued by server to initiate cookie
Cookie: Used in future requests sent by client

Concatenation of one or several av-pair (av-pair is a token with one or several values)

2.1.1.4 Pros and Cons

Pros

Convenience (Session handling, Login status)
Personalization (Recommendations)
Effective Advertising

Cons

Security (Stealing cookies, e.g., to hijack session)
Privacy (Browsers can be tracked)

2.1.2 Cookie Security

Security Problems

Ambient authority (Cookies can be used for (mis-)
using authentication as naming resource and authentication
process can be decoupled)
No encryption and integrity protection
Session stealing

Recommendations for Developers
Use Cookies with care -Use within secure protocols such as TLS
Some recommendations

Don’t store sensitive data in cookies, unless you absolutely have to.
Use Session cookies if possible. Otherwise set a strict expiration.
Use the HttpOnly and the Secure flags of cookies.
Set the SameSite flag to avoid other websites to link to your site
Leave the Domain empty, to avoid subdomains from using the cookie

2.1.3 Cookie Privacy

Targeted Advertising
Websites often free to browse. Website provider still want to monetize, e.g., by serving adverts to visitors
Targeted Advertising: Display personalized advertisement. The better the user profile is known, the higher the price the website can demand⇒ collect as much data about user as possible
RFC 6265: Privacy
Cookies are often criticized for letting servers track users. For example, a number of ‘web analytics’ companies use cookies to recognize when a user returns to a website or visits another website.
Although cookies are not the only mechanism servers can use to track users across HTTP requests, cookies facilitate tracking because they are persistent across user agent sessions and can be shared between hosts.

Third-Party Cookies
- Web sites that users did not visit can track
User Controls
- User should have control over cookies
Expiration Dates

Expiration Date
Cookie specification includes fields for
expiration date expires-av
maximum age max-age-av
Expiration date can be set to any time. Recommendation: servers should choose rather short expiration periods. 但事实上很多网站没有遵守推荐的意见，expiration date设置得非常久。

2.1.4 Third-Party Cookies

Some content such as news comes from server directly (first party) But: other content such as advertisements is coming from third parties
Today‘s typical website uses resources from different sources
Content is requested by user agent
Consequence: Receiving resource from host A (user initiated) triggers subsequent HTTP Requests to hosts B, C,… (third parties) to retrieve all referenced resources (user agent initiated)
Recall: HTTP Responses may instruct the user agent to store a cookie
This includes HTTP Responses from third parties
Problem: 无法知道自己存储了哪些第三方软件

2.1.5 User Control

browsers provide a variety of options to manage cookies
Users have full control: Deleting cookies, Forbidding cookies, Making exceptions…
In directive 2009/136/EC, the Policy change from opt-out to user consent for cookie
storage. User consent not required for cookies serving basic functions, e.g.Keeping track of a user‘s form inputs or shopping cart, Session handling, and …
In theory: Users should give informed consent to websites for storing cookies (actually, any data stored in a user‘s browser)
In practice: Websites inform users about their use of cookies using an eat-or-die approach (and sometimes hampering usability)将是否同意Cookie设计得很复杂，这样更多人因为不想麻烦直接选择Accept Cookies

2.1.6 Countermeasure

Do Not Track (DNT) HTTP Header
Do Not Track (DNT) HTTP Header indicating tracking unwanted

Not standardized, considered more a friendly request to the server
Probably ignored by most websites
Does not prevent cookies from being stored in general
Some browsers allow to disable cookies completely
Problem: Some websites use them for login/session management
Private Browsing Mode
Browser accepts cookies (and other data) for this session
Pretends to store them as asked
Closing browser removes all cookies (and other data)
Somewhat detectable by websites
Zombie Cookies
HTTP cookie that returns to life automatically after being deleted by the user
Realized by information stored somewhere else
Most common approach: super cookie
Removal not impossible, but requires more sophisticated strategy
JavaScript library evercookie by Samy Kamkar creates Zombie Cookies

Using these mechanisms, in particular private browsing mode, does help. However, it does not guarantee that tracking is not possible. Some techniques that are not prevented: Super cookies and Brower fingerprinting.

2.1.7 Super Cookies

Umbrella term. It refers to techniques for storing information at some places
different to HTTP cookies
Examples:

Tracking Header
Browser Extensions
Opposed to HTTP cookies, super cookies are harder to detect and to prevent
Internet Service Provider (ISP)
Before a device can browse the Internet, it needs to connect to its Internet Service Provider (ISP). That is, it has to login there and all traffic is routed through the
ISP.

2.1.7.1 Tracking Header

ISP needs to identify device/user
Tracking header = HTTP header inserted by ISP
Allows other websites to identify user
Tracking Header work flow1
Tracking Header work flow2

2.1.7.2 Browser Extensions

HTTP cookies just one out of many cookie ‘jars’
Browsers‘ functionalities can be extended by add-ons or plugins, e.g., Adobe Flash Player, Microsoft Silverlight, Java,…
Browser plugins may implement their own cookie jars. So now there are Flash-cookies, Silverlight-cookies, Java-cookies,… It is possible to embed Java/Flash code into HTML

Flash = animation software to display animations on web sites
Flash cookie, aka local shared object (LSO)
Text file that is sent by a Web server to a client when the browser requests Flash content. Unlike HTTP cookies which are stored with the browser’s files, Flash cookies are stored in a separate Adobe file and may have to be managed and deleted separately through Adobe Flash player settings.

HTML5
HTML5 exposes Storage APIs to websites, providing servers with client-side session/key-value storage and ‘simple’ SQL databases
New features include storage API

Allows to locally store data similar to cookies but supports larger data
Two types of storage:
- Session storage saves the data until the session expiry
- Local storage is persistent storage

2.2 Fingerprint

2.2.1 Overview

Motivation
Cookie Makes browsing stateful and Allows for tracking
User has full control of information locally stored by the browser. If cookies are deleted, server does no longer recognize the browser and hence the user
Providers (and attackers) developed other means for recognizing the browser: extracting and storing fingerprints
Fingerprints
Biometric feature of (at least) every human being which is

Robust: usually does not change over time (manipulation excluded)
Unique: no known non-unique set of fingerprints of two individuals
Used in crime scene investigation as forensic method for identification. Also used for authentication purposes, e.g., in smartphones, PCs, physical security measures

Need to identify properties of browser-device-user combinations that are similar to human fingerprints. Following properties are desired

Robust: hardly changing over time
Unique: different enough to be distinguishable
Efficient: checking has marginal impact only on client‘s resources (e.g., smartphone battery)
Compatible: maximal number of clients of interest support properties

Derive characteristic features from client systems
Use these to recognize devices at a later point in time
Example: IP address
Robust: basically stays the same for lifetime of device
Unique: IPv4 addresses are unique by default (exceptions apply)
Efficient: communicated to destination, necessary for connection
Compatible: obviously, yes
Exceptions (incomplete)

Device used by multiple users → only ‘group’ identification
Device behind ‘popular’ proxy → many users share same IP, no direct
identification

2.2.2 Entropy

Entropy H(X) expresses the uncertainty about a random variable X.
It holds H(X) ≥ 0 with

H(X)=0 means no uncertainty
The higher H(X), the larger the uncertainty and the harder it gets to guess the value of X.

In the context of fingerprints, a variable X may represent one feature/characteristic
The higher H(X), the more does X tell about the owner of the fingerprint=> features with high entropy are better suited for fingerprints
Let X be a random variable which takes values of a finite set A according to some distribution, described by Pr[𝑋 = a] for a ∈ A. The entropy of 𝑋 denoted by H(X), is defined by
Entropy
Approximating Entropy
X is a random variable which takes values of a finite set A according to some distribution, described by Pr[𝑋 = a] for a ∈ A.
Problem: Often this set and/or distribution are not known, e.g., in case of fingerprints.
Approximating:

A = set of observed distinct values
For a ∈ A, we replace Pr[𝑋 = a] by the frequency of a. That is, how often does a occur divided by all observations.
Example: if 𝑎 took place half of the time, the frequency is 12

2.2.3 Browser Fingerprint

To provide better services, websites require knowledge about browser and environment. Example: desktop view vs. mobile view
Idea: gather such information for generating a fingerprint of the browser
Example Attributes

User Agent
User Agent = identification string of requesting application
Commonly send, e.g., for ensuring compatibility
HTTP Accept Headers
Describes the capabilities of the user agent

Entropy = measures the „uncertainty“ of a value
Problem: entropy values are not directly comparable. E.g., size of events can be different
To make entropy values comparable, one can consider the normalized entropy
Definition: H(X)/ log2 N, with N being the number of all events.
Motivation: log2 𝑁 represents the “worst case”, i.e., the case of maximum entropy

Countermeasures
Tor Browser

Meant to allow for private surfing
Comes with a number of different security levels
The higher the security level, the more features are disabled
In general: aims for “standard” values used in the browser

Browser Configuration

Disable JavaScript
Most tracking features (canvas, list of plugins,…) require JavaScript enabled
But, most modern website require it for functioning properly
Use of add-ons like NoScript could be beneficial, but also give away information
Spoofing of attributes
Must be done carefully
Inconsistencies might be detected and used as information
This and more can be accomplished by appropriate browser extensions (amiunique.org)

2.2.4 Cross-Browser Fingerprint

Browser Fingerprinting: Ask browser about features of the browser. Examples: plugins, languages, fonts, …
Cross-Browser Fingerprinting: Ask browser to perform tasks that rely on OS and hardware. Examples: graphics card, CPU, …

Old Features with Major Modifications

Screen Resolution
Problem: Some browsers change resolution value in proportion to zoom level
New: Use other information, e.g., ratio between screen width and height
Number of CPU virtual cores
Capability information
Problem: not supported by early versions of browser
New: request increasing number of tasks until finishing time increases significantly => max. number of cores reached
AudioContext
Create and re-reprocess audio signals
Problem: wave in the frequency domain can differ from browser to browser on the same machine
New: peak values and their corresponding frequencies are relatively stable
List of Fonts
Which fonts are supported
Problem: So far, Flash plugin has been used but Flash is disappearing
New: measure width and height of a certain string to determine the font type

New Features: Rendering Tasks
Let the browser render different types of graphics -> Concrete image depends on GPU

2.2.5 WLAN

WLAN = Wireless Local Area Network
Data transmission system designed to provide location independent network access
Usual structure: Access point (AP): regulates
access to the network + Devices: can join network via AP
SSID = Service Set Identification
Identifies a particular wireless network
A client must set the same SSID as the one in that particular AP to join the network
Without SSID, the client won’t be able to select and join a wireless network
SSID can be hidden (but does not provide
additional security)
Beacon Frames
AP regularly sends beacon frame

Announces presence of a WLAN
Includes timestamp and SSID

Association process

Required for joining a WLAN
Device scans channels, listening for beacon frames
Device selects AP to associate with; initiates association protocol

MAC address = Media Access Control (MAC) address
Used to identify device
Length of 48 bits
Client MAC address is sent in plaintext, i.e., can be eavesdropped
Automatic periodic scanning for WLANs with unchanged MAC address enables tracking -> Countermeasure: Randomize MAC address (Windows 10 introduced automatic MAC address randomization)

2.2.6 WLAN: NIC-Fingerprinting

Network Interface Card (NIC)

Responsible for sending and receiving WLAN signals
NIC has unique Media Access Control (MAC) address of 48 bits
Randomizing MAC address can help against tracking

Attack Approach
NICS show individual behavior which canbe utilized for identification

Attack properties

Passive & non-intrusive
No co-operation required
Exploits system/driver/device differences

Methods用以下方式生成Fingerprint

Waveform Characteristics
Clock Skew/Timing Inference (time intervals between probe requests)
Improved Timing Inference (Transmission Time (TT) and frame Inter-Arrival Time
(IAT))

2.2.7 WLAN: Probe-Request-Fingerprinting

Probe Requesst
Normally: device passively scans channels, listening for beacon frames. Once a beacon frame is received, network is displayed to user and association process can be started. Some time delay until next beacon frame is send.
To accelerate the connection process, the device can actively send messages (probe requests) to look for known APs. Probe requests may contain list of all known SSIDs (also hidden) to limit responses to those known SSIDs. Problem: probe requests may allow for tracking
Tracking Without Using MAC Address: Relies on frame counters to connect
frames. Use Information Elements (IEs) of probe requests. IE is an optional feature, containing a variable list of elements with varying order, exploitable as useful information
Anonymity Sets
Set of devices sharing the same fingerprint value. The larger the set, the more anonymous
SSID-Fingerprint
List of SSIDs searched by a device
Idea: which SSIDs are known to device is characteristic for the device

2.2.8 Mobile Communication Fingerprinting

GSM Structure

In GSM, user (usually) authenticates to SIM, which then authenticates to the mobile network (via base station). No network authentication towards the mobile station in GSM
Tracking Method
1.IMSI

Any SIM contains a globally unique ID, the International Mobile Subscriber Identity (IMSI)
If this is exposed over the air, tracking and profiling will be easy
For privacy protection, IMSI should be sent as rarely as possible
Instead a Temporary Mobile Subscriber Identity (TMSI) is used
Network may instruct mobile station to send IMSI
Mobile device has globally unique International Mobile Equipment Identity (IMEI)

2.IMSI Catcher

Approach: Masquerades as base station towards mobile stations; Acts as mobile station towards network; Mobile stations tend to choose base stations with the best signal; Identity request sent to mobile station is answered with IMSI (and IMEI)
Properties: Does not require assistance from network provider; Used by law enforcement to track and locate persons of interest, to intercept calls and SMS
Problem: other mobile stations connect to fake base station as well

3.Stealth Ping

Approach: Network provider sends stealth ping to mobile station; Also known as silent SMS, Short Message Type 0; Special purpose text message, intended for special network provider purposes; Is not displayed or signalled to the user at all
Properties: Reveals metadata (including location) when received by the mobile station; Used by law enforcement for tracking and locating mobile stations; Requires cooperation of the network provider

Improvements in UMTS
Network also authenticates to mobile station
But: UMTS supports inter-operation with GSM

Downgrade attacks feasible
Jamming UMTS frequencies will trigger mobile stations to fall back to
GSM

2.2.8 Printer-Fingerprinting

Machine Identification Code (MIC)
Governments demanded that (high-quality) counterfeit currency be forensically traceable
Machine Identification Code (MIC) – Digital watermark encoding identifiable
information on printouts. Also known as printer steganography
MIC automatically added to every printed page by most color laser printers
Traditionally realized by tiny yellow dots arranged in a grid, printed multiple times on the page to have complete information even when only fragments of the page are available
Yellow dots barely visible under normal circumstances

2.3 Motivation

Methods to identify users based on recognizing technical features of devices: IP address, System properties, Traces like yellow dots
=> methods to identify users and/or profile users based on
data incurred by behavior
Recall GDPR (General Data Protection Regulation). Regulation on protection of natural persons with regard to the processing of personal data (Explicitly mentioned, Monitoring behavior of users, Profiling of persons, Processing sensitive data) This requires to identify which data falls within these categories -> It is extremely difficult (if not impossible) to decide for collected data if they are subject to these regulations. Expectation: this is getting more apparent/worse in future
The following examples are not relevant for the exam. However, you should

Be aware of this problem
Get used to this thinking (what data may be leaked and how could it be misused)
Develop own approaches

2.3.1 Smartphones

Sensors in Smartphones
Smartphones are equipped with several sensible sensors. The behavior of a user
impacts the sensor values. An attacker who gets access to these values, e.g., via a
malicious app, can draw conclusions about a user.
malicious app can get information from sensor -> user’s behavior will impact the sensor values -> attacker can get some inforamtion about a user though the sensor
Example: GPS (Global Positioning System), Camera, Microphone, Gyroscope and Accelerometer

2.3.2 Automobile Driver Fingerprinting

Low amount of data and few sensors sufficient for reliable identification of particular driver

2.3.3 Smart Meter

Electronic device recording (electric) energy consumption and communicating to electricity supplier for monitoring and billing digitally
Can record energy consumption with high accuracy
Not limited to electricity meters, also gas or water meters

2.3.4 Traffic Analysis

when and how much data is transmitted per device may also leak information
Attack Steps: Identify smart home devices from network traffic -> infer user activities( User activities resulted in change of traffic rate. This may be used by an adversary to infer activities)
Threat Model:

Passive network observer
Access to Internet traffic into and out of smart home
Adversary is not active and does not need to manipulate traffic
Packets are encrypted => adversary has only access to traffic rate and
packet header metadata
Observation: almost all tested IoT (Internet of Things) devices use TLS/SSL encryption (transport layer)
Attacker can learn, i.e., can obtain a database of labeled traffic from smart home devices

Methods of identify devices from network traffic:

Use MAC Address
Using DNS(Domain Name System) Queries [Executed at Internet layer]
A device will send data always to the same set of addresses -> Results into the same set of DNS queries. DNS queries can be uniquely associated to devices
Use traffic rates: Different devices exhibit different traffic pattern
Example: traffic volume
Other properties: number of packets, inter-packet intervals, etc.

Countermeasure

Block traffic (For example: use Firewall. Problem: Many devices do not work without an active Internet connection)
VPN (Better but still, certain common device combinations and user activity
patterns can still leak. Possible reasons: Single device, Sparse activity, Dominating device)
Shape Traffic (Padding or fragmenting packets to a constant size. Aim: fixed rate or schedule. Efficient and effective)

Traffic metadata analysis can allow to infer private in-home user activities even works if network is encrypted (transport layer). Blocking/routing traffic is not a reasonable countermeasure. Creating covert traffic can help but more complicated
and incurs additional network traffic.

2.3.5 Smart Heating

Smart homes: devices use sensors to collect data and to trigger actions
Prominent example: smart heating
Room climate data (temperature, relative humidity) are measured
Heating is controlled based on this data
Attack: Exploiting ‘inconspicuous’ data without users‘ consent/knowledge

2.3.6 Multimedia Content Identification

Early smart meters communicated without any security at all

Fake data could be send to online backend (also for billing)
Not anonymized (ID was sent along the data)

Detecting specific film material being watched

Assumption: TV power consumption directly influenced by brightness
levels

Part3: Privacy and Anonymization

3.1 Data Collection and Analysis

Facebook’s Reason: Targeted Advertising
Goal: Users receive advertisement according to their profile. This eliminates wastage and increases effectiveness
Other reasons: Provide, personalize and improve our Products; Provide measurement, analytics, and other business services; Promote safety, integrity and security; Communicate with you; Research and innovate for social good

What kinds of information do Facebook collect?
Things you and others do and provide + Device Information + Information from partners

3.1.1 Risk

Discrimination: Collected data may be (mis-) used to discriminate people
Example:
- Higher insurance fees
- Ignoring job applications
- Project FlySec: Access control on airport based on user data
Social Engineering Attacks
Attack: Influence other people (to do something)
Rely on exploiting the human trust factor
Can be extremely effective in social networks
Example: Pretexting - Creating an invented scenario to engage a targeted victim in a manner that increases the chance the victim will divulge information or perform actions that would be unlikely in ordinary circumstances
Account Impersonation/ Take Over
Attack: Impersonate other persons
Enable cyber criminal: Perform social engineering attacks; Spread malicious links/malware; Steal private data
Social Botnets
Botnet = ‘robot’ + ‘network’
Cybercriminals take control of several users’ computers and organize them into network of ‘bots’ that can be managed remotely
Example:
- Attacks using botnets
  distributed denial of service attacks
  deploying large scale spam attack
- Attacks using social botnets
  сrowd manipulation (Were used for manipulating public opinion during)
Malicious Social Media Applications
API specifies how the third-party applications can interface within the social platform
User has to grant permissions to the app
The app has rights to view and copy information about the user

3.1.2 Cambridge Analytica Scandal

Cambridge Analytica - British political consulting company founded in 2014. It provide services for data mining, data brokerage, and data analysis. Goal: support political advertising in the context of electoral processes.
CA harvested the personal data of millions of Facebook users without their consent and used it for political purposes. Various political organizations used information from the data breach to attempt to influence public opinion.

3.1.3 General Data Protection Regulation (GDPR)

Grants users certain rights with respect to the collection and analysis of personal data

Transparency: It should be transparent that data is collected and to what extent it is processed This needs to be expained in clear and plaIn language)
Access to Collected Personal Data: User has the right to access and get a copy of his/her data, a business has on him/her
„Right to be forgotten“: Users will have a clearly defined “right to be forgotten” (right to erasure), with clear safeguards
Automated Decision: Businesses will have to inform the user whether the decision is automated and give him/her a possibility to contest it

GDPR Fines - Regulation
Violators of GDPR may be fined up to €20 million, or up to 4% of the annual worldwide turnover of the preceding financial year, whichever is greater

Terminology
Personal Data: means any information relating to an identified or identifiable natural person (‘data subject’);
An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person
Identifiability
Types of Identifiability

Direct identifiability: Data is directly related to a person (knowledge of the data also means knowledge of the person), mostly via the name
Indirect identifiability: Inference of person via supplementary information
Distinction between direct/indirect identifiability not always easy, but also not neccessary (same legal consequences)

Explicitly mentioned in the definition: Identifiability by an identifier like a name, an identification number, location information or online alias

No knowledge of name neccessary
Examples of identifiers: IP-address, email address, tax number, social security number
Also: Traits that are an expression of the physical, physiological, genetic, psychological, economical, cultural or social identity of a natural person. Examples: Profession, interests, physiological traits, fashion sense
But: Being able to connect the data to an identifier or trait alone is not sufficient – inference of the actual person must be possible
Requirements for identifiability
To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
Processing
GDPR regulates the processing of personal data ‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;

Behaviour
The processing of personal data of data subjects who are in the Union by a controller or processor not established in the Union should also be subject to this Regulation when it is related to the monitoring of the behaviour of such data subjects in so far as their behaviour takes place within the Union. In order to determine whether a processing activity can be considered to monitor the behaviour of data subjects, it should be ascertained whether natural persons are tracked on the internet including potential subsequent use of personal data processing techniques which consist of profiling a natural person, particularly in order to take decisions concerning her or him or for analysing or predicting her or his personal preferences, behaviours and
attitudes
Special Categories of Personal Data
This explicitly includes:

Data about racial or ethnic origin
Data about political opinions
Data about religious or philosophical beliefs
Data about trade union membership
Genetic data
Biometric data for the unambiguous identification of a natural person
Health data
Data about a natural person’s sex life or sexual orientation

3.1.4 Basics

Model
We assume a party, the data collector, who collected some data about individuals

The data collector aims to make the data available to third parties, e.g., publishes it
Possible reasons: Third party is a separate service provider that analyzes the data on behalf of the data collector; Third party are statistical institutes/researchers who conduct research on the data…

Types of Data Releases
Tabular data (record = group of individuals)

Cross-tabulated values showing aggregate values
Example: counts of values
These types of data is the classical output of official statistics.

Microdata(record = individual)

Refers to a record that contains information related to a specific individual

Queryable databases

Interactive databases to which the user can submit statistical queries (sums, averages, etc.).
Common case: database contains microdata

Disclosure Risks

Identity disclosure
- Privacy viewed as anonymity
- Attacker is able to associate a record in the released data set with the individual that originated it (re-identification)
- Relevant for microdata
Attribute disclosure
- Privacy viewed as confidentiality
- Attaker is able to determine the value of a confidential attribute of an individual with enough accuracy
- Relevant for tabular data and microdata

Attributes Classification
Identifier attributes & Quasi-identifier attributes

Identifier attributes
- Provides unambiguous re-identification of the individual to which the
  record refers.
- Examples: names, social security number, the passport number, etc.
Quasi-identifier attributes
- Unlike an identifier, a quasi-identifier attribute alone does not lead to record re-identification.
- However, in combination with other quasi-identifier attributes, it may allow unambiguous re-identification of some individuals.
- Example: Fingerprint = combination of different browser properties

Confidential/sensitive attributes & Non-Confidential attributes

Confidential/sensitive attributes
- Sensitive information on the individuals, e.g., salary, health condition, sex orientation, etc.).
- Primary goal of microdata protection techniques is to prevent intruders from learning confidential information about a specific individual.
Non-Confidential attributes
- Attributes that do not belong to any of the previous categories.
- That is, they do not contain sensitive information about individuals and cannot be used for record re-identification

Five Key Stages (protect privacy when run a company)

Assessment of need for confidentiality protection
Determine key characteristics and uses of the data
Disclosure risk definition and assessment
Disclosure control methods
Implementation

Terminology
Tabular database is composed of cells. A cell contains aggregated values. Values are contributed by respondents. Examples: sum of contributions, number of respondents. Usually, tabular data is derived from crossing microdata
Microdata: Collection of records where each record is composed of several variable
values

Variables and Disclosure Risk
Classification of variables (with respect to content)

Numerical variables: Variable values are numbers, e.g., age
Categorical variables: Variable values are from different categories, e.g., gender, region

Confidential variables

Refers to variables where value should be kept private
Example: value “yes” for variable “alcoholic addict”

Attribute Disclosure

Privacy viewed as confidentiality
Attacker aims to recover the value of a confidential variable/attribute of an individual with enough accuracy
Alternatively, an attacker aims to determine if certain person falls into a
confidential category, e.g., belongs to „alcoholic addicts

3.1.4.1 Frequency Count Table

Each cell-value represents the number of respondents that fall into that cell
Risks: As the cell values are the SUM of values, there shouldn’t be much of a problem. Or not?

Possible risk disclosure if too many respondents score on the same sensitive category
Consequence: Scores should be sufficiently spread over all categories

3.1.4.2 Magnitude Table

Each cell-value represents the sum of the score of respondents that fall into that cell
Risk
Goal: Cell values should not leak sensitive information about
single respondent
Leakage

Number of respondents may be known
What information is leaked about certain respondents?
In particular, what kind of information can a party deduce if it is one of the respondents?

Possible risk disclosure if too few respondents score on the same sensitive category and one or more dominating contributions
Consequence: Number and amount of contributions should be sufficiently spread

3.1.5 Sensitivity Measures

Statistical Disclosure Control (SDC)
Owner of the tabular database aims/has to release it
Question: How to avoid risk of disclosure?
Approach: Modify risky data before releasing it
Challenge: How to decide which data should be addressed? => Sensitivity measures
• Rule that decides for a cell whether it should be considered unsafe (i.e., subject to possible disclosure)

Sensitivity Measure - Definitions
A sensitivity measure S is a function that maps a cell to some numerical value.
Input to S can be any data that is known to data owner about X
Examples: the contributions xi, the number N(X) of respondents, etc.
A cell X is said to be sensitive if S(X)>0
Application: data owner has to address (at least) all cells that are considered to be sensitive

Informally, a leak happens if the value 𝑉(𝑋) of a cell 𝑋 based on confidential data can be (mis-)used by a respondent to the cell X or other knowledgeable party to narrowly estimate the contribution of another respondent to the cell.
Attacker knows: Number 𝑁(𝑋) of respondents; Cell value V(X); One or several contributions xi
Attacker aims to estimate one contribution or sum of some contributions

Rules
1. Threshold Rule
Intuition: The number of contributors to a cell 𝑋 should be above a certain threshold t
Formal definition: The cell is sensitive if N(X) < t
Sensitivity measure: S(X) := t − N(X)
2. Dominance Rule = (n,k)-rule
Intuition: A cell is considered to be sensitive if the largest n contributions in that cell amount to more than k% of the cell total
Dominance Rule1
Dominance Rule2
3. p%-Rule
Intuition: A cell is sensitive if one respondent can estimate the contribution of another respondent within p% of its true value
Note: It is sufficient to restrict to the two largest contribution
p%-Rule1
p%-Rule2
4. p/q-Rule
Intuition: In the p%-rule, one considers R2 who aims to guess x1 and knows x2 but knows nothing about V’(X) := V(X) -x1-x2
Extension: R2 can estimate V’(X) with probability (100-q)%
p/q-Rule1
p/q-Rule2

3.2 Protecting Tabular Data

Classification of data protection methods

Non-perturbative: Original data remains unchanged. Examples: Recoding, cell suppression
Perturbative: Modify table values. Examples: Controlled rounding, controlled tabular adjustment

3.2.1 Recording

Non-perturbative method
Approach: aggregate or change some of the categorical variables that define the table
Goal: Resulting table should satisfy sensitivity rules, i.e., all cells are considered to be non-sensitive
Examples: Combine Categories or Change Categorical Variables

3.2.2 Controlled Rounding

Perturbative Method
Approach: Fix some positive integer, the base number b. And replace each cell value by a multiple of base b
Notation

Rounding
Use base b = ? and randomly replace values
Obvious problems: Some rounded values differ strongly from original values. Total values are incorrect

Controlled Rounding

Additional measures to cope with the problems
Each cell value V(X) is replaced by an adjacent multiple of 𝑏, that is either with ⌊𝑽(𝑿)/𝒃⌋⋅𝒃 or ⌈𝑽(𝑿)/𝒃⌉⋅𝒃
After step 1, re-compute the total values (additivity property)

Zero-restricted: if V(X) is already a multiple of b, it remains unchanged
Choosing Base b: Problem on its own. The smaller b, the higher the accuracy (i.e., values are close to original values). The greater b, the higher the level of protection

Level of Protection
Given tabular data with rounded values V’(X), what does an attacker know about original data?
Without further information, at least for each cell X that
V**‘(X) − b ≤ V(X) ≤ V’(X)+ b**
Sometimes, further bounds on V(X) may be known, e.g., V(X)≥ 0
If the applied rounding was zero-restricted, the range for V(X) can be reduced to V’(X) − b < V(X) < V’(X) + b

3.2.3 Cell Suppression

Non-perturbative method
Approach: After sensitive cells have been identified, remove their values from the table
The sensitive cells are also called primary cells
The process of removing the values of the primary cells is called primary suppression. This approach usually requires also a secondary suppression(Problem: missing cell can be reconstructed -> At least: one more cell within row/column of primary cell -> Two or more suppressions per row/column are necessary but may not be sufficient -> the missing values cannot be reconstructed anymore (without further information) -> Still, some information are leaked on primary cell -> An attacker can derive upper and lower bounds on the primary cell)

Primary suppression is not sufficient
Challenge: which cells should me removed as well?

Constraints:

Must not be possible to approximate primary cell values with certain accuracy
Still: remaining data should remain useful (information loss should be minimal)

Different methods do exist

Finding optimal solution: possible, but time consuming
Approximate optimal solution: more efficient, but less effective
Subject of ongoing research

3.2.4 Controlled Table Adjustment (CTA)

CTA = minimum-distance controlled tabular adjustment
Perturbative Method

Approach:

Replace cell values (sensitive and non-sensitive cells)
Resulting table should meet protection level and
be as close as possible to the original table

Close to original table?

Requires to specify the distance between two tables
Common choice:

Sum over all cells X of the difference between original values V(X) and modified values V’(X)
Extension

Preparation
Identifying the sensitive cells
Specifying safety ranges (i.e., new sensitive cell values should be outside of these)

Challenge
Adjust other cell values such that new table is maximally close to old table
Fortunately, problem instance is smaller compared to other approaches such as cell suppression.
Problem can be solved by mixed-integer linear programming (MILP) solvers

(Dis-)Advantages of CTA
Advantages
Faster compared to other approaches such as optimal cell suppression
No suppressed cell values ⇒ lower information loss possible

Disadvantages
Perturbative method ⇒ it is unknown which values are the original values
Difficult to keep consistency between (related) tables

3.3 Linked Tables

So far, we discussed disclosure risks and protection mechanisms for single tables
This is already challenging, but situation may be even worse
Attacker may have access to linked tables. Tables that are related to each other
Depending on the type of linkage and information, this may allow an attacker to deduce more information

Protecting linked tabular data is a problem on its own

Problem size increases
Preserving consistency
Linked table may not be known apriori

Some mechanisms can be extended to handle also linked data. Still, subject of ongoing research

3.4 Microdata

Refers to a record that contains information related to a specific individual
A microdata set X is a set of records, i.e., X = {r1, …, rn}. Each record represents an individual contained in X
A set A = {a1, …, am} of attributes. Attributes can be numerical (e.g., age) or categorical (e.g., disease).
Records are a collection of attribute values.
That is, each record r ∈ X has an attribute value assigned for each attribute.
For some record r ∈ X and attribute a ∈ A, we denote by r[a] the attribute value of r with respect to attribute a
Microdata sets are usually represented by a table where rows = records and columns = attributes

Attributes Classification
Identifier attributes

Provides unambiguous re-identification of the individual to which the record refers.
Examples: names, social security number, the passport number, etc.

Confidential/sensitive attributes

Sensitive information on the individuals, e.g., salary, health condition, sex orientation, etc.).
Primary goal of microdata protection techniques is to prevent intruders from learning confidential information about a specific individual.
However, usually represents the information of interest for the data analyst. Hence, should be preserved.

Quasi-Identifier attributes

Unlike an identifier, a quasi-identifier attribute alone does not lead to record
re-identification. However, in combination with other quasi-identifier attributes, it may allow unambiguous re-identification of some individuals.
Examples: Browser fingerprinting

Non-Confidential attributes

Attributes that do not belong to any of the previous categories.
That is, they do not contain sensitive information about individuals and
cannot be used for record re-identification

In our examples/discussions, we will assume that none of the attributes
in X are non-confidential.

Disclosure Risks
Usually two types of disclosure risks are considered
Identity disclosure

Privacy viewed as anonymity
Attacker is able to associate a record in the released data set with the individual that originated it (re-identification)

Attribute disclosure

Privacy viewed as confidentiality
Attacker is able to determine the value of a sensitive attribute of an individual with enough accuracy

Privacy is not equal to Removing Indentifier Attributes

3.4.1 k-Anonymity (microdata)

3.4.1.1 Quasi-Identifiers

Consider a microdata set X with attributes A
A quasi-identifier attribute is a non-sensitive attribute that can be used for re-identification (together with other quasi-identifier attributes)
Notation
Quasi-identifier QI ⊆ A: set of attributes whose combination can lead to re-identification
Set of all quasi-identifiers: QIx
For some (sub-)set A′ ⊆ A with A′ = {a1′ , … , al′} and record r ∈ X, we denote by r[A’] = {r[a’1] , … , r[a’l] }, i.e., the attribute values of r with respect to the attributes in A′

Re-identification of some individual:
Let r ∈ X be some record, re-identification of the individual represented by r means that r[QI] is unique for some quasi-identifier QI ∈ QIx
Idea: If r[QI] is not unique for any record r ∈ X and any quasi-identifiers QI ∈ QIx, then re-identification should not possible anymore

3.4.1.2 k-anonymous

X is said to be k-anonymous with respect to QI for some quasi-identifier QI ∈ QIx if each combination of values of the attributes in QI that appears in X is shared by k or more records.
Formally: For any r ∈ X there exist k − 1 other records r1′, … , rk−1 ′ ∈ X such that
r[QI] = r’1[QI] = ⋯ = r’k−1[QI]
X is said to be k-anonymous if it is k-anonymous with respect to QI for all QI ∈ QIx

k-Anonymity aims to prevent identity disclosure

3.4.2 ℓ-Diversity (microdata)

Problems of 𝑘-Anonymity: No randomizations; Homogeneity Attack
k-Anonymity: for each choice of quasi-identifier, the group of records sharing the same values is sufficiently large (≥ k)
Attribute disclosure may happen if number of sensitive values is too small within such a group => We need a higher diversity of sensitive values

assume that microdata set has one sensitive attribute
Equivalence Class

k-Anonymity with respect to QI ⇔ each equivalence class with respect to QI has size ≥ k, i.e., ∀C ∈ E[QI] : |C| ≥ k

ℓ-Diversity

Extension of k-anonymity
Additionally maintains the diversity of the sensitive attribute

Definition: An equivalence class is said to have ℓ-diversity if there are at least ℓ
well-represented values for the sensitive attribute.
A microdata set is said to have ℓ-diversity if every equivalence class has ℓ-diversity

Distinct ℓ-Diversity
Requirement: There are at least ℓ distinct values for the sensitive attribute in each equivalence class

Entropy ℓ-Diversity

Recursive (𝑐, ℓ)-Diversity
Motivation: Recursive (𝑐, ℓ)-diversity makes sure that the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely.
Definition:
Definition of Recursive (𝑐, ℓ)-Diversity

3.4.3 t-Closeness (microdata)

k-Anonymity: Addresses identity disclosure; Requires that any non-empty equivalence class is sufficiently large
𝑙-Diversity: Addresses (sensitive) attribute disclosure; Requires that sensitive attribute values within any equivalence class are sufficiently diverse

Problem with ℓ-diversity: it is limited in its assumption of adversarial knowledge. That is, an attacker may see the microdata and have some background knowledge in addition
Shortcoming 1: ℓ-diversity may be difficult and unnecessary to achieve
Shortcoming 2: ℓ-diversity is insufficient to prevent attribute disclosure (Skewness Attack, Similarity Attack -> ℓ-diversity only helps if the values are semantically different => sometimes difficult to decide)

t-Closeness: Intuition

Privacy leak = information gain of an observer
Observer has prior belief and posterior belief about the sensitive attribute of an individual
Information gain can be represented as the difference between prior and posterior belief
Prior belief is modeled by knowing the distribution of the sensitive values in the whole table
Posterior belief is distribution in the equivalence class of the individual

An equivalence class is said to satisfy t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole data set is no more than a threshold t.
A data set is said to satisfy t-closeness if every equivalence class in it satisfies t -closeness.

Distance of distributions?

Definition is not restricted to any specific distance
Usual choice: Earth Mover’s Distance (EMD)
EMD is able to capture the semantic distance between values

3.4.4 Earth Movers Distance (EMD)

Assume two distributions over the same set of values. EMD measures the effort to transform one distribution into the other. The higher the effort, the higher the distance (and vice versa)
Transforming means to move probability mass (earth) from one value to the other
The more these values differ, the higher the effort
Depending on whether the values are numerical or categorical, different variants of EMD do exist
We discuss only the numerical case, i.e., where we have a total order on the values
Numerical distance = semantic similarity

EMD

3.4.5 Microdata Anonymization

To avoid disclosure, data collectors do not publish the
original microdata set X, but a modified version Y of it.
This data set Y is called the protected, anonymized, or
sanitized version of X

Two protection approaches

Synthetic data
Masking

3.4.5.1 Masking

Anonymized data Y generated by modifying the original records in X.

Quasi-identifier attributes: identity behind each record is masked (which yields anonymity).
Confidential attributes: values of the confidential data are masked (which yields confidentiality, even if the subject to whom the record corresponds might still be re-identifiable).

Two methods

Perturbative masking
Data is altered
Changes should be such that the statistics computed on the perturbed data set do not differ significantly from the statistics that would be obtained on the original data set.
Examples: noise addition, microaggregation, data/rank swapping
Non-perturbative masking
Do not alter data
Examples: suppression

3.4.5.2 Synthetic data

Anonymized data Y generated by randomly sampling
simulated records.
Three steps

Propose a (statistical) model
Adjust the model to the original data set X
Generate synthetic data by drawing from the model

Three types of synthetic data:

Fully synthetic: every attribute has been synthesized
Partially synthetic: only attribute values with high risk of disclosure are synthesized
Hybrid: original data set is mixed with fully synthetic data set

Advantages
Fully synthetic data is considered to be a very safe approach. Risk of disclosure of the synthetic data can be reduced to analyzing the risk of disclosure of the information about the original data that the model incorporates. Usually this reduces to some statistical properties of the original data only => re-identification should be almost impossible
BUT Not fully correct. By co-incidence, sampling could yield real values of individuals. Model may give hints towards extreme values
Challenges
In case of partially synthetic or hybrid data, risk of reidentification may still exist
Finding the model may be difficult
Selection of statistical properties?

3.4.5.3 Perturbative Masking

1. Noise-addition
Values in the original data set are masked by adding some noise, i.e., some random values drawn from a specific distribution
The expectation value of this distribution should be 0 => the average value of the original data and the average value of the masked data remain the same
Noise can be correlated or uncorrelated to the masked value
Some methods apply a transformation to the data after the noise has been added
2. Data swapping
Exchange values of confidential attributes among individual records
3. Rank swapping
Variant of data swapping
First, records in 𝑋 are ranked in ascending order with respect to an attribute a. Then each ranked value of 𝑎 is swapped with another ranked value randomly chosen within a restricted range
Microaggregation
Goal: records correspond to groups of k or more individuals, where no individual dominates (see also k-anonymity)
Basic idea: Group records into groups of size ≥ k each using a criterion of maximal similarity
For each attribute, the average value over each group is computed and is used to replace each of the original averaged values.

(1) Univariate aggregation: Microaggregate one attribute at a time. For each attribute, do
Sort records by this attribute -> Groups records of successive k values ->Replace in each group the original value by the average value of the group

(2) Multivariate aggregation: Aims to consider multiple attributes at a time
Approach 1: map multi-variate attribute date to a single attribute, e.g., concatenation of values
Approach 2: Consider a distance metrics that expresses similarity of records according to several attributes
Problem: finding optimal group partition (in terms of maximizing similarity) is a NP-hard problem
Often, heuristic algorithms are used
Example: Maximum Distance to Average Vector (MDAV)

Maximum Distance to Average Vector (MDAV)
Heuristic algorithm for multivariate fixed group size microaggregation on unprojected continuous data
Basic idea: find the two most distant records and groups records around them. Repeat step for remaining records
Details

Compute the average record x_ of all records in the data set.
Consider the most distant record xr to the average record x_ (using an appropriate distance metrics).
Find the most distant record xs from the record xr considered in the previous step.
Form two groups around xr and xs, respectively. One group contains xr and the k − 1 records closest to xr and analogously for xs
If more than 2k records remain, repeat for the remaining records. Otherwise, form a new group with those records and exit the algorithm.

3.4.5.4 Non-Perturbative Masking

Idea: unify concrete values into more generic values
Case1: categorical values
Combine categories to form new, less specific categories
Example: replace “electrician” and “painter” by “craftsman”
Case2: numerical (continuous) values
Map values into value ranges
Example: age “23” into “age [20…30]”
Top/bottom coding
Special case of generalization
Generalize extreme values, i.e., above/below certain thresholds
Example: age

(Local) Suppression
Certain values of individual attributes are suppressed (removed)
Example: remove values that are particularly characteristic/rare

3.5 Queryable databases

Queryable databases: Interactive databases to which the user can submit statistical queries (sums, averages, etc.).
Common case: database contains microdata

3.5.1 Differential Privacy: Scenario

Scenario
Two types of parties: Data holder/data owner/curator/… + Data analyst
Scenario
1.Data holder has control over some data D
2.Data analyst has some algorithm A and aims to apply it to D
Two models
1.Offline/non-interactive model(Model considered so far)
2.Online/interactive mode

Offline Model
Offline = analyst can run algorithm on its own (once data has been published)

Data holder anonymizes data D to get D′
Anonymized data D′ is handed to analyst
Analyst applies A to D’ and learns the result

Online Model
Online = analyst needs to interact to apply algorithm

Analyst hands its algorithm A to data holder
Data holder modifies algorithm A to algorithm A’
Data holder applies A’ to D and hands result to analyst

3.6 Differential Privacy

Applies to the online model. More precisely: differential privacy is a property of algorithm A’
Algorithms that are differential private have some useful properties
Definition
Definition of Differential Privacy
Applying algorithm 𝐴 to two different datasets 𝑋1 and 𝑋2 yields the same result with high probability (for 𝜀 and 𝛿 are small)
(𝟎, 𝟎)-differentially private
Applicability of Differential Privacy
Differential private algorithms exhibit a number of useful
properties: Post-processingl Group privacy; Composition

Mechanisms do exist to make an algorithm differentially private: Laplace mechanism; Exponential mechanism

Used in practice: Google’s RAPPOR

3.6.1 Indistinguishability

General formalization
Property 𝚷 is fulfilled ⇔ Output of algorithm A for inputs [Input1] and [Input2] are the same with high probabilty
Intuition: If outputs of A are the same, the inputs are indistinguishable for A

Example1: Semantic Security
Standard security definition for encryption schemes
Encryption scheme is semantically secure ⇔ Output of attacker A for inputs Ciphertext and Length of Ciphertext are the same with high probability
If inputs are indistinguishable, an attacker learns nothing from a ciphertext besides its length. In particular, attacker cannot deduce the plaintext

Example2:Turing Test (1950)
One of the early approaches to define artificial intelligence: A machine exhibits intelligent behaviour ⇔ Human evaluator cannot distinguish between humanand machine

A natural approach to defining privacy: Analyst knows no more about any individual after the analysis is completed
than she knew before the analysis was begun
This does not work for several reasons
(1) Two Left Feet
Assume that analyst has some wrong belief about
population
Example: any person has two left feet
After seeing a dataset on population, she knows that
she was wrong ⇒ Analyst learnt something new. However, it does not violate the privacy of an individual
(2) Smoking Causes Cancer
Assume that analyst learns from analysis that smoking causes cancer
Consider now some person P where P is not part of the database. Analyst knows that P is a strong smoker
Analyst learns that P has an increased risk to get cancer
⇒ Analyst learnt something new about an individual even
thought this individual was not part of the database

Concept of indistinguishability is promising for coining meaning that “certain information were not leaked”. Question: what kind of information?

3.6.2 Similarity of Dataset

Definition of Differential Privacy2

Coining Privacy?
Privacy Risks
(1) Identity disclosure: Identity of an individual is leaked
(2) Attribute disclosure: Sensitive attributes of an individual are leaked
In all cases: potential harm to an individual
However, we have seen examples where an analyst learns information about an individual even if this individual was not part of the database

Differential privacy uses an utilitarian definition of privacy
More precisely, it promises to protect individuals from any additional harm that they might face due to their data being in the database that they would not have faced had their data not been part of the database.
In other words, Any individual P has no control over the
remaining contents of the database. But: any individual can decide whether she/he allows her/his data to be included
Differential privacy aims to ensure that participation does not incur any disadvantage

3.6.3 Randomized Algorithm

Definition of Differential Privacy7
Example: Randomized Response
Technique developed in the social sciences to collect statistical information about embarassing or illegal behavior, captured by having a property Π.
Assume that algorithm A reports the value of attribute “Does attribute Π apply?”
Normally, individuals would refrain to answer this question (i.e., to add their response to the database)
Aim of randomized response: ensure privacy due to plausible deniability of any outcome

Necessity for Randomness
Can we use also non-random algorithms 𝐴, i.e., deterministic algorithms? No!
Counterexample

Definition of Differential Privacy7

3.6.4 Post-Processing

Differential privacy protects against post-processing
A data analyst, without additional knowledge about the private database, cannot compute a function of the output of a private algorithm A and make it less differentially private

Post-Processing-Theorem
Post-Processing - Proof
Privacy Loss

Privacy Loss and Differential Privacy

Privacy Loss and Differential Privacy
Privacy Loss Random Variable Differential Privacy

3.6.5 Group Privacy

Group Privacy

3.6.6 Composition

Data holder runs diff. private algorithm A on data on behalf of analyst
Differential privacy ensures protection for one algorithm, producing one result
It also ensures privacy under composition of algorithms

Composition = repeated use of algorithms
Case 1: Same database.
Same mechanism multiple times
Modular construction of differentially private algorithms from arbitrary private building blocks
Case 2: Different databases
May contain information relating to the same individual.

Composition - Theorem

Composition - Proof1

Composition - Proof2

Composition - Generalization

Privacy Budget

3.6.7 Laplace mechanism

Common Approach
Making a function differentially private: Apply the original function f; Randomize the output (independent of value)
Challenges: Dealing with different types of outputs (numerical, categorical); Preserving utility of result

Laplace Mechanism - Intuition
Popular approach for realizing differential privacy with numerical outputs

Output of function f: some numerical value x
Sample a random value 𝜌 according to the Laplace distribution
Output f(x) + 𝜌

Realizing differential privacy with numerical outputs
Add random value (noise) to function value f(x)
Intuition: The closer output to f(x), the more useful

Laplace Distribution

ℓ1-Sensitivity of a Function

Laplace Mechanism - Theorem

Laplace Mechanism - Proof
Laplace Mechanism - Proof1

Laplace Mechanism - Proof2

3.6.8 Exponential mechanism

Makes a function differentially private: Apply the original function f -> Randomize the output (independent of value)
As opposed to Laplace mechanism, output can be nonnumerical
Exponential Mechanism - Intuition

Sensitivity

Exponential Mechanism – Theorem (Sketch)

3.6.9 RAPPOR

Use Case
Crowdsourcing statistics from end-user client software
Service provider collects frequently information from end-users
Example 1: Reporting on Windows Process Names
Example 2: Reporting on Chrome Homepages
Motivation: Identify unusual behaviour
Challenge: Preserve privacy

Workflow
RAPPOR - workflow

Step 1: Compute Bloom Filter
Setup
Bitstring B = (b1, … , bk) with all bi set to 0
h hash functions Hash1, … , Hashh that map data into {1, … , k}
Computation
For each data value v, compute indices indj =
Hashj(v) • Set bindj := 1
Can be used as a kind of compact fingerprint

Step 2: Permanent Randomized Response

Step 3: Instantantenous Randomized Response

Part4: Cryptographic Techniques

4.1 Basic

Privacy Goals

Offline model: input privacy
Online model: output privacy

Alternative cryptographic approaches for ensuring input privacy

Multiparty Computation
Homomorphic Encryption

Advantages

Allow to compute any algorithm on original data without any data leakage
Extensions possible: multiple parties, algorithm privacy, …

Disadvantages

Higher effort (communication and/or computation)

Multi-Party Computation
Introduced a general notion of secure computation

Scenario

m parties P1, … , Pm and a common function f
Each party Pi has its own function input xi
The parties jointly compute y= f(x1, … , xm) and learn only the output y
In particular, Pi does not learn xj for i ≠ j
Note: Concept can be extended that each party has its own individual output, i.e., y1, … , ym = f(x1, … , xm)

Scenario of Multi-Party Computation

Applications
Theory

Any function can be securely computed
Mechanisms are getting more and more efficient

Practice

Boston Wage Gap Studies (2017)
Secure Machine Learning
Sharemind (https://sharemind.cyber.ee/)

4.2 Model

4.2.1 Requirements

Correctness: Each party learns the correct output of f(x1, … , xm)
Security: Each party learns nothing except of the output of f(x1, … , xm)
Attacker model?
Learning nothing?

4.2.2 Attacker Model

Attacker Model

4.2.3 Attacker Model

Formal definition is based on the concept of indistinguishability
Recall: Differential privacy -> Attacker learns nothing about a certain individual x ⇔ attacker
cannot distinguish between database with and without the record of
Similar approach for multi-party computation (MPC): Attacker learns nothing about other inputs ⇔ attacker cannot distinguish between ideal case and real case

4.2.3.1 Ideal Case

Ideal Case

4.2.3.2 Real Case

Real Case

4.2.3.3 Simulation Paradigm

Simulation Paradigm

4.3 Multiparty Computation Realization1 - Obvious Transfer

Oblivious Transfer (OT) is an essential building block for multiparty computation (MPC)
Actually, one can show that OT allows to realize MPC and vice versa
Many different realizations and improvements do exist, in the following

Generic description
One possible realization for OT
Simple (but inefficient) OT-based MPC

4.3.1 OT Functionality

OT Functionality

4.3.2 Discrete Logarithm Problem

Discrete Logarithm Problem

4.3.3 DLP-Based Oblivious Transfer

DLP-Based Oblivious Transfer

4.3.4 Simple OT-based MPC

Simple OT-based MPC1
Simple OT-based MPC2

Advantages: Constant number of communication rounds
Disadvantages: Table size (and hence communication effort) is exponential in the number of input bits
Assume now a function f(x1, x2, x3, x4) where (x1, x2) is the
input of P1 and (x3, x4) of P2. Simple approach would require a table with 4 × 4 = 16 cells
Motivation1

Motivation2

Problem: Representation of function f as lookup-table is highly inefficient

Yao’s garbled circuits

Follows similar approach
Uses Boolean circuits as representation

4.3.5 Yao’s Garbled Circuits (GC)

Boolean Circuits: Common representation for Boolean functions; Composition of logical gates NOT, AND, and XOR

Approach

Represent the Boolean function f by a Boolean circuit
For each gate, a lookup table of encryptions are created to run MPC as described before
Step by step, each gate is executed using MPC
Intermediate gates yield keys for the inputs to next gates
Final gates yield keys for the final outputs

Advantage: Runs in constant number of rounds (independent of structure of f)
Example

4.4 Multiparty Computation Realization2 - GMW-Protocol

4.4.1 Basic

Similar approach as garbled circuits

Decompose function f into several sub-functions
Execute sub-functions step-by-step

Advantages: Extends naturally to more than 2 parties; Can work both on Boolean and arithmetic circuits
Note: we restrict to the 2-party case with Boolean circuits

Disadvantages: Number of communication rounds depends on circuit depth (≠ constant)

Core idea: parties split their inputs into shares and operate independently in parallel (as far as possible)

4.4.2 Value Sharing

Value Sharing

4.4.3 Workflow

Workflow

Computing NOT Gates

Computing XOR Gates

Computing AND Gates

4.5 Shamir’s Secret Sharing

4.5.1 Secret Sharing

Secret Sharing

4.5.2 2-out-of-2 Secret Sharing Scheme

2-out-of-2 Secret Sharing Scheme

4.5.3 Finite Field

Field (in mathematics)
Algebraic structure
Informal: Set of elements, certain operations (addition, subtraction, multiplication, division)
Examples: set of rational numbers, set of real numbers
Finite fields
Field with finite number of elements
Example: ℤ𝑝 -> integers {0, … , 𝑝 − 1} for some prime 𝑝, integer addition and multiplication modulo p
Smallest field: GF(2)
Set {0,1} (one bit), operations XOR and AND

4.5.4 Polynomial Interpolation

在这里插入图片描述

4.5.5 Shamir’s Secret Sharing - Share

Shamir’s Secret Sharing - Share

4.5.6 Shamir’s Secret Sharing - Reconstruct

Shamir’s Secret Sharing - Reconstruct

4.6 BGW-Protocol

Similar approach as the GMW protocol

Split input into shares
Expressing function by a circuit
Execute function step-by-step along the circuit

Differences

Uses an arithmetic circuit over a finite field F -> Operations: addition, multiplication with a constanct, multiplication of values
Heavily relies on Shamir’s Secret Sharing scheme (SSS)

4.6.1 Workflow

Workflow

4.6.2 Computing Addition Gates

在这里插入图片描述

4.6.3 Computing Multiplication-With-Constant Gates

Computing Multiplication-With-Constant Gates

4.6.4 Computing Multiplication Gates

Computing Multiplication Gates1

Computing Multiplication Gates2

Computing Multiplication Gates3

The protocol requires that 2t < n holds

This implies that multiplying the shares once result into a polynomial that still can be interpolated
However, further multiplications may not be possible anymore
The trick explained on the previous slides guarantees that after one multiplication, the degree of the polynomial is “pushed back” to t

4.7 Homomorphic Encryption

4.7.1 Basic

Outsourcing Effort
Amount of data and computation effort increases
One or several parties may outsource data and/or computation to some external party
Example: Cloud computing. May help to increase efficiency and save costs

Security
Encryption and access control protects against outsider attackers. What if the server itself is corrupted?

Homomorphic Encryption

Applications

Consumer Privacy in Advertising
- Recommender systems
- User preferences not disclosed to service provider
Medical
- Real-time health analysis
- Nobody but the patient gets information
Financial Privacy
- Stock price predictions without portfolio disclosure
Forensic Image Recognition
- Find images from database in sensitive picture stream

Formal Definition
Formal Definition of homomorphic encryption

Work Flow
Work Flow of homomorphic encryption

Characterizations

Partially Homomorphic Encryption (PHE)
Support the Eval function for one type of operation only, e.g., group operation or addition/multiplication
Somewhat Homomorphic Encryption (SWHE)
Support the Eval function for only a limited number of operations or some limited circuits (e.g., branching programs)
Fully Homomorphic Encryption (FHE)
Support the Eval function for arbitrary functions for an unlimited number of times over ciphertexts

4.7.2 Partially Homomorphic Encryption (PHE)

Group Homomorphic Encryption Schemes
So far, PHE usually refer to group homomorphic encryption
schemes
Group homomorphic

Message space and ciphertext space are groups
Supported functions = group operation

Formal Definition
Formal Definition of PHE

Example: ElGamal Encryption Scheme

Complete Characterization

Complete Characterization - ElGamal

4.7.3 Somewhat Homomorphic Encryption (SWHE)

Example: A Simple Secret Key SWHE

Example: A Simple Secret Key SWHE
Explanations

Homomorphism

The Noise Problem

Making it Public Key
在这里插入图片描述

4.7.4 Fully Homomorphic Encryption (FHE)

Noise Problem
Somewhat homomorphic encryption schemes
Existing schemes: support all operations
But: fresh encryptions contains noise. Noise grows with each operation

Question
• If noise could be reduced (without leaking plaintext)
Somewhat homomorphic encryption scheme
⇒ Fully homomorphic encryption scheme

Craig Gentry presented in his PhD thesis the first concrete construction for realizing a fully homomorphic encryption scheme
Technique is called “bootstrapping” (also “recrypt”)

High level idea

Noise: Decryption removes noise
Homomorphism: We can evaluate certain functions on ciphertexts ⇒ Run decryption algorithm homomorphically on ciphertext

4.7.5 Bootstrapping

Bootstrapping1
Bootstrapping2
Bootstrapping3

摘自
Uni Mannheim课堂笔记

大白要努力啊

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Data Security and Privacy数据安全与隐私重要知识点

因为Symmetric Key Encryption Scheme比Public Key Encryption Scheme更加高效，从长远来看可以更快加密Message。为什么不直接使用Public Key Encryption Scheme对Message进行加密，而要用两种模式混合在一起？最常用：AES (Advanced Encryption Standard)后三个过程与Security有关。
复制链接

扫一扫