通过潜在因子模型对混合型数据差分私有化
LaTeX 学术报告PPT—通过潜在因子模型对混合型数据差分私有化
一、部分内容展示
- 通过潜在因子模型对混合型数据差分私有化。
二、latex全文代码
\documentclass[xcolor=svgnames, aspectratio=169]{beamer}
%\usecolortheme[named=CornflowerBlue]{structure}
\usetheme{Boadilla}
% Madrid
\usecolortheme{seahorse}
% dolphin
%\setbeamertemplate{itemize items}{\color{red}$\bullet$]}
\usepackage{time} % date and time
\usepackage{graphicx, epsfig}
\usepackage[T1]{fontenc} % european characters
\usepackage{amssymb,amsmath} % use mathematical symbols
\usepackage{palatino} % use palatino as the default font
\usepackage{multimedia}
\usepackage{subfigure}
\usepackage{mathrsfs}
%\usepackage{movie15}
\usepackage{color}
\usepackage{xcolor}
\title{Differential Private Data Release for Mixed-type Data via Latent Factor Models}
% \subtitle{} %副标题
\author{i\_chensihuo\_888}
\institute{Yunnan University} % COMMAND UNIQUE TO BEAMER
\date{\today}
% \date{March 1, 2023}
\begin{document}
%\textrm{Roman Family}
\rmfamily
%封面
\begin{frame}
\begin{figure}
\centering
\includegraphics[width=0.2\linewidth]{01.png}
\end{figure}
\titlepage
\end{frame}
%正文
\begin{frame}
\textbf{Synthetic Data via Factor Model}
\vspace{0.5cm}
we consider an original dataset $\boldsymbol{X}=\left(\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_n\right)^{\top} \in \mathbb{R}^{n \times p}$, where $\boldsymbol{x}_i \in \mathbb{R}^p$ is the $i$-th sample and correlations exist among the $p$-dimensional variables. We define a linear factor model based on the original dataset as follows:
$$
\boldsymbol{X}=\mathbf{W} \boldsymbol{\Lambda}^{\top}+\mathbf{E},
$$
where $\mathbf{W} \in \mathbb{R}^{n \times r}$ is a matrix of latent factors, $\boldsymbol{\Lambda} \in \mathbb{R}^{p \times r}$ is a factor loading matrix, $r$ is the number of factors, and $\mathbf{E} \in \mathbb{R}^{n \times p}$ is a random error matrix. Here, each column of $\mathbf{W}$ is an independent latent factor, and the factor loading matrix $\boldsymbol{\Lambda}$ captures the relationship of the original data and latent factors.
\end{frame}
\begin{frame}
Once obtaining the estimators $\widehat{\mathbf{W}}$ and $\widehat{\Lambda}$ of $\mathbf{W}$ and $\boldsymbol{\Lambda}$ in model(1), we can then generate synthetic data approximating the original data through the equation $\widehat{\boldsymbol{X}}=\widehat{\mathbf{W}} \widehat{\boldsymbol{\Lambda}}^{\top}$.
\end{frame}
\begin{frame}
The key to generating synthetic data via the above low-rank approximation is to estimate $\mathbf{W}$ and $\boldsymbol{\Lambda}$. Generally, they can be estimated by minimizing the following objective function:
$$
Q(\boldsymbol{\Lambda}, \mathbf{W})=\sum_{i=1}^n \sum_{j=1}^p\left(x_{i j}-\mathbf{w}_i^{\top} \boldsymbol{\lambda}_j\right)^2=\operatorname{tr}\left\{\left(\boldsymbol{X}-\mathbf{W} \boldsymbol{\Lambda}^{\top}\right)^{\top}\left(\boldsymbol{X}-\mathbf{W} \boldsymbol{\Lambda}^{\top}\right)\right\}
$$
For the identification of $\mathbf{W}$ and $\boldsymbol{\Lambda}$, normalization restrictions on $\mathbf{W}$ and $\boldsymbol{\Lambda}$ are needed in optimizing (2). Specifically, based on the framework of Bai (2003), we impose the requirements that $\boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda}=\mathbf{I}_r$ and $\mathbf{W}^{\top} \mathbf{W}$ is diagonal.
\end{frame}
\begin{frame}
Through normalization $\boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda}=\mathbf{I}_r$ and concentrating $\mathbf{W}$, solving (2) is identical to maximizing $\operatorname{tr}\left\{\boldsymbol{\Lambda}^{\top}\left(\boldsymbol{X}^{\top} \boldsymbol{X}\right) \boldsymbol{\Lambda}\right\}$. The solution is the estimated factor loading matrix $\widehat{\boldsymbol{\Lambda}}$ where the $j$-th column of $\widehat{\boldsymbol{\Lambda}}$ is the eigenvector $\boldsymbol{\mu}_j$ corresponding to the $j$-th largest eigenvalue $\nu_j$ of the matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$, that is, $\widehat{\boldsymbol{\Lambda}}=\boldsymbol{V}=\left(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \cdots, \boldsymbol{\mu}_r\right)$.
\vspace{0.5cm}
The corresponding factor matrix estimator $\widehat{\mathbf{W}}=\boldsymbol{X} \widehat{\boldsymbol{\Lambda}}\left(\widehat{\boldsymbol{\Lambda}}^{\top} \widehat{\boldsymbol{\Lambda}}\right)^{-1}=\boldsymbol{X} \widehat{\boldsymbol{\Lambda}}$. Thus, synthetic data can be generated via $\widehat{\boldsymbol{X}}=\widehat{\mathbf{W}} \widehat{\boldsymbol{\Lambda}}^{\top}=\boldsymbol{X} \boldsymbol{V} \boldsymbol{V}^{\top}$ to resemble the original data.
\end{frame}
\begin{frame}
We set the number of factors based on the cumulative information ratio, that is,
$$
r(c)=\arg \min _{1 \leq k<q}\left\{k: \frac{\sum_{j=1}^k \nu_j}{\sum_{j=1}^q \nu_j}>c\right\},
$$
where $q=\min \{n, p\}$ and $c \in(0,1)$ is a given threshold, e.g., $c=0.8$ or $0.9$.
\end{frame}
\begin{frame}
Privacy leakage can still occur since the synthetic data $\widehat{\boldsymbol{X}}=\boldsymbol{X} \boldsymbol{V} \boldsymbol{V}^{\top}$ is not random and could be still sensitive to the original data.
\end{frame}
\begin{frame}
\textbf{Differentially Private Synthetic Data via Factor Model}
\vspace{0.5cm}
Specifically, we construct a synthetic data generating model as follows:
$$
\widetilde{\boldsymbol{X}}=\widetilde{\mathbf{W}} \tilde{\boldsymbol{\Lambda}}^{\top}=\boldsymbol{X}\boldsymbol{V}\boldsymbol{V}^{\top}=:\{\boldsymbol{X} \cdot \boldsymbol{g}(\boldsymbol{V}+\mathbf{B})+\boldsymbol{C}\} \cdot \boldsymbol{g}(\boldsymbol{V}+\mathbf{B})^{\top}
$$
--The matrix $\boldsymbol{V}$ is the eigenvector matrix corresponding to the first $r$ largest eigenvalues of the matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ in a decreasing order.
--The matrix $\boldsymbol{g}(\boldsymbol{V}+\mathbf{B})$ is an orthonormal matrix obtained by $\mathrm{QR}$ decomposition of the matrix $\boldsymbol{V}+\mathbf{B}$.
--The matrix $\mathbf{B}$ is a $p \times r$ noise matrix and each entry of each column of $\mathbf{B}$ is from a Laplace distribution $\operatorname{Lap}\left(2 \sqrt{p} / \epsilon_{1 i}\right)$ with a zero location parameter and a scale parameter $2 \sqrt{p} / \epsilon_{1 i}$, where $\epsilon_{1 i}=\omega_i \epsilon_1, \omega_i>0$ and $\sum_{i=1}^r \omega_i=1$. For the allocation of the privacy budget $\epsilon_1$, we can consider proper privacy budgets $\epsilon_{1 i}$, for example, $\omega_i=\nu_i / \sum_{j=1}^r \nu_j$ or $\omega_i=1 / r$.
--The matrix $\boldsymbol{C}$ is a $n \times r$ noise matrix, where each entry of the $r$-th column of $\boldsymbol{C}$ is from a $\operatorname{Lap}\left(2 / \epsilon_2\right)$ and other entries are zero.
\end{frame}
\begin{frame}
Input:
Original data $\boldsymbol{X}(n \times p)$, privacy budget $\epsilon=\epsilon_1+\epsilon_2$ and the number of factors $r$;
Output:
Synthesized data $\widetilde{\boldsymbol{X}}$
1: Calculate matrix $\boldsymbol{A}=\boldsymbol{X}^{\top} \boldsymbol{X}$ and $\boldsymbol{V}=\left(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \cdots, \boldsymbol{\mu}_r\right)$ with the columns being the eigenvectors corresponding to the first $r$ largest eigenvalues $\boldsymbol{\nu}=\left(\nu_1, \ldots, \nu_r\right)^{\top}$ of the matrix $\boldsymbol{A}$ in decreasing order;
2: Construct the perturbed eigenvector matrix $\tilde{\boldsymbol{V}}$ with privacy protection:
(i) Generate a $p \times r$ random matrix $\mathbf{B}$ and calculate the perturbed matrix $\boldsymbol{V}^*=\boldsymbol{V}+\mathbf{B}$, where $\mathbf{B}=\left(\mathbf{b}_1, \cdots, \mathbf{b}_r\right) \in \mathbb{R}^{p \times r}$, each entry of the vector $\mathbf{b}_i$ is from $\operatorname{Lap}\left(2 \sqrt{p} / \epsilon_{1 i}\right)$, $\epsilon_{1 i}=\omega_i \epsilon_1, \omega_i>0$ and $\sum_{i=1}^r \omega_i=1$;
(ii) Calculate the orthogonal matrix $\tilde{\boldsymbol{V}}$ of the matrix $\boldsymbol{V}^*$ by $\mathrm{QR}$ decomposition;
\end{frame}
\begin{frame}
3: Calculate the estimators of the factor matrix and factor loading matrix:
(I) Generate a $n \times r$ random matrix $\boldsymbol{C}$, where each entry of the $r$-th column of $\boldsymbol{C}$ is sampled from a $\operatorname{Lap}\left(2 / \epsilon_2\right)$ and other entries are zeros;
(II) Calculate factor loading matrix $\widetilde{\boldsymbol{\Lambda}}=\widetilde{\boldsymbol{V}}$ and factor matrix $\widetilde{\mathbf{W}}=\boldsymbol{X} \widetilde{\boldsymbol{V}}+\boldsymbol{C}$;
4: Return: Synthesized data $\widetilde{\boldsymbol{X}}=\widetilde{\mathbf{W}} \tilde{\boldsymbol{\Lambda}}^{\top}$.
\end{frame}
\begin{frame}
\textbf{Implementation for Mixed-type Data}
\vspace{0.5cm}
We consider an original dataset $\boldsymbol{X}=\left(\boldsymbol{x}_1, \cdots, \boldsymbol{x}_n\right)^{\top}$ with mixed-type data including continuous, ordinal and nominal categorical variables. Denote the $i$ th sample as
$$
\boldsymbol{x}_i=\left(y_{i, 1}, \ldots, y_{i, p_1}, z_{i, p_1+1}, \ldots, z_{i, p_1+p_2}, u_{i, p_1+p_2+1}, \ldots, u_{i, p_1+p_2+p_3}\right)^{\top}
$$
\end{frame}
\begin{frame}
Based on the framework of Song et al. (2013), we construct an underlying vector
$$
\boldsymbol{x}_i^*=\left(y_{i, 1}^*, \ldots, y_{i, p_1}^*, z_{i, p_1+1}^*, \ldots, z_{i, p_1+p_2}^*, \mathbf{u}_{i, p_1+p_2+1}^{* \top}, \ldots, \mathbf{u}_{i, p_1+p_2+p_3}^{* \top}\right)^{\top}
$$
which is linked to the original vector data $\boldsymbol{x}_i$ as follows:
$$
\left\{\begin{array}{l}
y_{i j}=h_{1 j}\left(y_{i j}^*\right), \quad j=1, \ldots, p_1 \\
z_{i j}=h_{2 j}\left(z_{i j}^*\right), \quad j=p_1+1, \ldots, p_1+p_2 \\
u_{i j}=h_{3 j}\left(\mathbf{u}_{i j}^*\right), \quad j=p_1+p_2+1, \ldots, p_1+p_2+p_3
\end{array}\right.
$$
\end{frame}
\begin{frame}
where $h_{1 j}, h_{2 j}$ and $h_{3 j}$ 's correspond to identity, threshold and multinomial probit link functions, respectively. The identity link function $h_{1 j}(\cdot)$ keeps the continuous variables $y_{i j}$ invariant, that is, $y_{i j}=h_{1 j}\left(y_{i j}^*\right)=y_{i j}^*$.
\end{frame}
\begin{frame}
For ordinal variables $z_{i j}$ with integer values in $\left\{0,1, \ldots, M_j-1\right\}$, a threshold link function $h_{2 j}(\cdot)$ is defined as the following:
$$
z_{i j}=h_{2 j}\left(z_{i j}^*\right)=\sum_{m=0}^{M_j-1} m \cdot I\left(\tau_{j, m} \leq z_{i j}^*<\tau_{j, m+1}\right),
$$
where $I(\cdot)$ is an indicator function, and takes a value of 1 if $\tau_{j, m} \leq z_{i j}^*<\tau_{j, m+1}$ and 0 otherwise. The set of thresholds can be estimated by converting the cumulative proportions of the observed data $z_{i j}$ to standard z-scores. That is, $\tau_{j, l}=\Phi^{-1}\left(f_{j, m}\right)$ for $m=1,2, \ldots, M_j-1$, where $\Phi(\cdot)$ is the cumulative distribution function of the standard normal $N(0,1)$, and $f_{j, m}$ is the cumulative frequency of categories with $z_{i j}<m$.
\end{frame}
\begin{frame}
For nominal categorical variables $u_{i j}$ with $Q_j$ categories, we assume that $u_{i j}$ takes values from $\left\{0,1, \ldots, Q_j-1\right\}$. The $u_{i j}$ is modeled through $\mathbf{u}_{i j}^*=\left(u_{i j, 1}^*, \ldots, u_{i j, Q_j-1}^*\right)^{\top} \in \mathbb{R}^{Q_j-1}$ with a multinomial probit link function $h_{3 j}(\cdot)$ such that:
$$
u_{i j}=h_{3 j}\left(\mathbf{u}_{i j}^*\right)= \begin{cases}0, & \text { if } \max \left(\mathbf{u}_{i j}^*\right) \leq 0, \\ l, & \text { if } \max \left(\mathbf{u}_{i j}^*\right)=u_{i j, l}^*>0,\end{cases}
$$
where each element of $\mathbf{u}_{i j, l}^*$ is generated from a truncated normal distribution with zero mean and variance 1 , which falls into $(-\infty, 0)$ if $u_{i j}=0$; and each element of $\mathbf{u}_{i j}^*$ is from a standard normal distribution if $u_{i j}=l$ so that $\max \left(\mathbf{u}_{i j}^*\right)=u_{i j l}^*>0$.
\end{frame}
\begin{frame}
Input:
Original data $\boldsymbol{X}$, privacy budget $\epsilon$ and the number of factors $r$;
Output:
Synthesized data $\widetilde{\boldsymbol{X}}$
1: Construct link functions (3) based on $\boldsymbol{X}$ and generate continuous data $\boldsymbol{X}^*$ via link functions (3)
2: Normalize each sample on the data $\boldsymbol{X}^*$, still denoted as $\boldsymbol{X}^*$;
3: Execute Algorithm 1 on the data $\boldsymbol{X}^*$ and obtain continuous synthetic data $\widetilde{\boldsymbol{X}}^*$;
4: Transform the data $\widetilde{\boldsymbol{X}}^*$ to mixed-type synthetic data $\widetilde{\boldsymbol{X}}$ via the link functions (3);
5: Return: Synthesized data $\widetilde{\boldsymbol{X}}$
\end{frame}
% ---------------% --------------------------------% ---------------
%结束页
\begin{frame}
\frametitle{Thank You!}
\begin{center}
i\_chensihuo\_888 \\
% \*\*\*chen888cc@163.com\\
\end{center}
\nocite{*}
\bibliography{references}
\end{frame}
\end{document}
三、沟通与交流
1、latex代码,只需更改图片名称,即可成功编译。
2、若你需要完整文件,例如PDF、图片等,请留言或者评论你的邮箱。