[论文精读]Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: --CSDN博客

①Two methods of FL: a) sending parameters only, b) information transfers between different communities by encryption techniques. They applied the first one

2.3.2. Domain adaptation

①Listing some domain adaptation methods by citation and pointing out that they do not use FL

2.4. Methods

2.4.1. Basic privacy-preserving federated learning setup

（1）Problem definition

①Data in $i$ -th site is denoted by matrix $D_i$ （是fMRI data所以是个矩阵）

② $N$ sites: $\left \{ F_1,...,F_N \right \}$ with institution owning private fMRI data

③Feature space is noted by $X$ (extracted fMRI feature), label space is $Y$ (diagnosis or phenotype needing predict) and sample ID space is represented by $I$

④Data distribution:

$X_i=X_j,Y_i=Y_j,I_i\neq I_j,\forall D_i,D_j,i\neq j$

⑤FL:

（2）Privacy-preserving decentralized training

①Cross entropy loss in this FL:

$\mathcal{L}_{ce}^n=-\sum_{n_i}\left[y_{n_i}\log\left(p_{n_i}\right)+\left(1-y_{n_i}\right)\log\left(1-p_{n_i}\right)\right]$

where $y_{n_i}$ denotes the label of the $i$ -th subject in the $n$ -th site, $Y_n=\left \{ y_{n_1},...,y_{n_{\left | Y_n \right |}} \right \}$ , and $p_{n_i}$ is the predicted probability

②Training process:

（3）Randomized mechanism for privacy protection

①For a deterministic real-valued function $h:D\to\mathbb{R}^m$ , L1 sensitivity $s_h$ of $h$ is:

$\|h\left(D\right)-h\left(D^{\prime}\right)\|_{1}$

if $\left \| D-D' \right \|_1=1$ , denotes there is only one data point difference between $D$ and $D'$ （为啥这是D和D'？而不是 $D_i$ 和 $D_j$ 这种？还是两个站点的数据集吗？按照普通的差分隐私来说，应该代指两个数据集间只有一个数据的差异，但是不知道这里作者是不是这个意思。我超，查了网上的差分隐私看见别人都是D和D’这种表示，不会是作者搬过来的时候没有改表达吧。）

②They define $h$ in their model is the $m$ weight parameters

③Differential privacy:

$Pr\left[h\left(D\right)\in S\right]\leq e^{\epsilon}Pr\left[h\left(D^{\prime}\right)\in S\right]$

$Pr\left[h\left(D\right)\in S\right]\leq e^{\epsilon}Pr\left[h\left(D^{\prime}\right)\in S\right]+\delta$

（作者都没说这里在干嘛诶....S是什么？Pr又是什么？虽然不用把文章当成普通科普，但是符号的定义至少要给啊...）

④Gaussian Mechanism: adding $N\left(0,s_{h}^{2}\sigma^{2}\right)$ noise to $h$ , where the $(\epsilon,\delta)$ differential privacy will become $\delta\geq\frac{4}{5}\mathrm{exp}\left(-\left(\sigma\epsilon\right)^{2}/2\right)$ and $\epsilon< 1$

⑤Laplace Mechanism: 我先随便插几张拉普拉斯分布的图，可以见得它和高斯分布有点类似只是是尖尖的，而且也是俩参数，公式还比高斯看起来简单一点：

They employ Laplace Distribution with scale $b$ :

$Lap\left(b\right):=Lap\left(x|b\right)=\frac{1}{2}\mathrm{exp}\left(-\frac{|x|}{b}\right)$

and $\sigma^{2}=2b^{2}$ , the $Lap(s_{h}/\epsilon)$ noise will be add to $h$ , the difference privacy is $(\epsilon,0)$ （作者只用了一个参数咩~）。然后他们为了简化讨论假定灵敏度 $s_h$ 为1？？？这能假定吗？这不是两个数据集间的差异吗。我猜测作者是在说两个站点间参数的差异为1？

2.4.2. Boosting multi-site learning with domain adaptation

（1）Mixture of experts (MoE) domain adaptation

①Experts: mean deep learning models

②MoE: trainable gating network used in feed-forward neural network

③Domain adaptation strategies with FL:

④The final output of their network:

$\hat{y}_i=a_i\left(x\right)y_G+\left(1-a_i\left(x\right)\right)y_P$

where the $a_i\left ( x \right )$ is the gating function in MoE and they use an non-linear layer $a_{i}\left(x\right)=\sigma\left(\psi_{i}^{T}\cdot x+b_{i}\right)$ to represent it, $\sigma$ is Sigmoid, $\psi _i$ and $b_i$ are learnable weights

（2）Adversarial domain alignment

①They trained a local feature extractor $G_s$ for the source site $D_s$

②They also trained a local feature generator $G_t$ for the target site $D_t$

③They align distribution of $D_s$ and $D_t$ by training a adversarial domain discriminator $D$

④ $G_s$ and $G_t$ aim to confuse $D$ by adding noise (generate $M\circ G_s(x^s)$ and $M\circ G_t(x^t)$ , $M$ denotes noise generator) and $D$ aims to identify the domain

⑤To discriminate domain:

$\begin{aligned}\mathcal{L}_{advD}\left(\mathbf{X}^{S},\mathbf{X}^{T},G_{s},G_{t}\right)&=\quad-\mathbb{E}_{x^{s}\sim\mathbf{X}^{s}}\left[\log D_{s}\left(G_{s}\left(x^{s}\right)\right)\right]\\&-\mathbb{E}_{x^{t}\sim\mathbf{X}^{T}}\left[\log\left(1-D_{s}\left(M\circ G_{t}\left(x^{t}\right)\right)\right)\right]\end{aligned}$

⑥第二步中的损失？？第二步是什么？ $\mathcal{L}_{advD}$ 不变但 $\mathcal{L}_{advG}$ 更新？以前也没这个东西啊怎么能叫更新？

$\begin{aligned}\mathcal{L}_{advG}\left(\mathbf{X}^{S},\mathbf{X}^{T},G_{s},G_{t}\right)&= -\mathbb{E}_{x^{s}\sim\mathbf{X}^{S}} [\log D_{s} (G_{s} (x^{s}))]\\&-\mathbb{E}_{x^{t}\sim\mathbf{X}^{T}}[\log\left(D_{s} (M\circ G_{t} (x^{t}))\right]\end{aligned}$

⑦Algorithm（在这个第八行感觉出⑤和⑥俩损失就是直接都用上就行，但是作者也没有特别的解释，不知道从哪扒来的）:

（3）Evaluate model by interpreting biomarkers

①Gradient based method:

$g_k^c=ReLU\left(\frac{\partial\hat{y}^c}{\partial x_k}\right)$

where $c \in \left \{ 0,...,C-1 \right \}$ denotes the input ground class, $y^c$ is the score of the $c$ lass befor softmax, $x_k$ is the $k$ -th feature in the input, $g^c_k$ denotes the imprtance of classifying $c$ of feature $k$

2.5. Experiments and results

2.5.1. Data

（1）Participants

①Sites chosen: the largest four, UM1、NYU、USM、UCLA1 with 106, 175, 72, 71. Eliminating incomplete data samples, left 88, 167, 52, 63 each

②Atlas: HO with 111 ROIs

③Slicing window: 32 size with 1 stride to crop the original time series

④Sample statistics:

⑤Demographic data:

（2）Data preprocessing

①FC: average ROI series by Pearson Correlation

②FC is applied to Fisher transformation

③Only remain the upper triangular matrix and flatten them to vector, fed them to MLP later (111*(111-1)/2=6105 dimension)

2.5.2. Federated training setup and hyper-parameters discussion

①MLP: 6105-16-2（哥们儿跳水呢？）

②Cross validation: 5 fold

③They define $m$ instances for each subject, if more than $m/2$ instances are 'ASD' tag, then the subject is classified to ASD（哪里来的实例？？MLP输出不是2吗不就是一个概率吗怎么还有多头了呢）

④Learning rate: 1e-5 with 1/2 decline each 20 epoch and stop at the 50-th epoch

⑤Optimizer: Adam

⑥Steps of each epoch: 60（这是什么玩意儿？）

⑦Batch size: 60

⑧Local uptating on each epoch: rely on communication pace $\tau$ （这又是什么玩意儿？是训练每 $\tau$ 次就去服务器更新一下吗？然后这个玩意儿是60的因数是吗？）

⑨ $\tau$ ablation:

no significant difference

⑩Accuracy when adding different noise in Gauss mechanism (L2 norm and $\varepsilon_{n}{\sim}N(0,\alpha\sigma)$ ):

the variable is $\alpha$

⑪Adding Laplace noise $\varepsilon_{n}\sim Lap\left(\alpha\sigma/\sqrt{2}\right)$ to local weight and varying $\alpha$ :

2.5.3. Comparisons with different strategies

①Evaluation methods:

②Comparison result:

2.5.4. Evaluate model from interpretation perspective

（1）Aligned feature embedding

①Visualizing fully connected layer embedding:

（2）MoE Gating value

①Gate value in different sites:

（gate value是learnable的参数）（哥们儿怎么在画图啊不能化成三维吗后面的都被挡住了）

（3）Neural patterns: Connectivity in the autistic brain

①They define "informativity" as functional representation difference between ASD and HC groups and "robustness" as the biomarker consistency of 4 sites

②They applied guided back-propagation method to detect the robust biomarkers of HC in Fed:

③ASD biomarker:

④Function correlation:

2.5.5. Limitation and discussion

①“尽管根据我们的实证调查，控制局部和全局模型更新权重信息频率的通信速度不会影响分类性能，但我们不能得出速度参数无关紧要的结论”

②Sensitivity of deep learning is hard to define

2.6. Conclusion

3. 知识补充

3.1. Differential privacy

（1）定义：差分隐私是一种数学框架，用于量化在数据发布或算法处理过程中保护个人隐私的程度。它通过在数据中引入随机性来确保即使数据被公开或分析，也无法识别出任何特定个体的信息。具体来说，差分隐私要求对于两个仅相差一个数据点的数据集（即邻接数据集），其查询结果应当具有相近的概率分布，从而无法通过观察查询结果来推断出单个数据点的存在与否。

（2）感觉就是添加噪声，使得单个数据难以被辨识。不过我感觉传神经网络参数为什么能计算出单个人的信息呢？

（3）参考学习：全局敏感度，局部敏感度和平滑敏感度到底有什么区别？【差分隐私】_全局影响度-CSDN博客

3.2. L1 sensitivity

（1）定义：L1灵敏度通常指的是在L1范数（也称为曼哈顿距离或绝对值和）意义下，某个参数或系统对输入变化的敏感度。它衡量的是当输入发生微小变化时，输出在L1范数下的变化量。

（2）计算：对于单个向量来说，L1范数（L1 norm）是指向量中各个元素绝对值之和；对于两个向量来说，是每个对应元素相减的加总：

4. Reference

Li, X. et al. (2020) 'Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results', Medical Image Analysis, 65. doi: https://doi.org/10.1016/j.media.2020.101765