NeurIPS 2019 Federated Learning Accepted Paper

Improving Federated Learning Personalization via Model Agnostic Meta Learning

Problems

FL applications generally face non-i.i.d and unbalanced data available to devices, which makes it challenging to ensure good performance across different devices with a FL-trained global model.

Contribution

  • The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. FedAvg可以被解释成元学习算法

  • Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 仔细的微调可以生成具有更高准确度的全局模型,同时更易于个性化。 但是,仅针对全局模型精度进行优化会产生较弱的个性化结果

  • A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. 与使用联邦平均法训练的模型相比,使用标准数据中心优化方法训练的模型更难个性化

Existing Methods

现有的FL个性化工作直接采用融合的初始模型,并通过梯度下降进行个性化评估

Idea

We refer to a trained global model as the initial model, and the locally adapted model as the personalized model.

Objectives

  1. Improved Personalized Model – for a large majority of the clients
  2. Solid Initial Model – some clients have limited or even no data for
  3. Fast Convergence – reach a high quality model in small number of training rounds.

Method

Definition:

For each client i i i, define its local loss function as L i ( θ ) L_i(\theta) Li(θ)

g j i g_j^i gji be the gradient computed in j t h j^{th} jth iteration during a local gradient-based optimization process
g F e d S G D = − β T ∑ i = 1 T δ L i ( θ ) δ θ = 1 T ∑ i = 1 T g 1 i g FedSGD = \frac{-\beta}{T}\sum_{i=1}^T \frac{\delta L_i(\theta)}{\delta \theta} = \frac{1}{T} \sum_{i=1}^Tg_1^i gFedSGD=Tβi=1TδθδLi(θ)=T1i=1Tg1i

θ K i = U K i ( θ ) = θ − β ∑ j = 1 K g j i = θ − β ∑ j = 1 K δ L i ( θ j ) δ θ \theta_K^i = U_K^i (\theta) = \theta - \beta \sum_{j=1}^K g_j^i \\ =\theta -\beta \sum_{j=1}^K \frac{\delta L_i(\theta_j)}{\delta \theta} θKi=UKi(θ)=θβj=1Kgji=θβj=1KδθδLi(θj)

δ U K i ( θ ) δ θ = I − β ∑ j = 1 J δ 2 L i ( θ j ) δ θ 2 \frac{\delta U^i_K (\theta)}{\delta \theta} = I - \beta \sum_{j=1}^J \frac{\delta^2 L_i(\theta_j)}{\delta \theta^2} δθδUKi(θ)=Iβj=1Jδθ2δ2Li(θj)

g M A M L = δ L M A M L δ θ = 1 T ∑ i = 1 T δ L i ( U K i ( θ ) ) δ θ = 1 T ∑ i = 1 T L i ′ ( U K i ( θ ) ) ( 1 − β ∑ j = 1 K δ 2 L I ( θ j ) δ θ 2 ) g MAML = \frac{\delta L_{MAML}}{\delta \theta} = \frac{1}{T} \sum_{i=1}^T \frac{\delta L_i(U_K^i(\theta))}{\delta \theta}=\frac{1}{T} \sum_{i=1}^T L^{'}_i(U_K^i(\theta))(1-\beta \sum_{j=1}^K \frac{\delta^2 L_I(\theta_j)}{\delta \theta^2}) gMAML=δθδLMAML=T1i=1TδθδLi(UKi(θ))=T1i=1TLi(UKi(θ))(1βj=1Kδθ2δ2LI(θj))

MAML requires to compute 2nd-order gradients, which can be computationally expensive and creates potentially infeasible memory requirements.

g F e d A v g = 1 T ∑ i = 1 T ∑ j = 1 k g j i = 1 T ∑ i = 1 T g 1 i + ∑ j = 1 K − 1 1 T ∑ i = 1 T g j + 1 i = g F e d S G D + ∑ j = 1 K − 1 g F O M A M L ( j ) g FedAvg = \frac{1}{T}\sum_{i=1}^T \sum_{j=1}^k g_j^i = \frac{1}{T}\sum_{i=1}^T g_1^i + \sum_{j=1}^{K-1}\frac{1}{T}\sum_{i=1}^T g^i_{j+1} = gFedSGD + \sum_{j=1}^{K-1} gFOMAML(j) gFedAvg=T1i=1Tj=1kgji=T1i=1Tg1i+j=1K1T1i=1Tgj+1i=gFedSGD+j=1K1gFOMAML(j)

  1. Run F e d A v g ( E ) FedAvg(E) FedAvg(E) with momentum SGD as server optimizer and a relatively larger E.

  2. Switch to R e p t i l e ( K ) Reptile(K) Reptile(K) with Adam as server optimizer to fine-tune the initial model.

  3. Conduct personalization with the same client optimizer used during training.

Think Locally, Act Globally: Federated Learning with Local and Global Representations

Problem

  1. Efficiency

    使局部模型提取有用的低维表示形式意味着全局模型现在需要较少数量的参数,从而减少了需要与全局模型进行通信的参数和更新的数量,以及通信方面的瓶颈成本

  2. Heterogeneity

    现实世界中的数据通常是异构的(来自不同来源)。 新设备可能包含训练之前从未观察到的数据源,例如个性化移动设备上不同域或不同文本样式的图像。 本地表示使我们能够根据源模式使用专用编码器来处理新设备数据,而不是使用可能无法推广到新模式和分布的单一全局模型。 我们证明了我们的模型从现实世界的私人移动数据中学习个性化的情绪预测因子,并更好地处理了训练期间从未见过的异构数据。

  3. Fairness

    现实世界中的数据通常包含敏感属性,最近的研究表明,无需访问数据本身即可从数据表示中恢复这些属性。 我们表明,可以修改本地模型以学习使混淆的种族,年龄和性别等受保护属性变得模糊的公平表示形式,这对于保护设备上数据的隐私至关重要。

Idea

( X m , Y m ) (X_m,Y_m) (Xm,Ym) represents data on device m m m

H m H_m Hm are learned local representations via local model l m ( ⋅ , θ m l ) : x → h \mathcal{l}_m(\cdot,\theta_m^l):x \rightarrow h lm(,θml):xh

(optional) auxiliary models a m ( ⋅ , θ m a ) : h → z a_m(\cdot,\theta_m^a):h \rightarrow z am(,θma):hz:

g ( ⋅ ; θ g ) : h → y g(\cdot;\theta^g):h \rightarrow y g(;θg):hy is the global model

AGG is an aggregation function over local updates to the global model.
θ m l ← θ m l − η θ m l L m g ( θ m l , θ m g ) \theta_m^\mathcal{l} \leftarrow \theta_m^\mathcal{l} - \eta_{\theta_m^\mathcal{l}} \mathcal{L}^g_m (\theta_m^\mathcal{l},\theta_m^\mathcal{g}) θmlθmlηθmlLmg(θml,θmg)

θ m g ← θ m g − η θ m l L m g ( θ m l , θ m g ) \theta_m^\mathcal{g} \leftarrow \theta_m^\mathcal{g} - \eta_{\theta_m^\mathcal{l}} \mathcal{L}^g_m (\theta_m^\mathcal{l},\theta_m^\mathcal{g}) θmgθmgηθmlLmg(θml,θmg)

θ g ( t + 1 ) ← ∑ m = 1 M N m N θ m g ( t + 1 ) \theta^{g(t+1)} \leftarrow \sum_{m=1}^M \frac{N_m}{N}\theta_m^{g(t+1)} θg(t+1)m=1MNNmθmg(t+1)

Theoretical Analysis

  1. purely local models do not suffer from device variance but suffer from data variance

  2. the opposite holds true for purely global models

  3. having both local and global models achieves a balance between both desiderata

Goal: Train a network f u ^ f_{\widehat{u}} fu with weight u ^ ∈ R d \widehat{u} \in R^d u Rd

To adapt this setting for federated learning, we assume that all device share some underlying structure (e.g. natural syntactic and semantic structures in text) while also displaying personalization across users (e.g. personalized vocabularies and writing styles).

The global feature vector v that represents shared features across devices.

Local features r m r_m rm that represent differences across devices.

The labels on device m are generated by a local teacher with weights:
u m = v + r m ∈ R d u_m= v+r_m \in R^d um=v+rmRd

r m ∼ N ( 0 , ρ 2 I ) r_m \sim \mathcal{N}(0,\rho^2 I) rmN(0,ρ2I) is a different independent draw from a d d d-dimensional Gaussian with covariance of ρ 2 \rho^2 ρ2 .

ρ 2 \rho^2 ρ2 represents device variance: with higher ρ 2 \rho^2 ρ2, the local features differ more representing more personalized targets across devices

Private Federated Learning with Domain Adaptation

Problems

  • Introducing “noise” in the training process (inputs, parameters, or outputs) makes it difficult to guarantee whether any particular data point was used to train the model. While this noise ensures ϵ \epsilon ϵ-differential privacy for the data point, it can degrade the accuracy of model predictions.
  • There exists a large body ofwork on domain adaptation in non-FL systems. In domain adaptation, a model trained over a data set from a source domain is further refined to adapt to a data set from a different target domain.

Idea

M G M_G MG is general model with parameters Θ G \Theta_G ΘG, and y ^ G = M G ( x , Θ G ) \widehat{y}_G = M_G(x,\Theta_G) y G=MG(x,ΘG)

M G M_G MG is shared between all parties, and is trained on all data using FL with differentially private SGD (1), enabling each party contribute to training the general model.
M P M_P MP be a private model of party i i i, parameterized by Θ P i \Theta_{P_i} ΘPi, and y ^ P i = M P i ( x , Θ P i ) \widehat{y}_{P_i} = M_{P_i}(x,\Theta_{P_i}) y Pi=MPi(x,ΘPi)

M P i M_{P_i} MPi could have a different architecture from M G M_G MG

y ^ i = α i M G ( x , Θ G ) + ( 1 − α i ( x ) ) M P i ( x , Θ P i ) \widehat{y}_i = \alpha_i M_G (x,\Theta_G)+(1-\alpha_i(x))M_{P_i}(x,\Theta_{P_i}) y i=αiMG(x,ΘG)+(1αi(x))MPi(x,ΘPi)
α i ( x ) \alpha_i(x) αi(x) is called a gating function in the MoE literature.
α i ( x ) = σ ( w i T ⋅ x + b + i ) \alpha_i(x) = \sigma (w_i^T \cdot x + b+i) αi(x)=σ(wiTx+b+i)
whether to trust the general model or the private model more for a given input, and the private model M P i M_{P_i} MPi needs to perform well on only the sub-set of points for which the general model fails.

这意味着具有异常域的用户对通用模型的影响较小,这可能会增强通用模型的能力。 这也可以为用户的数据提供更多的隐私。

Exploring private federated learning with laplacian smoothing

Problems

  1. However, in fields like medical or financial research, sensitive data are collected by different parties, like hospitals or banks, who are not willing to share their own data with others.
  2. 仅通过直接将模型训练与直接访问原始训练数据的需求脱钩,仍然不足以保护敏感数据,其信息将由训练有素的模型揭示。 攻击者可能会在训练过程中推断出特定记录的存在,甚至可能会通过攻击已发布的模型来恢复训练集中的人脸图像
  3. 差异性隐私的一个主要问题在于其对训练后的模型的实用性的可能显着降低。 最近,拉普拉斯平滑(LS)被证明是减少方差并避免随机梯度下降(SGD)中伪造极小值的好选择,因此有望改善差分隐私学习中的效用

Idea

在本文中,我们将基于拉普拉斯平滑的效用增强方案应用于差分私有联合学习(DP-Fed-LS),其中在注入高斯噪声的情况下参数聚合得到了改善
w k + 1 = w k − η A σ − 1 ▽ f i ( w k ) w^{k+1} = w^k - \eta A_{\sigma}^{-1}\triangledown f_i(w^k) wk+1=wkηAσ1fi(wk)

A σ = I + σ L   w h e r e : A σ ( i , i ) = 1 + 2 σ A σ ( i , i + 1 ) = − σ A_{\sigma} = I + \sigma L \ where: \\ A_{\sigma}(i,i)=1+2\sigma \\ A_{\sigma}(i,i+1)=-\sigma Aσ=I+σL where:Aσ(i,i)=1+2σAσ(i,i+1)=σ

Real-World Image Datasets for Federated Learning

联合学习是一种新的机器学习允许数据各方协作建立机器学习模型,同时保持其数据安全和私有的范例。研究期间联邦学习的努力正在增加在过去的两年中,大部分已经存在作品仍取决于现有的公共数据集和人工分区来模拟数据联合由于缺乏从现实世界的边缘应用程序生成的高质量标记数据。因此,联邦学习的基准和模型评估的进展一直落后。在本文中,我们介绍了一个真实世界的图像数据集。数据集包含900多个由26个路边摄像头和7个对象类别生成的图像,带有详细的边界盒子。数据分布是非IID且不平衡的,反映了典型的现实世界联合学习场景。基于此数据集,我们实现了两种主流的对象检测算法(YOLO和Faster R-CNN),并提供了联合学习中有关模型性能,效率和沟通的广泛基准环境。

MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

Problems

尽管密集连接的网络拓扑可以确保迭代更快地收敛,但是每次迭代都会花费更多的通信时间/延迟,从而导致更长的训练时间。

Idea

MATCHA使用基本拓扑的匹配分解采样来并行化工作人员之间的信息交换,从而显着减少通信延迟。

Illustration of the proposed method. Given the base communication graph, we decompose it into disjoint subgraphs (in particular, matchings, in order to allow parallel communications). Then, at each communication round, we carefully sample a subset of these matchings to construct a sparse subgraph of the base topology. Worker nodes are synchronized only through the activated topology.

Benefit

  1. MATCHA provides a highly flexible communication scheme among nodes.
  2. MATCHA gets a 50x reduction in communication delay per iteration, and up to 5x reduction in wall-clock time to achieve the same training accuracy.

Active Federated Learning

Problems

Due to privacy concerns users may not want to transmit data from their per- sonal devices, making such centralized training impossible.

Federated Learning enables the training of models on this data, but transmission costs between the server and the client are high, and reducing these costs is important.

Idea

在本文中,我们介绍了主动联合学习(AFL)来优先培训用户,该用户在该培训迭代过程中对模型更有利。 受主动学习的启发,我们建议使用一个值函数,该函数可以在用户设备上进行评估,然后将评估值返回给服务器,以表明对该用户进行培训的可能效用。 服务器收集这些评估并将其转换为选择下一批用户进行培训的概率。 通过使用简单的与用户数据在当前模型下遭受的损失相关的值函数,我们可以将模型达到特定精度水平所需的训练轮次减少20-70%

Sample algorithm

Input: Client Valuations { v 1 , . . . , v K } \{v_1, ..., v_K\} {v1,...,vK}, tuning parameters α 1 , . . . , α 3 \alpha_1, ..., \alpha_3 α1,...,α3, number of clients per round m

Output: Client indices k 1 , . . . , k m {k_1, ..., k_m} k1,...,km

Sort users by v k v_k vk

For the α 1 K \alpha_1 K α1K users with smallest v k , v k = − ∞ v_k, vk = −\infin vk,vk=

for k f r o m 1 t o K k from 1 to K kfrom1toKdo

p k ∝ e α 2 v k p_k \propto e^{\alpha_2 v_k} pkeα2vk

end

Sample ( 1 − α 3 ) (1 − \alpha_3) (1α3)m users according to their p k p_k pk, producing set S ′ S^{'} S

Sample α 3 m \alpha_3 m α3m from the remaining users uniformly at random, producing set S ′ ′ S^{''} S

return S = S ′ ∪ S ′ ′ S = S^{'} ∪ S^{''} S=SS

Loss valuation

v K = 1 n k l ( x k , y k ; w ) v_K = \frac{1}{\sqrt{n_k}}l(x_k,y_k;w) vK=nk 1l(xk,yk;w)

它已在模型训练期间进行了计算,并且随着模型对客户端数据的执行程度变差而增加。 此外,当数据中存在所需的结构时,它会模仿常见的重采样技术。 如果存在极端的类别不平衡和类别之间的弱分离,则少数类别的数据点的损失将明显高于多数类别的数据点。 因此,我们将首选具有更多少数派数据的用户,以模拟对少数派类别数据的重新采样。 类似地,如果噪声取决于与分类边界的距离,则使用损耗会复制基于余量的重采样技术。 最后,如果所有数据点都具有同等价值,那么拥有更多数据的用户将获得更高的估值。 最重要的是,这些对数据的适应并不需要从业人员知道所利用的特定结构。 这在有关数据信息有限的联合设置中特别重要。

Conclusion

在本文中,我们提出了主动联合学习(AFL),这是FL的第一个用户队列选择技术,它可以主动适应模型状态和每个客户端上的数据。这种适应性使我们能够以相同的性能训练迭代次数减少20%至70%。给予正式的隐私保证是未来必不可少的工作,但是还有许多其他有趣的扩展。这些实验是在简化条件下完成的,这些条件没有考虑到联合学习在实践中面临的许多问题,而AFL可能可以缓解这些问题。例如,客户可能具有不同的培训可用率(rates of availability for training)。这种可用性可能与客户端上的数据相关联,如果不加以纠正会导致我们的模型存在偏差。还可以考虑可靠性的AFL,可以通过提高对不可靠用户的训练速度来减少这种偏见。另一个挑战是客户不断收集(并可能遗失)数据,并且在许多情况下,分发可能是不稳定的。保持AFL的优势可能需要一种原则上的方法,即确保没有用户花费太长时间而不会刷新其估值。最后,我们的实验和分析集中在分类设置上,但是损失值函数可以用于任何监督问题,并且以更复杂的模型了解AFL将是一个有趣的研究方向。

Overcoming forgetting in federated learning on non-IID data

Abstract

我们在非i.d.中解决联合学习的问题。 在这种情况下,局部模型会分离开来,从而阻碍学习。 在与终身学习进行类比的基础上,我们为联合学习调整了灾难性遗忘的解决方案。 我们在损失函数中增加一个惩罚项,迫使所有局部模型收敛到一个共享的最优值。 我们证明,对于通信而言,这可以有效地完成(不增加其他隐私风险),并且可以根据分布式设置中节点的数量进行缩放。 我们的实验表明,该方法在MNIST数据集上的图像识别方面优于竞争方法。

Problems

Federated Learning poses three challenges that make it different from traditional distributed learning.

The first one is the number of computing stations, which can be in the hundreds of millions.

The second is much slower communication compared to the inter cluster communication found in data centers.

The third difference, on which we focus in this work, is the highly non i.i.d. manner in which the data may be distributed among the devices.

Overcoming Forgetting in Sequential Lifelong Learning and in Federated Learning

联合学习问题与另一个称为终身学习的基本机器学习问题(以及相关的多任务学习)之间存在着深远的相似之处。

在“终身学习”中,挑战在于学习任务A,并继续使用相同的模型学习任务B,但又不能“忘记”,而又不会严重损害任务A的性能。 或一般来说,学习任务A1,A2。 。 。 顺序进行,而不会忘记以前不再提供示例的学习任务。

因此,除了串行而不是并行学习任务之外,在终身学习中每个任务只能看到一次,而在联合学习中则没有这种限制。

但是除了这些差异之外,这些范式还面临一个共同的主要挑战-如何学习一项任务而又不干扰在同一模型上学习到的不同任务。

Elastic Weight consolidation(EWC)

EWC aims to prevent catastrophic forgetting when moving from learning task A to learning task B.

The idea is to identify the coordinates in the network parameters θ \theta θ that that are the most informative for task A, and then, while task B is being learned, penalize the learner for changing these parameters.

The basic assumption is that deep neural networks are over-parameterized enough, so that there are good chances of finding an optimal solution θ B ∗ \theta_B^* θB to task B in the neighborhood of perviously learned θ A ∗ \theta_A^* θA.

L ~ ( θ ) = L B ( θ ) + λ ( θ − θ A ∗ ) T d i a g ( I A ∗ ) ( θ − θ A ∗ ) \widetilde{L}(\theta) = L_B(\theta)+\lambda(\theta-\theta_A^*)^T diag(\mathcal{I}^*_A)(\theta - \theta_A^*) L (θ)=LB(θ)+λ(θθA)Tdiag(IA)(θθA)
Due to:
l o g   p ( θ ∣ D A   a n d   D B ) = l o g   p ( D B ∣ θ ) + l o g   p ( θ ∣ D A ) − l o g   p ( D B ) log \ p(\theta | D_A \ and \ D_B) = log \ p(D_B|\theta)+log\ p(\theta|D_A)-log\ p(D_B) log p(θDA and DB)=log p(DBθ)+log p(θDA)log p(DB)
We get a non Bayesian interpretation:
L ~ ( θ ) ≈ L B ( θ ) + 1 2 ( θ − θ A ∗ ) T H L A ( θ − θ A ∗ ) ≈ L B ( θ ) + L A ( θ ) \widetilde{L}(\theta) \approx L_B(\theta)+ \frac{1}{2}(\theta-\theta_A^*)^T H_{L_A} (\theta - \theta_A^*) \approx L_B(\theta)+L_A(\theta) L (θ)LB(θ)+21(θθA)THLA(θθA)LB(θ)+LA(θ)

FedCurv

L ~ t , s ( θ ) = L s ( θ ) + λ ∑ j ∈ S \ s ( θ − θ ^ t − 1 , j ) T d i a g ( I ^ t − 1 , j ) ( θ − θ ^ t − 1 , j ) \widetilde{L}_{t,s}(\theta) = L_s(\theta)+\lambda \sum_{j \in S \backslash s} (\theta - \widehat{\theta}_{t-1,j})^T diag(\widehat{\mathcal{I}}_{t-1,j})(\theta-\widehat{\theta}_{t-1,j}) L t,s(θ)=Ls(θ)+λjS\s(θθ t1,j)Tdiag(I t1,j)(θθ t1,j)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值