联邦学习论文精读🚀
最近由于课题需要,我人生第一次,从头到尾完完全全一字不拉的精度了一篇论文,就是这篇联邦学习的始祖论文,Communication-Efficient Learning of Deep Networks from Decetralized Data
,然后自己也是给每一句话都进行了逐字的翻译,通过GPT-4o辅助翻译,再加上自己的理解,得到最终的翻译版本,然后也是在这记录一下,如有翻译错的理解有问题的地方,欢迎留言探讨。
系列文章:联邦学习实验复现—MNISIT IID实验 pytorch
联邦学习(Federated Learning)是一种分布式机器学习技术,它允许多个设备或节点在不共享原始数据的情况下协作训练模型。数据保留在本地设备上,各节点仅共享模型参数或梯度,而非敏感信息。联邦学习通过这种方式保护用户隐私,减少数据传输,尤其适用于数据分散、隐私要求高的场景,如智能手机中的个性化服务、医疗数据分析以及智能电网的预测任务等。
论文地址:https://arxiv.org/abs/1602.05629
文章目录
Title 题目
Communication-Efficient Learning of Deep Networks form Decentralized Data
让深度网络从分散的数据中进行高效的通信学习。
Abstract 摘要
-
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device.
当今的移动通信设备已经可以接触大量适合用于训练模型的数据,这些数据可以极大的提升用户在使用设备时的使用体验。 -
For example, language models can improve speech recognition and text entry, and image models can automatically select good photos.
例如,大语言模型可以改善语言识别和文字输入的效果,图像模型可以自动选择一个好的照片。 -
However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches.
然而,这些数据经常是非常隐私的,或数量巨大,或者两者皆有的,由于有这两个因素,所以把数据都上传到数据中心并使用传统方法进行训练是可能受到一些阻碍的。 -
We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates.
我们倡导一种替代方案,就是将训练数据分离在移动设备上,然后通过聚合每个一移动设备上的更新来训练一个共享模型。 -
We term this decentralized approach Federated Learning.
我将这种去中心化的方法称为联邦学习。 -
We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets.
我们提出了一种给予迭代模型平均的深度网络联邦学习的实用方法,并考虑了5种模型架构和4个数据集,开展了广泛的验证。 -
These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting.
这些实验证明了,该方法在作用与不平平衡和不具有独立同分布(non-IID)特性的数据的时候是鲁棒的。而不平衡和不独立同分布正式数据本身具有的特征。 -
Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.
通信成本是重要的制约因素,我们和同步随机梯度下降的方法相比,所需的通信轮次减少了10-100倍。
同步随机梯度下降,指的是每一个节点在完成一个mini-batch训练之后都要进行通讯,需要消耗巨量的计算资源
Introduction 引言
-
Increasingly, phones and tablets are the primary computing devices for many people.
越来越多的人使用平板和电脑作为主要的计算设备。 -
The powerful sensors on these devices (including cameras, microphones, and GPS), combined with the fact they are frequently carried, means they have access to an unprecedented amount of data, much of it private in nature.
由于这些设备上的一些包括相机摄像头麦克风和GPS等强大的传感器经常被携带,这个情况意味着它们可以访问前所未有的数据量,其中大部分是私人数据。 -
Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications
模型如果利用这些数据学习将极大的提高模型的可用性。 -
but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location.
但是数据的敏感性意味着,将其存储在集中位置存在风险和责任。 -
We investigate a learning technique that allows users to collectively reap the benefits of shared models trained from this rich data, without the need to centrally store it
我们研究的这种技术允许用户可以受益于这些丰富的私人数据的共享模型,而且也不需要在公共的服务器上存储他。 -
We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices
我们管我们的方法叫联邦学习,原因是由于学习任务是由一些离散的参与的联合的设备一起解决的。 -
(which we refer to as clients) which are coordinated by a central server
这些离散的参与联合的设备叫客户端,他们都由中央服务器统筹协调。 -
Each client has a local training dataset which is never uploaded to the server
每个客户端都有一个本地的训练集,而且这个数据集永远不会上传到服务器。 -
Instead, each client computes an update to the current global model maintained by the server,and only this update is communicated。
同时每个服务器也会给全局模型提供一些更新,然而这些更新仅仅是靠通信传达的。 -
This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on privacy of consumer data
这种方式也是对2012年白宫消费者数据隐私报告提出的集中收集或数据最小化原则的直接应用。 -
Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.
由于这些更新专门用于改进当前的模型,因此一旦应用了这些更新,就没理由再存储他们。 -
A principal advantage of this approach is the decoupling of model training from the need for direct access to the raw training data.
联邦学习的一个主要优点是将训练模型与直接访问的原始训练数据的需求解耦。 -
Clearly, some trust of the server coordinating the training is still required.
然后对于需要协调数据的中心服务器也需要一些信任。 -
However, for applications where the training objective can be specified on the basis of data available on each client, federated learning can significantly reduce privacy and security risks by limiting the attack surface to only the device, rather than the device and the cloud.
然而,对于一些需要使用指定数据进行训练的应用,联邦学习可以显著减少隐私安全风险,因为一些攻击是仅仅面向设备而不是面向云端。
Our primary contributions 我们的主要贡献
-
1) the identification of the problem of training on decentralized data from mobile devices as an important research direction;
将各个移动设备上的分散的数据集进行分布式训练,作为一个重要的研究方向。 -
2) the selection of a straightforward and practical algorithm that can be applied to this setting;
选择一个可以应用于这种训练方式的简单使用的算法。 -
3) an extensive empirical evaluation of the proposed approach.
对所提出的方法进行广泛的验证和评估。 -
More concretely, we introduce the FederatedAveraging algorithm.
更具体的说,我们引入了联邦平均法。 -
which combines local stochastic gradient descent (SGD) on each client with a server that performs model averaging.
联邦平均法,将每个客户端执行的随机梯度下降与负责平均每个客户端随机梯度下降产生的模型参数进行平均的服务器相结合。 -
We perform extensive experiments on this algorithm, demonstrating it is robust to unbalanced and non-IID data distributions, and can reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.
我们对提出的联邦平均法进行了广泛的验证,证明他对不平衡数据集和非独立同分布的数据集是鲁棒的,并让训练多个分散数据集使用中心服务器整个的过程中所需的通信轮次减少好几个数量级。
Federated Learning properties 联邦学习适用属性
-
Federated Learning Ideal problems for federated learning have the following properties 适用于联邦学习的理想问题应该有以下几个属性
-
1) Training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center.
与中心服务器提供的代理数据相比,使用来自移动设备的真实数据进行训练就有更明显的优势。 -
2) This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training (in service of the focused collection principle)
这些数据非常隐私,或者和模型的大小相比尺寸过大,所以为了贯彻集中收集原则,最好不要出于存粹于训练模型的目的将其记录到中心服务器。 -
3)For supervised tasks, labels on the data can be inferred naturally from user interaction.
对于有监督学习任务,数据的标签可以从用户的交互种自然推断出来。 -
Many models that power intelligent behavior on mobile devices fit the above criteria.
许多支持移动设备上智能行为的模型都符合上述标准。 -
As two examples, we consider image classification
这里可以举两个例子例如图像分类。 -
for example predicting which photos are most likely to be viewed multiple times in the future, or shared
例如,预测哪些照片将来最后可能被多次查看或者共享。 -
and language models, which can be used to improve voice recognition and text entry on touch-screen keyboards by improving decoding, next-word-prediction, and even predicting whole replies
以及语言模型,可以用于通过改进解码器,来完成下一个单词的预测或者预测整个回复,来改进键盘输入和语音识别。 -
The potential training data for both these tasks (all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.) can be privacy sensitive.
这两项任务的潜在的训练数据(用户拍摄的所有照片和他们在移动键盘上键入的所有内容、包括密码,URL)等,都是隐私且敏感的。 -
The distributions from which these examples are drawn are also likely to differ substantially from easily available proxy datasets
从这些例子中提取出的数据的分布与更容易获取的代理数据集中的数据分布不相同。 -
the use of language in chat and text messages is generally much different than standard language corpora, e.g., Wikipedia and other web documents
聊天和短信中的使用的语言,通常与一些标准的语料库,(例如,维基百科,或者其他的网络文档)有很大的不同。 -
the photos people take on their phone are likely quite different than typical Flickr photos.
人们用收集拍摄的照片可能与典型的Flicker照片大不相同。 -
And finally, the labels for these problems are directly available
最后,这些问题的标签是可以直接获得的。 -
entered text is self-labeled for learning a language model
自己被标记的文本用于学习一个语言模型。 -
and photo labels can be defined by natural user interaction with their photo app(which photos are deleted, shared, or viewed).
对于图像分类,前馈深度网络,尤其是卷积神经网络,以提供最先进的结果而闻名。
Privacy 隐私问题
-
Federated learning has distinct privacy advantages compared to data center training on persisted data. Holding even an “anonymized” dataset can still put user privacy at risk via joins with other data.
联邦学习比起在持有全部数据的数据中心服务器上学习,具有明显的隐私优势,尽管中心服务器即持有“匿名”数据集,也可能与其他数据集的连接使用用户隐私的风险。 -
In contrast, the information transmitted for federated learning is the minimal update necessary to improve a particular model (naturally, the strength of the privacy benefit depends on the content of the updates).
相比之下,为联邦学习传输的信息是改进特定模型所需的最小更新(当然,隐私保护的强度取决于更新的内容) -
The updates themselves can (and should) be ephemeral.
更新本身可以(并且应该)是短暂的。 -
They will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less.
根据数据处理不等式,这些更新内容包含的有效信息不可能大于原始的训练数据,而且一般而言要小的多。 -
Further, the source of the updates is not needed by the aggregation algorithm
此外联邦算法不需要知道更新的来源 -
so updates can be transmitted without identifying meta-data over a mix network such as Tor [7] or via a trusted third party.
因此,可以通过例如Tor的这种混合网络,或者受信任的第三方传输更新,而无需识别性元数据。 -
We briefly discuss the possibility of combining federated learning with secure multiparty computation and differential privacy at the end of the paper.
-
在论文的最后我们简单讨论了联邦学习结合差分隐私的可能。
Federated Optimization 联合优化
-
We refer to the optimization problem implicit in federated learning as federated optimization
我们将联邦学习种隐含的优化问题称为联邦优化问题 -
drawing a connection (and contrast) to distributed optimization.
与分布式优化建立联系和对比。 -
Federated optimization has several key properties that differentiate it from a typical distributed optimization problem
联合优化有几个关键属性,这些属性将与典型的分布式优化问题区分开来。 -
Non-IID The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular user’s local dataset will not be representative of the population distribution.
非独立同分布 客户端上的训练数据通常给予特定用户对移动设备的使用情况,因此任何特定的用户的本地数据的分布都不会代表群体分布。 -
Unbalanced Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data.
-
不均衡 同样的一些用户会比其他用户更平凡的使用服务或应用程序,从而导致了本地训练数据量的不同。
-
Massively distributed We expect the number of clients participating in an optimization to be much larger than the average number of examples per client.
大规模分布 我们预计参与优化的客户端的数量将远远大于每个客户端平均的示例数量。 -
Limited communication Mobile devices are frequently offline or on slow or expensive connections.
限制通信 连接设备经常频繁掉线或者连接过慢或连接成本高昂。 -
In this work, our emphasis is on the non-IID and unbalanced properties of the optimization, as well as the critical nature of the communication constraints.
在这项工作种我们强调了独立同分布的数据的不平衡属性的优化,以及进行通信约束的重要性。 -
A deployed federated optimization system must also address a myriad of practical issues
部署联邦学习还必须要解决无数的实际问题。 -
client datasets that change as data is added and deleted
随着数据的添加和删除而变化的客户端的数据集 -
client availability that correlates with the local data distribution in complex ways (e.g., phones from speakers of American English will likely be plugged in at different times than speakers of British English)
客户端的可用性与本地数据的分布方式以复杂的方式相关。(例如美国英语用户,和英国英语用户可能在不同的时间为手机充电) -
and clients that never respond or send corrupted updates
以及一些从不响应或者发送损坏的更新的客户端。 -
These issues are beyond the scope of the current work; instead, we use a controlled environment that is suitable for experiments
这些问题超出了当前的工作的范围,我们使用的是合适的受控的实验环境。 -
but still addresses the key issues of client availability and unbalanced and non-IID data.
但我们依然解决了客户端可用性和不平衡的非独立同分布的数据等关键的问题。 -
We assume a synchronous update scheme that proceeds in rounds of communication.
我们假定一个同步更新方案,该方案在多轮通信中进行 -
There is a fixed set of K K K clients,each with a fixed local dataset.
这里有一组固定了 K K K个客户端的集合,每个客户端都有一个固定的本地数据集。 -
At the beginning of each round, a random fraction C C C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters).
在每一轮训练开始,会随机选择比例为 C C C个客户端,服务器将全部的算法状态发送到每个Client端(例如模型参数)。 -
We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point.
我们只选择一小部分客户端参与训练来提高效率,因为我们发现添加的客户端的数量超过某个节点之后,训练效果会带来递减的回报。 -
Each selected client then performs local computation based on the global state and its local dataset, and sends an update to the server.
然后,每个选定的客户端根据全局中泰及本地数据集执行本地计算,然后再把这个更新发给服务器。 -
The server then applies these updates to its global state, and the process repeats.
然后这个服务器应用这些更新到他的全局的状态,然后再重复这个过程。 -
While we focus on non-convex neural network objectives, the algorithm we consider is applicable to any finite-sum objective of the form
尽管我们一般专注于非凸神经网络,但是我们的算法适用于一下形式的任何有限的目标。
min w ∈ R d f ( w ) where f ( w ) = def 1 n ∑ i = 1 n f i ( w ) \min_{w \in \mathbb{R}^d} f(w) \quad \text{where}\quad f(w)\stackrel{\text{def}}{=}\frac{1}{n}\sum_{i=1}^{n} f_i(w) w∈Rdminf(w)wheref(w)=defn1i=1∑nfi(w)
w w w表示这个模型的权重,或者是这个模型的参数 w ∈ R d w \in \mathbb{R}^d w∈Rd表示 w w w是一个维度为 d d d的向量,且是位于实数空间 R \mathbb{R} R的。
f ( w ) f(w) f(w)为优化问题的目标函数,目的是找到最优的 w w w使得 f ( w ) f(w) f(w)的值最小。
f ( w ) = def 1 n ∑ i = 1 n f i ( w ) f(w)\stackrel{\text{def}}{=}\frac{1}{n}\sum_{i=1}^{n} f_i(w) f(w)=defn1∑i=1nfi(w) 表示 f ( w ) f(w) f(w) 等价于 n n n 个 f i ( w ) f_i(w) fi(w) 函数最小值的平均值。
-
For a machine learning problem, we typically f i ( w ) = ℓ ( x i , y i ; w ) f_i(w)=\ell(x_i,y_i;w) fi(w)=ℓ(xi,yi;w),the loss of the prediction on example ( x i , y i ) (x_i,y_i) (xi,yi) made with model parameters w w w.
对于一个机器学习任务,我们通常采用 f i ( w ) = ℓ ( x i , y i ; w ) f_i(w)=\ell(x_i,y_i;w) fi(w)=ℓ(xi,yi;w) ,这个公式表示的是一个参数为 w w w 的模型预测一个样本 ( x i , y i ) (x_i,y_i) (xi,yi) 的损失。 -
We assume there are K clients over which the data is partitioned.
我们假设全部的数据被划分到 K K K 个客户端上。 -
with P k \mathcal{P}_k Pk the set of indexes of data points on c lients k k k ,with n k = ∣ P k ∣ n_k = \left| \mathcal{P}_k \right| nk=∣Pk∣
P k \mathcal{P}_k Pk 是第 k k k 个客户端包含的所有单个数据样本的索引作为元素组成的集合,同时 n k n_k nk 表示客户端 k k