联邦学习论文精读🚀
最近由于课题需要,我人生第一次,从头到尾完完全全一字不拉的精度了一篇论文,就是这篇联邦学习的始祖论文,Communication-Efficient Learning of Deep Networks from Decetralized Data
,然后自己也是给每一句话都进行了逐字的翻译,通过GPT-4o辅助翻译,再加上自己的理解,得到最终的翻译版本,然后也是在这记录一下,如有翻译错的理解有问题的地方,欢迎留言探讨。
系列文章:联邦学习实验复现—MNISIT IID实验 pytorch
联邦学习(Federated Learning)是一种分布式机器学习技术,它允许多个设备或节点在不共享原始数据的情况下协作训练模型。数据保留在本地设备上,各节点仅共享模型参数或梯度,而非敏感信息。联邦学习通过这种方式保护用户隐私,减少数据传输,尤其适用于数据分散、隐私要求高的场景,如智能手机中的个性化服务、医疗数据分析以及智能电网的预测任务等。
论文地址:https://arxiv.org/abs/1602.05629
文章目录
Title 题目
Communication-Efficient Learning of Deep Networks form Decentralized Data
让深度网络从分散的数据中进行高效的通信学习。
Abstract 摘要
-
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device.
当今的移动通信设备已经可以接触大量适合用于训练模型的数据,这些数据可以极大的提升用户在使用设备时的使用体验。 -
For example, language models can improve speech recognition and text entry, and image models can automatically select good photos.
例如,大语言模型可以改善语言识别和文字输入的效果,图像模型可以自动选择一个好的照片。 -
However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches.
然而,这些数据经常是非常隐私的,或数量巨大,或者两者皆有的,由于有这两个因素,所以把数据都上传到数据中心并使用传统方法进行训练是可能受到一些阻碍的。 -
We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates.
我们倡导一种替代方案,就是将训练数据分离在移动设备上,然后通过聚合每个一移动设备上的更新来训练一个共享模型。 -
We term this decentralized approach Federated Learning.
我将这种去中心化的方法称为联邦学习。 -
We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets.
我们提出了一种给予迭代模型平均的深度网络联邦学习的实用方法,并考虑了5种模型架构和4个数据集,开展了广泛的验证。 -
These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting.
这些实验证明了,该方法在作用与不平平衡和不具有独立同分布(non-IID)特性的数据的时候是鲁棒的。而不平衡和不独立同分布正式数据本身具有的特征。 -
Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.
通信成本是重要的制约因素,我们和同步随机梯度下降的方法相比,所需的通信轮次减少了10-100倍。
同步随机梯度下降,指的是每一个节点在完成一个mini-batch训练之后都要进行通讯,需要消耗巨量的计算资源
Introduction 引言
-
Increasingly, phones and tablets are the primary computing devices for many people.
越来越多的人使用平板和电脑作为主要的计算设备。 -
The powerful sensors on these devices (including cameras, microphones, and GPS), combined with the fact they are frequently carried, means they have access to an unprecedented amount of data, much of it private in nature.
由于这些设备上的一些包括相机摄像头麦克风和GPS等强大的传感器经常被携带,这个情况意味着它们可以访问前所未有的数据量,其中大部分是私人数据。 -
Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications
模型如果利用这些数据学习将极大的提高模型的可用性。 -
but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location.
但是数据的敏感性意味着,将其存储在集中位置存在风险和责任。 -
We investigate a learning technique that allows users to collectively reap the benefits of shared models trained from this rich data, without the need to centrally store it
我们研究的这种技术允许用户可以受益于这些丰富的私人数据的共享模型,而且也不需要在公共的服务器上存储他。 -
We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices
我们管我们的方法叫联邦学习,原因是由于学习任务是由一些离散的参与的联合的设备一起解决的。 -
(which we refer to as clients) which are coordinated by a central server
这些离散的参与联合的设备叫客户端,他们都由中央服务器统筹协调。 -
Each client has a local training dataset which is never uploaded to the server
每个客户端都有一个本地的训练集,而且这个数据集永远不会上传到服务器。 -
Instead, each client computes an update to the current global model maintained by the server,and only this update is communicated。
同时每个服务器也会给全局模型提供一些更新,然而这些更新仅仅是靠通信传达的。 -
This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on privacy of consumer data
这种方式也是对2012年白宫消费者数据隐私报告提出的集中收集或数据最小化原则的直接应用。 -
Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.
由于这些更新专门用于改进当前的模型,因此一旦应用了这些更新,就没理由再存储他们。 -
A principal advantage of this approach is the decoupling of model training from the need for direct access to the raw training data.
联邦学习的一个主要优点是将训练模型与直接访问的原始训练数据的需求解耦。 -
Clearly, some trust of the server coordinating the training is still required.
然后对于需要协调数据的中心服务器也需要一些信任。 -
However, for applications where the training objective can be specified on the basis of data available on each client, federated learning can significantly reduce privacy and security risks by limiting the attack surface to only the device, rather than the device and the cloud.
然而,对于一些需要使用指定数据进行训练的应用,联邦学习可以显著减少隐私安全风险,因为一些攻击是仅仅面向设备而不是面向云端。
Our primary contributions 我们的主要贡献
-
1) the identification of the problem of training on decentralized data from mobile devices as an important research direction;
将各个移动设备上的分散的数据集进行分布式训练,作为一个重要的研究方向。 -
2) the selection of a straightforward and practical algorithm that can be applied to this setting;
选择一个可以应用于这种训练方式的简单使用的算法。 -
3) an extensive empirical evaluation of the proposed approach.
对所提出的方法进行广泛的验证和评估。 -
More concretely, we introduce the FederatedAveraging algorithm.
更具体的说,我们引入了联邦平均法。 -
which combines local stochastic gradient descent (SGD) on each client with a server that performs model averaging.
联邦平均法,将每个客户端执行的随机梯度下降与负责平均每个客户端随机梯度下降产生的模型参数进行平均的服务器相结合。 -
We perform extensive experiments on this algorithm, demonstrating it is robust to unbalanced and non-IID data distributions, and can reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.
我们对提出的联邦平均法进行了广泛的验证,证明他对不平衡数据集和非独立同分布的数据集是鲁棒的,并让训练多个分散数据集使用中心服务器整个的过程中所需的通信轮次减少好几个数量级。
Federated Learning properties 联邦学习适用属性
-
Federated Learning Ideal problems for federated learning have the following properties 适用于联邦学习的理想问题应该有以下几个属性
-
1) Training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center.
与中心服务器提供的代理数据相比,使用来自移动设备的真实数据进行训练就有更明显的优势。 -
2) This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training (in service of the focused collection principle)
这些数据非常隐私,或者和模型的大小相比尺寸过大,所以为了贯彻集中收集原则,最好不要出于存粹于训练模型的目的将其记录到中心服务器。 -
3)For supervised tasks, labels on the data can be inferred naturally from user interaction.
对于有监督学习任务,数据的标签可以从用户的交互种自然推断出来。 -
Many models that power intelligent behavior on mobile devices fit the above criteria.
许多支持移动设备上智能行为的模型都符合上述标准。 -
As two examples, we consider image classification
这里可以举两个例子例如图像分类。 -
for example predicting which photos are most likely to be viewed multiple times in the future, or shared
例如,预测哪些照片将来最后可能被多次查看或者共享。 -
and language models, which can be used to improve voice recognition and text entry on touch-screen keyboards by improving decoding, next-word-prediction, and even predicting whole replies
以及语言模型,可以用于通过改进解码器,来完成下一个单词的预测或者预测整个回复,来改进键盘输入和语音识别。 -
The potential training data for both these tasks (all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.) can be privacy sensitive.
这两项任务的潜在的训练数据(用户拍摄的所有照片和他们在移动键盘上键入的所有内容、包括密码,URL)等,都是隐私且敏感的。 -
The distributions from which these examples are drawn are also likely to differ substantially from easily available proxy datasets
从这些例子中提取出的数据的分布与更容易获取的代理数据集中的数据分布不相同。 -
the use of language in chat and text messages is generally much different than standard language corpora, e.g., Wikipedia and other web documents
聊天和短信中的使用的语言,通常与一些标准的语料库,(例如,维基百科,或者其他的网络文档)有很大的不同。 -
the photos people take on their phone are likely quite different than typical Flickr photos.
人们用收集拍摄的照片可能与典型的Flicker照片大不相同。 -
And finally, the labels for these problems are directly available
最后,这些问题的标签是可以直接获得的。 -
entered text is self-labeled for learning a language model
自己被标记的文本用于学习一个语言模型。 -
and photo labels can be defined by natural user interaction with their photo app(which photos are deleted, shared, or viewed).
对于图像分类,前馈深度网络,尤其是卷积神经网络,以提供最先进的结果而闻名。
Privacy 隐私问题
-
Federated learning has distinct privacy advantages compared to data center training on persisted data. Holding even an “anonymized” dataset can still put user privacy at risk via joins with other data.
联邦学习比起在持有全部数据的数据中心服务器上学习,具有明显的隐私优势,尽管中心服务器即持有“匿名”数据集,也可能与其他数据集的连接使用用户隐私的风险。 -
In contrast, the information transmitted for federated learning is the minimal update necessary to improve a particular model (naturally, the strength of the privacy benefit depends on the content of the updates).
相比之下,为联邦学习传输的信息是改进特定模型所需的最小更新(当然,隐私保护的强度取决于更新的内容) -
The updates themselves can (and should) be ephemeral.
更新本身可以(并且应该)是短暂的。 -
They will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less.
根据数据处理不等式,这些更新内容包含的有效信息不可能大于原始的训练数据,而且一般而言要小的多。 -
Further, the source of the updates is not needed by the aggregation algorithm
此外联邦算法不需要知道更新的来源 -
so updates can be transmitted without identifying meta-data over a mix network such as Tor [7] or via a trusted third party.
因此,可以通过例如Tor的这种混合网络,或者受信任的第三方传输更新,而无需识别性元数据。 -
We briefly discuss the possibility of combining federated learning with secure multiparty computation and differential privacy at the end of the paper.
-
在论文的最后我们简单讨论了联邦学习结合差分隐私的可能。
Federated Optimization 联合优化
-
We refer to the optimization problem implicit in federated learning as federated optimization
我们将联邦学习种隐含的优化问题称为联邦优化问题 -
drawing a connection (and contrast) to distributed optimization.
与分布式优化建立联系和对比。 -
Federated optimization has several key properties that differentiate it from a typical distributed optimization problem
联合优化有几个关键属性,这些属性将与典型的分布式优化问题区分开来。 -
Non-IID The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular user’s local dataset will not be representative of the population distribution.
非独立同分布 客户端上的训练数据通常给予特定用户对移动设备的使用情况,因此任何特定的用户的本地数据的分布都不会代表群体分布。 -
Unbalanced Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data.
-
不均衡 同样的一些用户会比其他用户更平凡的使用服务或应用程序,从而导致了本地训练数据量的不同。
-
Massively distributed We expect the number of clients participating in an optimization to be much larger than the average number of examples per client.
大规模分布 我们预计参与优化的客户端的数量将远远大于每个客户端平均的示例数量。 -
Limited communication Mobile devices are frequently offline or on slow or expensive connections.
限制通信 连接设备经常频繁掉线或者连接过慢或连接成本高昂。 -
In this work, our emphasis is on the non-IID and unbalanced properties of the optimization, as well as the critical nature of the communication constraints.
在这项工作种我们强调了独立同分布的数据的不平衡属性的优化,以及进行通信约束的重要性。 -
A deployed federated optimization system must also address a myriad of practical issues
部署联邦学习还必须要解决无数的实际问题。 -
client datasets that change as data is added and deleted
随着数据的添加和删除而变化的客户端的数据集 -
client availability that correlates with the local data distribution in complex ways (e.g., phones from speakers of American English will likely be plugged in at different times than speakers of British English)
客户端的可用性与本地数据的分布方式以复杂的方式相关。(例如美国英语用户,和英国英语用户可能在不同的时间为手机充电) -
and clients that never respond or send corrupted updates
以及一些从不响应或者发送损坏的更新的客户端。 -
These issues are beyond the scope of the current work; instead, we use a controlled environment that is suitable for experiments
这些问题超出了当前的工作的范围,我们使用的是合适的受控的实验环境。 -
but still addresses the key issues of client availability and unbalanced and non-IID data.
但我们依然解决了客户端可用性和不平衡的非独立同分布的数据等关键的问题。 -
We assume a synchronous update scheme that proceeds in rounds of communication.
我们假定一个同步更新方案,该方案在多轮通信中进行 -
There is a fixed set of K K K clients,each with a fixed local dataset.
这里有一组固定了 K K K个客户端的集合,每个客户端都有一个固定的本地数据集。 -
At the beginning of each round, a random fraction C C C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters).
在每一轮训练开始,会随机选择比例为 C C C个客户端,服务器将全部的算法状态发送到每个Client端(例如模型参数)。 -
We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point.
我们只选择一小部分客户端参与训练来提高效率,因为我们发现添加的客户端的数量超过某个节点之后,训练效果会带来递减的回报。 -
Each selected client then performs local computation based on the global state and its local dataset, and sends an update to the server.
然后,每个选定的客户端根据全局中泰及本地数据集执行本地计算,然后再把这个更新发给服务器。 -
The server then applies these updates to its global state, and the process repeats.
然后这个服务器应用这些更新到他的全局的状态,然后再重复这个过程。 -
While we focus on non-convex neural network objectives, the algorithm we consider is applicable to any finite-sum objective of the form
尽管我们一般专注于非凸神经网络,但是我们的算法适用于一下形式的任何有限的目标。
min w ∈ R d f ( w ) where f ( w ) = def 1 n ∑ i = 1 n f i ( w ) \min_{w \in \mathbb{R}^d} f(w) \quad \text{where}\quad f(w)\stackrel{\text{def}}{=}\frac{1}{n}\sum_{i=1}^{n} f_i(w) w∈Rdminf(w)wheref(w)=defn1i=1∑nfi(w)
w w w表示这个模型的权重,或者是这个模型的参数 w ∈ R d w \in \mathbb{R}^d w∈Rd表示 w w w是一个维度为 d d d的向量,且是位于实数空间 R \mathbb{R} R的。
f ( w ) f(w) f(w)为优化问题的目标函数,目的是找到最优的 w w w使得 f ( w ) f(w) f(w)的值最小。
f ( w ) = def 1 n ∑ i = 1 n f i ( w ) f(w)\stackrel{\text{def}}{=}\frac{1}{n}\sum_{i=1}^{n} f_i(w) f(w)=defn1∑i=1nfi(w) 表示 f ( w ) f(w) f(w) 等价于 n n n 个 f i ( w ) f_i(w) fi(w) 函数最小值的平均值。
-
For a machine learning problem, we typically f i ( w ) = ℓ ( x i , y i ; w ) f_i(w)=\ell(x_i,y_i;w) fi(w)=ℓ(xi,yi;w),the loss of the prediction on example ( x i , y i ) (x_i,y_i) (xi,yi) made with model parameters w w w.
对于一个机器学习任务,我们通常采用 f i ( w ) = ℓ ( x i , y i ; w ) f_i(w)=\ell(x_i,y_i;w) fi(w)=ℓ(xi,yi;w) ,这个公式表示的是一个参数为 w w w 的模型预测一个样本 ( x i , y i ) (x_i,y_i) (xi,yi) 的损失。 -
We assume there are K clients over which the data is partitioned.
我们假设全部的数据被划分到 K K K 个客户端上。 -
with P k \mathcal{P}_k Pk the set of indexes of data points on c lients k k k ,with n k = ∣ P k ∣ n_k = \left| \mathcal{P}_k \right| nk=∣Pk∣
P k \mathcal{P}_k Pk 是第 k k k 个客户端包含的所有单个数据样本的索引作为元素组成的集合,同时 n k n_k nk 表示客户端 k k k 上表示的一组训练样本的数量。
f ( w ) = ∑ k = 1 K n k n F k ( w ) where F k ( w ) = 1 n k ∑ i ∈ P k f i ( w ) . f(w) = \sum_{k=1}^{K} \frac{n_k}{n} F_k(w) \quad \text{where} \quad F_k(w) = \frac{1}{n_k} \sum_{i \in \mathcal{P}_k} f_i(w). f(w)=k=1∑KnnkFk(w)whereFk(w)=nk1i∈Pk∑fi(w).
这里的目标函数被分为了两级,where后面的公式表示的是每个客户端上计算的平均损失,左侧表示每个客户端损失平均后的总体损失。
-
If the partition P k \mathcal{P}_k Pk was formed by distributing the training examples over the clients uniformly at random, then we would have E P k [ F k ( w ) ] = f ( w ) \mathbb{E}_{\mathcal{P}_k}[F_k(w)] = f(w) EPk[Fk(w)]=f(w) , where the expectation is over the set of examples assigned to a fixed client k k k . This is the IID assumption typically made by distributed optimization algorithms; we refer to the case where this does not hold (that is, F k F_k Fk could be an arbitrarily bad approximation to f f f as the non-IID setting.
如果数据分区 P k \mathcal{P}_k Pk 是通过将训练样本均匀随机地分配到各个客户端得到的,那么我们会有 E P k [ F k ( w ) ] = f ( w ) \mathbb{E}_{\mathcal{P}_k}[F_k(w)] = f(w) EPk[Fk(w)]=f(w),其中期望是针对分配给固定客户端 k k k 的样本集合的。这是分布式优化算法中通常做出的独立同分布(IID)假设;当这种假设不成立时(也就是说, F k F_k Fk 可能是 f f f 的任意差的近似值),我们称这种情况为非独立同分布(non-IID)设置。 -
In data center optimization, communication costs are relatively small, and computational costs dominate, with much of the recent emphasis being on using GPUs to lower these costs.
在数据中心优化中,通信成本相对较小,计算成本占主导地位,最近的大部分重点是使用GPU来降低这些成本。 -
In contrast, in federated optimization communication costs dominate — we will typically be limited by an upload bandwidth of 1 MB/s or less.
相比之下,在联邦学习优化中通信成本占据主导地位,我们通常会受到1MB/s或更小的上传宽带的限制。 -
Further, clients will typically only volunteer to participate in the optimization when they are charged, plugged-in, and on an unmetered wi-fi connection.
此外,客户往往只有在充电和不限量wifi的时候才自愿参与优化。 -
Further, we expect each client will only participate in a small number of update rounds per day.
而且,我们预计每个客户端每天只会参与少量的轮次更新。 -
On the other hand, since any single on-device dataset is small compared to the total dataset size, and modern smartphones have relatively fast processors (including GPUs), computation becomes essentially free compared to communication costs for many model types.
另一方面,任何单个设备上的数据集都很小,并且现代智能手机具有相对较快的处理器,包括GPU,因此与许多模型类型的通信成本相比,计算基本上是免费的。 -
Thus, our goal is to use additional computation in order to decrease the number of rounds of communication needed to train a model.
因此,我们的目标是使用额外的九三来减少训练模型所需的通信轮数。 -
There are two primary ways we can add computation:
因此我们可以通过两种主要添加计算。 -
①increased parallelism, where we use more clients working independently between each communication round;
①增加并行性,我们在每轮通信之间使用更多的客户端 -
②increased computation on each client, where rather than performing a simple computation like a gradient calculation, each client performs a more complex calculation between each communication round.
②增加每个客户端的计算,而不是像梯度计算那样执行简单的计算,而是像每个客户端在每一轮通信之间执行更复杂的计算。 -
We investigate both of these approaches, but the speedups we achieve are due primarily to adding more computation on each client, once a minimum level of parallelism over clients is used.
我们对这两种方法进行了研究,但我们获得的加速效果主要是通过在每个客户端上增加更多的计算量,而不是通过客户端之间的并行化,前提是客户端的并行度达到了一定的最低水平。
Related Work 之前的相关工作
-
Distributed training by iteratively averaging locally trained models has been studied by McDonald et al. [28] for the perceptron and Povey et al. for speech recognition DNNs.
通过迭代平均本地训练的模型来进行分布式训练,McDonald 等人 [28] 对感知机进行了研究,Povey 等人则针对语音识别的深度神经网络(DNNs)进行了研究。 -
Zhang et al. [42] studies an asyn-chronous approach with “soft” averaging.
Zhang et al. [42] 使用了一种软平均的异步方法. -
These works only consider the cluster / data center setting (at most 16 workers,wall-clock time based on fast networks), and do not consider datasets that are unbalanced and non-IID.
上述的研究只考虑了集群/数据中心,并且最多只用16个工作者,且是基于快速网络的实际时间,没有考虑到数据不平衡和非独立同分布的样本。 -
properties that are essential to the federated learning setting.
而上述的这些属性对联邦学习至关重要。 -
We adapt this style of algorithm to the federated setting and perform the appropriate empirical evaluation, which asks different questions than those relevant in the data center setting, and requires different methodology.
我们将这种算法风格调整并应用到联邦学习环境中,进行了适当的实证评估。这些评估提出了与数据中心环境不同的问题,并且需要不同的方法来解决。 -
Using similar motivation to ours, Neverova et al. [29] also discusses the advantages of keeping sensitive user data on device.
出于与我们相似的动机,Neverova 等人 [29] 也讨论了将敏感用户数据保存在设备上的优点。 -
The work of Shokri and Shmatikov [35] is related in several ways: they focus on training deep networks, empha size the importance of privacy, and address communication costs by only sharing a subset of the parameters during each round of communication; however, they also do not consider unbalanced and non-IID data, and the empirical evaluation is limited.
Shokri 和 Shmatikov [35] 的工作在多个方面与我们的研究相关:他们专注于训练深度网络,强调了隐私的重要性,并通过在每轮通信中仅共享一部分参数来解决通信成本的问题;然而,他们也没有考虑数据不平衡和非独立同分布(non-IID)的情况,而且实证评估也比较有限。 -
One endpoint of the (parameterized) algorithm family we consider is simple one-shot averaging, where each client solves for the model that minimizes (possibly regularized) loss on their local data, and these models are averaged to produce the final global model. This approach has been studied extensively in the convex case with IID data, and it is known that in the worst-case, the global model produced is no better than training a model on a single client.
我们考虑的(参数化)算法家族的一个端点是简单的“一次性平均”(one-shot averaging)。在这种方法中,每个客户端在其本地数据上求解使(可能经过正则化的)损失最小化的模型,然后将这些模型进行平均,产生最终的全局模型。这种方法在具有独立同分布(IID)数据的凸优化情况下已经被广泛研究,并且已知在最坏情况下,所产生的全局模型并不比在单个客户端上训练的模型更好。
FederatedAveraging Algorithm 联邦学习平均算法
-
The recent multitude of successful applications of deep learning have almost exclusively relied on variants of stochastic gradient descent (SGD) for optimization.
最近深度学习的大量的成功应用几乎完全依赖于随机梯度下降 -
in fact,many advances can be understood as adapting the structure of the model (and hence the loss function) to be more amenable to optimization by simple gradient-based methods [16].
事实上,许多进展可以被理解为调整模型的结构(因此也调整了损失函数),使其更适合通过简单的基于梯度的方法进行优化 [16] -
Thus, it is natural that we build algorithms for federated optimization by starting from SGD.
因此使用随机梯度下降法来构建联邦学习是很自然的。 -
SGD can be applied naively to the federated optimization problem, where a single batch gradient calculation (say on a randomly selected client) is done per round of communication.
随机梯度下降法,可以直接应用于联邦学习的优化问题,其中每一轮通信,只进行一次的批量梯度计算。 -
This approach is computationally efficient, but requires very large numbers of rounds of training to produce good models.
这个方法具有较高的计算效率,但是依然要求庞大的训练轮数才能训练出一个好的模型。 -
(e.g., even using an advanced approach like batch normalization, Ioffe and Szegedy [21] trained MNIST for 50000 steps on minibatches of size 60).
甚至需要使用一些高级方法,例如batch normalization Ioffe 和 Szegedy 这两个人使用mini-batch 尺寸为60 训练MINIST数据集训练了50000轮。 -
We consider this baseline in our CIFAR-10 experiments.
在CIRAR-10的实验中我们也将使用相同的标准来进行考虑。 -
In the federated setting, there is little cost in wall-clock time to involving more clients, and so for our baseline we use large-batch synchronous SGD;
在联邦学习的环境中,让更多的客户端参与在实际耗时(挂钟时间)上几乎没有成本,因此,作为我们的基本用法,我们使用大批量的同步随机梯度下降法(SGD) -
experiments by Chen et al. [8] show this approach is state-of-the-art in the data center setting, where it outperforms asynchronous approaches.
Chen 等人 [8] 的实验表明,这种方法在数据中心环境中是最先进的,且其性能优于异步方法。 -
To apply this approach in the federated setting, we select a C C C fraction of clients on each round, and compute the gradient of the loss over all the data held by these clients.
为了在联邦学习中应用这个方法,我们选择百分比为C的客户端每轮进行训练和梯度上传进行损失更新。 -
Thus, C controls the global batch size, with C = 1 corresponding to full-batch (non-stochastic) gradient descent.
选择的控制参数的比例 C C C 就是全局batch size的大小,如果 C = 1 C=1 C=1的话就相当批量梯度下降(选择全部的服务器进行更新)。 -
A typical implementation of FedSGD with C=1 and a fixed learning rate μ \mu μ has each client k k k comupte g k = ∇ F k ( w t ) g_k= ∇F_k(w_t) gk=∇Fk(wt),the average gradient on its local data at the current model w t w_t wt,and the central server aggregates these gradients and applies the update w t + 1 ← w t − η ∑ k = 1 K n k n g k w_{t+1} \leftarrow w_t - η \sum_{k=1}^{K} \frac{n_k}{n} g_k wt+1←wt−η∑k=1Knnkgk ,since ∑ k = 1 K n k n g k = ∇ F k ( w t ) \sum_{k=1}^{K} \frac{n_k}{n} g_k=∇F_k(w_t) ∑k=1Knnkgk=∇Fk(wt).A n quivalent update is given by ∀ k \forall k ∀k, w t k ← w t − η g k w_t^k \leftarrow w_t - η g_k wtk←wt−ηgk and then w t + 1 ← ∑ k = 1 K n k n w t k + 1 w_{t+1} \leftarrow \sum_{k=1}^{K} \frac{n_k}{n} w_t^{k+1} wt+1←∑k=1Knnkwtk+1.
在一个典型的C=1(表示所有客户端都参与训练)的实验中,选择学习率为 μ \mu μ 每个客户端 k k k 计算梯度 g k = ∇ F k ( w t ) g_k= ∇F_k(w_t) gk=∇Fk(wt),这个平均梯度在是用本地数据在当前的模型 w t w_t wt计算的梯度。中央服务器接收各客户端的梯度 g k g_k gk之后应用更新 w t + 1 ← w t − η ∑ k = 1 K n k n g k w_{t+1} \leftarrow w_t - η \sum_{k=1}^{K} \frac{n_k}{n} g_k wt+1←wt−η∑k=1Knnkgk,更新模型参数,更新模型参数。 n k n_k nk表示客户端 k k k的数据样本数量, n n n是所有客户端的数据样本总量。学习率为 η \eta η 。此外,等价的更新方式是每个客户端先局部更新 w t k ← w t − η g k w_t^k \leftarrow w_t - η g_k wtk←wt−ηgk ,然后中央服务器聚合各客户端的更新 w t + 1 ← ∑ k = 1 K n k n w t + 1 k w_{t+1} \leftarrow \sum_{k=1}^{K} \frac{n_k}{n} w_{t+1}^{k} wt+1←∑k=1Knnkwt+1k 。 -
That is, each client locally takes one step of gradient descent on the current model using its local data, and the server then takes a weighted average of the resulting models.
也就是说,每个客户端使用其本地数据再当前模型上本地执行一个梯度下降步骤,然后服务器获取结果模型的加权平均值。 -
We can add more computation to each client by iterating the local update w k ← w k − η ∇ F k ( w k ) w^k \leftarrow w^k - \eta \nabla F_k(w^k) wk←wk−η∇Fk(wk) multiple times before the averaging step.
我们可以再平均步骤之前,在本地多次迭代更新 w k ← w k − η ∇ F k ( w k ) w^k \leftarrow w^k - \eta \nabla F_k(w^k) wk←wk−η∇Fk(wk) ,增加每个客户端的运算量。 -
We term this approach FederatedAveraging or FedAvg)
我们给这个方法定义了一个术语,叫联邦学习算法。 -
The amount of computation is controlled by three key parameters: C, the fraction of clients that perform computation on each round; E, then number of training passes each client makes over its local dataset on each round; and B, the local minibatch size used for the client updates.
算法的计算量由三个关键参数控制
C:每轮训练参与训练的客户端比例。
E:每个客户端在本地训练的一次完整数据集的次数。
B:是客户端本地更新的批量大小。 -
We write B = ∞ to indicate that the full local dataset is treated as a single minibatch.
如果我们设置B = ∞ 则表示整个数据集被视为一个完整的完整的mini-batch来训练。 -
Thus, at one endpoint of this algorithm family, we can take B = ∞ and E = 1 which corresponds exactly to FedSGD.
因此,对于整个联邦学习算法家族来说,当B = ∞ and E = 1 的时候对应的就是FedSGD,算法。 -
For a client with n k n_k nk local examples, the number of local updates per round is given by u k = E n k B u_k = E \frac{n_k}{B} uk=EBnk, Complete pseudo-code is given in Algorithm 1.
对于一个客户端来说他有 n k n_k nk 个本地样本,本地的训练轮数为 u k = E n k B u_k = E\frac{n_k}{B} uk=EBnk ,完整的计算过程已经在伪代码1里给出。 -
For general non-convex objectives, averaging models in parameter space could produce an arbitrarily bad model.
对于一般的非凸目标,在参数空间种对模型进行平均,可能产生一个非常糟糕的模型。
-
Figure 1: The loss on the full MNIST training set for models generated by averaging the parameters of two models w w w and w ′ w' w′ using θ w + ( 1 − θ ) w ′ \theta w + (1 - \theta) w' θw+(1−θ)w′ for 50 evenly spaced values θ ∈ [ − 0.2 , 1.2 ] \theta \in [-0.2, 1.2] θ∈[−0.2,1.2] . The models w w w and w ′ w' w′ were trained using SGD on different small datasets. For the left plot, w w w and w ′ w' w′ were initialized using different random seeds; for the right plot, a shared seed was used. Note the different y-axis scales. The horizontal line gives the best loss achieved by w w w or w ′ w' w′ (which were quite close, corresponding to the vertical lines at θ = 0 \theta = 0 θ=0 and θ = 1 \theta = 1 θ=1 . With shared initialization, averaging the models produces a significant reduction in the loss on the total training set (much better than the loss of either parent model).
图 1:通过对两个模型 w w w 和 w ′ w' w′ 的参数进行平均来生成模型的 MNIST 训练集上的损失结果,公式为: θ w + ( 1 − θ ) w ′ \theta w + (1 - \theta) w' θw+(1−θ)w′ ,其中 θ ∈ [ − 0.2 , 1.2 ] \theta \in [-0.2, 1.2] θ∈[−0.2,1.2] 取 50 个等间隔值。模型 w w w 和 w ′ w' w′ 是在不同的小数据集上使用 SGD 训练的。对于左图, w w w 和 w ′ w' w′ 使用了不同的随机种子进行初始化;对于右图,使用了相同的随机种子。请注意两个图的 y 轴刻度不同。水平线表示由 w w w 或 w ′ w' w′ 单独获得的最佳损失(这两个模型的最佳损失非常接近,分别对应于 θ = 0 \theta = 0 θ=0 和 θ = 1 \theta = 1 θ=1 的垂直线)。当使用相同的初始化时,对模型进行平均可以显著减少整个训练集上的损失(比任何一个父模型的损失都要好)。 -
Following the approach of Goodfellow et al. [17], we see exactly this bad behavior when we average two MNIST digit-recognition models trained from different initial conditions (Figure 1, left).
根据 Goodfellow 的理论,我们分别从不同初始条件开始训练模型的时候,我们看到了不良的现象。 -
For this figure, the parent models w w w and w ′ w' w′ were each trained on non-overlapping IID samples of 600 examples from the MNIST training set.
对于这张图,父模型 w w w 和 w ′ w' w′ 都是在MINIST上,训练的600个独立同分布的IID样本上进行训练的。 -
Training was via SGD with a fixed learning rate of 0.1 for 240 updates on minibatches of size 50 (or E = 20 passes over the mini-datasets of size 600).
训练是使用SGD优化器,学习率设置为0.1,进行240次跟新,mini_batch 的大小为50,(或者E为20,mini_batch 的尺寸为600) -
This is approximately the amount of training where the models begin to overfit their local datasets.
这大约就是模型即将要开始过拟合的数据量。 -
Recent work indicates that in practice, the loss surfaces of sufficiently over-parameterized NNs are surprisingly well behaved and in particular less prone to bad local minima than previously thought [11, 17, 9].
最近的研究表明,在实际的研究中,足够过度化的神经网络的损失曲面,表现的出奇的好,特别是其更不容易陷入糟糕的局部极小值。 -
And indeed, when we start two models from the same random initialization and then again train each independently on a different subset of the data (as described above), we find that naive parameter averaging works surprisingly well.
确实,我们分别初始化了两个参数一样的模型,然后分别在不同的子数据集上进行训练,我们发现平均参数之后的模型训练效果出奇的好。 -
While Figure 1 starts from a random initialization, note a shared starting model w t w_t wt is used for each round of FedAvg, and so the same intuition applies.
Figue 1 是从一个随机初始化的模型开始的,注意FedAvg算法,每次分发的模型也都是一样的。
联邦学习平均算法伪代码✨
算法1:联邦学习平均算法:一共有 K K K个客户端,被 k k k索引, B B B是本地的mini-batch尺寸, E E E是本地的训练总轮数, η \eta η是学习率。
中心服务器执行:
初始化权重
w
0
w_0
w0
f
o
r
for
for 每轮
t
=
1
,
2
,
.
.
.
.
t=1,2,....
t=1,2,.... do
m
←
m
a
x
(
C
×
K
,
1
)
m \leftarrow max(C×K,1)
m←max(C×K,1)
S
t
←
S_t \leftarrow
St←(随机选择m个客户端构成一个客户端集合)
f
o
r
for
for 客户端
k
∈
S
t
k \in S_t
k∈St并行的做
w
t
+
1
k
←
C
l
i
e
n
t
U
p
a
d
a
t
e
(
k
,
w
t
)
w^k_{t+1}\leftarrow ClientUpadate(k,w_t)
wt+1k←ClientUpadate(k,wt) //在第
k
k
k个客户端上训练帮返回该客户端训练好的权重
m
t
←
∑
k
∈
S
t
n
k
m_t \gets \sum_{k \in S_t} n_k
mt←∑k∈Stnk //获得选中的客户端上的样本总数的和
w
t
+
1
←
∑
k
∈
S
t
n
k
m
t
w
t
k
w_{t+1} \gets \sum_{k \in S_t} \frac{n_k}{m_t} w_t^k
wt+1←∑k∈Stmtnkwtk //对所有客户端训练出的权重加权(客户端
k
k
k上的样本数量/
m
t
m_t
mt)求和,更新出
w
t
+
1
w_{t+1}
wt+1。
C
l
i
e
n
t
U
p
d
a
t
e
(
k
,
w
)
:
ClientUpdate(k,w):
ClientUpdate(k,w): // 在第
k
k
k个客户端上运行
B
←
\mathcal{B} \leftarrow
B←(以批量尺寸
B
B
B为大小切分客户端上的数据
P
k
\mathcal{P}_k
Pk)
f
o
r
for
for 本地训练轮次
i
i
i 从1到
E
E
E do
f
o
r
for
for 每个在切分好的集合中的批量数据
b
∈
B
b\in\mathcal{B}
b∈B do
w
←
w
−
η
∇
ℓ
(
w
;
b
)
w\gets w-\eta\nabla\ell(w;b)
w←w−η∇ℓ(w;b)
r
e
t
u
r
n
return
return
w
w
w to 中心服务器
Experimental Results 实验结果
-
We are motivated by both image classification and language modeling tasks where good models can greatly enhance the usability of mobile devices.
我们的动机是图像分类任务和语言任务任务,好的模型可以极大的提升移动设备的易用性。 -
For each of these tasks we first picked a proxy dataset of modest enough size that we could thoroughly investigate the hyperparameters of the FedAvg algorithm.
对于这些任务,我们选择一个足够大的代理数据集,以便我们可以研究FedAvg算法的超参数。 -
While each individual training run is relatively small, we trained over 2000 individual models for these experiments.
虽然每次训练任务都不大,但我们训练了2000次模型为了这些实验。 -
We then present results on the benchmark CIFAR-10 image classification task.
然后我们展示了给予CIFAR-10 的图像训练结果。 -
Finally, to demonstrate the effectiveness of FedAvg on a real-world problem with a natural partitioning of the data over clients, we evaluate on a large language modeling task.
最后为了证明FedAvg算法自然分区数据的真实问题上的有效性,我们最后在一个大语言模型上对任务进行了评估。 -
Our initial study includes three model families on two datasets.
我们最初始的研究包括两个数据集。 -
The first two are for the MNIST digit recognition task [26]:
最初的两个任务是MINIST 的手写数字识别任务。 -
A simple multilayer-perceptron with 2-hidden layers with 200 units each using ReLu activations (199,210 total parameters), which we refer to as the MNIST 2NN.
使用的是一个简单的2层全连接层的网络,每个层后面都加一个ReLU 激活函数层(一共有199210个参数),我们管这个模型叫MINIST 2NN. -
A CNN with two 5×5 convolution layers (the first with 32 channels, the second with 64, each followed with 2×2 max pooling), a fully connected layer with 512 units and ReLu activation, and a final sofrmax output layer (1663370 total parameters.)
一个CNN 有两个 5×5的卷积层(第一个卷积层有32个通道,第二个卷积层有64个通道,每个卷积层后面接一个2×2的最大池化层)然后是一个有512个神经元的全连接层,和一个ReLU 激活函数, 然后最后还有一个softmax层(总参数 1663370) -
To study federated optimization, we also need to specify how the data is distributed over the clients.
为了研究联邦优化,我们需要取明确, 这个数据是如何分布到客户端上的。 -
IID, where the data is shuffled, and then partitioned into 100 clients each receiving 600 examples,
IID 独立同分布数据集,将数据打乱,然后随机分配到100个客户端上,每个客户端上600个训练样本。 -
Non-IID, where we first sort the data by digit label, divide it into 200 shards of size 300, and assign each of 100 clients 2 shards.
非独立同分布的,我们将数据按照标签排序,划分到成200组每组300个,然后每个客户端随机分到两组。 -
This is a pathological non-IID partition of the data as most clients will only have examples of two digits
这是一个有些病理状态的非独立同分布的数据,每个客户端只有2个数字的数据, -
letting us explore the degree to which our algorithms will break on highly non-IID data.
让我们探索算法在非独立同分布数据集上的崩溃程度。 -
Both of these partitions are balanced, however.
这两种划分方式划分到每一个客户端上的数据量都是一样的。 -
For language modeling, we built a dataset from The Complete Works of William Shakespeare
对于大语言模型,我们从莎士比亚全集中,构建了一个数据集。 -
We construct a client dataset for each speaking role in each play with at least two lines. This produced a dataset with 1146 clients.
我们为每一个至少说过两行话的角色构建了一个数据集,最后一共产生了1146个客户端的数据集。 -
For each client, we split the data into a set of training lines (the first 80% of lines for the role), and test lines (the last 20%, rounded up to at least one line).
对于每一个客户端,我们都按照8:2的比例随机划分了数据集,在划分的时候测试机四舍五入最少也得有1条。 -
The resulting dataset has 3,564,579 characters in the training set, and 870,014 characters6 in the test set.
最终这个数据集有3564579个字符在训练集,有870014个字符在测试集。 -
This data is substantially unbalanced, with many roles having only a few lines, and a few with a large number of lines.
这个数据集是非常不平衡的很多角色都只有几行的数据,然后有一些则有大量的数据。 -
Further, observe the test set is not a random sample of lines, but is temporally separated by the chronology of each play.
这里还要提一下的就是,测试机不是随机采样的,而且根据时间随机划分的。 -
Using an identical train/test split, we also form a balanced and IID version of the dataset, also with 1146 clients.
使用相同的训练集和测试集划分,我们还形成了一个平衡的独立同分布的1146个客户端的数据集。 -
On this data we train a stacked character-level LSTM language model, which after reading each character in a line, predicts the next character .
我们训练了一个有堆叠层的字符级的LSTM网络,通过预测一行中的每个字符之后,可以预测下一个字符。 -
The model takes a series of characters as input and embeds each of these into a learned 8 dimensional space.
这个模型将每个字符都表征为一个8维度的向量。 -
The embedded characters are then processed through 2 LSTM layers, each with 256 nodes.
嵌入的字符经过一个两层的LSTM,每个LSTM层都有256个节点。 -
Finally the output of the second LSTM layer is sent to a softmax output layer with one node per character.
最后第二个LSTM的数据被送给一个sofrmax层,每个字符单独有个自己的节点有一类。 -
The full model has 866,578 parameters, and we trained using an unroll length of 80 characters.
整个模型有866578个参数,我们使用长度为80个字符的样本进行训练。 -
SGD is sensitive to the tuning of the learning-rate parameter η
SGD 随机梯度下降算法对 学习率参数的变化很敏感。 -
The results reported here are based on training over a sufficiently wide grid of learning rates (typically 11-13 values for η on a multiplicative grid of resolution 1 0 1 3 10^{\frac{1}{3}} 1031 or 1 0 1 6 10^{\frac{1}{6}} 1061 ).
这个结果是给予一个足够宽的学习率上进行训练,从 1 0 1 6 10^{\frac{1}{6}} 1061 到 1 0 1 3 10^{\frac{1}{3}} 1031 的乘法网格上,选择11-13个值。 -
We checked to ensure the best learning rates were in the middle of our grids, and that there was not a significant difference between the best learning rates.and that there was not a significant difference between the best learning rates.
我们检查确认,确保最好的学习率在网格中间,并且在最佳学习率之前没有显著差异。 -
Unless otherwise noted, we plot metrics for the best performing rate selected individually for each x-axis value.
除非特意提及,否则我们绘制曲线用的都是在x轴指标固定下的最佳表现。 -
We find that the optimal learning rates do not vary too much as a function of the other parameters.
我们发现在改变其他参数的条件下,最优学习率并没有很大的不同。
Increasing paralleism 增加并行性
-
We first experiment with the client fraction C C C
我们首先对选择客户端比例系数 C C C进行实验。 -
Table 1 shows the impact of varying C C C for both MNIST models.
表1 显示了不同的 C C C对两种MINIST 模型的影响。 -
We report the number of communication rounds necessary to achieve a target test-set accuracy.
我们记录了到达目标测试集准确率需要的通信轮数。 -
To compute this, we construct a learning curve for each combination of parameter settings
为了获得这个通信轮数,我们设定了一条组合参数的学习曲线。 -
optimizing η \eta η as described above and then making each curve monotonically improving by taking the best value of test-set accuracy achieved over all prior rounds.
为了优化 η \eta η我们取每次训练测试机之前的所有的最优曲线,保证每条曲线都是单调改进的。 -
We then calculate the number of rounds where the curve crosses the target accuracy,
我们随后通过线性插值法 -
using linear interpolation between the discrete points forming the curve.
在形成曲线的离散点之间计算曲线与目标准确率交叉的轮数 -
This is perhaps best understood by reference to Figure 2, where the gray lines show the targets.
这一点可以参考图2,其中灰线表示目标。 -
With B = ∞ (for MNIST processing all 600 client examples as a single batch per round),
当 B = ∞ 时(对于 MNIST,每轮将所有600个客户端示例作为一个批次处理) -
there is only a small advantage in increasing the client fraction.
在提升客户端比例时,只有一点小用。 -
Using the smaller batch size B = 10 shows a significant improvement in using C ≥ 0.1, especially in the non-IID case.
使用较小的批次 B = 10 时,设置C ≥ 0.1对模型表现又显著改进,尤其是对非独立同分布的数据集。 -
Based on these results, for most of the remainder of our experiments we fix C = 0.1.which strikes a good balance between computational efficiency and convergence rate.
基于以上结果,在接下来的大部分实验中,我们都固定了C=0.1,这个值在计算效率和准确率之前取得了良好的平衡。 -
Comparing the number of rounds for the B = ∞ and B = 10 columns in Table 1 shows a dramatic speedup, which we investigate next.
比较表中 B = ∞ and B = 10 的这两列,可以看到B = 10 有显著的加速,在接下来的研究中我们详细讨论这一问题。
Table 1 (超参数C 实验)实验结果✨
验证
C
C
C,MINIST 2NN 设置
E
=
1
E=1
E=1,CNN设置
E
=
5
E=5
E=5 ,然后
C
=
0
C=0
C=0 代表着每次只用1个客户端进行训练,由于我们使用了100个客户端进行训练,所以每一列代表着我随即选择了,1个,10个,20个,50个,和100个客户端进行训练,每个表项代表的是测试集达到指定准确率的轮数,对于MINIST2NN 来说,是测试集准确率达到97%,对于CNN来说是准确率达到99%。以及对
C
=
0
C=0
C=0速度的对比,表格中表(—)表示没有在规定的时间内达到指定的准确率。
Increasing computation per client 增加每个客户端的计算量
-
In this section, we fix C=0.1, and add more computation per client on each round, either decreasing B,increasing E,or both
在这个部分,我们固定了C=0.1,并且增加每一轮在客户端上的训练轮数,包括减小B,然后增加E,或者两个都做。 -
Figure 2 demonstrates that adding more local SGD updates per round can produce a dramatic decrease in communication costs,
图2中表示了,每轮在本地增加SGD更新,可以极大的减少通信成本。 -
and Table2 quantifies these speedups.
并且表2中统计这些速度的提升。 -
The expected number of updates per client per round is u = ( E [ n k ] B ) E = n E K B u = \left( \frac{\mathbb{E}[n_k]}{B} \right) E = \frac{nE}{KB} u=(BE[nk])E=KBnE,where the expectation is over the draw of a random client k.
每个客户端的预期更新轮数,计算方式为 u = ( E [ n k ] B ) E = n E K B u = \left( \frac{\mathbb{E}[n_k]}{B} \right) E = \frac{nE}{KB} u=(BE[nk])E=KBnE ,以上计算适用于随机抽取到的k -
We order the rows in each section of Table 2 by this statistic.
我们针对这项统计量,在表中进行了排序。 -
We see the increasing u by varying both E and B is effective.
我们可以看到可以通过改变和B来有效的增加u。 -
As long as B is large enough to take full advantage of available parallelism on the client hardware
如果B足够大,可以非常充分的利用好在客户端硬件上的并行性。 -
there is essentially no cost in computation time for lowering it, and so in practice this should be the first parameter tuned.
降低B也不会提升硬件的通信成本,所以在实际应用中,B应该优先调节。
Table2 (超参数 E B)实验结果✨
Table2: FedAvg算法,达到目标准确率的通信轮数,FedSGD算法为,
E
=
1
E=1
E=1,
B
=
∞
B=∞
B=∞的时候。计算公式
E
=
n
E
K
B
E = \frac{nE}{KB}
E=KBnE。
-
For the IID partition of the MINIST data, using more computation per client decreases the number of rounds to reach the target accuracy by 35× for CNN and 46× for the 2NN.
对于IID 的MINIST 数据集来说, 在每个客户端上增加更多的计算,可以显著减少达到目标准确率的通信轮数,对于CNN来说减少了35倍,对于2NN来说减少了46倍。 -
(see Table 4 in Appendix A for details for the 2NN)
2NN 模型的数据见附件中的Table 4.
-
The speedups for the pathologically partitioned non-IID data are smaller, but still substantial (2.8-3.7×)
对于非均匀数据划分的非IID数据来说,提升比较小,但是实际上也有2.8到3.7倍的提升。 -
It is impressive that averaging provides any advantage(Vs. actually diverging) when we naively average the parameters of models trained on entirely different pairs of digits.
令我们印象深刻的是,当我们简单地对训练于完全不同数字对上的模型参数进行平均时,竟然还能取得某种优势(而不是实际上导致发散) -
Thus, we view this as strong evidence for robustness of this approach.
因此我们认为这是这种方法具有鲁棒性的强有力的证据。 -
The unbalanced and non-IId distribution of the Shakespeare is much more representative of the kind of data distribution we expect for real-world applications.
不平衡的和莎士比亚的数据集,更能代表我们在真实世界中获取到的程序。 -
Encouragingly, for this problem learning on the non-IID and unbalanced data is actually much easier(a 95 × speedup vs 13× for the balanced IID data)
令人鼓舞的是,对于这个问题,在非IID训练集上进行训练会变得更容易,一个提升了95倍的速度,一个提升了13倍的速度。 -
we conjecture this is largely due to the fact some roles have relatively large local datasets,which makes increased local training particularly valuable
我们推测,这原因是,一些角色有较大的数据集,这使得增加本地训练特别有价值。 -
For all three model classes, FedAvg converges to a higher level of test-set accuracy than the baseline FedSGD models.
对于这三种模型的训练,FedAvg在测试集上的准确率都高于FedSGD模型。 -
This trend continues even if the lines are extended beyong the plotted ranges.
这个趋势即使眼震到绘制的范围之外依旧还是继续的。 -
For example, for the CNN the B=∞,E=1 FedSGD model eventually reaches 99.22% accuracy after 1200 rounds.
举例来说对于CNN模型, B=∞,E=1 时,FedSGD 模型在训练了1200轮之后,达到的是99.22%的准确率。 -
while the B=10, E=20 FedAvg model reaches an accuracy of 99.44% after 300 rounds.
如果设置B=10, 然后E=20的FedAvg Moxing,在300轮达到了99.44%的准确率。 -
We conjecture that in addition to lowering communication costs, model averaging produces a regularization benefit similar to that achieved by dropout.
我们还推测,除了降低通信成本,这个策略还进行了类似dropout类似的好处。 -
We are primarily concerned with generalization performance, but FedAvg is effective at optimizing the training loss as well, even beyong the point where test-set accuracy plateaus.
我们主要关心的是模型的泛化性能(generalization performance),但联邦平均算法(FedAvg)在优化训练损失(training loss)方面也很有效,甚至在测试集准确率达到平稳点之后依然能继续降低训练损失。 -
We observed similar behavior for all three model classes, and present plots for the MINIST CNN in Figure 6.
我们发现了对于这三种模型来说都有这种情况。 -
Can we over-optimize on the client client datasets?
我们可以对客户端的数据集进行进一步的优化吗? -
The current model parameters only influence the optimization performed in each ClientUpdata via initialization.
当前的模型参数仅通过初始化,影响每个客户端一开始更新的过程。 -
Thus,as E → ∞, at least for a convex problem eventually the initial conditions should be irrelevant, and the global minimum would be reached regardless of initialization.
当E达到无穷的时候,至少对于一个凸问题来说,最终应该是与初始化条件无关的,不管如何初始化都应该达到局部极小值点。 -
Even for a non-convex problem, on might conjecture the algorithm would converge to the same local minimum as long as the initialization was in the same basin.
对于非凸问题,也可以推测,如果初始化在一片盆地中算法应该也会优化到最小值的部分。 -
That is, we would expect that while one round of averaging might produce a reasonable model, additional rounds of communication (and averaging) would not produce further improvements.
我们可以预期的是,第一轮参数的平均可能会生成一个合理的莫i选哪个,但是多轮的通信不会带来进一步的改进。 -
Figure 3 shows the impact of large E during initial training on the Shakespeare LSTM problem.
图片3显示了,更大的E对于莎士比亚LSTM问题的影响, -
Indeed, for very large number of local epochs, FedAvg can plateau or diverge.
确实对于非常大的本地训练数据集,FedAvg 可能会趋于平稳或者发散。 -
it may be useful to decay the amount of local computation per round (moving to smaller E or larger B) in the same way decaying learning rates can be useful.
这也表明,减少每轮本地的计算量可能是有意的,就像衰减学习率一样。 -
Figure 8 in Appendix A gives the analogous experiment for the MNIST CNN。
在Figure8,中我们进行了类似的实验, -
Interestingly, for this model we see no significant degradation in the convergence rate for large values of E.
有趣的是,我们看到使用大值的E的收敛速度没有显著的下降。 -
However, we see slightly better performance for E=1 versus E=5 for the large-scale language modeling task described below.
然而我们看到在大语言模型任务上,E=1的效果,要稍微好于E=5。
CIFAR experiments
-
We also ran experiments on the CIFAR-10 dataset to further validate FedAvg.
我们又通过在CIFAR-10上进行实验,来进一步验证FedAvg的有效性。 -
The dataset consists of 10 classes of 32×32 images with three RGB channels.
这个数据集包含10类,大小为 32×32 的RGB三通道的图像。 -
There are 50000 training examples and 10000 testing examples, which we partitioned into 100 clients each containing 500 training and 100 testing examples;
这个数据集有50000个训练集,10000个测试集,我们划分出了100个客户端,每个客户端包括500个训练样本和100个测试样本。 -
since there isn’t a natural user partitioning of this data, we considered the balanced and IID setting.
由于这个数据集没有一个自然的用户分析,我们考虑了数据平和和IID设置。 -
The model architecture was taken from the TensorFlow tutorial
这个模型结构取自TensorFlow的教程。 -
which consists of two convolutional layers followed by two fully connected layers and then a linear transformation layer to produce logits, for a total of about 1 0 6 10^6 106 parameters。
模型包含两个卷积层,后面接着两个全连接层,然后接一个原始输出值,总共大约有, 1 0 6 10^6 106 个参数。
测试级训练轮数提升和速度比例,对比基础使用的SGD,使用minibatch size 尺寸为100, FedSGD 和 FedAvg 使用 C=0.1 的百分比,FedAvg 算法使用 E=5, 然后B=50。
-
Note that state-of-the-art approaches have achieved a test accuracy of 96.5% for CIFAR;
目前最优的方法,已经将CIFAR数据集的准确率提升到了96.5% -
nevertheless ,the standard model we use is sufficient for our needs, as our goal is to evaluate our optimization method, not achieve the best possible accuracy on this task.
然后我们使用的标准模型,足以满足我们的需求,我们的目标是验证我们的方法,而不是在这个任务实现最佳准确率。 -
The images are preprocessed as part of the training input pipeline, which consists of cropping the images to 24×24,randomly flipping left-right and adjusting the contrast, brightness and whitening.
图像会在整个训练流程中进行一些预处理,包括将图片裁剪到24×24,随机左右翻转,然后调整,亮度对比度和美白。 -
For these experiments, we considered an additional base line, standard SGD training on the full training set using mini-batches of size 100.
在这些实验中,我们考虑了一个额外的基线模型,标准的SGD训练,在整个训练集上,然后mini-batch大小设置为100。 -
We achieved an 86% test accuracy after 197500 mini-batch updates(each mini-batch update requires a communication round in the federated setting.)
我们最终实现了再197500mini-batch 训练之后,实现了86%的准确率,然后每一次mini-batch训练认为相当于他在连邦学习的一次通信。 -
FedAvg achieves a similar test accuracy of 85% after only 2000 communication rounds.
联邦学习平均算法在2000次的通信后就达到了大约85%的准确率。 -
For all algorithms, we tuned a learning-tuned a learning -rate. Table 3 gives the number of communication rounds for baseline SGD,FedSGD,andFed Avg to reach three different accuracy targets, and Figure 4 gives the learning-rate curves for FedAvg versus FedSGD.
对于所有的算法,我们都调整了学习率,表3给出了基准SGD、FedSGD和FedAvg,达到三个不同的目标精度所需的通信轮次数量,图4展示了FedAvg 和 FedSGD 的学习率曲线。 -
By running experiments with minibatches of size B=50 for both SGD and FedAvg, we can also look at accuracy as a function of the number of such minibatch gradient calculations.
通过在SGD和FedAvg中使用大小为B=50的mini-batch进行实验,我们还可以准确率如何随着这种mini-batch 梯度计算的次数而变化。 -
However as Figure 9 in the appendix show, for modest values of C and E, FedAvg makes a similar amount of progress per minibatch computation.
然而根据附录图9所示,对于适度的C和E值,FedAvg平均算法,每次mini-bath训练的时候,都会取得比较相似的进展。 -
Further, we see that both standard SGD and FedAvg with only one client per round, demonstarte significant oscillations in accuracy, whereas averaging over more clients smooths this out.
此外,我们发现,标准的SGD和每轮只有一个客户端参与的FedAvg的准确率波动很大,而平均多个客户端的结果能够使波动变得平滑。
Large-scale LSTM experiments 大规模的LSTM实验
-
We ran experiments on a large-scale next-world prediction task to demonstrate the effectiveness of our approach on a real-world problem.
我们在一个大规模的下一个词的预测任务上进行了实验,以展示我们方法在实际问题中的有效性。 -
Our training dataset consists 10 million public posts from a large social network.
我们的数据集来源与一个社交网站的公共帖子。 -
We grouped the posts by author, for a total of over 500,000 clients.
我们按照作者,对帖子进行分组,有一共超过500,000个客户端。 -
This dataset is a realistic proxy for the type of text entry data that would be present on a user’s mobile device.
此数据集是用户移动设备上可能出现的文本输入数据类型的真实代理。 -
We limited each client dataset to at most 5000 words and report accuracy (the fraction of the data where the highest predicted probability was on the correct next word, out of 10000 possibilites) on a test set of 1e5 posts from different (non-training) authors.
我们将每个客户端的数据集限制为最多 5000 个词,并在包含来自不同(非训练)作者的 1 万个帖子组成的测试集上报告准确率(即在 10000 个可能的词中,模型预测概率最高的词是正确下一个词的比例)。 -
Our model is a 256 node LSTM on a vocabulary of 10000 words.
我们的模型一个分类10000个词的有256个节点的LSTM模型。 -
The input and output embeddings for each word were of dimension 192.
输入和输出的词嵌入向量的大小为192。 -
and co-trained with the model; there are 4,950,544 parameters in all. We used an unroll of 10 words.
我们的模型是一个包含 256 个节点的 LSTM,词汇量为 10,000 个单词。每个单词的输入和输出嵌入维度为 192,并与模型一起训练;总共有 4,950,544 个参数。我们使用了长度为 10 个词的展开(unroll)。 -
These experiments required significant computational resources and so we did not explore hyper-parameters as thoroughtly;
这些实验需要大量的计算资源,因此我们没有对超参数进行非常深入的探索。 -
all runs trained on 200 clients per round;
每轮训练选200个客户端。 -
FedAvg used B=8 adn E=1
平均联邦学习算法,使用的是B=8,E=1 -
We explored a variety of learning rates for FedAvg and the baseline FedSGD.
我们探索了各种学习率在FedAvg和FedSGD两个算法上。 -
Figure 5 shows monotonic learning curves for the best learning rates.
图5显示了最佳学习率的单调学习曲线。 -
FedSGD with μ = 18 \mu=18 μ=18 required 820 rounds to reach 10.5% accuracy.
FedSGD算法,学习率设置为18,通过820轮达到了10.5%的准确率。 -
while FedAvg with μ \mu μ =9,reach an accuracy of 10.5% only 35 communication rounds(23×fewer then FedSGD).
然后我们设置FedAvg 的学习率为9,只用35轮通信,就打到了10.5%的准确率,比FedSGD快23倍。 -
see Figure 10 in Appendix A. This figure also include results for E=5, which performed slightly worse than E=1
Figure 10 中还包含,E=5的结果,性能略差与E=1。
Conclusions and Future Work 结论和未来的工作
-
Our experiments show that federated learning can be made pactical as FedAvg trains high-quality models using relatively few rounds of communication as demonstrated by results on a variety of model architectures:
我们的实验表明,联邦学习可以变得实用,因为 FedAvg 使用相对较少的通信轮次来训练高质量的模型,各种模型架构的结果证明了这一点 -
a multi-layer perceptron, two different convolutional NNs, a two-layer character LSTM, and a large-scale word-level LSTM.
一个多层感知器、两个不同的卷积 NN、一个两层字符 LSTM 和一个大规模词级 LSTM。 -
While federated learning offers many partical privacy benefits, providing stronger guarantees via differential privacy,secure multi-party computation, or their combination is an interesting direction for future work.
虽然联邦学习提供了许多实际的隐私优势,但通过差分隐私 [14, 13, 1]、安全的多方计算 [18] 或它们的组合提供更强的保证是未来工作的一个有趣方。 -
Note that both classes of techniques apply most naturally to synchronous algorithms like FedAvg
这两类技术自然的适用于联邦学习算法。
附件:Fig2-Fig10
结束
这个文章我自己精读就花了差不多4个整的白天的工作时间,也是第一次,彻底把每一个概念都读懂的完全的的精度一篇论文,也是想分享出来,希望不管是为自己课题组的,还是之后做联邦学习的老师和同学提供一些帮助。