Google — Federated Learning
📚原文:《Communication-Efficient Learning of Deep Networks from Decentralized Data》
最近研读了这篇提出了联邦学习(Federated Learning)的文章,内容主要是对原文的理解和整理,希望能帮助正在了解联邦学习的小伙伴们。
⚠️笔者也是刚开始了解FL,所以也可能有些地方理解不到位或有错误,者希望能和大家交流,共同进步~
-
一、研究背景
移动通信设备中有许多有用的数据,训练模型后可以提高用户体验;但是,这些数据通常是敏感的或者很庞大的,不能直接上传到data center,使用传统的方法训练模型。
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos.However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches.
-
二、主要研究内容
提出了一种训练模型的替代方法Federated Learning
-
leaves the training data distributed on the mobile devices;
- learns a shared model by aggregating locally-computed updates;
We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.
-
提出一种实用的联邦学习算法——迭代的模型平均;
- 适用于unbalanced and non-IID data distributions;
- 通信成本是主要的约束条件,与同步随机梯度下降相比减少了10–100倍。
We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.
-
提出了FederatedAveraging算法;robust to unbalanced and non-IID data distributions;reduce the rounds of communication needed to train
More concretely, we introduce the FederatedAveraging algorithm, which combines local stochastic gradient descent (SGD) on each client with a server that performs model averaging. We perform extensive experiments on this algorithm, demonstrating it is robust to unbalanced and non-IID data distributions, and can reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.
-
三、联邦学习(Federated Learning)
- FL的数据应具有以下特性(标准):
- 训练来自移动设备的真实数据比数据中心提供的代理数据具有明显的优势;
- 数据是隐私敏感的或较大规模的,不需要仅出于训练模型的目的将其记录在数据中心;
- 对于监督任务,可以从用户交互中自然地推断出数据的标签。
Federated Learning Ideal problems for federated learn- ing have the following properties: 1) Training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center. 2) This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training (in service of the focused collection principle). 3) For supervised tasks, labels on the data can be inferred naturally from user interaction.
- 许多 移动设备上的智能应用有满足上述标准的数据
- image classification: predicting which photos are most likely to be shared.
- language models: used to improve voice recognition and text entry on touch-screen keyboards.
数据是敏感的: 用户的照片或键盘输入的文本;
数据的分布也与代理数据提供的不同, 更有用户特点和优势;
数据的标签也是可以直接获得的:比如用户的照片和输入的文字等本身就是带标签的;照片可以通过用户的交互操作进行打标签(删除、分享、查看)。
同时,这两个任务都非常适合学习神经网络。 对于图像分类,前馈深层网络,特别是卷积网络(LeCun等,1998; Krizhevsky等,2012);对于语言模型,神经网络,LSTM(Hochreiter和Schmidhuber,1997; Kim等,2015)。
The potential training data for both these tasks (all the photos a user takes and everything they type ) can be privacy sensitive. The distributions from which these examples are drawn are also likely to differ substantially from easily available proxy datasets: the use of language in chat and text messages is generally much different than standard language corpora, e.g., Wikipedia and other web documents; the photos people take on their phone are likely quite different than typical Flickr photos. And finally, the labels for these problems are directly available: entered text is self-labeled for learning a language model, and photo labels can be defined by natural user interaction with their photo app (which photos are deleted, shared, or viewed).
-
四、隐私(Privacy)
-
FL 传输的信息是改进特定模型所必需的最小更新(隐私利益的强度取决于更新的内容);
-
更新本身是短暂的,所包含的信息绝不会超过原始训练数据且通常会少得多;
-
聚合算法不需要更新源(不需要知道用户是谁?),因此,可以通过混合网络(例如Tor)或通过受信任的第三方传输更新而无需标识元数据。
-
在本文的最后,简要地讨论了将联合学习与安全的多方计算和差分隐私相结合的可能性。
Privacy Federated learning has distinct privacy advantages compared to data center training on persisted data. Holding even an “anonymized” dataset can still put user privacy at risk via joins with other data (Sweeney, 2000). In contrast, the information transmitted for federated learning is the minimal update necessary to improve a particular model (naturally, the strength of the privacy benefit depends on the content of the updates.) The updates themselves can (and should) be ephemeral. They will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less. Further, the source of the updates is not needed by the aggregation algorithm, so updates can be transmitted without identifying meta-data over a mix network such as Tor (Chaum, 1981) or via a trusted third party. We briefly discuss the possibility of combining federated learning with secure multiparty computation and differential privacy at the end of the paper.
-
五、联邦优化(Federated Optimization)
- 联邦优化: 隐含在联邦学习中的优化问题(与分布式优化问题对比)
- 联邦优化问题的关键属性(与典型的分布式优化问题对比)
- 用户数据非独立同分布: 特定的用户数据不能代表用户的整体分布;
- 用户数据量不平衡: 数据量不均衡,因为有的用户使用多,有的用户使用少;
- 用户(分布)是大规模的: 参与优化的 用户数>平均每个用户的数据量;
- 用户端设备通信限制: 移动设备经常掉线、速度缓慢、费用昂贵。
Non-IID The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular user’s local dataset will not be representative of the population distribution.
Unbalanced Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data.
Massively distributed We expect the number of clients participating in an optimization to be much larger than the average number of examples per client.
Limited communication Mobile devices are frequently offline or on slow or expensive connections.
⚠️重点: 联邦优化问题中的non-IID[1]和unbalanced[2]特性,以及通信约束中的关键性质。
- 实际部署的联邦优化系统还须解决许多实际问题:
- 随着数据添加和删除而不断改变的客户端数据集;
- 客户端(更新)的可用性与其本地数据分布有着复杂的关系;
- 从来不响应或发送信息的客户端会损坏更新。
In this work, our emphasis is on the non-IID and unbalanced properties of the optimization, as well as the critical nature of the communication constraints. A deployed federated optimization system must also address a myriad of practical issues: client datasets that change as data is added and deleted; client availability that correlates with the local data distribution in complex ways; and clients that never respond or send corrupted updates.
注:这些实际问题超出了当前工作的范围;本文使用了适合实验的可控环境,且仍解决了客户端可用性、不平衡和non-IID数据的关键问题。
These issues are beyond the scope of the current work; instead, we use a controlled environment that is suitable for experiments, but still address the key issues of client availability and unbalanced and non-IID data.
-
六、优化方法(基本思路&优化公式)
-
执行思路:
假设:同步更新方案在各轮通信中进行;有一组固定的客户端集合,大小为K,每个客户端都有一个固定的本地数据集;
- 在每轮更新开始时,随机选择部分客户端,大小为C-fraction(应该是比例,C≤1);
- 然后,服务器将当前的全局算法的状态发送给这些客户(例如,当前的模型参数);
- 然后,每个客户端都基于全局状态及其本地数据集执行本地计算,并将更新发送到服务器;
- 最后,服务器将这些更新应用于其全局状态,然后重复该过程。
We assume a synchronous update scheme that proceeds in rounds of communication. There is a fixed set of K clients, each with a fixed local dataset. At the beginning of each round, a random fraction C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters). Each client then performs local computation based on the global state and its local dataset, and sends an update to the server. The server then applies these updates to its global state, and the process repeats.
- 非凸神经网络的目标函数
While we focus on non-convex neural network objectives, the algorithm we consider is applicable to any finite-sum objective of the form:
min w ∈ R d f ( w ) , where f ( w ) = def 1 n ∑ i = 1 n f i ( w ) \min _{w \in \mathbb{R}^{d}} f(w), \text { where } \ f(w) \stackrel{\text { def }}{=} \frac{1}{n} \sum_{i=1}^{n} f_{i}(w) w∈Rd
-