2023-10 去中心化联邦均衡(写作学习)

原文题目:Decentralized Federated Averaging

作者:Tao Sun , Dongsheng Li , and Bao Wang

摘要

Federated averaging (FedAvg) is a communication-efficient algorithm for distributed training with an enormous number of clients. In FedAvg, clients keep their data locally for privacy protection; a central parameter server is used to communicate between clients.This central server distributes the parameters to each client and collects the updated parameters from clients.

介绍FedAvg:何种算法+基本的工作模式

FedAvg is mostly studied in centralized fashions, requiring massive communications between the central server and clients, which leads to possible channel blocking. Moreover, attacking the central server can break the whole system’s privacy.

点出FedAvg存在的问题:通信成本+易受攻击

Indeed, decentralization can significantly reduce the communication of the busiest node (the central one) because all nodes only communicate with their neighbors.

介绍去中心化方法的优势

To this end, in this paper, we study the decentralized FedAvg with momentum (DFedAvgM), implemented on clients that are connected by an undirected graph. In DFedAvgM, all clients perform stochastic gradient descent with momentum and communicate with their neighbors only. To further reduce the communication cost, we also consider the quantized DFedAvgM. The proposed algorithm involves the mixing matrix, momentum, client training with multiple local iterations, and quantization, introducing extra items in the Lyapunov analysis. Thus, the analysis of this paper is much more challenging than previous decentralized (momentum) SGD or FedAvg.

本文的亮点,并点出与类似工作的区别和优势

We prove convergence of the (quantized) DFedAvgM under trivial assumptions; the convergence rate can be improved to sublinear when the loss function satisfies the PL property. Numerically, we find that the proposed algorithm outperforms FedAvg in both convergence speed and communication cost.

假设和效果

1 简介

Federated learning (FL) is a privacy-preserving distributed machine learning (ML) paradigm [1]. In FL, a central server connects with enormous clients (e.g., mobile phones, pads, etc.); the clients keep their data without sharing it with the server. In each communication round, clients receive the current global model from the server, and a small portion of clients are selected to update the global model by running stochastic gradient descent (SGD) [2] for multiple iterations using local data. The central server then aggregates these updated parameters to obtain the updated global model. The above learning algorithm is known as the federated average (FedAvg) [1]. In particular, if the clients are homogeneous, FedAvg is equivalent to the local SGD [3]. FedAvg involves multiple local SGD updates and one aggregation by the server in each communication round, which significantly reduces the communication cost between sever and clients compared to the conventional distributed training with one local SGD update and one communication.

介绍FL,FedAvg,Local SGD之间的关系,Local SGD节省通信成本

1.1 动机

In FL applications, large companies and government organizations usually play the role of the central server. On the one hand, since the number of clients in FL is massive, the communication cost between the server and clients–the busiest communication in the centralized system–can be a bottleneck because all clients are connected with the central server [4]. On the other hand, the updated models collected from clients encode the private information of the local data; adversaries can attack the central server to break the privacy of the whole system, which remains the privacy issue as a serious concern. To this end, decentralized federated learning has been proposed [5], [6], where all clients are connected with an undirected graph, a.k.a. overlay network. Decentralized FL (DFL) replaces the server-clients communication in FL with clients-clients communication, or peer-to-peer communication.

FL的痛点:1.通信成本;2.安全性。针对痛点引入DFL,并简述DFL的本质: 将FL中的服务器-客户机通信替换为客户机-客户机通信或点对点通信。

Compared with centralized federated learning, decentralized federated learning enjoys several advantages: 1) DFL significantly reduces the communication costs of the busiest node in FL. DFL reduces the communication cost of centralized FL’s busiest node (central server), since all nodes are connected to the central server in the centralized FL. However, in the decentralized case, all nodes only communicate with their neighbors. One of the simplest decentralized FL cases is using a ring graph to connect all clients, in which each node just connects two topological neighbors. 2) DFL is more robust to clients’ failures than the centralized FL. Centralized FL will stop if the central server is broken. While decentralized FL can still work even several clients are out of order. Thus, the decentralized one is more robust to potential clients’ failures. 3) DFL is more resilient to potential privacy attacks than FL. Privacy is another primary concern of federated learning since the central server is also exposed to adversarial attacks. Notice that the central server contains all clients’ information in FedAvg. If someone successfully attacks the centralized server, all information may be divulged. While in the decentralized case, all clients only communicate with their neighbors. Only part of the information will be leaked if some clients are attacked. As confirmed numerically in our paper, DFL is more robust to potential privacy attacks, e.g., membership inference attacks.

对比DFL和CFL(中心化FL)的优劣势:

1.DFL显著减少通信成本:DFL降低了集中式FL最繁忙节点(中央服务器)的通信成本,因为在集中式FL中,所有节点都连接到中央服务器。而在分散的情况下,所有节点只与邻居通信。最简单的去中心化FL案例之一是使用环图连接所有客户端,其中每个节点只连接两个拓扑邻居。

2.DFL有更好的鲁棒性:如果中央服务器损坏,集中式FL将停止。而去中心化的FL仍然可以工作,即使几个客户端出了故障。

3.DFL有更好的防御性:隐私是联邦学习的另一个主要问题,因为中央服务器也会受到对抗性攻击。注意,中央服务器包含FedAvg中的所有客户端信息。如果有人成功地攻击了中央服务器,所有信息都可能泄露。而在去中心化的情况下,所有客户端只与它们的邻居通信。如果部分客户端受到攻击,只会泄露部分信息。正如我们的论文中数值所证实的那样,DFL对潜在的隐私攻击(例如成员推断攻击)更加健壮。

In this paper, we consider two crucial issues about decentralized FL: 1) Although there is no expensive communication between server and clients in decentralized FL, the communication between local clients can be costly when the size of ML model is large. Therefore, it is crucial to ask can we reduce the client-client communication cost in DFL systems? 2) Momentum is a well-established acceleration technique for SGD [7]. It is natural to ask can we use SGD with momentum to improve the training of ML models in decentralized FL with theoretical convergence guarantees?

明确本文针对的两个科学问题:1.我们能否降低DFL系统中的客户-客户通信成本?(模型大时,仅是节点间的通信也有很高的通信成本)

2.我们是否可以使用带动量的SGD来改进分散FL中ML模型的训练,并保证理论收敛?(动量是公认的SGD加速方法)

1.2 其他相关工作和新颖性

We briefly review three lines of work that are most related to this paper, i.e., federated learning, decentralized training, and decentralized federated learning.

总述相关工作的三个方面:FL, Decentralized training, DFL

Federated Learning. Many variants of FedAvg have been developed with theoretical guarantees. In [8], the authors use the momentum method for local clients training in FedAvg. The authors of [9] propose the adaptive FedAvg, whose central parameter server uses the adaptive learning rate to aggregate local models. Lazy and quantized gradients are used to reduce communications [10], [11]. In the paper [12], the authors propose a Newton-type scheme for federated learning. The federated learning method has been applied to Internet of Things (IoT) researches [13]. The convergence analysis of FedAvg on heterogeneous data is discussed in [14], [15], [16]. More details and applications of federated learning could be found in [17], [18]. Recent advances and some open problems in FL is available in survey papers [19], [20].

 联邦学习:动量方法,自适应学习率,惰性和量化梯度,牛顿式方案,IoT设备,异构数据上的收敛性分析,应用,综述

Decentralized Training. Decentralized algorithms were originally developed to calculate the mean of data that are stored over multiple sensors [21], [22], [23], [24]. Decentralized (sub) gradient descents (DGD), one of the simplest and efficient decentralized algorithms, have been studied in [25], [26], [27], [28], [29]. In DGD, the convexity assumption is unnecessary [30], which makes DGD useful for nonconvex optimization. A provably convergent decentralized SGD (DSGD) is proposed in [4], [31], [32]. The paper [31] provides the complexity result of a decentralized stochastic algorithm. In [32], the authors design a decentralized stochastic algorithm with dual information and provides the theoretical convergence guarantee. The authors of [4] prove that DSGD outperforms SGD in communication efficiency. Asynchronous DSGD is analyzed in [33]. DGD with momentum is proposed in [34], [35]. Quantized DSGD has been proposed in [36].

去中心化训练:分散算法的初衷:计算存储在多个传感器上的数据的平均值。DGD:分散梯度下降对凸性无要求。DSGD理论工作:复杂度,改进和理论收敛性证明,证明DSGD在通信高效性上优于SGD,异步DSGD,动量,量化。

Decentralized Federated Learning. Decentralized FL is a learning paradigm of choice when the edge devices do not trust the central server in protecting their privacy [18]. The authors in [37] propose a novel FL framework without a central server for medical applications, and the new method offers a highly dynamic peer-to-peer environment. The papers [5], [6] consider training an ML model with a connected overlay network whose nodes take a Bayesian-like approach by introducing a prior of the parameter space.

去中心化联邦学习:当边缘设备不相信中央服务器能保护它们的隐私时,去中心化FL是一种选择的学习范例。[37]的作者为医疗应用程序提出了一种新的FL框架,没有中央服务器,新方法提供了一个高度动态的点对点环境。论文[5],[6]考虑训练一个带有连接覆盖网络的ML模型,其节点采用类似贝叶斯的方法,通过引入参数空间的先验。

Compared with existing works on FedAvg [1], [8], [9], [10], [11], this paper uses a decentralized framework to enhance the robustness of FL to node failures and privacy attacks. In contrast to decentralized training [5], [6], [37], all nodes perform multiple local iterations rather than only one and employ momentum in our algorithm. To further reduce the communication costs, we use the quantization technique. These new algorithms are much more complicated than FedAvg or DSGD, and their convergence analysis is significantly more challenging than analyzing FedAvg and DSGD. We present the detailed convergence results of the proposed algorithms in convex, nonconvex, and Pº conditions. From a practical viewpoint, decentralized FL enjoys communication efficiency and fast convergence; we summarize the advantages of decentralized FL over FL in Table 1. Moreover, we present a sufficient condition that reveals when quantization enjoys communication efficiency and convergence tradeoffs.

与传统FedAvg相比本文用了去中心化框架来增强对节点失效和隐私性攻击的鲁棒性。与去中心化方法相比,本文的方法在所有节点都进行了多次局部迭代并引入了动量方法。为了进一步减小通信成本,还引入了量化技术。这些新算法比FedAvg或DSGD复杂得多,其收敛性分析比分析FedAvg和DSGD具有更大的挑战性。我们给出了所提算法在凸、非凸和PL条件下的详细收敛结果。此外,我们提出了一个充分条件,揭示了何时量化具有通信效率和收敛权衡。

1.3 贡献

We propose decentralized FedAvg with momentum (DFedAvgM) to improve training machine learning models in a DFL fashion. To further reduce the communication cost between clients, we also integrate model quantization, i.e., quantize the local machine learning models before communication, with DFedAvgM. Our contributions in this paper are elaborated below in threefold.

提出了DFedAvgM来改善DFL,并为了进一步降低通信成本引入了量化。贡献分为三个

Algorithmically, we extend FedAvg to the decentralized setting, where all clients are connected by an undirected graph. We motivate DFedAvgM from the DSGD algorithm. In particular, we use SGD with momentum to train ML models on each client. To reduce the communication cost between each client, we further introduce a quantized version of DFedAvgM, in which each client will send and receive a quantized model.

 Theoretically, we prove the convergence of (quantized) DFedAvgM. Our theoretical results show that the convergence rate of (quantized) DFedAvgM is not inferior to that of SGD or DSGD. More specifically, we show that the convergence rates of both DFedAvgM and quantized DFedAvgM depend on the local training and the graph that connects all clients. Besides the convergence results under nonconvex assumptions, we also establish their convergence guarantee under the Polyak-Lojasiewicz (PL) condition, which has been widely studied in nonconvex optimization. Under the PL condition, we establish a faster convergence rate for (quantized) DFedAvgM. Furthermore, we present a sufficient condition to guarantee reducing communication costs.

Empirically, we perform extensive numerical experiments on training deep neural networks (DNNs) on various datasets in both IID and Non-IID settings. Our results show the effectiveness of (quantized) DFedAvgM for training ML models, saving communication costs, and protecting membership privacy of training data.

算法方面:提出DFedAvgM+量化

理论方面:收敛性证明(凸、非凸、PL)

实验证明。

1.4 论文组成

1.5 符号说明

2 问题的表述和假设

定义优化问题:

 无向图建立去中心化系统的结构

 

假设:

1. 函数f可微且L光滑

2.梯度噪声有界

3.梯度有界

4. PL性质

3 去中心化联邦均衡

3.1 DFedAvgM

简述去中心化训练过程:

1.客户端i持有参数的近似副本并计算梯度的无偏估计;

2.客户端i将局部参数更新为所有邻近节点的平均

3.客户端对更新后的参数进行梯度下降

 画图说明不足:其中每次局部训练迭代后都需要一个通信步骤。这说明上述朴素去中心化训练算法与FedAvg不同,注意后者在每次通信前都会执行多个局部训练步骤。为此,我们不得不对去中心化算法的方案进行稍微的修改。为了简单起见,我们考虑修改DSGD来激励我们的去中心化FedAvg算法。

当将DSGD应用于解决问题(1)时,得到以下结果:

 3.2 量化

 随机量化:

4 收敛性证明

点明算法和传统算法的不同,讲述证明的难点

 

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值