BRIDGE: Byzantine-resilient Decentralized Gradient Descent
Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiage…
Authors: Cheng Fang, Zhixiong Yang, Waheed U. Bajwa
1 BRIDGE: Byzantine-resilient Decentralized Gradient Descent Cheng Fang, Zhixiong Y ang, and W aheed U. Bajwa Abstract —Machine learning has begun to play a central role in many applications. A multitude of these applications typically also in volve datasets that are distributed acr oss multiple comput- ing devices/machines due to either design constraints (e.g., multi- agent and Internet-of-Things systems) or computational/pri vacy reasons (e.g., large-scale machine lear ning on smartphone data). Such applications often r equire the lear ning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-rob ust learning algorithms. The focus of this paper is on rob ustification of decentralized learning in the pr esence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most rob ust of algorithms. But the study of Byzantine resilience within decen- tralized learning, in contrast to distributed learning, is still in its infancy . In particular , existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical conv ergence guarantees that help characterize their generalization errors. In this paper , a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical con- vergence guarantees f or one variant of BRIDGE ar e also provided in the paper for both strongly con vex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive re sults for Byzantine-resilient convex and noncon vex learning. I . I N T R O D U C T I O N One of the fundamental tasks of machine learning (ML) is to learn a model using training data that minimizes the statisti- cal risk [ 1 ]. A typical technique that accomplishes this task is empirical risk minimization (ERM) of a loss function [ 2 ]–[ 6 ]. Under the ERM framework, an ML model is learned by an optimization algorithm that tries to minimize the average loss with respect to the training data that are assumed available at a single location. In many recent applications of ML, howe ver , training data tend to be geographically distributed; examples include the multi-agent and Internet-of-Things systems, smart grids, sensor networks, etc. In several other recent applications C. Fang ( cf446@soe.rutgers.edu ) and W .U. Bajwa ( waheed.bajwa@rutgers.edu ) are with the Department of Electrical and Computer Engineering, Rutgers University–Ne w Brunswick, NJ 08854. Z. Y ang ( zhixiong.yang@bluedanube.com ) completed this work as part of his PhD dissertation at Rutgers University; he is now a Machine Learning Systems Engineer at Blue Danube Systems. This work is supported in part by the National Science Foundation under awards CCF-1453073, CCF-1907658, and OA C-1940074, and by the Army Research Office under award W911NF2110301. of ML, the training data cannot be gathered at a single machine due to either the massiv e scale of data and/or priv acy concerns; examples in this case include the social network data, smartphone data, healthcare data, etc. The applications in both such cases require that the ML model be learned using training data that are distributed over a network. When the ML/optimization algorithm in such applications requires a central coordinating server connected to all the nodes in the network, the resulting framew ork is often referred to as distributed learning [ 7 ]. Practical constraints many times also require an application to accomplish the learning tasks without a central server [ 8 ], in which case the resulting framework is referred to as decentralized learning . The focus of this paper is on decentralized learning, with a particular emphasis on characterizing the sample complexity of the decentralized learning algorithm—i.e., the rate, as a function of the number of training data samples, at which the ERM solution approaches the Bayes optimal solution in a decentralized setting [ 5 ], [ 8 ]. While decentralized learning has a rich history , a significant fraction of that work has focused on the f aultless setting [ 9 ]–[ 13 ]. But real-world de- centralized systems are bound to undergo failures because of malfunctioning equipment, cyberattacks, and so on [ 14 ]. And when the failures happen and go undetected, the learning algorithms designed for the faultless networks break do wn [ 7 ], [ 15 ]. Among the different types of failures in the network, the so-called Byzantine failur e [ 16 ] is considered the most general, as it allo ws the faulty/compromised nodes to arbitrarily de viate from the agreed-upon protocol [ 14 ]. Byzantine failures are the hardest to safeguard against and can easily jeopardize the ability of the network to reach consensus [ 17 ], [ 18 ]. Moreover , it has been shown in [ 15 ] that a single Byzantine node with a simple strategy can lead to the failure of a decentralized learning algorithm. The o verarching goal of this paper is to develop and (algorithmically and statistically) analyze an efficient decentralized learning algorithm that is prov ably resilient against Byzantine failures in decentralized settings with respect to both con ve x and noncon vex loss functions. A. Relationship to prior works Although the model of Byzantine failure was brought up decades ago, it has attracted the attention of ML researchers only very recently . Motiv ated by applications in large-scale machine learning [ 8 ], much of that work has focused solely on the distributed learning setup such as the parameter–server setting [ 19 ] and the federated learning setting [ 20 ]. A neces- sarily incomplete list of these works, most of which hav e de- veloped and analyzed Byzantine-resilient distributed learning 2 approaches from the perspecti ve of stochastic gradient descent, include [ 21 ]–[ 42 ]. Nonetheless, translating the algorithmic and analytical insights from the distributed learning setups to the decentralized ones, which lack central coordinating servers, is a nontrivial endeavor . As such, despite the plethora of work on Byzantine-resilient distributed learning, the problem of Byzantine-resilient decentralized learning—with the exception of a handful of works discussed in the following—lar gely remains unexplored in the literature. In terms of decentralized learning in general, there are three broad classes of iterativ e algorithms that can be utilized for decentralized training purposes. The first one of these classes of algorithms corresponds to first-order methods such as the distributed gradient descent (DGD) and its (stochastic) variants [ 43 ]–[ 46 ]. The iterati ve methods in this class hav e low (local) computational comple xity , which mak es them particularly well suited for large-scale problems. The second class of algorithms inv olves the use of augmented Lagrangian- based methods [ 47 ]–[ 49 ], which require each node in the network to locally solve an optimization subproblem. The third class of algorithms includes second-order methods [ 50 ], [ 51 ], which typically hav e high computational and/or commu- nications cost. Although the decentralized learning methods within these three classes of algorithms ha ve their o wn sets of strengths and weaknesses, all of these traditional works assume faultless operations within the decentralized network. W ithin the context of Byzantine failures in decentralized systems, some of the first works focused on the problem of Byzantine-resilient averaging consensus [ 52 ], [ 53 ]. These works were then leveraged to develop theory and algorithms for Byzantine-resilient decentralized learning for the case of scalar-v alued models [ 15 ], [ 54 ]. But neither of these works are applicable to the general vector -valued ML frame work being considered in this paper . In parallel, some researchers hav e also dev eloped Byzantine-resilient decentralized learning methods for some specific vector-v alued problems that include the decentralized support vector machine [ 55 ] and decentralized estimation [ 56 ]–[ 58 ]. Similar to the classical ML framew ork, howe ver , there is a need to dev elop and algorithmically/statistically ana- lyze Byzantine-resilient decentralized learning methods for vector -valued models for general—rather than specialized— loss functions, which can be broadly divided into two classes of con vex and noncon ve x loss functions. The first work in the literature that tackled this problem is [ 59 ], which developed a decentralized coordinate-descent-based learning algorithm termed ByRDiE and established its resilience to Byzantine failures in the case of a loss function that is giv en by the sum of a con ve x differentiable function and a strictly con ve x and smooth regularizer . The analysis in [ 59 ] also provided rates for algorithmic con ver gence as well as statistical con vergence (i.e., sample complexity) of ByRDiE. One of the limitations of [ 59 ] is its exclusi ve focus on conv ex loss functions for the purposes of analysis. More importantly , howe ver , the coordinate-descent nature of ByRDiE makes it slow and inefficient for learning of large-scale models. Let d denote the number of parameters in the ML model being trained (e.g., the number of weights in a deep neural network). One iteration of ByRDiE then requires updating the d coordinates of the model in d network-wide collaborative steps, each one of which requires a computation of the local d -dimensional gradient at each node in the network. In the case of large-scale models such as the deep neural networks with tens or hundreds of thousands of parameters, the local computation costs as well as the network-wide coordination and communications ov erhead of such an approach can be prohibitive for many applications. By contrast, since the algorithmic de velopments in this paper are based on the gradient-descent method, the resulting computational frame work is highly efficient and scalable in a decentralized setting. And while the algorithmic and statistical con vergence results derived in here match those for ByRDiE in the case of con v ex loss functions, the proposed framework is fundamentally different from ByRDiE and therefore necessitates its own theoretical analysis. W e conclude by noting that some additional works [ 60 ]– [ 63 ] rele vant to the topic of Byzantine-resilient decentral- ized learning have appeared during the course of revising this paper, which are being discussed here for the sake of completeness. It is worth reminding the reader, howe ver , that the work in this paper predates these recent efforts. Equally important, none of these works provide statistical con ver gence rates for the proposed methods. Additionally , the work in [ 60 ] only focuses on con vex loss functions and it does not provide any con ver gence rates. Further, the ability of the proposed algorithm to defend against a large number of Byzantine nodes se verely diminishes with an increase in the problem dimension. In contrast, the authors in [ 61 ] focus on Byzantine-resilient decentralized learning in the presence of non-uniformly distrib uted data and time-varying networks. The focus in this work is also only on conv ex loss functions and the performance of the proposed algorithm is worse than that of the approach advocated in this work for static networks and uniformly distributed data. Next, an algorithm termed MOZI is proposed in [ 62 ], with the focus once again being on con v ex loss functions. The resilience of MOZI, howe ver , requires an aggressive two-step ‘filtering’ operation, which limits the maximum number of Byzantine nodes that can be handled by the algorithm. The analysis in [ 62 ] also makes the unrealistic assumption that the faulty nodes al ways send messages that are ‘outliers’ relativ e to those of the regular nodes. Finally , the only paper in the literature that has inv estigated Byzantine- resilient decentralized learning for noncon ve x loss functions is [ 63 ]. The authors in this work have introduced three methods, among which the so-called ICwTM method is ef fectiv ely a variant of our approach. The ICwTM algorithm, howe ver , has at least twice the communications overhead of our approach, since it requires the neighbors to exchange both their local models and local gradients. In addition, [ 63 ] requires the nodes to have the same initialization and it does not bring out the dependence of the network topology on the learning problem. Remark 1 . While this paper was in re view , a related work [ 68 ] appeared on a preprint server; this recent work on Byzantine- resilient decentralized learning, in contrast to our paper , studies a more general class of noncon ve x loss functions and also allows the distribution of training data at the regular nodes to 3 Algorithm Noncon vex Byzantine failures Algorithmic con vergence rate Statistical con vergence rate ByRDiE [ 59 ] × √ √ √ Kuwaranancharoen et. al [ 60 ] × √ × × Peng and Ling [ 61 ] × √ √ × MOZI [ 62 ] × √ √ × ICwTM [ 63 ] √ √ √ × DGD [ 45 ] × × √ × NEXT [ 64 ] √ × × × Noncon vex DGD [ 65 ] √ × √ × D-GET [ 66 ] √ × √ √ GT -SARAH [ 67 ] √ × √ √ BRIDGE (This paper) √ √ √ √ T ABLE I: Comparison of BRIDGE with different vector-v alued decentralized learning/optimization methods in the literature. be heterogeneous. Howe ver , in addition to the fact that [ 68 ] significantly postdates our work, the main result in [ 68 ] relies on clairvoyant kno wledge of several network-wide parameters, including the subset of Byzantine nodes within the network, and also requires the maximum ‘cumulative mixing weight’ associated with the Byzantine nodes to be impractically small (e.g., ev en in the case of a fully connected network, the cumulativ e weight must be no greater than 9 . 76 × 10 − 5 ). B. Our contributions One of the main contrib utions of this paper is the intro- duction of an efficient and scalable algorithmic framew ork for Byzantine-resilient decentralized learning. The proposed framew ork, termed B yzantine- r es i lient d ecentralized g radient d e scent (BRIDGE), overcomes the computational and com- munications overhead associated with the one-coordinate-at- a-time update pattern of ByRDiE through its use of the gradient descent-style updates. Specifically , the network nodes locally compute the d -dimensional gradient (and exchange the local d -dimensional model) only once in each iteration of the BRIDGE framew ork, as opposed to the d computations of the d -dimensional gradient in each iteration of ByRDiE. The BRIDGE frame work therefore has significantly less local com- putational cost due to fe wer gradient computations, and it also has smaller netw ork-wide coordination and communications ov erhead due to fewer exchange of node-to-node messages. Note that BRIDGE is being referred to as a frame work since it allows for multiple v ariants of a single algorithm depending on the choice of the scr eening method used within the algorithm for resilience purposes; see Section III for further details. Another main contribution of this paper is analysis of one of the variants of BRIDGE, termed BRIDGE-T , for resilience against Byzantine failures in the network. The analysis enables us to provide both algorithmic con ver gence rates and statistical con ver gence rates for BRIDGE-T for certain classes of con ve x and nonconv ex loss functions, with the rates deri ved for the con ve x setting matching those for ByRDiE [ 59 ]. The final main contribution of this paper is reporting of large-scale numerical results on the MNIST [ 69 ] and CIF AR-10 [ 70 ] datasets for both conv ex and nonconv ex decentralized learning problems in the presence of Byzantine failures. The reported results, which include both independent and identically dis- tributed (i.i.d.) and non-i.i.d. datasets within the network, highlight the benefits of the BRIDGE framework and validate our theoretical findings. In summary , and to the best of our knowledge, BRIDGE is the first Byzantine-resilient decentralized learning algorithm that is scalable, has results for a class of noncon v ex learn- ing problems, and provides rates for both algorithmic and statistical con v ergence. W e also refer the reader to T able I , which compares BRIDGE with recent works in both faultless and faulty vector -v alued decentralized optimization/learning settings. Additional relev ant works not appearing in this table include [ 15 ], [ 54 ], since they limit themselves to scalar-v alued problems, and [ 68 ], since it substantially postdates our work. Further , [ 15 ], [ 54 ] neither study noncon vex loss functions nor derive the statistical con ver gence rates, while the main result in [ 68 ]—despite the generality of its problem setup—is significantly restrictiv e, as discussed in Remark 1 . C. Notation and organization The following notation is used throughout the rest of the paper . W e denote scalars with regular -faced lowercase and uppercase letters (e.g., a and A ), vectors with bold-faced low- ercase letters (e.g., a ), and matrices with bold-faced uppercase letters (e.g., A ). All vectors are taken to be column vectors, while [ a ] k and [ A ] ij denote the k -th element of vector a and the ( i, j ) -th element of matrix A , respectiv ely . W e use k a k to denote the ` 2 -norm of a , 1 to denote the vector of all ones, and I to denote the identity matrix, while ( · ) T denotes the transpose operation. Gi ven two matrices A and B , the notation A B signifies that A − B is a positive semidefinite matrix. W e also use h a 1 , a 2 i to denote the inner product between two vectors. For a gi ven vector a and nonnegati ve constant γ , we denote the ` 2 -ball of radius γ centered around a as B ( a , γ ) := { a 0 : k a − a 0 k ≤ γ } . Finally , given a set, | · | denotes its cardinality , while we use the notation G ( J , E ) to denote a graph with the set of nodes J and edges E . The rest of this paper is or ganized as follo ws. Section II provides a mathematical formulation of the Byzantine-resilient decentralized learning problem, along with a formal definition of a Byzantine node and various assumptions on the loss func- tion. Section III introduces the BRIDGE framew ork and dis- cusses different variants of the BRIDGE algorithm. Section IV provides theoretical guarantees for the BRIDGE-T algorithm for certain classes of con v ex and noncon ve x loss functions, which include guarantees for network-wide consensus among the nonfaulty nodes and statistical con v ergence. Section V reports results corresponding to numerical experiments on the MNIST and CIF AR-10 datasets for both con vex and 4 noncon ve x learning problems, establishing the usefulness of the BRIDGE framework for Byzantine-resilient decentralized learning. W e conclude the paper in Section VI , while the ap- pendices contain the proofs of the main lemmas and theorems. I I . P R O B L E M F O R M U L AT I O N A. Pr eliminaries Let ( w , z ) 7→ f ( w , z ) be a non-negati ve-v alued (and possi- bly regularized) loss function that maps the tuple of a model w and a data sample z to the corresponding loss f ( w , z ) . W ithout loss of much generality , we assume the model w in this paper to be a parametric one, i.e., w ∈ R d (e.g., d could be the number of parameters in a deep neural network). The data sample z , on the other hand, corresponds to a random v ariable on some probability space (Ω , F , P ) , i.e., z is F -measurable and has been drawn from the sample space Ω according to the probability law P . The holy grail in machine learning (ML) is to obtain an optimal model w ∗ that minimizes the expected loss, termed the statistical risk [ 5 ], [ 6 ], i.e., w ∗ ∈ arg min w ∈ R d E P [ f ( w , z )] . (1) A model w ∗ that satisfies ( 1 ) is termed a statistical risk minimizer (also known as a Bayes optimal model ). In the real world, howe ver , one seldom has access to the distribution of z , which precludes the use of the statistical risk E P [ f ( w , z )] in any computations. Instead, a common approach utilized in ML is to lev erage a collection Z := { z n } N n =1 of data samples that ha ve been drawn according to the law P and solve an empirical variant of ( 1 ) as follows [ 5 ], [ 6 ]: w ∗ E RM ∈ arg min w ∈ R d 1 N N X n =1 f ( w , z n ) . (2) This approach, which is termed as the empirical risk minimiza- tion (ERM), typically relies on an optimization algorithm to solve for w ∗ E RM . The resulting solution b w , from the perspecti ve of an ML practitioner , must satisfy two criteria: ( i ) it should hav e fast algorithmic con ver gence, measured in terms of the algorithmic conv ergence rate, to a fixed point (often taken to be a stationary point of ( 2 ) in centralized settings); and ( ii ) it should hav e fast statistical conv ergence, often specified in terms of the sample complexity (number of samples), to a statistical risk minimizer . Our focus in this paper , in contrast to sev eral related prior works (cf. T able I ), is on both the algorithmic and the statistical con ver gence of the ERM solution. The final set of results in this case rely on a number of assumptions on the loss function f ( w , z ) , stated below . Assumption 1 (Bounded and Lipschitz gradients) . The loss function f ( w , z ) is differentiable in the first ar gument P - almost surely (a.s.) and the gradient of f ( w , z ) with respect to the first argument, denoted as ∇ f ( w , z ) , is bounded and L 0 -Lipschitz a.s., i.e., 1 ∀ w ∈ R d , k∇ f ( w , z ) k ≤ L a.s. and ∀ w 1 , w 2 ∈ R d , k∇ f ( w 1 , z ) −∇ f ( w 2 , z ) k ≤ L 0 k w 1 − w 2 k a.s. 1 Unless specified otherwise, all almost sure statements in the paper are to be understood with respect to the probability law P . In the literature, functions with L 0 -Lipschitz gradients are also referred to as L 0 -smooth functions. Assumption 1 implies the loss function is itself a.s. L -Lipschitz continuous [ 71 ], i.e., ∀ w 1 , w 2 ∈ R d , | f ( w 1 , z ) − f ( w 2 , z ) | ≤ L k w 1 − w 2 k a.s. Assumption 2 (Bounded training loss) . The loss function is a.s. bounded over the training samples, i.e., there exists a constant C such that sup w ∈ R d , z ∈Z f ( w , z ) ≤ C < ∞ a.s. The analysis carried out in this paper considers two differ - ent classes of loss functions, namely , con v ex functions and noncon ve x functions. In the case of analysis for the conv ex loss functions, we make the following assumption. Assumption 3 (Strong conv exity) . The loss function f ( w , z ) is a.s. λ -strongly con ve x in the first argument, i.e., ∀ w 1 , w 2 ∈ R d , f ( w 1 , z ) ≥ f ( w 2 , z ) + h∇ f ( w 2 , z ) , w 1 − w 2 i + λ 2 k w 1 − w 2 k 2 a.s. Note that the Lipschitz gradients assumption can be relaxed to Lipschitz subgradients in the case of strongly con vex loss functions. Some e xamples of loss functions that satisfy Assumptions 1 and 3 arise in ridge regression, elastic net re- gression, ` 2 -regularized logistic regression, and ` 2 -regularized training of support vector machines when the optimization variable w is constrained to belong to a bounded set in R d . Our discussion in the sequel shows that w indeed remains bounded for the algorithms in consideration, justifying the usage of Assumptions 1 and 3 . And while Assumption 2 , as currently stated, would not be satisfied for the aforementioned regularized problems, the analysis in the paper only requires boundedness of the data-dependent term(s) of the loss function ov er the finite set of training data. For the sake of compactness of notation, howe ver , we refrain from expressing the loss func- tion as the sum of two terms, with the implicit understanding that Assumption 2 is concerned only with the data-dependent component of f ( · , · ) . Finally , in contrast to Assumption 3 , we make the following assumption in relation to analysis of the class of noncon vex loss functions. Assumption 3 0 (Local strong con vexity) . The loss function f ( w , z ) is nonconv ex and a.s. twice differentiable in the first argument. Next, let ∇ 2 F ( w ) denote the Hessian of the statistical risk F ( w ) := E P [ f ( w , z )] and let W ∗ s denote the set of all first-order stationary points of F ( w ) , i.e., W ∗ s := { w ∈ R d : ∇ F ( w ) = 0 } . Then, for any w ∗ s ∈ W ∗ s , the statistical risk is locally λ -strongly conv ex in a suf ficiently large neighborhood of w ∗ s , i.e., there exist positive constants λ and β such that ∀ w ∈ B ( w ∗ s , β ) , ∇ 2 F ( w ) λ I . It is straightforward to see that the local strong con ve xity of the statistical risk does not imply the global strong conv exity of either the statistical risk or the loss function. Assumptions similar to Assumption 3 0 are no wadays routinely used for anal- ysis of noncon vex optimization problems in machine learning; see, e.g., [ 72 ]–[ 75 ]. In particular, Assumption 3 0 along with the proof techniques utilized in this work allow the theoretical results to be applicable to a broader class of functions in which the strong con vexity is preserved only locally . 5 Remark 2 . The conv ergence guarantees for noncon ve x loss functions in this paper are local in the sense that they hold as long as the BRIDGE iterates are initialized within a sufficiently small neighborhood of a stationary point (cf. Sec- tion IV ). Such local con vergence guarantees are typical of many results in noncon vex optimization (see, e.g., [ 76 ]–[ 78 ]), but the y do not imply that an iterative algorithm requires knowledge of the stationary point. B. System model for decentralized learning Consider a network of M nodes (devices, machines, etc.), expressed as a directed, static, and connected graph G ( J , E ) in which the set J := { 1 , . . . , M } represents nodes in the network and the set of edges E represents communication links between the nodes. Specifically , ( j, i ) ∈ E if and only if node i can directly receiv e messages from node j and vice versa. W e also define the neighborhood N j of node j as the set of nodes that can send messages to it, i.e., N j := { i ∈ J : ( i, j ) ∈ E } . W e assume each node j has access to a local training dataset Z j := { z j n } |Z j | n =1 . For simplicity of exposition, we assume the cardinalities of the local training sets to be the same, as the generalization of our results to the case of Z j ’ s not being same sized is trivial. Collectiv ely , therefore, the network has a total of M N samples that could be utilized for learning purposes. In order to obtain an estimate of the statistical risk min- imizer w ∗ (cf. ( 1 )) in this decentralized setting, one would ideally like to solve the following ERM problem: min w ∈ R d 1 M N M X j =1 N X n =1 f ( w , z j n ) = min w ∈ R d 1 M M X j =1 f j ( w ) , (3) where we have used f j ( w ) := 1 N P N n =1 f ( w , z j n ) to denote the local empirical risk associated with the j -th node. In particular , it is well kno wn that the minimizer of ( 3 ) will statistically con ver ge with high probability to w ∗ in the case of a strictly con vex loss function [ 1 ]. The problem in ( 3 ), howe ver , necessitates bringing together of the data at a single location; as such, it cannot be practically solved in its current form in the decentralized setting. Instead, we assume each node j maintains a local version w j of the desired global model and collaborate among themselves to solve the following decentralized ERM problem: min { w 1 ,..., w M } 1 M M X j =1 f j ( w j ) subject to ∀ i, j, w i = w j . (4) T raditional decentralized learning algorithms proceed itera- tiv ely to solve this decentralized ERM problem [ 9 ]–[ 13 ], [ 47 ], [ 79 ]. This is typically accomplished through each node j engaging in two tasks during each iteration: update the local variable w j according to some (local and data-dependent) rule g j ( · ) , and broadcast some summary of its local information to the nodes that hav e node j in their respectiv e neighborhoods. C. Byzantine-r esilient decentralized learning While decentralized learning is well understood in the case of faultless networks, the main assumption in this paper is that some of the network nodes can arbitrarily deviate from their intended behavior during the iterative process. Such deviations could be caused by malfunctioning equipment, cyberattacks, etc. W e model the deviations of the faulty nodes as a Byzantine failur e , which is formally defined as follows [ 14 ], [ 16 ]. Definition 1 (Byzantine node) . A node j ∈ J is said to hav e undergone a Byzantine failure if, during any iteration of decentralized learning, it either updates its local variable w j using an update rule e g j ( · ) 6 = g j ( · ) or it broadcasts some information other than the intended summary of its local information to the nodes in its vicinity . Throughout the remainder of this paper, we use R ⊆ J and B ⊂ J to denote the sets of nonfaulty and Byzantine nodes in the network, respectiv ely . In addition, we use r to denote the cardinality of the set R and assume that the number of Byzantine nodes is upper bounded by an integer b . Thus, we hav e 0 ≤ |B| ≤ b and r ≥ M − b . In addition, without loss of generality , we label the nonfaulty nodes from 1 to r within our analysis, i.e., R := { 1 , . . . , r } . Under this assumption of Byzantine failures in the network, it is straightforward to see that the decentralized ERM problem as stated in ( 4 ) cannot be solved. Rather , the best one could hope for is to solve an ERM problem that is restricted to the set of nonfaulty nodes, i.e., min { w j : j ∈R} 1 r X j ∈R f j ( w j ) subject to ∀ i, j ∈ R , w i = w j , (5) except that the set R is unknown to an algorithm and therefore traditional decentralized learning algorithms cannot be utilized for this purpose. Consequently , the main goal in this paper is threefold: ( i ) Develop a decentralized learning algorithm that can prov ably solve some variant of the decentralized ERM problem ( 4 ); ( ii ) Establish that the resulting solution statisti- cally con ver ges to the statistical risk minimizer (Assumption 3 ) or a stationary point of the statistical risk (Assumption 3 0 ); and ( iii ) Characterize the sample complexity of the solution, i.e., the statistical rate of con ver gence as a function of the number of samples, r N , associated with the nonfaulty nodes. In order to accomplish the stated goal of this paper, we need to make one additional assumption concerning the topology of the network. This assumption, which is common in the literature on Byzantine resilience within decentralized net- works [ 15 ], [ 59 ], requires definitions of the notions of a source component of a graph and a reduced graph , G red ( b ) , of G . Definition 2 (Source component) . A source component of a graph is any subset of graph nodes such that each node in the subset has a directed path to ev ery other node in the graph. Definition 3 (Reduced graph) . A subgraph G red ( b ) of G is called a reduced graph with parameter b if it is generated from G by ( i ) remo ving all Byzantine nodes along with all their incoming and outgoing edges from G , and ( ii ) additionally removing b incoming edges from each nonfaulty node. Assumption 4 (Sufficient network connectivity) . The decen- tralized network is assumed to be suf ficiently connected in the sense that all reduced graphs G red ( b ) of the underlying graph 6 G ( J , E ) contain at least one source component of cardinality greater than or equal to ( b + 1) . W e conclude by expanding further on Assumption 4 , which concerns the redundancy of information flow within the net- work. In words, this assumption ensures that each nonfaulty node can continue to recei ve information from a fe w other nonfaulty nodes e ven after a certain number of edges ha ve been removed from every nonfaulty node. And while efficient certification of this assumption remains an open problem, there is an understanding of the generation of graphs that satisfy this assumption [ 52 ]. In addition, we hav e empirically observed that Assumption 4 is often satisfied in Erd ˝ os–R ´ enyi graphs as long as the degree of the least connected node is larger than 2 b . This is also the approach we take while generating graphs for our numerical experiments. Remark 3 . In the finite sample regime, in which each (non- faulty) node has only a finite number of training data samples, the local empirical risk f j ( w ) at ev ery node will be different due to the randomness of the data samples, regardless of whether the training data across the nodes are i.i.d. or non-i.i.d. While this makes the formulation in ( 5 ) similar to the one in [ 54 ] and [ 60 ] for scalar-v alued and vector-v alued Byzantine- resilient decentralized optimization, respectiv ely , the funda- mental dif ference between the statistical learning frame work of this work and the optimization-only frame work in [ 54 ], [ 60 ] is the intrinsic focus on sample complexity in statistical learning. In particular , the sample complexity results in Section IV -B also help characterize the gap between local-only learning , in which every regular node learns its own model using its own training data, and decentralized learning, in which the regular nodes collaborate to learn a common model. I I I . B Y Z A N T I N E - R E S I L I E N T D E C E N T R A L I Z E D G R A D I E N T D E S C E N T In the faultless case, the decentralized ERM problem ( 4 ) can be solved, perhaps to one of its stationary points, using any one of the distributed/decentralized optimization methods in the literature [ 43 ]–[ 46 ], [ 64 ]–[ 67 ]. The prototypical distributed gradient descent (DGD) method [ 43 ] with decreasing step size, for instance, accomplishes this for (strongly) con ve x loss functions by letting each node in iteration ( t + 1) update its local variable w j ( t ) as w j ( t + 1) = X i ∈N j ∪{ j } a j i w i ( t ) − ρ ( t ) ∇ f j ( w j ( t )) , (6) where 0 ≤ a j i ≤ 1 is the weighting that node j applies to the local variable w i ( t ) that it receives from node i , and { ρ ( t ) } denotes a positive sequence of step sizes that satisfies ρ ( t +1) ≤ ρ ( t ) , ρ ( t ) t → 0 , P ∞ t =0 ρ ( t ) = ∞ , and P ∞ t =0 ρ 2 ( t ) < ∞ . One choice for such a sequence is ρ ( t ) = 1 λ ( t 0 + t ) for some t 0 , which ensures that a network-wide consensus is reached among all nodes, i.e., ∀ i, j, w i ( t ) t → w j ( t ) , and all local variables con verge to the decentralized (and thus the centralized) ERM solution. T raditional distrib uted/decentralized optimization methods, howe ver , fail to reach a stationary point of the decentral- ized ERM problem ( 4 ) (or its restricted variant ( 5 )) in the Algorithm 1 The BRIDGE Framework Input: Local datasets Z j , maximum number of Byzantine nodes b , step size sequence { ρ ( t ) } ∞ t =0 , and maximum number of iterations t max 1: Initialize: t ← 0 and ∀ j ∈ R , w j (0) 2: for t = 0 , 1 , . . . , t max − 1 do 3: Broadcast w j ( t ) , ∀ j ∈ R 4: Receiv e w i ( t ) at each node j ∈ R from ev ery i ∈ N j ⊂ ( R ∪ B ) 5: y j ( t ) ← screen ( { w i ( t ) } i ∈N j ∪{ j } ) , ∀ j ∈ R 6: w j ( t + 1) ← y j ( t ) − ρ ( t ) O f j ( w j ( t )) , ∀ j ∈ R 7: end for Output: w j ( t max ) , ∀ j ∈ R presence of a single Byzantine failure in the network [ 27 ]– [ 29 ], [ 54 ], [ 80 ]–[ 82 ]. T o overcome this shortcoming of the traditional approaches as well as improve on the limitations of existing works on Bynzatine-resilient decentralized learn- ing (cf. Sections I-A and I-B ), we introduce an algorithmic framew ork termed B yzantine- r es i lient d ecentralized g radient d e scent (BRIDGE). The BRIDGE framew ork, which is listed in Algorithm 1 , is a gradient descent-based approach whose main update step (Step 6 in Algorithm 1 ) is similar to the DGD update ( 6 ). The main dif ference between the BRIDGE frame work and DGD is that each node j ∈ R in BRIDGE scr eens the incoming messages from its neighboring nodes for potentially malicious content (Step 5 in Algorithm 1 ) before updating its local v ariable w j ( t ) . Note, ho wever , that BRIDGE does not permanently label any nodes as malicious, which also allows it to manage any transitory Byzantine failures in a graceful manner . While this makes BRIDGE similar to the ByRDiE algorithm [ 59 ], the fundamental adv antage of BRIDGE over ByRDiE is its scalability that comes from the fact that it eschews the one-coordinate-at-a-time update of the local variables in ByRDiE in fav or of one update of the entire vector w j ( t ) in each iteration. In terms of other details, the BRIDGE framework is input with the maximum number of Byzantine nodes b that need to be tolerated, a decreasing step size sequence { ρ ( t ) } , and the maximum number of gradient descent iterations t max . Next, the local variable at each nonfaulty node j in the network is initialized at w j (0) . Afterward, within ev ery iteration ( t + 1) of the framew ork, each node j ∈ R broadcasts w j ( t ) as well as recei ves w i ( t ) from e very node i ∈ N j . This is follo wed by ev ery node j ∈ R screening the received w i ( t ) ’ s for any malicious information and then updating the local variable w j ( t ) . In the following, we discuss four dif ferent variants of the scr eening rule (Step 5 in Algorithm 1 ), each one of which in turn giv es rise to a different realization of the BRIDGE framew ork. The moti vation for these screening rules comes from the literature on rob ust statistics [ 83 ], with all these rules appearing in some form within the literature on robust av eraging consensus [ 52 ], [ 53 ] and robust distributed learning [ 21 ], [ 27 ]–[ 29 ]. The challenge here of course, as discussed in Section I , is that these prior works on rob ust av eraging con- 7 V ariant Screening Min. neighborhood size A ve. computational complexity BRIDGE-T coordinate-wise 2 b + 1 O ( nd ) , where n := max j |N j | BRIDGE-M coordinate-wise 1 O ( nd ) BRIDGE-K vector b + 3 O ( n 2 d ) BRIDGE-B vector + coordinate-wise max(4 b, 3 b + 2) + 1 O ( n 2 d ) T ABLE II: Comparison between the four different variants of the BRIDGE framew ork. sensus and distributed learning do not translate into equi valent results for Byzantine-resilient decentralized learning. The BRIDGE-T variant of the BRIDGE frame work uses the coordinate-wise trimmed mean as the screening rule. A similar screening principle has been utilized within distributed framew orks [ 28 ] and decentralized frameworks [ 15 ], [ 54 ], [ 59 ]. The coordinate-wise trimmed-mean screening within BRIDGE-T filters the b largest and the b smallest v alues in each coordinate of the local variables w i ( t ) receiv ed from the neighborhood of node j and uses an average of the remaining values for the update of w j ( t ) . Specifically , for any iteration index t , BRIDGE-T finds the follo wing three sets for each coordinate k ∈ { 1 , . . . , d } in parallel: N k j ( t ) := arg min X : X ⊂N j , |X | = b X i ∈X [ w i ( t )] k , (7) N k j ( t ) := arg max X : X ⊂N j , |X | = b X i ∈X [ w i ( t )] k , and (8) C k j ( t ) := N j \ n N k j ( t ) [ N k j ( t ) o . (9) Afterward, the screening routine outputs a combined and filtered vector y j ( t ) whose k -th element is given by [ y j ( t )] k = 1 |N j | − 2 b + 1 X i ∈C k j ( t ) ∪{ j } [ w i ( t )] k . (10) Notice that BRIDGE-T requires each node to hav e at least 2 b + 1 neighbors. Also, note that the elements from different neighbors may surviv e the screening at dif ferent coordinates. Therefore, the a verage within BRIDGE-T is not taken ov er vectors; rather, the calculation of y j ( t ) has to be carried out in a coordinate-wise manner . The BRIDGE-M v ariant uses the coordinate-wise median as the screening rule, with a similar screening idea having been utilized within distrib uted frame works [ 28 ]. Similar to BRIDGE-T , BRIDGE-M is also a coordinate-wise screening procedure in which the k -th element of the combined and filtered output y j ( t ) takes the form [ y j ( t )] k = median { [ w i ( t )] k } i ∈N j ∪{ j } . (11) Notice that, unlike BRIDGE-T , the coordinate-wise median screening within BRIDGE-M neither requires an explicit knowledge of b nor does it impose an explicit constraint on the minimum number of neighbors of each node. The BRIDGE-K variant uses the Krum function as the screening rule, which is similar to the screening principle that has been employed within distributed frameworks [ 21 ]. In terms of specifics, the Krum screening for the decentralized framew ork can be described as follows. Giv en i, h ∈ N j ∪ { j } , write h ∼ i if w h ( t ) is one of the |N j | − b − 2 vectors with the smallest Euclidean distance, expressed as k w h ( t ) − w i ( t ) k , from w i ( t ) . The Krum-based screening at node j then finds the neighbor index i ∗ j ( t ) as i ∗ j ( t ) = arg min i ∈N j X h ∈N j ∪{ j } : h ∼ i k w h ( t ) − w i ( t ) k , (12) and outputs the (combined and filtered) vector y j ( t ) as y j ( t ) = w i ∗ j ( t ) . Unlike BRIDGE-T and BRIDGE-M, BRIDGE-K utilizes vector-v alued operations for screening, resulting in the surviving vector y j ( t ) to be entirely from one neighbor of each node. The Krum screening rule within BRIDGE-K requires the neighborhood of every node to be larger than b + 2 . Note that since the Krum function requires the pairwise distances of all nodes within the neighborhood of ev ery node, BRIDGE-K has high computational complexity in comparison to BRIDGE-T and BRIDGE-M. Last but not least, the BRIDGE-B v ariant—inspired by a similar screening procedure within distrib uted frame- works [ 27 ]—uses a combination of Krum and coordinate-wise trimmed mean as the screening rule. Specifically , the screening within BRIDGE-B in volv es first selecting |N j | − 2 b neighbors of the j -th node by recursiv ely finding an index i ∗ j ( t ) ∈ N j using ( 12 ), remo ving the selected node from the neighborhood, finding a ne w inde x from N j \ { i ∗ j ( t ) } using ( 12 ) again, and repeating this Krum-based process |N j | − 2 b times. Next, coordinate-wise trimmed mean-based screening, as described within BRIDGE-T , is applied to the recei ved w i ( t ) ’ s of the |N j | − 2 b neighbors of node j that surviv e the first-stage Krum- based screening. Intuitively , the Krum-based vector screen- ing first guarantees the surviving neighbors hav e the closest w i ( t ) ’ s in terms of the Euclidean distance, while coordinate- wise trimmed-mean screening afterward guarantees that each coordinate of the combined and filtered vector y j ( t ) only includes the “inlier” v alues. The cost of this tw o-step screening procedure includes high computational complexity due to the use of the Krum function and the stricter requirement that the neighborhood of each node be larger than max(4 b, 3 b + 2) . W e conclude by providing a comparison in T able II between the four different variants of the BRIDGE frame work. Note that both BRIDGE-T and BRIDGE-B reduce to DGD in the case of b = 0 ; this, ho wev er, is not the case for BRIDGE-M and BRIDGE-K. It is also worth noting that additional variants of BRIDGE can be obtained through further combinations of the different screening rules and/or incorporation of additional ideas from the literature on robust statistics. Nonetheless, each variant of the BRIDGE framework requires its own theoretical analysis for con vergence guarantees. In this paper , we limit ourselves to BRIDGE-T for this purpose and provide the corresponding guarantees in the next section. 8 I V . B R I D G E - T: C O N V E R G E N C E G UA R A N T E E S In this section, we deriv e the algorithmic and statistical con vergence guarantees for BRIDGE-T for both con ve x and noncon vex loss functions. While these results match those for ByRDiE for con vex loss functions, the y cannot be obtained directly from [ 59 ] since the filtered vector y j ( t ) in BRIDGE- T corresponds to an iteration-dependent amalgamation of the iterates { w i ( t ) } in the neighborhood of node j (cf. ( 10 )). Our statistical conv ergence guarantees require the training data to be independent and identically distributed (i.i.d.) among all the nodes. Additionally , let w ∗ denote the unique global minimizer of the statistical risk in the case of the strongly conv ex loss function, while it denotes one of the first-order stationary points of the statistical risk in the case of the nonconv ex loss function. W e assume there exists a positive constant Γ such that for all j ∈ R , t ∈ R d , k w j ( t ) − w ∗ k ≤ Γ . Note that this Γ can be arbitrary large and so the abov e assumption, which is again needed for deriv ation of the statistical rate of con ver - gence, is a mild one (also, see Lemma 4 and its accompanying discussion). Broadly speaking, the analysis for our algorithmic and statistical con vergence guarantees proceeds as follows. First, we define a “consensus” vector v ( t ) and establish in Section IV -A that ∀ j ∈ R , w j ( t ) → v ( t ) as t → ∞ for an appropriate choice of the step-size sequence ρ ( t ) that satisfies ρ ( t + 1) ≤ ρ ( t ) , ρ ( t ) t → 0 , P ∞ t =0 ρ ( t ) = ∞ , and P ∞ t =0 ρ 2 ( t ) < ∞ . In particular , we work with the step size ρ ( t ) = 1 λ ( t 0 + t ) , t 0 ≥ L λ , where L is the Lipschitz constant of the loss function; see, e.g., Assumption 1 . This consensus analysis relies on the condition that the w j (0) ’ s for all j ∈ R are initialized such that v (0) ∈ B ( w ∗ , Γ) . One such choice of initialization is initializing all w j (0) ’ s to be within B ( w ∗ , Γ) . (Note that in terms of our analysis for the noncon ve x loss func- tion, which is provided in Section IV -B2 , we will impose addi- tional constraints on the initialization.). Afterward, we define two vectors u ( t + 1) and x ( t + 1) in Section IV -B that corre- spond to gradient descent updates of the consensus vector v ( t ) using, respecti vely , a conv ex combination of gradients of the local loss functions e valuated at v ( t ) and the gradient of the statistical risk e valuated at v ( t ) . Finally , we define the follo w- ing collection of distances: a 1 ( t ) := k x ( t + 1) − w ∗ k , a 2 ( t ) := k u ( t + 1) − x ( t + 1) k , a 3 ( t ) := k v ( t + 1) − u ( t + 1) k , and a 4 ( t + 1) := max j ∈R k w j ( t + 1) − v ( t + 1) k . It is straightfor- ward to see that k v ( t + 1) − w ∗ k ≤ a 1 ( t )+ a 2 ( t )+ a 3 ( t ) , and we establish in Section IV -B that a 1 ( t ) + a 2 ( t ) + a 3 ( t ) t → 0 with high probability as well as a 4 ( t + 1) t → 0 using Assumptions 1 , 2 , 3 and 4 in the con vex setting and Assumptions 1 , 2 , 3 0 and 4 in the noncon vex setting, thereby completing our proof of optimality for both con ve x and noncon vex loss functions. A. Consensus guarantees Let us pick an arbitrary index k ∈ { 1 , . . . , d } and define a vector Ω ( t ) ∈ R r whose respectiv e elements correspond to the k -th element of the iterate w j ( t ) of nonfaulty nodes, i.e., ∀ j ∈ R , [ Ω ( t )] j = [ w j ( t )] k . Note that Ω ( t ) as well as most of the variables in our discussion in this section depend on the index k ; howe ver , since k is arbitrary , we drop this e xplicit dependence on k in many instances for simplicity of notation. W e first show that the BRIDGE-T update at the nonfaulty nodes in the k -th coordinate can be expressed in a form that only in volves the nonfaulty nodes. Specifically , we write Ω ( t + 1) = Y ( t ) Ω ( t ) − ρ ( t ) g ( t ) , (13) where the vector g ( t ) is defined as [ g ( t )] j = [ ∇ f j ( w j ( t ))] k , j ∈ R . The formulation of the matrix Y ( t ) in this expression is as follows. Let N r j denote the nonfaulty nodes in the neighborhood of node j , i.e., N r j := R ∩ N j . The set of Byzantine neighbors of node j can then be defined as N b j := N j \ N r j . T o make the rest of the expressions clearer , we drop the iteration index t for the remainder of this discussion, even though the variables are still t -dependent. Let us now define the notation b ∗ := |B | as the actual (unknown) number of Byzantine nodes in the network, b k j as the number of Byzantine nodes remaining in the filtered set C k j , and q k j := b − b ∗ + b k j . Since b − b ∗ ≥ 0 by assumption and b k j ≥ 0 by definition, notice that only one of two cases can happen during each iteration for every coordinate k : ( i ) q k j > 0 or ( ii ) q k j = 0 . For case ( i ), we either ha ve b − b ∗ > 0 or b k j > 0 or both. These conditions correspond to the scenario where the node j filters out more than b regular nodes from its neighborhood. Thus, we kno w that N k j ∩ N r j 6 = ∅ . Like wise, it follows that N k j ∩ N r j 6 = ∅ . Then ∃ m 0 j ∈ N k j ∩ N r j and m 00 j ∈ N k j ∩ N r j satisfying [ w m 0 j ] k ≤ [ w i ] k ≤ [ w m 00 j ] k for any i ∈ C k j . Thus, for every i ∈ C k j ∩ N b j , ∃ θ i ∈ (0 , 1) satisfying [ w i ] k = θ i [ w m 0 j ] k + (1 − θ i )[ w m 00 j ] k . Consequently , the elements of the matrix Y can be written as [ Y ] j i = 1 2( |N j |− 2 b +1) , i ∈ N r j ∩ C k j , 1 |N j |− 2 b +1 , i = j, P i 0 ∈N b j ∩C k j θ i 0 q k j ( |N j |− 2 b +1) + P i 0 ∈N r j ∩C k j θ i 0 q k j ( |N j |− 2 b +1) , i ∈ N k j ∩ N r j , P i 0 ∈N b j ∩C k j 1 − θ i 0 q k j ( |N j |− 2 b +1) + P i 0 ∈N r j ∩C k j 1 − θ i 0 q k j ( |N j |− 2 b +1) , i ∈ N k j ∩ N r j , 0 , otherwise . (14) For case ( ii ), we must hav e that b − b ∗ = 0 and b k j = 0 . Thus, all the filtered nodes in C k j would be regular nodes in this case. Therefore, we can describe Y in this case as [ Y ] j i = ( 1 |N j |− 2 b +1 , i ∈ { j } ∪ C k j , 0 , otherwise . (15) Combining the expressions of Y in the two cases above allows us to express the update in ( 13 ) exclusi vely in terms of information from the nonfaulty nodes. Next, we define ψ to be the total number of reduced graphs that can be generated from G , the parameter ν as ν := ψ r , and the maximum neighborhood size of the nonfaulty nodes as N max := max j ∈R |N j | . Further, we define a transition matrix Φ ( t, t 0 ) from some index t 0 ≤ t to t , i.e., Φ ( t, t 0 ) := Y ( t ) Y ( t − 1) · · · Y ( t 0 ) . (16) 9 Then it follows from [ 84 , Lemma 4] that if Assumption 4 is satisfied then lim t →∞ Φ ( t, t 0 ) = 1 α T ( t 0 ) , (17) where the vector α ( t 0 ) ∈ R r satisfies [ α ( t 0 )] j ≥ 0 and P r j =1 [ α ( t 0 )] j = 1 . In particular , we have [ 84 , Theorem 3] | [ Φ ( t, t 0 )] j i − [ α ( t 0 )] i | ≤ µ ( t − t 0 +1 ν ) , (18) where µ ∈ (0 , 1) is defined as µ := 1 − 1 (2 N max − 2 b +1) ν . Next, it follows from ( 13 ) and the definition of Φ ( t, t 0 ) that Ω ( t ) = Y ( t − 1) Ω ( t − 1) − ρ ( t − 1) g ( t − 1) = Y ( t − 1) Y ( t − 2) · · · Y (0) Ω (0) − t − 1 X τ =0 Y ( t − 1) Y ( t − 2) · · · Y ( τ + 1) ρ ( τ ) g ( τ ) = Φ ( t − 1 , 0) Ω (0) − t − 1 X τ =0 Φ ( t − 1 , τ + 1) ρ ( τ ) g ( τ ) . (19) Now , similar to [ 84 , Con ver gence Analysis of Algorithm 1], suppose all nodes stop computing their local gradients after iteration t so that g ( τ ) = 0 when τ > t . Note that this is without loss of generality when we let t approach infinity , as we recover BRIDGE-T in that case. Further , let T ≥ 0 be an integer and define a vector ¯ v ( t ) as follows: ¯ v ( t ) = lim T →∞ Ω ( t + T + 1) = lim T →∞ Φ ( t + T , 0) Ω (0) − lim T →∞ t + T X τ =0 Φ ( t + T , τ + 1) ρ ( τ ) g ( τ ) = 1 α T (0) Ω (0) − t − 1 X τ =0 1 α T ( τ + 1) ρ ( τ ) g ( τ ) . (20) Notice that ¯ v ( t ) is a constant vector and we define a scalar- valued sequence v ( t ) to be any one of its elements. W e no w show that [ w j ( t )] k t → v ( t ) . Indeed, we hav e from ( 20 ) that v ( t ) = r X i =1 [ α (0)] i [ w i (0)] k − t − 1 X τ =0 ρ ( τ ) r X i =1 [ α ( τ + 1)] i [ ∇ f i ( w i ( τ ))] k . (21) Also recall from the update of [ w j ( t )] k that [ w j ( t )] k = r X i =1 [ Φ ( t − 1 , 0)] j i [ w i (0)] k − t − 1 X τ =0 ρ ( τ ) r X i =1 [ Φ ( t − 1 , τ + 1)] j i [ ∇ f i ( w i ( τ ))] k . From Assumption 1 and the initialization of w j ( t ) ’ s, there exist two scalars C w and L such that ∀ j ∈ R , | [ w j (0)] k | ≤ C w and | [ ∇ f j ( w j )] k | ≤ L . Therefore, we ha ve | [ w j ( t )] k − v ( t ) | ≤ | r X i =1 ([ Φ ( t − 1 , 0)] j i − [ α (0)] i )[ w i (0)] k | + | t − 1 X τ =0 ρ ( τ ) r X i =1 ([ Φ ( t − 1 , τ + 1)] j i − [ α ( τ + 1)] i )[ ∇ f i ( w i ( τ ))] k | ≤ r C w µ t ν + r L t X τ =0 ρ ( τ ) µ t − τ +1 ν t → 0 . (22) Here, the fact that the second term in the second inequality of ( 22 ) conv erges to zero follows from our assumptions on the decreasing step size sequence along with [ 84 , Lemma 6]. Finally , recall that the vector-v alued iterate updates in BRIDGE-T can be thought of as individual updates of the d coordinates in parallel. Therefore, since the coordinate k was arbitrarily picked, we hav e proven that BRIDGE-T achieves consensus among the nonfaulty nodes for both conv ex and noncon vex loss functions, as summarized in the following. Theorem 1. Define a vector v ( t ) ∈ R d as one whose k - th entry [ v ( t )] k is given by the right-hand-side of ( 21 ) . If Assumptions 1 and 4 ar e satisfied, then the gap between w j ( t ) , ∀ j ∈ R , and v ( t ) goes to 0 as t → ∞ , i.e., lim t →∞ a 4 ( t ) = lim t →∞ max j ∈R k w j ( t ) − v ( t ) k ≤ lim t →∞ " √ dr C w µ t ν + √ dr L t X τ =0 ρ ( τ ) µ t − τ +1 ν # = 0 . (23) W e conclude our discussion with a couple of remarks. First, note that Theorem 1 has been obtained without needing Assumptions 2 and 3 / 3 0 . Thus, BRIDGE-T guarantees consensus among the nonfaulty nodes for both conv ex and noncon vex loss functions under a general set of assumptions. Second, notice that the second term in ( 23 ) is a sum of t terms and it con verges to 0 at a slower rate than the first term. Among all the sub-terms in the sum of the second term, the last term ρ ( t ) µ 1 ν is the one that con verge to zero at the slowest rate. Thus, the rate at which BRIDGE-T achie ves consensus is determined by this sub-term and is giv en by O ( √ dρ ( t )) . In particular, if we choose ρ ( t ) to be O (1 /t ) , the rate of consensus for BRIDGE-T is O ( √ d/t ) . B. Statistical optimality guarantees While Theorem 1 guarantees consensus among the non- faulty nodes by pro viding an upper bound on the distance a 4 ( t ) , this result alone cannot be used to characterize the gap between the iterates { w j ( t ) } j ∈R and the global minimizer (resp., first-order stationary point) w ∗ of the statistical risk for the con vex loss function (resp., nonconv ex loss function). W e accomplish the goal of establishing the statistical optimality by providing bounds on the remaining three distances a 1 ( t ) , a 2 ( t ) , and a 3 ( t ) described in the beginning of the section. W e start with a bound on a 3 ( t ) . T o this end, let v ( t ) be as defined in Theorem 1 and notice from ( 20 ) that v ( t + 1) = v ( t ) − ρ ( t ) g 1 ( t ) , (24) 10 where g 1 ( t ) has the k -th entry defined in terms of gradients of the local loss functions e valuated at the local iterates as [ g 1 ( t )] k = P r j =1 [ α k ( t )] j [ ∇ f j ( w j ( t ))] k . Here, α k ( t ) is an element of the probability simplex associated with the k -th coordinate of the consensus vector v ( t ) , as described earlier . Next, define another vector g 2 ( t ) whose k -th entry is defined in terms of gradients of the local loss functions e valuated at the consensus vector as [ g 2 ( t )] k = P r j =1 [ α k ( t )] j [ ∇ f j ( v ( t ))] k . Further , define a ne w sequence u ( t + 1) as u ( t + 1) = v ( t ) − ρ ( t ) g 2 ( t ) . (25) It then follows from ( 22 ), Assumption 1 , and some algebraic manipulations that a 3 ( t ) = k v ( t + 1) − u ( t + 1) k = ρ ( t ) k g 2 ( t ) − g 1 ( t ) k ≤ ρ ( t ) L 0 max j ∈R k v ( t ) − w j ( t ) k = ρ ( t ) L 0 a 4 ( t ) . (26) W e next turn our attention to a bound on a 2 ( t ) , which is necessarily going to be probabilistic in nature because of its dependence on the training samples, and define a ne w sequence x ( t ) as x ( t + 1) = v ( t ) − ρ ( t ) ∇ E P [ f ( v ( t ) , z )] . (27) T rivially , we hav e a 2 ( t ) = k u ( t + 1) − x ( t + 1) k = ρ ( t ) k g 2 ( t ) − E P [ f ( v ( t ) , z )] k . (28) The follo wing lemma, whose proof is provided in Appendix A , now bounds a 2 ( t ) by establishing that g 2 ( t ) con ver ges in probability to E P [ f ( v ( t ) , z )] . Lemma 1. Suppose Assumptions 1 and 2 are satisfied and the training data are i.i.d. Then, fixing any δ ∈ (0 , 1) , we have with pr obability at least 1 − δ that a 2 ( t ) ≤ ρ ( t ) sup t k g 2 ( t ) − E P [ f ( v ( t ) , z )] k = O s d k α m k 2 log 2 δ N ρ ( t ) , (29) wher e the vector α m ∈ R r is a pr o blem-dependent (unknown) vector defined in Appendix A and satisfies [ α m ] j ≥ 0 and P r j =1 [ α m ] j = 1 . Notice that while the bounds for a 2 ( t ) , a 3 ( t ) , and a 4 ( t ) hav e been obtained for both the con ve x and noncon vex loss functions in the same manner , the bound for a 1 ( t ) = k x ( t + 1) − w ∗ k in the noncon vex setting does require the aid of Assumption 3 0 , as opposed to Assumption 3 in the con vex setting. This leads to separate proofs of the final statistical optimality results under Assumption 3 and Assumption 3 0 . 1) Statistical optimality for the conve x case: Notice that x ( t + 1) is obtained from v ( t ) by taking a regular gradient descent step, with step size ρ ( t ) , with respect to the gradient E P [ f ( v ( t ) , z )] of the statistical risk. Under Assumptions 1 and 3 , therefore, it follows from our understanding of the behavior of gradient descent iterations that [ 85 , Chapter 2.1.5] a 1 ( t ) = k x ( t + 1) − w ∗ k ≤ (1 − L 0 ρ ( t )) k v ( t ) − w ∗ k ≤ (1 − λρ ( t )) k v ( t ) − w ∗ k , (30) where the last inequality holds because of the fact that λ ≤ L 0 . W e then have the following bound: k v ( t + 1) − w ∗ k ≤ (1 − λρ ( t )) k v ( t ) − w ∗ k + a 2 ( t ) + a 3 ( t ) . (31) Notice that ( 31 ) only provides the relationship between steps t and t + 1 . In order to bound the distance k v ( t + 1) − w ∗ k in terms of the initial distance k v (0) − w ∗ k , we can recursi vely make use of ( 31 ) to arrive at the following lemma. Lemma 2. Suppose Assumptions 1 , 2 , 3 , and 4 ar e satisfied and the training data are i.i.d. Then fixing any δ ∈ (0 , 1) , an upper bound on k v ( t + 1) − w ∗ k can be derived with pr obability at least 1 − δ as k v ( t + 1) − w ∗ k ≤ t 0 t + t 0 C 1 + C 2 ( N ) λ + C 3 t + t 0 + C 4 1 t + t 0 1 t 0 + 1 1 + t 0 + 1 2 + t 0 + · · · + 1 t + t 0 , (32) wher e C 1 = k v (0) − w ∗ k , C 2 ( N ) = O q d k α m k 2 log 2 δ N , C 3 = √ dL 0 rC w λ 1 − µ 1 ν and C 4 = √ dLL 0 rµ 1 ν t 0 λ 2 1 − µ 1 ν . Proof of Lemma 2 is provided in Appendix B . Lemma 2 establishes that k v ( t + 1) − w ∗ k can be upper bounded by a sum of terms that can be made arbitrarily small for sufficiently large t and N . Since max j ∈R k w j ( t ) − v ( t ) k can also be made arbitrarily small when t is suf ficiently large, we can therefore bound k w j ( t + 1) − w ∗ k , j ∈ R , using the bounds on k v ( t + 1) − w ∗ k and max j ∈R k w j ( t + 1) − v ( t + 1) k to arriv e at the follo wing lemma. Lemma 3. Suppose Assumptions 1 , 2 , 3 , and 4 ar e satisfied and the training data are i.i.d. Then fixing any δ ∈ (0 , 1) and any > C 2 ( N ) λ > 0 , we can always find a t 1 such that for all t ≥ t 1 and j ∈ R , with pr obability at least 1 − δ , k w j ( t + 1) − w ∗ k ≤ . Proof of Lemma 3 is provided in Appendix C . Notice that giv en a suf ficiently large N , C 2 ( N ) is arbitrarily small, which means that the iterates of non-Byzantine nodes can be made arbitrarily close to w ∗ . W e are now ready to state the main result concerning the statistical conv ergence of BRIDGE-T at the nonfaulty nodes to the global statistical risk minimizer w ∗ . Theorem 2. Suppose Assumptions 1 , 2 , 3 , and 4 ar e satisfied and the training data ar e i.i.d. Then the iterates of BRIDGE- T conver ge sublinearly in t to the minimum of the global statistical risk at each nonfaulty node. In particular , given any > 00 λ > 0 , ∀ j ∈ R , with pr obability at least 1 − δ and for lar ge enough t , k w j ( t + 1) − w ∗ k ≤ , (33) wher e δ = 2 exp − 4 rN 00 2 16 L 2 rd k α m k 2 + 00 2 + r log 12 L √ rd 00 + d log 12 L 0 β √ d 00 and 00 = C 2 ( N ) = O q d k α m k 2 log 2 δ N . Proof of Theorem 2 is provided in Appendix D . Note that when N → ∞ and when ρ ( t ) is chosen as a O (1 /t ) sequence, ( 33 ) leads to a sublinear con ver gence rate, as shown 11 in Appendix D . Thus, both the algorithmic and statistical con vergence rates deri ved for BRIDGE-T match the existing Byzantine-resilient rates in the decentralized setting [ 59 ]. In terms of the statistical guarantees, recall that when there is no failure in the network, the non-resilient gradient descent algorithms such as DGD usually ha ve the statistical learning rate as O p 1 / M N , while if each node runs centralized algorithm with the giv en N samples, the learning rate is giv en as O p 1 / N . BRIDGE-T achie ves the statistical learning rate of O p k α m k 2 / N , which lies between the rate of centralized learning and DGD. In particular , compared to local learning, BRIDGE-T reduces the sample complexity by a factor of k α m k 2 for each node by cooperating ov er a network, but it cannot approach the fault-free rate. This shows the trade-of f between sample complexity and robustness. 2) Statistical optimality for the noncon vex case: A general challenge for optimization methods in the nonconv ex setting is the presence of multiple stationary points within the landscape of the loss function. Distributed frameworks in general and potential Byzantine failures within the network in particular make this an ev en more challenging problem. W e ov ercome this challenge by making use of Assumption 3 0 (local strong con vexity) and aiming for local con vergence guarantees. In terms of specifics, recall the positi ve constant β from Assumption 3 0 that describes the region of local strong con- ve xity around a stationary point w ∗ , let β 1 ≤ β be another positiv e constant that will be defined shortly , and pick any β 0 ∈ (0 , β − β 1 ] . Then our local con vergence guarantees are based on the assumption that ∀ j ∈ R , w j (0) ’ s are initialized such that v (0) ∈ B ( w ∗ , β 0 ) , with one such choice of initialization being that the w j (0) ’ s at the nonfaulty nodes are initialized within B ( w ∗ , β 0 ) . In particular , assuming β ≥ Γ for Γ defined in the beginning of Section IV , we obtain the following lemma that establishes the boundedness of the iterates w j ( t ) for any j ∈ R and t ∈ R , characterizes the relationship between β , β 0 , and β 1 , and helps us understand how large β need be as a function of the different parameters. Lemma 4. Suppose Assumptions 1 , 2 , 3 0 , and 4 ar e satisfied and the training data ar e i.i.d. Then with the initialization described above for β ≥ max { Γ , β 1 } and β 1 defined as β 1 := C 2 ( N ) λ + C 3 t 0 + C 4 t 2 0 + C 5 , the w j ( t ) ’ s will never escape fr om B ( w ∗ , β ) for all j ∈ R , t ∈ R . Here , the constants C 2 , C 3 , and C 4 ar e as defined in Lemma 2 , while the constant C 5 := √ dr C w µ 1 ν + √ dr L 1 1 − µ 1 ν h 1 λt 0 µ 1 ν + 1 λ ( t 0 +1) i . Lemma 4 is prov ed in Appendix E . In terms of the impli- cations of this lemma, it first and foremost helps justify the assumption of bounded iterates of the nonfaulty nodes stated in the be ginning of Section IV . More importantly , ho wever , notice that the constraint β ≥ Γ and the assumption that ∀ j ∈ R , t ∈ R d , k w j ( t ) − w ∗ k ≤ Γ have the potential to make Assumption 3 0 meaningless for large-enough Γ . But Lemma 4 ef fectiv ely helps us characterize the extent of Γ , i.e., choosing Γ to be C 1 + C 2 ( N ) λ + C 3 t 0 + C 4 t 2 0 + C 5 is sufficient to guarantee that all iterates of the nonfaulty nodes remain within B ( w ∗ , Γ) ⊆ B ( w ∗ , β ) . W e conclude with our main result for the nonconv ex case, which mirrors that for the con ve x loss functions. Theorem 3. Suppose Assumptions 1 , 2 , 3 0 and 4 are satisfied and the training data ar e i.i.d. Then with the earlier described initialization within B ( w ∗ , β 0 ) , the iterates of BRIDGE-T con verg e sublinearly in t to the stationary point w ∗ of the statistical risk at each nonfaulty node. In particular , given any > 00 λ > 0 , ∀ j ∈ R , with pr obability at least 1 − δ and for lar ge enough t , k w j ( t + 1) − w ∗ k ≤ , (34) wher e δ = 2 exp − 4 rN 00 2 16 L 2 rd k α m k 2 + 00 2 + r log 12 L √ rd 00 + d log 12 L 0 β √ d 00 and 00 = C 2 ( N ) = O q d k α m k 2 log 2 δ N . Proof of Theorem 3 is provided in Appendix F . It can be seen from the appendix that the proof in the conv ex setting maps to the noncon vex one without much additional work. The reason for this is the new proof technique being used in the con ve x case, instead of the one utilized in [ 86 ], which guarantees BRIDGE-T con ver ges to a local stationary point for the noncon vex functions described in Assumption 3 0 . Remark 4 . The con ver gence rates derived for BRIDGE-T are a function of the dimension d . Such dimension dependence is typical of many results in (centralized) statistical learning theory [ 87 ], [ 88 ]. While there is an ongoing effort to ob- tain dimension-independent rates in statistical learning [ 88 ], [ 89 ], we leav e an in vestigation of this within the context of Byzantine-resilient decentralized learning for future work. V . N U M E R I C A L R E S U L T S The numerical e xperiments are separated into three parts. In the first part, we run experiments on the MNIST and CIF AR-10 datasets using a linear classifier with squared hinge loss, which is a case that fully satisfies all our assumptions for the theoretical guarantees in the con vex setting. In the second part, we run experiments on the MNIST and CIF AR- 10 datasets using a con volutional neural network. By showing that BRIDGE works in this general noncon vex loss function setting, we establish that BRIDGE indeed works for loss functions that satisfy Assumption 3 0 . In the third part, we run experiments on the MNIST dataset with non-i.i.d. distributions of data across the agents. The purpose of all the experiments is to provide numerical validation of our theoretical results and address the usefulness of our Byzantine-resilient technique on a broad scope (con vex, nonconv ex, non-i.i.d. data) of machine learning problems. A. Linear classifier on MNIST and CIF AR-10 The first set of experiments is performed to demonstrate two facts: BRIDGE can maintain good performance under Byzantine attacks while classic decentralized learning methods fail; and compared to an existing Byzantine-resilient method, ByRDiE [ 59 ], BRIDGE is more efficient in terms of commu- nications cost. W e choose one of the most well-understood machine learning tools, the linear classifier with squared hinge loss, to learn the model for this purpose. Note that by showing BRIDGE works in this strictly con vex and Lispchitz 12 Fig. 1: Comparison between DGD and BRIDGE-T , -M, -K, -B in the faultless setting for a conv ex loss function and the MNIST dataset, where b is set to be b = 1 for BRIDGE. loss function setting means BRIDGE also works for strongly con vex loss functions with bounded Lipschitz gradients. The MNIST dataset is a set of 60,000 training images and 10,000 test images of handwritten digits from ‘0’ to ‘9’. Each image is con verted to a 784-dimensional vector and we distribute 60,000 images equally among 50 nodes. The CIF AR- 10 dataset is a set of 50,000 training images and 10,000 test images of 10 different classes. Each image is conv erted to a 3072-dimensional vector and we distribute 50,000 images equally among 50 nodes. Then, unless stated otherwise, we connect each pair of nodes with probability p = 0 . 5 . Some of the nodes are randomly picked to be Byzantine nodes, which broadcast random v ectors to all their neighbors during each iteration. The parameter b for BRIDGE is set to be b = 1 in the faultless setting, while it is set to be equal to |B | in the faulty setting. Once a random network is generated and the Byzantine nodes are randomly placed, we check for each v ariant of BRIDGE whether the minimum neighborhood-size condition listed in T able II for its execution is satisfied before running that variant. The classifiers are trained using the “one vs all” strategy . W e run fiv e sets of experiments, with the first four on the MNIST dataset and the last one on the CIF AR-10 dataset: ( i ) classic distrib uted gradient descent (DGD) and BRIDGE- T , -M, -K, -B with no Byzantine nodes; ( ii ) classic DGD and BRIDGE-T , -M, -K, -B with 2 and 4 Byzantine nodes; ( iii ) BRIDGE-T , -M, and -K with 6, 12, 18, and 24 Byzantine nodes and v arying probabilities of connection ( p = 0 . 5 , 0 . 75 , and 1 ); ( iv ) ByRDiE and BRIDGE-T with 2 Byzantine nodes; and ( v ) BRIDGE-T , -M, and -K with 0, 2, 4, and 6 Byzantine nodes. The performance is e valuated by two metrics: classifica- tion accuracy on the 10,000 test images and whether consensus is achieved. When comparing ByRDiE and BRIDGE, we compare the accuracy with respect to the number of commu- nication iterations, which is defined as the number of scalar- valued information exchanged among the neighboring nodes. As we can see from Figure 1 , despite using b = 1 , the performances of all BRIDGE methods except BRIDGE-B are as good as DGD in the faultless setting with ∼ 88% av erage accuracy , while BRIDGE-B performs slightly worse Fig. 2: Comparison between DGD and BRIDGE-T , M, K, B with two and four Byzantine nodes for a con ve x loss function with the MNIST dataset. at 83% average accuracy ( final accuracy: DGD = 87 . 8% , BRIDGE-T = 87 . 6% , -M = 87 . 3% ; -K = 86 . 9% , and -B = 83 . 1% ). Note that these accuracy figures match the state-of-the-art results for the MNIST dataset using a linear classifier [ 69 ]. W e also attribute the superiority of BRIDGE-T ov er -M, -K, and -B to its ability to retain information from a wider set of neighbors after the screening in each iteration. Next, we conclude from Figure 2 that DGD fails in the faulty setting when |B | = 2 and produces an even worse accuracy in the faulty setting when |B | = 4 . Ho wev er , BRIDGE-T , -M, -K and -B with b = |B | are able to learn relativ ely good models in these faulty settings. The next set of results in Figure 3 highlights the robustness of the BRIDGE framew ork to a larger number of Byzantine nodes for v arying lev els of netw ork connectivity that range from p = 0 . 5 to p = 1 . W e exclude BRIDGE-B in this figure since it fails to run for a majority of the randomly generated networks because of the stringent minimum neighborhood size condition (cf. T able II ). The results in this figure reaf firm our findings from Figure 2 that the BRIDGE framework is extremely resilient to Byzantine attacks in the network. In particular , we see that BRIDGE-T (when it satisfies the minimum neighborhood size condition) and -M perform very similarly in the face of a large number of Byzantine nodes in 13 Fig. 3: Comparison between BRIDGE-T , -M, and -K for different numbers of Byzantine nodes and varying lev els of network connectivity (conv ex loss and MNIST dataset). the network, while BRIDGE-K is a close third in performance. Another observation from this figure is the robustness of BRIDGE-M for loosely connected Erd ˝ os–R ´ enyi networks and a large number of Byzantine attacks; indeed, BRIDGE-M is the only v ariant that can be run in the case of b = 24 and p = 0 . 5 since it alw ays satisfies the minimum neighborhood size condition e ven when close to 50% of the nodes in the network are Byzantine. In contrast, BRIDGE-T could not be Fig. 4: Comparing BRIDGE-T with ByRDiE in the presence of two Byzantine nodes (conv ex loss and MNIST dataset). Fig. 5: Performance of BRIDGE-T , -M, and -K with zero, two, four , and six Byzantine nodes for a con vex loss function with the CIF AR-10 dataset. run when b ≥ 18 (resp., b = 24 ) and p = 0 . 5 (resp., p = 0 . 75 ), while BRIDGE-K could not be run when b = 24 and p = 0 . 5 . W e next compare BRIDGE-T and ByRDiE in Figure 4 . Both BRIDGE-T and ByRDiE are resilient to two Byzantine nodes but since ByRDiE is based on a coordinate-wise screening method, the time it takes to reach the final optimal solution is thousands-fold more than BRIDGE-T . Indeed, since nodes within the BRIDGE framework compute the local gradients and communicate with their neighbors only once per iteration, this leads to massi ve savings in computation and communi- cations costs in comparison with ByRDiE. This difference in terms of computation and communications costs is ev en more pronounced in higher -dimensional tasks; thus we will not compare with ByRDiE for the rest of our experiments. Last but not the least, Figure 5 highlights the robustness of the BRIDGE frame work on the higher-dimensional CIF AR- 10 dataset for the case of a linear classifier . It can be seen from this figure that the earlier conclusions concerning the different v ariants of BRIDGE still hold, with BRIDGE-T 14 Fig. 6: Comparison between DGD and BRIDGE-T , -M, -K, and -B in the faultless setting for a noncon ve x loss function with the MNIST dataset. approaching the state-of-the-art accuracy of ∼ 39% [ 90 ] for a linear classifier . W e conclude by noting that BRIDGE-B is excluded here and in the following for the experiments on CIF AR-10 dataset because of its relatively high computational complexity on larger-dimensional data. B. Con volutional neural network on MNIST and CIF AR-10 In the theoretical analysis, we gav e local con vergence guar- antees of BRIDGE-T in the noncon vex setting. In this set of experiments, we numerically show that BRIDGE indeed performs well in the nonconv ex case. W e train a conv olutional neural network (CNN) on MNIST and CIF AR-10 datasets for this purpose, with the model including two conv olutional layers followed by two fully connected layers. Each con volu- tional layer is followed by a max pooling and ReLU acti vation while the output layer uses softmax activ ation. W e construct a network with 50 nodes and each pair of nodes has probability of 0.5 to be directly connected. Each node has access to 1200 samples randomly picked from the training set. W e randomly choose two or four of the nodes to be Byzantine nodes, which broadcast random v ectors to all their neighbors during each iteration for all screening methods. W e again use the av eraged classification accuracy on the MNIST and CIF AR-10 test sets ov er all nonfaulty nodes as the metric for performance. In this experiment, we cannot compare with ByRDiE due to the fact that the learning model is of high dimension, which mak es ByRDiE unfeasible in this setting. As we can see from Figure 6 , the performances of all BRIDGE methods are as good as DGD in the faultless setting with 92% to 95% percent average accuracy . In Figure 7 , we see that DGD fails in the faulty setting when b = 2 and b = 4 . But BRIDGE-T , -M, -K, and -B are able to learn a relatively good model in these cases. The final set of results for the CIF AR-10 dataset obtained using BRIDGE-T , -M, and -K for the case of zero, two, and four Byzantine nodes is presented in Figure 8 . The top-left quadrant in this figure is reserved Fig. 7: Comparison between DGD and BRIDGE-T , -M, -K, and -B with two and four Byzantine nodes for noncon ve x loss with the MNIST dataset. for the accuracy of the centralized solution for the chosen CNN architecture. 2 It can be seen that both BRIDGE-T and -M remain resilient to Byzantine attacks and achie ve accuracy similar to the centralized solution. Ho wev er , BRIDGE-K gets stuck in a suboptimal critical point of the loss landscape for the case of two and four Byzantine nodes. C. Non-i.i.d. data distribution on MNIST In the theoretical analysis section, we gav e con vergence gurantees for BRIDGE-T in both conv ex and noncon ve x settings. Howe ver , the main results are based on identical and independent distribution (i.i.d.) of the dataset. In this section, we compare our method to the one proposed in [ 61 ], which we term “Byzantine-robust decentralized stochastic optimization” (BRDSO) in the following discussion based on the terminol- ogy used in [ 61 ]. W e compare BRIDGE-T with BRDSO [ 61 ] in the following non-i.i.d. settings. Extreme non-i.i.d. setting: W e group the dataset corre- sponding to labels and distribute all the samples labelled “0” to 5 agents, distribute all the samples labelled “1” to another 5 agents, and so on. W e can see from Figure 9 that 2 Note that a better CIF AR-10 accuracy can be obtained through a fine tuning of the CNN architecture and the step size sequence. 15 Fig. 8: Performance comparison of BRIDGE-T , -M, and -K with zero, two, and four Byzantine nodes for nonconv ex loss with the CIF AR-10 dataset. when the number of Byzantine nodes is equal to 0 or 2, the accuracies of both the algorithms are as good as in the i.i.d. case, while in the case when the number of Byzantine nodes is equal to 4, there is about 9 percent of accuracy drop due to the non-i.i.d. distribution of data for both algorithms. In this extreme non-i.i.d. setting when four of the agents are chosen to be Byzantine, the worst-case scenario happens when all the Byzantine nodes are assigned all samples with the same label. This means 80 percent of the samples of one label is not being used towards the training process, which causes both algorithms to underperform compared to the i.i.d. setting. Moderate non-i.i.d. setting: W e group the dataset corre- sponding to labels and distribute the samples associated with each label evenly to 10 agents. Every agent receives two sets of differently labelled data ev enly . As we can see from Figure 10 , both algorithms perform as well as in the i.i.d. setting in the presence of two or four Byzantine nodes. W e conclude that, with distribution closer to i.i.d., the impact of Byzantine nodes in the non-i.i.d. setting will be less. V I . C O N C L U S I O N This paper introduced a new decentralized machine learning framew ork called Byzantine resilient decentralized gradient descent (BRIDGE). This framework was designed to solve machine learning problems when the training set is distributed ov er a decentralized network in the presence of Byzantine failures. Both theoretical results and experimental results were used to show that the frame work could perform well while tolerating Byzantine attacks. One variant of the frame work was shown to conv erge sublinearly to the global minimum in the conv ex setting and first-order stationary point in the con vex setting. In addition, statistical con ver gence rates were also provided for this variant. Future work aims to impro ve the framework to tolerate more Byzantine agents within the network with faster con ver gence rates and to deal with more general noncon ve x objectiv e Fig. 9: Comparing BRIDGE-T with BRDSO [ 61 ] in the extreme non-i.i.d. setting (conv ex loss and MNIST dataset). Fig. 10: Comparing BRIDGE-T with BRDSO [ 61 ] in the moderate non-i.i.d. setting (con ve x loss and MNIST dataset) functions with either i.i.d. or non-i.i.d. distribution of the dataset. In addition, an exploration of the number of Byzantine nodes that this frame work can tolerate under different kinds of non-i.i.d. settings will also be undertaken in future work. A P P E N D I X A. Pr oof of Lemma 1 First, we drop P and z in notation of the statistical risk for con venience and observe that for an y dimension k we have E [ g 2 ( t )] k = E r X j =1 [ α k ( t )] j [ ∇ f j ( v ( t ))] k = E [ ∇ f ( v ( t ))] k . Since k is arbitrary , it then follows that E [ g 2 ( t )] = E [ ∇ f ( v ( t ))] . (35) In the definition of g 2 ( t ) , notice that v ( t ) depends on t and α k ( t ) depends on both t and k . W e therefore need to show that the con ver gence of g 2 ( t ) to E [ ∇ f ( v ( t ))] is uniformly ov er all v ( t ) and α k ( t ) . W e fix an arbitrary coordinate k and drop the index k for the rest of this section for simplicity . W e next define a vector h ( t ) as h ( t ) := [[ ∇ f j ( v ( t ))] : j ∈ R ] 16 and note that g 2 ( t ) = h α ( t ) , h ( t ) i . Since the training data are i.i.d., h ( t ) has identically distributed elements. W e therefore hav e from Hoeffding’ s inequality [ 91 ] that for any 0 ∈ (0 , 1) : P |h α ( t ) , h ( t ) i − E [ ∇ f ( v ( t ))] | ≥ 0 ≤ 2 exp − 2 N 2 0 L 2 k α ( t ) k 2 . (36) Further , since the r -dimensional vector α ( t ) is an arbitrary element of the standard simplex, defined as ∆ := { q ∈ R r : r X j =1 [ q ] j = 1 and ∀ j, [ q ] j ≥ 0 } , (37) the probability bound in ( 36 ) also holds for any q ∈ ∆ , i.e., P |h q , h ( t ) i − E [ ∇ f ( v ( t ))] | ≥ 0 ≤ 2 exp − 2 N 0 2 L 2 k q k 2 . (38) W e now define the set S α := { α k ( t ) } ∞ ,d t,k =1 . Our next goal is to leverage ( 38 ) and deriv e a probability bound similar to ( 36 ) that uniformly holds for all q ∈ S α . T o this end, let C ξ := { c 1 , . . . , c d ξ } ⊂ ∆ s.t. S α ⊆ d ξ [ q =1 B ( c q , ξ ) (39) denote a ξ -covering of S α in terms of the ` 2 norm and define ¯ c := arg max c ∈ C ξ k c k . Then from ( 38 ) and the union bound P sup c ∈ C ξ |h c , h ( t ) i − E [ ∇ f ( v ( t ))] | ≥ 0 ≤ 2 d ξ exp − 2 N 0 2 L 2 k ¯ c k 2 . (40) In addition, we hav e sup q ∈S α |h q , h ( t ) i − E [ ∇ f ( v ( t ))] | ( a ) ≤ sup q ∈S α , c ∈ C ξ k q − c kk h ( t ) k + sup c ∈ c ξ |h c , h ( t ) i − E [ ∇ f ( v ( t ))] | , (41) where ( a ) is due to the triangle and Cauchy–Schwarz inequal- ities. Tri vially , sup q ∈S α , c ∈ C ξ k q − c k ≤ ξ from the definition of C ξ , while k h ( t ) k ≤ √ r L from the definition of h ( t ) and Assumption 1 . Combining ( 40 ) and ( 41 ), we get P sup q ∈S α |h q , h ( t ) i − E [ ∇ f ( v ( t ))] | ≥ 0 + √ r ξL ≤ 2 d ξ exp − 2 N 0 2 L 2 k ¯ c k 2 . (42) W e now define α m := arg max q ∈S α k q k . It can then be shown from the definitions of C ξ and ¯ c that k ¯ c k 2 ≤ 2( k α m k 2 + ξ 2 ) . (43) Therefore, fixing 0 ∈ (0 , 1) , and defining 0 := 2 0 and ξ := 0 / (2 L √ r ) , we hav e from ( 42 ) and ( 43 ) that P sup q ∈S α |h q , h ( t ) i − E [ ∇ f ( v ( t ))] | ≥ 0 ≤ 2 d ξ exp − 4 r N 0 2 4 L 2 r k α m k 2 + 0 2 . (44) Note that ( 44 ) is deriv ed for a fixed but arbitrary k . Extending this to the entire vector gives us for any v ( t ) P k g 2 ( t ) − E [ ∇ f ( v ( t ))] k ≥ √ d 0 ≤ 2 d ξ exp − 4 r N 0 2 4 L 2 r k α m k 2 + 0 2 . (45) T o obtain the desired uniform bound, we next need to remov e the dependence on v ( t ) in ( 45 ). Here we drop t from v ( t ) for simplicity of notation and write g 2 ( t ) as g 2 ( v ) to show the dependence of g 2 on v . Notice from our discussion in the beginning of Section IV and the analysis in Section IV -A that v ( t ) ∈ V := { v : k v k ≤ Γ 0 } for some Γ 0 and all t . W e then define E ζ := { e 1 , . . . , e m ζ } ⊂ V to be a ζ -covering of V in terms of the ` 2 norm. It then follows from ( 45 ) that P sup e ∈ E ζ k g 2 ( e ) − E [ ∇ f ( e )] k ≥ √ d 0 ≤ 2 d ξ m ζ exp − 4 r N 0 2 4 L 2 r k α m k 2 + 0 2 . (46) Similar to ( 41 ), we can also write sup v ∈ V k g 2 ( v ) − E [ ∇ f ( v ) k| ≤ sup e ∈ E ζ k g 2 ( e ) − E [ ∇ f ( e )] k + sup e ∈ E ζ , v ∈ V k g 2 ( v ) − g 2 ( e ) k + k E [ ∇ f ( e )] − E [ ∇ f ( v )] k . (47) Further , Assumption 1 and definition of the set E ζ imply sup e ∈ E ζ , v ∈ V k g 2 ( v ) − g 2 ( e ) k ≤ L 0 ζ . (48) W e now define 00 := 2 0 √ d and ζ := 00 / 4 L 0 . W e then obtain the following from ( 45 )–( 48 ): P sup v ∈ V k g 2 ( v ) − E [ ∇ f ( v )] k ≥ 00 ≤ 2 d ξ m ζ exp − 4 r N 00 2 16 L 2 r d k α m k 2 + 00 2 . (49) Since v ( t ) ∈ V for all t , we then hav e P sup t k g 2 ( v ( t )) − E [ ∇ f ( v ( t ))] k ≥ 00 ≤ 2 d ξ m ζ exp − 4 r N 00 2 16 L 2 r d k α m k 2 + 00 2 . (50) The proof now follows from ( 50 ) and the following facts about the cov ering numbers of the sets S α and V : (1) Since S α is a subset of ∆ , which can be circumscribed by a sphere in R r − 1 of radius p r − 1 /r < 1 , we can upper bound d ξ by 12 L √ rd 00 r [ 92 ]; and (2) Since V ⊂ R d can be circumscribed by a sphere in R d of radius Γ 0 √ d , we can upper bound m ζ by 12 L 0 Γ 0 √ d 00 d . Then for any 00 ∈ (0 , 1) , we have sup t k g 2 ( v ( t )) − E [ ∇ f ( v ( t ))] k < 00 (51) 17 with probability exceeding 1 − 2 exp − 4 r N 00 2 16 L 2 r d k α m k 2 + 00 2 + r log 12 L √ r d 00 + d log 12 L 0 Γ 0 √ d 00 . (52) Equiv alently , we have with probability at least 1 − δ that sup t k g 2 ( v ( t )) − E [ ∇ f ( v ( t ))] k < O s 4 L 2 d k α m k 2 log 2 δ N , where δ = 2 exp − 4 rN 00 2 16 L 2 rd k α m k 2 + 00 2 + r log 12 L √ rd 00 + d log 12 L 0 Γ 0 √ d 00 . B. Pr oof of Lemma 2 All statements in this proof are probabilistic, with proba- bility at least 1 − δ and δ being given in Appendix A . In particular , the discussion should be assumed to ha ve been implicitly conditioned on this high probability ev ent. From ( 31 ), we iterativ ely add up from v (0) to v ( t + 1) and get k v ( t + 1) − w ∗ k ≤ 1 − 1 t + t 0 k v ( t ) − w ∗ k + a 2 ( t ) + a 3 ( t ) ≤ 1 − 1 t + t 0 1 − 1 t − 1 + t 0 k v ( t − 1) − w ∗ k + a 2 ( t − 1 , N ) + a 3 ( t − 1) + a 2 ( t ) + a 3 ( t ) ≤ t Y τ =0 1 − 1 τ + t 0 k v (0) − w ∗ k + a 2 ( t ) + a 3 ( t ) + t − 1 X τ =0 [ a 2 ( τ ) + a 3 ( τ )] t Y z = τ +1 1 − 1 z + t 0 . (53) The first term in ( 53 ) can be simplified as t Y τ =0 1 − 1 τ + t 0 k v (0) − w ∗ k = 1 − 1 0 + t 0 1 − 1 1 + t 0 · · · 1 − 1 t + t 0 k v (0) − w ∗ k = t 0 − 1 t + t 0 C 1 , (54) where C 1 = k v (0) − w ∗ k . The remaining terms in ( 53 ) can be simplified as a 2 ( t ) + a 3 ( t ) + t − 1 X τ =0 [ a 2 ( τ ) + a 3 ( τ )] t Y z = τ +1 1 − 1 z + t 0 ≤ 1 − 1 t + t 0 a 2 ( t − 1) + 1 − 1 t + t 0 1 − 1 t − 1 + t 0 a 2 ( t − 2) + a 2 ( t ) + · · · + 1 − 1 t + t 0 · · · 1 − 1 1 + t 0 a 2 (0) + L 0 λ ( t + t 0 ) a 4 ( t ) + 1 − 1 t + t 0 L 0 λ ( t − 1 + t 0 ) a 4 ( t − 1) + · · · + 1 − 1 t + t 0 · · · 1 − 1 1 + t 0 L 0 λ ( t 0 ) a 4 (0) = C 2 ( N ) ρ ( t ) + C 2 ( N ) ρ ( t − 1) t + t 0 − 1 t + t 0 + · · · + C 2 ( N ) ρ (0) t 0 t + t 0 + L 0 λ ( t + t 0 ) a 4 ( t ) + t + t 0 − 1 λ ( t + t 0 ) 1 t − 1 + t 0 L 0 a 4 ( t − 1) + · · · + t 0 λ ( t + t 0 ) 1 t 0 L 0 a 4 (0) = t λ ( t + t 0 ) C 2 ( N ) + L 0 λ 1 t 0 + t [ a 4 (0) + · · · + a 4 ( t )] , (55) where C 2 ( N ) = O q d k α m k 2 log 2 δ N . The term a 4 (0) + · · · + a 4 ( t ) in ( 55 ) can be further simplified using ( 22 ) as: a 4 (0) + · · · + a 4 ( t ) ≤ √ d r C w + r L 1 λt 0 µ 1 ν + √ d r C w µ 1 ν + r L 1 λt 0 µ 2 ν + r L 1 λ (1 + t 0 ) µ 1 ν + · · · + √ d r C w µ t ν + r L 1 λt 0 µ t +1 ν + r L 1 λ (1 + t 0 ) µ t ν + r L 1 λ (2 + t 0 ) µ t − 1 ν + · · · + r L 1 λ ( t + t 0 ) µ 1 ν ≤ √ d r C w + r C w µ 1 ν + r C w µ 2 ν + · · · + r C w µ t ν + √ dr L λ µ 1 ν 1 t 0 + 1 t 0 + 1 + · · · 1 t 0 + t + √ dr L λ µ 2 ν 1 t 0 + 1 t 0 + 1 + · · · + 1 t 0 + t + · · · + + √ dr L λ µ t ν 1 t 0 + 1 t 0 + 1 + · · · + 1 t 0 + t = √ dr C w (1 − µ t ν ) 1 − µ 1 ν + √ dr L λ µ 1 ν (1 − µ t +1 ν ) 1 − µ 1 ν 1 t 0 + 1 t 0 + 1 + · · · + 1 t 0 + t . (56) 18 Plugging ( 56 ) into ( 55 ) we finish the simplification of the remaining terms as: a 2 ( t ) + a 3 ( t ) + t − 1 X τ =0 [ a 2 ( τ ) + a 3 ( τ )] t Y z = τ +1 1 − 1 z + t 0 ≤ 1 λ C 2 ( N ) + L 0 λ ( t 0 + t ) √ dr C w 1 − µ 1 ν + √ dLL 0 r µ 1 ν λ 2 1 − µ 1 ν 1 t + t 0 1 t 0 + 1 1 + t 0 + · · · + 1 t + t 0 . (57) Finally we express k v ( t + 1) − w ∗ k using ( 54 ) and ( 57 ) as: k v ( t + 1) − w ∗ k ≤ t 0 t + t 0 C 1 + C 2 ( N ) λ + C 3 t + t 0 + C 4 t + t 0 1 t 0 + 1 1 + t 0 + · · · + 1 t + t 0 , (58) where C 3 = √ drC w L 0 λ 1 − µ 1 ν and C 4 = √ dLL 0 rµ 1 ν λ 2 1 − µ 1 ν . C. Pr oof of Lemma 3 Similar to the proof of Lemma 2 , our statements in this appendix are also conditioned on the high probability e vent described in Appendix A . From Theorem 1 and ( 58 ) in the proof of Lemma 2 , we conclude for all j ∈ R that k w j ( t + 1) − w ∗ k ≤ t 0 t + t 0 C 1 + C 2 ( N ) λ + C 3 t + t 0 + C 4 t + t 0 1 t 0 + 1 1 + t 0 + · · · + 1 t + t 0 + a 4 ( t + 1) . (59) Next, we get an upper bound on a 4 ( t + 1) as following when we choose t as an ev en number: a 4 ( t + 1) ≤ √ dr C w µ t +1 ν + √ dr L ρ (0) µ t +1 ν + ρ (1) µ t ν + · · · + ρ ( t − 1) µ 2 ν + ρ ( t ) µ 1 ν + ρ ( t + 1) ≤ √ dr C w µ t +1 ν + √ dr L ρ (0) µ t +1 ν + · · · + ρ (0) µ t 2 +1 ν + √ dr L ρ t 2 + 1 µ t 2 ν + · · · + ρ t 2 + 1 µ 1 ν + ρ t 2 + 1 = √ dr C w µ t +1 ν + √ dr Lρ (0) µ t +2 2 ν 1 − µ t 2 ν 1 − µ 1 ν + √ dr Lρ t 2 + 1 1 − µ t 2 ν 1 − µ 1 ν ≤ √ dr C w µ t +1 ν + √ dr L 1 1 − µ 1 ν ρ (0) µ t +2 2 ν + ρ t 2 + 1 . (60) When t is an odd number , we hav e: √ dr L ρ (0) µ t +1 ν + ρ (1) µ t ν + · · · + ρ ( t − 1) µ 2 ν + ρ ( t ) µ 1 ν + ρ ( t + 1) ≤ √ dr L ρ (0) µ t +1 ν + · · · + ρ (0) µ t 2 + 1 2 ν + √ dr L ρ t 2 + 1 2 µ t 2 − 1 2 ν + · · · + ρ t 2 + 1 2 µ 1 ν + ρ t 2 + 1 2 (61) and the remaining steps are similar to ( 60 ), so we omit them. Plugging either ( 60 ) or ( 61 ) into ( 59 ), we have for all j ∈ R k w j ( t + 1) − w ∗ k ≤ t 0 t + t 0 C 1 + C 2 ( N ) λ + C 3 t + t 0 + C 4 t + t 0 1 t 0 + 1 1 + t 0 + · · · + 1 t + t 0 + √ dr C w µ t +1 ν + √ dr L 1 1 − µ 1 ν ρ (0) µ t +2 2 ν + ρ t 2 + 1 . (62) From ( 62 ) we see that all the terms except C 2 ( N ) λ are monotonically decreasing with increasing t . Thus, giv en any > C 2 ( N ) λ > 0 , we can find a t 1 such that for all t ≥ t 1 , with probability at least 1 − δ , k w j ( t + 1) − w ∗ k ≤ . D. Pr oof of Theor em 2 Lemma 3 establishes the con ver gence of w j ( t + 1) to w ∗ for all j ∈ R with probability at least 1 − δ . Next, we derive the rate of conv ergence. From ( 59 ) since a 4 ( t ) has a conv ergence rate of O (1 /t ) , as mentioned in Theorem 1 , the term that con- ver ges the slo west to 0 among all the terms in k w j ( t + 1) − w ∗ k is C 4 t + t 0 1 t 0 + 1 1+ t 0 + · · · + 1 t + t 0 . Therefore, using the harmonic series approximation, we conclude that the con vergence rate is O log t t . The abov e statement sho ws BRIDGE-T con verges to the minimum of the global statistical risk at a sublinear rate, thus completing the proof of Theorem 2 . E. Pr oof of Lemma 4 Under the stated assumptions of the lemma as well as the initialization for the noncon ve x loss function, it is straightfor- ward to see that ( 62 ) also holds for the noncon ve x setting. In particular , the upper bound in ( 62 ) on k w j ( t + 1) − w ∗ k for all j ∈ R monotonically decreases with t . Thus, k w j ( t + 1) − w ∗ k for all j ∈ R , t ∈ R can be upper bounded by the bound on k w j (1) − w ∗ k , which is C 1 + C 2 ( N ) λ + C 3 t 0 + C 4 t 2 0 + C 5 . The proof now follows from the fact that C 1 := k v (0) − w ∗ k ≤ β 0 by virtue of the initialization. F . Pr oof of Theorem 3 The local strong con ve xity of the loss function implies that f ( w , z ) can be treated as λ -strongly con vex when re- stricted to the ball B ( w ∗ , β ) . Therefore, Assumption 3 0 , the Γ - boundedness of the iterates stated in the beginning of Section 19 IV and Lemma 4 , and the constraint β ≥ Γ imply that the proof of this theorem is a straightforward replication of the proof of Theorem 2 for the con vex case. R E F E R E N C E S [1] V . V apnik, The Nature of Statistical Learning Theory , 2nd ed. New Y ork, NY : Springer-V erlag, 1999. [2] F . Sebastiani, “Machine learning in automated text categorization, ” ACM Computing Surveys , vol. 34, no. 1, pp. 1–47, 2002. [3] S. B. K otsiantis, I. Zaharakis, and P . Pintelas, “Supervised machine learning: A revie w of classification techniques, ” Emer ging Artificial Intell. Applicat. Comput. Eng. , vol. 160, pp. 3–24, 2007. [4] Y . Bengio, “Learning deep architectures for AI, ” F ound. and T r ends Mach. Learning , vol. 2, no. 1, pp. 1–127, 2009. [5] M. Mohri, A. Rostamizadeh, and A. T alwalkar , F oundations of Machine Learning , 2nd ed. Cambridge, MA: MIT Press, 2018. [6] R. M. Golden, Statistical Machine Learning: A Unified F ramework . Boca Raton, FL: Chapman and Hall/CRC, 2020. [7] Z. Y ang, A. Gang, and W . U. Bajwa, “ Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent adv ances under the Byzantine threat model, ” IEEE Signal Process. Mag. , vol. 37, no. 3, pp. 146–159, May 2020. [8] M. Nokleby , H. Raja, and W . U. Bajwa, “Scaling-up distributed process- ing of data streams for machine learning, ” Proceedings of the IEEE , vol. 108, no. 11, pp. 1984–2012, 2020. [9] J. B. Predd, S. B. Kulkarni, and H. V . Poor , “Distributed learning in wireless sensor networks, ” IEEE Signal Process. Mag . , vol. 23, no. 4, pp. 56–69, 2006. [10] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers, ” F ound. and T rends Mach. Learning , vol. 3, no. 1, pp. 1–122, 2011. [11] A. H. Sayed, “ Adaptation, learning, and optimization over networks, ” F ound. and T rends Mach. Learning , vol. 7, no. 4-5, pp. 311–801, 2014. [12] A. Nedi ´ c, A. Olshevsky , and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization, ” Pr oceedings of the IEEE , vol. 106, no. 5, pp. 953–976, 2018. [13] T . Sun, D. Li, and B. W ang, “Decentralized federated averaging, ” arXiv pr eprint arXiv:2104.11375 , 2021. [14] K. Driscoll, B. Hall, H. Sivencrona, and P . Zumsteq, “Byzantine fault tolerance, from theory to reality , ” in Pr oc. Int. Conf. Computer Safety, Reliability , and Security (SAFECOMP’03) , 2003, pp. 235–248. [15] L. Su and N. H. V aidya, “Fault-tolerant multi-agent optimization: Optimal iterativ e distributed algorithms, ” in Proc. A CM Symp. Principles of Distributed Computing , 2016, pp. 425–434. [16] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals prob- lem, ” A CM T rans. Pr ogramming Languages and Syst. , vol. 4, no. 3, pp. 382–401, 1982. [17] P . Dutta, R. Guerraoui, and M. V ukolic, “Best-case complexity of asyn- chronous Byzantine consensus, ” EPFL/IC/200499, T ech. Rep., 2005. [18] J. Sousa and A. Bessani, “From Byzantine consensus to BFT state machine replication: A latency-optimal transformation, ” in Pr oc. 9th Eur o. Dependable Computing Conf.(EDCC’12) , 2012, pp. 37–48. [19] M. Li, D. G. Andersen, J. W . Park, A. J. Smola, A. Ahmed, V . Josifovski, J. Long, E. J. Shekita, and B.-Y . Su, “Scaling distributed machine learning with the parameter server, ” in Pr oc. 11th USENIX Symp. Operating Systems Design and Implementation (OSDI’14) , Broomfield, CO, Oct. 2014, pp. 583–598. [20] J. K one ˇ cn ´ y, H. B. McMahan, F . X. Y u, P . Richtarik, A. T . Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficienc y , ” in Pr oc. NeurIPS W orkshop on Private Multi-P arty Machine Learning , 2016. [21] P . Blanchard, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent, ” in Proc. Advances in Neural Inf. Pr ocess. Syst. , 2017, pp. 118–128. [22] L. Chen, H. W ang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient distributed training via redundant gradients, ” in Proc. 35th Intl. Conf. Machine Learning (ICML) , 2018, pp. 903–912. [23] X. Cao and L. Lai, “Robust distributed gradient descent with arbitrary number of Byzantine attackers, ” in Proc. IEEE Int. Conf. Acoust. Speech and Signal Pr ocess. (ICASSP’19) , 2018, pp. 6373–6377. [24] L. Su and J. Xu, “Securing distributed machine learning in high dimensions, ” arXiv preprint , 2018. [25] D. Y in, Y . Chen, R. Kannan, and P . Bartlett, “Defending against saddle point attack in Byzantine-rob ust distributed learning, ” in Pr oc. 36th Intl. Conf. Machine Learning , Jun. 2019, pp. 7074–7084. [26] G. Damaskinos, E. E. Mhamdi, R. Guerraoui, R. Patra, and M. T aziki, “ Asynchronous Byzantine machine learning (the case of SGD), ” in Proc. 35th Int. Conf. Machine Learning , 2018, pp. 1145–1154. [27] E. E. Mhamdi, R. Guerraoui, and S. Rouault, “The hidden vulnerability of distributed learning in Byzantium, ” in Pr oc. 35th Int. Conf. Machine Learning , 2018, pp. 3521–3530. [28] D. Y in, Y . Chen, K. Ramchandran, and P . Bartlett, “Byzantine-robust distributed learning: To wards optimal statistical rates, ” in Proc. 35th Intl. Conf. Machine Learning , Jul. 2018, pp. 5650–5659. [29] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient descent, ” in Proc. Advances in Neural Information Pr ocessing Systems , 2018, pp. 4618–4628. [30] C. Xie, O. Koyejo, and I. Gupta, “Zeno: Byzantine-suspicious stochastic gradient descent, ” arXiv preprint , 2018. [31] ——, “Generalized Byzantine-tolerant SGD, ” arXiv preprint arXiv:1802.10116 , 2018. [32] ——, “Phocas: Dimensional Byzantine-resilient stochastic gradient de- scent, ” arXiv preprint , 2018. [33] X. Chen, T . Chen, H. Sun, S. W u, and M. Hong, “Distributed training with heterogeneous data: Bridging median- and mean-based algorithms, ” in Pr oc. Advances in Neural Information Processing Systems , 2020, pp. 21 616–21 626. [34] S. Rajput, H. W ang, Z. Charles, and D. Papailiopoulos, “DETO X: A redundancy-based framework for faster and more robust gradient aggre- gation, ” in Proc. Advances in Neural Information Processing Systems , 2019. [35] L. Li, W . Xu, T . Chen, G. Giannakis, and Q. Ling, “RSA: Byzantine- robust stochastic aggre gation methods for distributed learning from heterogeneous datasets, ” in Pr oc. AAAI Confer ence on Artificial Intelli- gence , vol. 33, 2019, pp. 1544–1551. [36] R. Jin, X. He, and H. Dai, “Distributed Byzantine tolerant stochastic gradient descent in the era of big data, ” in Pr oc. IEEE Intl. Conf . Communications (ICC) , 2019, pp. 1–6. [37] F . Lin, Q. Ling, and Z. Xiong, “Byzantine-resilient distributed large- scale matrix completion, ” in Proc. IEEE Int. Conf. Acoust. Speech and Signal Process. (ICASSP’19) , 2019, pp. 8167–8171. [38] A. Ghosh, J. Hong, D. Y in, and K. Ramchandran, “Rob ust fed- erated learning in a heterogeneous environment, ” arXiv preprint arXiv:1906.06629 , 2019. [39] D. Data, L. Song, and S. N. Diggavi, “Data encoding for Byzantine- resilient distributed optimization, ” IEEE Tr ansactions on Information Theory , vol. 67, no. 2, pp. 1117–1140, 2021. [40] E. M. E. Mhamdi, R. Guerraoui, A. Guirguis, and S. Rouault, “SGD: Decentralized Byzantine resilience, ” arXiv pr eprint arXiv:1905.03853 , 2019. [41] C. Xie, S. Ko yejo, and I. Gupta, “Zeno++: Robust fully asynchronous SGD, ” in Pr oc. 37th Intl. Conf. Mac hine Learning , Jul. 2020, pp. 10 495– 10 503. [42] E.-M. El-Mhamdi, R. Guerraoui, and S. Rouault, “Fast and robust dis- tributed learning in high dimension, ” arXiv preprint , 2019. [43] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization, ” IEEE T rans. Autom. Control , vol. 54, no. 1, pp. 48–61, 2009. [44] S. S. Ram, A. Nedi ´ c, and V . V eeravalli, “Distributed stochastic subgra- dient projection algorithms for con vex optimization, ” J. Optim. Theory and Appl. , vol. 147, no. 3, pp. 516–545, 2010. [45] A. Nedi ´ c and A. Olshevsky , “Distributed optimization over time-varying directed graphs, ” IEEE Tr ans. Autom. Control , vol. 60, no. 3, pp. 601– 615, 2015. [46] S. Pu and A. Nedi ´ c, “Distrib uted stochastic gradient tracking methods, ” Mathematical Pro gramming , vol. 187, pp. 409–457, 2021. [47] P . A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based dis- tributed support vector machines, ” J . Mach. Learning Researc h , vol. 11, pp. 1663–1707, 2010. [48] J. F . Mota, J. M. Xavier , P . M. Aquiar, and M. Puschel, “D-ADMM: A communication-efficient distributed algorithm for separable optimiza- tion, ” IEEE Tr ans. Signal Process. , v ol. 61, no. 10, pp. 2718–2723, 2013. [49] W . Shi, Q. Ling, K. Y uan, G. W u, and W . Y in, “On the linear con vergence of the ADMM in decentralized consensus optimization, ” IEEE T rans. Signal Pr ocess. , vol. 62, no. 7, pp. 1750–1761, 2014. [50] A. Mokhtari, W . Shi, Q. Ling, and A. Ribeiro, “ A decentralized second-order method with exact linear conver gence rate for consensus optimization, ” IEEE T rans. Signal Inf. Pr ocess. Netw . , vol. 2, no. 4, pp. 507–522, 2016. 20 [51] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network Newton distributed optimization methods, ” IEEE T rans. Signal Pr ocess. , vol. 65, no. 1, pp. 146–161, 2017. [52] H. J. LeBlanc, H. Zhang, X. Koutsoukos, and S. Sundaram, “Resilient asymptotic consensus in robust networks, ” IEEE J. Sel. Areas in Com- mun. , vol. 31, no. 4, pp. 766–781, 2013. [53] N. H. V aidya, L. Tseng, and G. Liang, “Iterati ve Byzantine vector consensus in incomplete graphs, ” in Pr oc. 15th Int. Conf. Distributed Computing and Networking , 2014, pp. 14–28. [54] S. Sundaram and B. Gharesifard, “Distributed optimization under ad- versarial nodes, ” IEEE T rans. Autom. Control , vol. 64, no. 3, pp. 1063– 1076, 2019. [55] Z. Y ang and W . U. Bajwa, “RD-SVM: A resilient distributed support vector machine, ” in Proc. IEEE Int. Conf. Acoust. Speech and Signal Pr ocess. (ICASSP’16) , 2016, pp. 2444–2448. [56] W . Xu, Z. Li, and Q. Ling, “Robust decentralized dynamic optimization at presence of malfunctioning agents, ” Signal Pr ocess. , vol. 153, pp. 24–33, 2018. [57] A. Mitra, J. Richards, S. Bagchi, and S. Sundaram, “Resilient distributed state estimation with mobile agents: Overcoming Byzantine adversaries, communication losses, and intermittent measurements, ” Autonomous Robots , vol. 43, no. 3, pp. 743–768, 2019. [58] L. Su and S. Shahrampour, “Finite-time guarantees for Byzantine- resilient distrib uted state estimation with noisy measurements, ” IEEE T ransactions on Automatic Contr ol , vol. 65, no. 9, pp. 3758–3771, 2020. [59] Z. Y ang and W . U. Bajwa, “ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning, ” IEEE T rans. Signal Inf. Pr ocess. Netw . , vol. 5, no. 4, pp. 611–627, Dec. 2019. [60] K. Kuwaranancharoen, L. Xin, and S. Sundaram, “Byzantine-resilient distributed optimization of multi-dimensional functions, ” in Proc. Amer- ican Control Confer ence (ACC) , 2020, pp. 4399–4404. [61] J. Peng, W . Li, and Q. Ling, “Byzantine-robust decentralized stochastic optimization ov er static and time-varying networks, ” Signal Pr ocessing , vol. 183, p. 108020, 2021. [62] S. Guo, T . Zhang, X. Xie, L. Ma, T . Xiang, and Y . Liu, “T owards Byzantine-resilient learning in decentralized systems, ” arXiv pr eprint arXiv:2002.08569 , 2020. [63] E.-M. El-Mhamdi, R. Guerraoui, A. Guirguis, L. Hoang, and S. Rouault, “Collaborativ e learning as an agreement problem, ” arXiv preprint arXiv:2008.00742v3 , 2020. [64] P . D. Lorenzo and G. Scutari, “Next: In-network nonconv ex optimiza- tion, ” IEEE T ransactions on Signal and Information Pr ocessing over Networks , vol. 2, pp. 120–136, 2016. [65] J. Zeng and W . Y in, “On noncon vex decentralized gradient descent, ” IEEE T ransactions on Signal Pr ocessing , vol. 66, pp. 2834–2848, 2018. [66] H. Sun, S. Lu, and M. Hong, “Improving the sample and communication complexity for decentralized non-conv ex optimization: Joint gradient estimation and tracking, ” in Pr oc. 37th Intl. Conf. Machine Learning , Jul. 2020, pp. 9217–9228. [67] R. Xin, U. A. Khan, and S. Kar, “Fast decentralized non-con vex finite- sum optimization with recursive variance reduction, ” arXiv preprint arXiv:2008.07428 , 2020. [68] L. He, S. P . Karimireddy , and M. Jaggi, “Byzantine-robust decentralized learning via self-centered clipping, ” arXiv preprint , 2022. [69] Y . Lecun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient-based learning applied to document recognition, ” Pr oceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998. [70] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images, ” 2009. [71] H. H. Sohrab, Basic Real Analysis , 2nd ed. New Y ork, NY : Springer, 2003. [72] J. Sun, Q. Qu, and J. Wright, “When are nonconv ex problems not scary?” arXiv preprint arXiv:1510.06096 , 2015. [73] P . Jain and P . Kar, “Non-con vex optimization for machine learning, ” F oundations and T r ends in Machine Learning , vol. 10, no. 3-4, p. 142–336, 2017. [74] B. Y onel and B. Y azici, “ A deterministic theory for exact non-con vex phase retriev al, ” IEEE T ransactions on Signal Processing , vol. 68, pp. 4612–4626, 2020. [75] Y . Zhou, H. Zhang, and Y . Liang, “Geometrical properties and accel- erated gradient solvers of non-conv ex phase retrieval, ” in Pr oc. 54th Annu. Allerton Conf. Communication, Control, and Computing , 2016, pp. 331–335. [76] T . Ypma, “Local conv ergence of inexact Newton methods, ” SIAM Journal on Numerical Analysis , vol. 21, no. 3, pp. 583–590, 1984. [77] P . Ochs, “Local conv ergence of the heavy-ball method and iPiano for non-con vex optimization, ” Journal of Optimization Theory and Applications , vol. 177, no. 1, pp. 153–180, 2018. [78] S. Bock and M. W eiß, “ A proof of local con vergence for the Adam optimizer , ” in Proc. International Joint Conference on Neural Networks (IJCNN) . IEEE, 2019, pp. 1–8. [79] J. C. Duchi, A. Agarwal, and M. J. W ainwright, “Dual averaging for distributed optimization: Conver gence analysis and network scaling, ” IEEE T rans. Autom. control , vol. 57, no. 3, pp. 592–606, 2012. [80] L. Su and N. V aidya, “Multi-agent optimization in the presence of Byzantine adversaries: Fundamental limits, ” in Pr oc. American Contr ol Confer ence (ACC) , 2016, pp. 7183–7188. [81] H. Y ang, X. zhong Zhang, M. Fang, and J. Liu, “Byzantine-resilient stochastic gradient descent for distributed learning: A Lipschitz-inspired coordinate-wise median approach, ” Pr oc. IEEE Conference on Decision and Control (CDC) , pp. 5832–5837, 2019. [82] D. Data and S. Diggavi, “Byzantine-resilient high-dimensional SGD with local iterations on heterogeneous data, ” in Pr oc. 38th Intl. Conf. Machine Learning , Jul. 2021, pp. 2478–2488. [83] P . J. Huber, Robust Statistics . Berlin, Heidelberg: Springer , 2011. [84] L. Su and N. H. V aidya, “Fault-tolerant distributed optimization (part IV): Constrained optimization with arbitrary directed networks, ” arXiv pr eprint arXiv:1511.01821 , 2015. [85] Y . Nesterov , Intr oductory Lectures on Con vex Optimization , ser. Applied optimization; v . 87. Springer US, 2004. [86] Z. Y ang and W . U. Bajwa, “BRIDGE: Byzantine-resilient decentralized gradient descent, ” arXiv preprint , Aug. 2019. [Online]. A vailable: https://arxiv .org/abs/1908.08098v1 [87] S. Mei, Y . Bai, and A. Montanari, “The landscape of empirical risk for noncon vex losses, ” The Annals of Statistics , vol. 46, no. 6A, pp. 2747 – 2774, 2018. [88] D. Davis and D. Drusvyatskiy , “Graphical con vergence of subgradients in noncon vex optimization and learning, ” Mathematics of Operations Resear ch , vol. 47, no. 1, pp. 209–231, 2021. [89] D. J. Foster , A. Sekhari, and K. Sridharan, “Uniform conv ergence of gradients for non-con vex learning and optimization, ” in Pr oc. Advances in Neural Information Processing Systems , vol. 31. Curran Associates, Inc., 2018. [90] T . V . J. Le and N. Gopee, “Classifying CIF AR-10 images using unsupervised feature & ensemble learning, ” Dec 2016. [Online]. A vailable: https://trucvietle.me/files/601- report.pdf [91] W . Hoeffding, “Probability inequalities for sums of bounded random variables, ” J . American Stat. Assoc. , vol. 58, no. 301, pp. 13–30, 1963. [92] J. V erger -Gaugry , “Covering a ball with smaller equal balls in R n , ” Discr ete & Computational Geometry , vol. 33, no. 1, pp. 143–155, 2005.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment