Parallelization

A major challenge in Federated Learning is tackling the behavior of Byzantine machines, which behave completely arbitrarily. This can happen due to software or hardware crashes, poor communication links between local machines and the center machine, stalled computations, and even coordinated or malicious attacks by a third party. In order to fit complex machine learning models, one is often required to find local minima of a non-convex loss f(.), instead of just critical points which may include several saddle points. Training deep neural networks and other high-capacity learning architectures are some of the examples where finding local minima is crucial. It was recently shown that the stationary points of these problems are in fact saddle points and far away from any local minimum, and hence designing an efficient algorithm that escapes saddle points is of interest. Moreover, it was also argued that saddle points can lead to highly sub-optimal solutions in many problems of interest. This is amplified in high dimensions, and becomes the main bottleneck in training deep neural nets. Furthermore, recent work shows that for many non-convex problems it is sufficient to find a local minimum. In fact, in many problems of interest, all local minima are global minima (e.g., dictionary learning, phase retrieval, matrix sensing and completion, and some neural nets). Also, it is often argued that for more general neural nets, the local minima are as good as global minima.

The problem of saddle point avoidance in the context of non-convex optimization has received considerable attention in the past few years. In this work, we consider a variant of the famous cubic-regularized Newton algorithm of Nesterov and Polyak, namely FEDerated CUbic REgularized newton (FED-CURE) which efficiently escapes the saddle points of a non-convex function by appropriately choosing a regularization, and thus pushing the Hessian towards a positive semi-definite matrix. The primary motivation behind this choice is the faster convergence rate compared to first order methods, which is crucial in terms of communication efficiency in applications like federated learning.

Figure 1. (a) Plot of the function value with different initialization to show that the algorithm escapes the saddle point with functional value 0. (b,c,d) Comparison of our algorithm with ByzantinePGD (Yin et al., 2019) in terms of the total number of iterations.

FED-CURE escapes saddle points efficiently and converges at a rate of T^(-2/3), which is faster than the first-order methods (which converge at a T^(-1/2) rate). Also, the convergence rate matches that of the centralized scheme of Nesterov and Polyak, and hence, we do not lose in terms of convergence rate while making the algorithm distributed.

In FED-CURE, the center machine aggregates the solution of the local machines. We emphasize that, unlike gradient aggregation, the aggregation of the solutions of the local optimization problems is a highly non-linear operation, as evidenced by even a much simpler second order optimization algorithm like GIANT (Wang et al., 2018). Hence, it is non-trivial to extend the centralized cubic regularized algorithm to a distributed one. The solution to the cubic regularization even lacks a closed-form solution, unlike the second-order Hessian based update or the first-order gradient based update. The analysis of FED-CURE is carried out by leveraging the first- and second-order stationary conditions of the auxiliary function solved in each local machine.

Along with the saddle point avoidance, we simultaneously address the issues of (i) communication efficiency and (ii) Byzantine resilience by using an approximate compressor and a norm-based thresholding scheme, respectively. A major technical challenge here is to simultaneously address the aforementioned issues jointly.

Team Members

Arya Mazumdar1

1. UC San Diego

Publications

FL-ICML 2023 >