CaltechAUTHORS
  A Caltech Library Service

signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Bernstein, Jeremy and Zhao, Jiawei and Azizzadenesheli, Kamyar and Anandkumar, Anima (2018) signSGD with Majority Vote is Communication Efficient And Fault Tolerant. . (Unpublished) http://resolver.caltech.edu/CaltechAUTHORS:20190327-085803968

[img] PDF - Submitted Version
See Usage Policy.

1041Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20190327-085803968

Abstract

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses 32× less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
http://arxiv.org/abs/1810.05291arXivDiscussion Paper
Additional Information:JB was primary contributor for theory. JZ was primary contributor for large-scale experiments. We would like to thank Yu-Xiang Wang, Alexander Sergeev, Soumith Chintala, Pieter Noordhuis, Hongyi Wang, Scott Sievert and El Mahdi El Mhamdi for useful discussions. KA is supported in part by NSF Career Award CCF-1254106. AA is supported in part by a Microsoft Faculty Fellowship, Google Faculty Award, Adobe Grant, NSF Career Award CCF-1254106, and AFOSR YIP FA9550-15-1-0221.
Funders:
Funding AgencyGrant Number
NSFCCF-1254106
Microsoft Faculty FellowshipUNSPECIFIED
Google Faculty Research AwardUNSPECIFIED
AdobeUNSPECIFIED
Air Force Office of Scientific Research (AFOSR)FA9550-15-1-0221
Record Number:CaltechAUTHORS:20190327-085803968
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20190327-085803968
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:94178
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:28 Mar 2019 14:31
Last Modified:28 Mar 2019 14:31

Repository Staff Only: item control page