Published October 2015 | Version public
Journal Article

Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale

Abstract

As clusters continue to grow in size and complexity, providing scalable and predictable performance is an increasingly important challenge. A crucial roadblock to achieving predictable performance is stragglers, i.e., tasks that take significantly longer than expected to run. At this point, speculative execution has been widely adopted to mitigate the impact of stragglers. However, speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. In this work, we present Hopper, a job scheduler that is speculation-aware, i.e., that integrates the tradeoffs associated with speculation into job scheduling decisions. We implement both centralized and decentralized prototypes of the Hopper scheduler and show that 50% (66%) improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation.

Additional Information

© 2015 ACM. We would like to thank Michael Chien-Chun Hung, Shivaram Venkataraman, Masoud Moshref, Niangjun Chen, Qiuyu Peng, and Changhong Zhao for their insightful discussions. We would like to thank the anonymous reviewers and our shepherd, Lixin Gao, for their thoughtful suggestions. This work was supported in part by National Science Foundation (NSF) with Grants (CNS-1319820, CNS-1423505).

Additional details

Identifiers

Eprint ID
72788
DOI
10.1145/2829988.2787481
Resolver ID
CaltechAUTHORS:20161213-150643278

Related works

Funding

NSF
CNS-1319820
NSF
CNS-1423505

Dates

Created
2016-12-13
Created from EPrint's datestamp field
Updated
2021-11-11
Created from EPrint's last_modified field