of 357
Scheduling for Today’s Computer Systems:
Bridging Theory and Practice
Adam Wierman
2007
CMU-CS-07-126
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Thesis Committee:
Mor Harchol-Balter, chair
John Lafferty
Bruce Maggs
Alan Scheller-Wolf
Ward Whitt
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
c
©
2007 Adam Wierman
This research was supported, in part, by an NSF Graduate Research Fellowship and a Siebel Scholar award. Additional funding
was provided by a grant from the Pittsburgh Digital Greenhouse and NSF grants CCR-0133077, CCR-0311383, and CCR-9457766.
The views and conclusions contained in this document are those of the author, and should not be interpreted as represent-
ing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government, or any other entity.
Keywords:
scheduling; queueing; fairness; tail behavior; response time; classification; multiserver; SRPT;
LAS; FB; PS; FSP; PSJF; PLCFS; SJF; SMART; FOOLISH; protective; symmetric; closed system; M/G/1;
web servers; routers; wireless; access points; databases
This thesis is dedicated to
my wife Sonya,
who’s love and encouragement made this thesis possible,
and my parents,
who have always supported me in everything I’ve attempted.
Abstract
Scheduling is a fundamental technique for improving performance in computer systems. From web servers
to routers to operating systems, how the bottleneck device is scheduled has an enormous impact on the
performance of the system as a whole. Given the immense literature studying scheduling, it is easy to
think that we already understand enough about scheduling. But, modern computer system designs have
highlighted a number of disconnects between traditional analytic results and the needs of system designers.
In particular, the idealized policies, metrics, and models used by analytic researchers do not match the
policies, metrics, and scenarios that appear in real systems.
The goal of this thesis is to take a step towards modernizing the theory of scheduling in order to provide
results that apply to today’s computer systems, and thus ease the burden on system designers. To accom-
plish this goal, we provide new results that help to bridge each of the disconnects mentioned above. We will
move beyond the study of idealized policies by introducing a new analytic framework where the focus is on
scheduling heuristics and techniques rather than individual policies. By moving beyond the study of individ-
ual policies, our results apply to the complex hybrid policies that are often used in practice. For example, our
results enable designers to understand how the policies that favor small job sizes are affected by the fact that
real systems only have estimates of job sizes. In addition, we move beyond the study of mean response time
and provide results characterizing the distribution of response time and the fairness of scheduling policies.
These results allow us to understand how scheduling affects QoS guarantees and whether favoring small
job sizes results in large job sizes being treated unfairly. Finally, we move beyond the simplified models
traditionally used in scheduling research and provide results characterizing the effectiveness of scheduling
in multiserver systems and when users are interactive. These results allow us to answer questions about the
how to design multiserver systems and how to choose a workload generator when evaluating new scheduling
designs.
v
Thesis Committee
Mor Harchol-Balter (Chair)
Computer Science Department
Carnegie Mellon University
John Lafferty
Computer Science Department and Machine Learning Department
Carnegie Mellon University
Bruce Maggs
Computer Science Department
Carnegie Mellon University
Alan Scheller-Wolf
Tepper School of Business
Carnegie Mellon University
Ward Whitt
Department of Industrial Engineering and Operations Research
Columbia University
vii
Acknowledgements
I have spent nearly ten years at Carnegie Mellon, working through the undergraduate and graduate programs,
and I consider myself extremely lucky to have had the opportunity to interact with such an amazing group of
scholars and students. Carnegie Mellon is an exceptional environment and I can only hope that I will serve
as a worthy representative of my alma matter throughout my future career.
Computer Science at Carnegie Mellon provides an amazingly supportive environment for its students,
so I have many people that I wish to thank. First, I would like to thank my advisor, Mor Harchol-Balter.
There is no aspect of research in which she did not provide excellent guidance. She has taught me how to
take stock of the big picture, not only when presenting my work, but also as a guide for new directions.
Her advice also extended beyond research to career planning and teaching, and I am extremely grateful for
the time she spent working with me on these areas. Finally, Mor’s boundless and infective energy is an
enormous inspiration. There were many times during my graduate career when I came into a meeting with
Mor frustrated about a problem or stuck on a proof, but I had a new excitement about the work by the end
of the meeting (even when we had not actually made any concrete progress).
Next, I would like to thank Alan Scheller-Wolf, who in many ways acted as an advisor for the operations
research half of my work. His guidance, both about research and beyond, has always been thoughtful and
insightful. I also want to thank the other members of my thesis committee: John Lafferty, Bruce Maggs,
and Ward Whitt, who each provided valuable feedback that helped me look at aspects of the thesis in a new
light. In the same breath, I would like to thank Anupam Gupta for his advice over the years on research,
teaching, and life.
I have had a amazing group of coauthors over my graduate career, and each of them deserves my grati-
tude. I would especially like to express my gratitude to Takayuki Osogami. Through research collaborations,
presentation critiques, career advice, and more, Taka played a key role in my graduate career. And, equally
importantly, grad school just would not have been as fun without him around. Bianca Schroeder also de-
serves a big thank you, not only for her role in our research collaborations, but also for her critiques of
my presentations and for her help with my job search. Collaborations with Misja Nuyens, Bert Zwart, and
Jorgen Olsen were amazingly fruitful given that we hardly ever were on the same continent. Thank you
for taking the extra effort to collaborate via email. Onno Boxma and Ivo Adan gave me the opportunity to
spend half a year working at the EURANDOM institute in Eindhoven, which not only led to a number of
interesting collaborations, but also provided me a wonderful opportunity to experience a different culture
first hand, in a way that a vacation can never provide. Also, thank you to the many graduate students and
post-docs at EURANDOM who welcomed me and made me feel at home, especially Erik Winands, Marcel
van Vuuren, and Maak Holmes.
ix
I have been lucky to have the support of many generous people throughout my graduate career. I would
like to give a special thank you everyone who attended SQUALL over the years, especially Varun Gupta,
Paul Enders, and David McWherter. I would also like to thank the support staff in the department, which
smoothly took care of all the administrative issues so that I could focus on my research, especially Charlotte
Yano, Sharon Burks, and Catherine Copetas.
Of course, life did not end at the walls of Wean Hall, and I would like to acknowledge Brian Knudsen,
Helen Lafferty, Mike Rossi, and Rob Reeder, who have been great friends and have kept me sane over the
years. Also, I’d like to thank everyone who has helped with the Random Distance Run, especially the new
RDR guru, Gaurav Veda.
Finally, I would like to thank my family. I would not be where I am today without the support and
love of my parents, and this thesis would not be what it is without the encouragement and patience of my
wife Sonya. She has truly been a partner in my research and writing, letting me vent my complaints and
frustrations and keeping me on task (and off the golf course) when there was nothing I wanted to do less
than work on writing the thesis. She is my inspiration in research and life, and I am excited to find out what
lies ahead for us.
Table of Contents
Abstract
v
Thesis Committee
vii
Acknowledgements
ix
Table of Contents
xv
List of Figures
xviii
List of Tables
xix
Motivation and Background
1
1 Introduction
3
1.1
Scheduling success stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
The essence of a scheduling success story . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Choosing a scheduling policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3.1
What traditional theory says . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3.2
What happens in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3.3
Gaps between theory and practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4
Bridging the gaps between theory and practice . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5
An overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1
Synopsis of Part I: Motivation and Background . . . . . . . . . . . . . . . . . . . . 16
1.5.2
Synopsis of Part II: Moving beyond idealized policies . . . . . . . . . . . . . . . . 16
1.5.3
Synopsis of Part III: Moving beyond mean response time . . . . . . . . . . . . . . . 18
1.5.4
Synopsis of Part IV: Moving beyond the M/GI/1 . . . . . . . . . . . . . . . . . . . 20
1.5.5
Synopsis of Part V: Further discussion and conclusion . . . . . . . . . . . . . . . . 23
xi
xii
TABLE OF CONTENTS
2 The basic model of the thesis
25
2.1
An overview of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2
Performance metrics of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3
Commonly used notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1
Basic mathematical notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2
Queueing-specific notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4
Commonly used distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1
Phase-type distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2
Heavy-tailed distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 An introduction to common policies
43
3.1
Simple policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.1
First-Come-First-Served (FCFS) and the stationary workload . . . . . . . . . . . . . 44
3.1.2
Preemptive-Last-Come-First-Served (PLCFS) and busy periods . . . . . . . . . . . 46
3.1.3
Non-preemptive blind scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.4
Processor-Sharing (PS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2
Priority-based policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1
Notation for priority-based policies . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.2
Non-preemptive priority queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.3
Preemptive priority queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.4
Shortest-Remaining-Processing-Time-First (SRPT) . . . . . . . . . . . . . . . . . . 70
3.2.5
Foreground-Background scheduling (FB) . . . . . . . . . . . . . . . . . . . . . . . 83
3.2.6
Other priority based policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Scheduling Classifications: Moving Beyond Idealized Policies
95
4 Classification via scheduling heuristics
99
4.1
The class of SMART policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1.1
Defining SMART scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1.2
Examples of SMART policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.3
Policies excluded from SMART . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.1.4
Bounding response times for SMART policies . . . . . . . . . . . . . . . . . . . . 105
4.2
Generalizing the SMART class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.1
Defining SMART

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2.2
Examples of SMART

policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2.3
Bounding response times for SMART

policies . . . . . . . . . . . . . . . . . . . . 116
4.3
The class of FOOLISH policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.3.1
Defining FOOLISH scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.2
Examples of FOOLISH policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.3.3
Bounding response times for FOOLISH policies . . . . . . . . . . . . . . . . . . . 123
4.4
The class of SYMMETRIC policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4.1
Defining SYMMETRIC scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 126
TABLE OF CONTENTS
xiii
4.4.2
Examples of SYMMETRIC scheduling . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.3
Bounding response times for SYMMETRIC policies . . . . . . . . . . . . . . . . . 127
4.5
The class of PROTECTIVE scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5.1
Fair-Sojourn-Protocol (FSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5.2
Defining PROTECTIVE scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5.3
Bounding response times for PROTECTIVE policies . . . . . . . . . . . . . . . . . 133
4.6
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Classification via scheduling techniques
139
5.1
The class of preemptive size based policies . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.1
Defining a class of preemptive size based policies . . . . . . . . . . . . . . . . . . . 141
5.1.2
Bounding response times for preemptive size based policies . . . . . . . . . . . . . 141
5.2
The class of remaining size based policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2.1
Defining a class remaining size based policies . . . . . . . . . . . . . . . . . . . . . 143
5.2.2
Bounding response times for remaining size based policies . . . . . . . . . . . . . . 143
5.3
The class of age based policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.1
Defining a class of age based policies . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.2
Bounding response times for age based policies . . . . . . . . . . . . . . . . . . . . 146
5.4
The class of non-preemptive policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.1
Defining classes of non-preemptive policies . . . . . . . . . . . . . . . . . . . . . . 147
5.4.2
Bounding response times for non-preemptive policies . . . . . . . . . . . . . . . . . 148
5.5
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Diverse Metrics: Moving Beyond Mean Response Time
153
6 The distribution of response time
157
6.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2
The response time tail under individual policies . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2.1
FCFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2.2
SRPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.2.3
PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.2.4
FB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.2.5
LCFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.3
The response time tail under scheduling classifications . . . . . . . . . . . . . . . . . . . . 170
6.3.1
The class of non-preemptive policies . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3.2
The SMART class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.3.3
The FOOLISH class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.4
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7 Fairness
183
7.1
Proportional fairness in expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.1.1
Defining proportional fairness in expectation . . . . . . . . . . . . . . . . . . . . . 186
7.1.2
The proportional fairness of individual policies . . . . . . . . . . . . . . . . . . . . 187
xiv
TABLE OF CONTENTS
7.1.3
The proportional fairness of scheduling classifications . . . . . . . . . . . . . . . . 194
7.2
Proportional fairness to large jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.2.1
Asymptotic behavior of slowdown . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.2.2
Scaling response times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.3
A unified framework for proportional fairness . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.4
Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.4.1
Defining predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.4.2
The predictability of individual policies . . . . . . . . . . . . . . . . . . . . . . . . 213
7.4.3
The predictability of scheduling classifications . . . . . . . . . . . . . . . . . . . . 223
7.5
Temporal Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.5.1
Defining politeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.5.2
The politeness of individual policies . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.5.3
The politeness of scheduling classifications . . . . . . . . . . . . . . . . . . . . . . 236
7.6
Hybrid fairness metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.6.1
Order Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.6.2
RAQFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.6.3
Discrimination Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.7
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Broader Models: Moving Beyond the M/GI/1
247
8 The impact of interactive users
251
8.1
Defining closed, open, and partly-open systems . . . . . . . . . . . . . . . . . . . . . . . . 253
8.2
Comparison methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.3
Real-world case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
8.3.1
Static web content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
8.3.2
E-commerce site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
8.3.3
Auctioning web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.3.4
Supercomputing center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.3.5
Study of WAN effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.4
Open versus closed systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.4.1
FCFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
8.4.2
The impact of scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8.5
Partly-open systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
8.6
Choosing a system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
8.7
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
9 The impact of multiserver architectures
275
9.1
Prior work analyzing multiserver priority queues . . . . . . . . . . . . . . . . . . . . . . . 277
9.2
Analyzing the M/PH/k with m priority classes . . . . . . . . . . . . . . . . . . . . . . . . . 278
9.2.1
Exponential job sizes and two priority classes . . . . . . . . . . . . . . . . . . . . . 278
9.2.2
Exponential job sizes and
m
priority classes, . . . . . . . . . . . . . . . . . . . . . 282
9.2.3
The M/PH/
k
with
m
priority classes . . . . . . . . . . . . . . . . . . . . . . . . . . 285
TABLE OF CONTENTS
xv
9.2.4
Computing higher moments of response time . . . . . . . . . . . . . . . . . . . . . 286
9.2.5
A computationally efficient approximation . . . . . . . . . . . . . . . . . . . . . . 287
9.3
Numerical validation and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
9.4
The impact of prioritization in an M/PH/k . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
9.4.1
The effect of the number of servers . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.4.2
The effect of “smart” prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
9.4.3
The effect of priority aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9.5
Designing multiserver systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.5.1
Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.5.2
How many servers are best in a FCFS system . . . . . . . . . . . . . . . . . . . . . 295
9.5.3
How many servers are best in a dual priority system . . . . . . . . . . . . . . . . . . 296
9.6
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Impact and future directions
307
10 Conclusion
309
10.1 Lessons and surprises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
10.2 The impact for system design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
10.3 The impact for theoretical scheduling research . . . . . . . . . . . . . . . . . . . . . . . . . 315
10.4 Further directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Bibliography
319
Afterward
335
About the Author
337
xvi
TABLE OF CONTENTS
List of Figures
1.1
An illustration of a single server queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2
An illustration of the impact of scheduling on mean response time. . . . . . . . . . . . . . .
8
1.3
An overview of the classifications studied in this thesis. . . . . . . . . . . . . . . . . . . . . 17
1.4
A comparison of the proportional fairness and temporal fairness classifications . . . . . . . . 21
1.5
Illustrations of the closed and partly-open system models. . . . . . . . . . . . . . . . . . . . 22
2.1
Simple examples of phase-type distributions. . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1
An illustration of how to view
PS
as a branching process . . . . . . . . . . . . . . . . . . . 53
3.2
An illustration of the near insensitivity of
PSJF
. . . . . . . . . . . . . . . . . . . . . . . . 68
3.3
An illustration of the mean response time under
PSJF
. . . . . . . . . . . . . . . . . . . . . 69
3.4
An illustration of the near insensitivity of
SRPT
. . . . . . . . . . . . . . . . . . . . . . . . 74
3.5
An illustration of the mean response time under
SRPT
. . . . . . . . . . . . . . . . . . . . . 78
3.6
An illustration of the effect of job size variability on the mean response time of
FB
. . . . . . 86
3.7
An illustration of the mean response time under
FB
. . . . . . . . . . . . . . . . . . . . . . . 89
3.8
A summary of the mean response time under common scheduling policies . . . . . . . . . . 94
4.1
An overview of the classifications studied in this thesis. . . . . . . . . . . . . . . . . . . . . 100
4.2
An illustration of the priority structure enforced by the
SMART
Bias Property . . . . . . . . 102
4.3
An example illustrating that the
SMART
definition only enforces a partial ordering . . . . . 103
4.4
An illustration of bounds on response times under
SMART
policies . . . . . . . . . . . . . 106
4.5
An illustration of the priority structure enforced by the
SMART

Bias Property . . . . . . . 114
4.6
The tradeoff between the accuracy of job sizes estimates and mean response time. . . . . . . 121
4.7
An illustration of the priority structure enforced by the
FOOLISH
Bias Property . . . . . . . 122
4.8
An illustration of bounds on response times under
FOOLISH
policies . . . . . . . . . . . . 123
4.9
An illustration of the response time under
SYMMETRIC
policies . . . . . . . . . . . . . . 127
4.10 A comparison of the mean response times of
FSP
,
PS
, and
SRPT
. . . . . . . . . . . . . . 132
4.11 An illustration of bounds on response times under
PROTECTIVE
policies . . . . . . . . . . 134
4.12 An illustration of bounds on conditional mean response time under scheduling heuristics . . 136
4.13 An illustration of bounds on the overall response time under scheduling heuristics . . . . . . 136
5.1
An overview of the classifications studied in this thesis. . . . . . . . . . . . . . . . . . . . . 140
5.2
An illustration of bounds response times under preemptive size based policies . . . . . . . . 142
xvii
xviii
LIST OF FIGURES
5.3
An illustration of bounds on response times under remaining size based policies . . . . . . . 144
5.4
An illustration of bounds on response times under age based policies . . . . . . . . . . . . . 146
5.5
An illustration of bounds on response times under non-preemptive policies . . . . . . . . . . 148
5.6
Illustration of bounds on conditional mean response time under scheduling techniques . . . . 150
5.7
Illustration of bounds on overall mean response time under scheduling techniques . . . . . . 150
7.1
A detail of the proportional fairness classification . . . . . . . . . . . . . . . . . . . . . . . 188
7.2
An illustration of he behavior of both common policies with respect to proportional fairness 189
7.3
A detail of the predictability classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.4
The behavior of both preemptive and non-preemptive policies with respect to predictability . 216
7.5
An illustration of the politeness of common policies . . . . . . . . . . . . . . . . . . . . . . 233
7.6
A detail of the politeness classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.7
A comparison of the proportional fairness, predictability, and politeness classifications . . . 245
8.1
Illustrations of the closed, open, and partly-open system models. . . . . . . . . . . . . . . . 252
8.2
Implementation results comparing open, closed, and partly-open models. . . . . . . . . . . . 257
8.3
Flow of data in Linux with
SRPT
-like scheduling. . . . . . . . . . . . . . . . . . . . . . . 259
8.4
The effect of WAN conditions on open and closed systems. . . . . . . . . . . . . . . . . . . 263
8.5
A comparison of open and closed systems under
FCFS
. . . . . . . . . . . . . . . . . . . . 265
8.6
The effect of job size variability, MPL, and think time on load in a closed system. . . . . . . 266
8.7
Simulation results showing the effectiveness of scheduling in closed and open systems. . . . 268
8.8
Model and implementation-based results for the partly-open system. . . . . . . . . . . . . . 270
8.9
Statistics for 3 representative web traces illustrating the effect of varying the timeout threshold.273
9.1
An overview of Dimentionality Reduction in the case of an M/M/2 dual priority queue. . . . 279
9.2
The Markov chain resulting from applying DR to an M/M/3 dual priority queue. . . . . . . . 281
9.3
An overview of applying RDR in the case of 3 priority classes. . . . . . . . . . . . . . . . . 283
9.4
An illustration of how to calculate the busy period distributions used by RDR. . . . . . . . . 286
9.5
Numerical validation of RDR and RDR-A in the M/M/2 and M/PH/2 queues. . . . . . . . . 289
9.6
Numerical validation of RDR-A in the M/PH/2 queue. . . . . . . . . . . . . . . . . . . . . 290
9.7
A contrast of per-class mean response times under 1, 2, and 4 server systems. . . . . . . . . 291
9.8
Error in predicting mean delay using the BB approximation. . . . . . . . . . . . . . . . . . 291
9.9
An illustration of the effect of “smart” and “foolish” scheduling in multiserver systems. . . . 292
9.10 An illustration of the impact of aggregating priority classes in multiserver systems. . . . . . 293
9.11 How many servers are best in a M/PH/
k
/
FCFS
queue. . . . . . . . . . . . . . . . . . . . . 295
9.12 How many servers are best when the two priority classes have the same mean job size? . . . 297
9.13 How many servers are best when the high priority jobs have a smaller mean job size? . . . . 299
9.14 How many servers are best when the high priority class has a larger mean job size? . . . . . 301
9.15 Mean response time as a function of the number of servers, which range from 1 to 10. . . . . 303
List of Tables
2.1
An introduction to the most common scheduling disciplines discussed in the thesis. . . . . . 27
2.2
An overview of common notation used in the thesis. . . . . . . . . . . . . . . . . . . . . . . 29
2.3
An overview of common performance metrics in the thesis. . . . . . . . . . . . . . . . . . . 30
3.1
A summary of the busy period variations studied in this thesis . . . . . . . . . . . . . . . . 57
8.1
A summary of the system models underlying web related workload generators. . . . . . . . 254
8.2
Summary statistics for the trace used in the static web case study. . . . . . . . . . . . . . . . 259
8.3
Summary statistics for the trace used in the e-commerce case study. . . . . . . . . . . . . . 260
8.4
Summary statistics for the trace used in the auctioning case study. . . . . . . . . . . . . . . 261
8.5
Summary statistics for the trace used in the supercomputing case study. . . . . . . . . . . . 262
8.6
A summary of web traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8.7
A summary of the expected number of visits per session in web traces. . . . . . . . . . . . . 273
xix
xx
LIST OF TABLES
P A R T
I
Motivation and Background
1
C
HAPTER
1
Introduction
Scheduling policies are implicitly (or explicitly) used everywhere that a resource needs to be allocated.
Whenever there is contention for a limited resource, a queue builds, and a scheduling policy is used to
determine the order in which the resource is allocated to satisfy requests.
This happens almost everywhere we venture in our daily lives. From restaurants and supermarkets, to
banks and amusement parks, we queue for service in a variety of ways. In many convenience stores there is
a single cash register where people line up, and are then served in First Come First Served (
FCFS
) order. In
large supermarkets, there are many registers and some are dedicated to serving only customers with a small
number of items. On the other hand, in restaurants everyone gets a little bit of service all of the time, which
can be viewed as a form of Processor Sharing (
PS
). In fact, even deciding what order you will do the things
on your to-do list can be thought of as a scheduling policy. Do you want to finish the most urgent tasks first,
i.e. use Earliest Deadline First (
EDF
), or work on the tasks that you can finish the quickest, i.e. use Shortest
Remaining Processing Time (
SRPT
)?
In addition to these everyday examples, scheduling policies are fundamental components of a variety of
modern computer systems. Applications such as web servers, routers, operating systems, supercomputers,
databases, and disks all face a constant barrage of requests and need to determine how best to allocate
resources to users. In all these cases, queues of service requests build and a scheduling policy (discipline) is
used to decide the order of service.
Wherever scheduling policies are used, they can have a dramatic impact on system “performance.” In
particular, at a high level, requests experience delay as a result of waiting in queues for service at a limited
resource; thus how requests are scheduled at this resource is a fundamental determinant of delay. In fact, the
delay experienced by requests can differ by orders of magnitude across scheduling policies.
This means that scheduling is especially important in computer systems, because users of computer
systems are extremely demanding and unforgiving. In our daily lives, we are willing to accept some delay
while we queue for service, but, in computer systems, users demand service that is both instantaneous and
predictable. For example, web users become dissatisfied if response times for requests exceed 5 seconds and
view delays of greater than 10 seconds as intolerable [65]. Further, meeting the high expectations of users in
computer systems is crucial because it is often effortless for users to switch to a competitor’s product. If we
again consider the example of a web user, the competition is always “just a click away.” Thus, a fundamental
3
4
CHAPTER 1: INTRODUCTION
design goal of computer systems today is to minimize the
response times
of users, i.e. the time between the
moment a user makes a request and the moment the request is completed.
Scheduling is a fundamental tool for minimizing response times (reducing delays) and, as a result, the
study of scheduling policies has a long history including a vast literature of analytic results. However, in
recent years, the field has been going through a resurgence. This resurgence is a result of a variety of
“scheduling success stories” in computer systems. In particular, at all levels of computer systems, designers
have dramatically reduced user response times by making small changes to the scheduling policy used at
the bottleneck resource. We will provide an overview of a few examples of success stories in Section 1.1,
where we will see that the essence of these scheduling success stories is very simple (Section 1.2). In
particular, in all the examples, system designers identify the bottleneck resource in a system, determine how
it is scheduled, and then design a new scheduling policy in order to improve performance. This third step is
really the defining aspect of the success story, and is the focus of this thesis.
In Section 1.3 we will begin to explore the issues involved in designing a new scheduling policy for a
computer system. This task is not easy – there is an enormous variety of scheduling policies from which
to choose. As a result, computer system designers are often guided by analytical results about scheduling
policies, and we discuss what traditional analytical results about scheduling suggest for computer system
design in Section 1.3.1. However, we will see in Sections 1.3.2 and 1.3.3 that there are many gaps between
the analytical results and what happens in practice. Because of these disconnects between theory and prac-
tice, traditional analytic results do not apply for the policies that system designers use in practice. This fact
is problematic because analytic results can be an important tool for system design.
The goal of this thesis is to develop a modernized theory of scheduling that can provide analytic results
that apply to today’s computer systems
, easing the burden on system designers. To accomplish this goal,
we will provide new results that help to bridge the disconnect between the analytical results and the needs
of practitioners. We detail our approach for bridging these disconnects in Section 1.4 and then we conclude
the introduction by providing an overview of the thesis in Section 1.5.
1.1 Scheduling success stories
Across applications, computer system designers have been suggesting new designs for the scheduling poli-
cies at the core of systems. This increasing focus has led to
scheduling success stories
at all levels of
computer systems. In web servers [97, 183], routers, [180, 181], wireless networks [103, 137], peer-to-peer
systems [179], operating systems [74], databases [139, 140], and beyond, researchers have made simple
changes to the scheduling policies used at the bottleneck resources in computer systems that have led to
dramatic improvements in user response times. Not only do these new designs result in improved response
times, they do so without requiring the purchase of additional resources.
Example: Static Web Servers
The scheduling success story that is perhaps the easiest to explain is that of web servers. If we
consider a web server that serves primarily static requests, its operation is very simple at a high
level. Static requests are typically of the form “get me a file.” In order to fulfill such a request, the
web server must retrieve the file and then send the file over the outgoing link. Typically the amount of
bandwidth at the web server is the bottleneck device since purchasing more bandwidth is much more
expensive than upgrading the disks or CPUs at the web server [143, 59]. Even a modest web server
1.1: SCHEDULING SUCCESS STORIES
5
can saturate a T3 or 100Mbps Ethernet connection. Thus, much of the delay experienced by requests
for files is a result of queueing for bandwidth.
In standard web server designs, such as Apache [226] and Flash [169] servers, the bandwidth is
allocated by cycling through the queued files, giving each a small slice of service. Specifically, each
connection between a client and the web server has a corresponding socket buffer into which the web
server writes the contents of the requested file. The sockets are then drained in a cyclic manner where
a handful of packets from each socket are sent before moving to the next socket. This behavior is
typically modeled using the Processor Sharing (
PS
) scheduling policy, which gives an equal share of
the service capacity to each job in the queue at all times.
Now comes the success story. Harchol-Balter et al. [97] have recently achieved dramatic re-
ductions in user response times at static web servers by adjusting the scheduling policy used in web
servers. They modified the way sockets are drained in order to implement a version of
SRPT
and
found that not only were response times much smaller [97], but also the performance in periods of
overload was improved [204] and the response times of large files did not suffer as a result of the
bias towards small files [26]. Further, following the initial work of Harchol-Balter et al., other re-
searchers have gone on to design improvements to the scheduling policy, thus providing even more
dramatic improvements over standard web server designers [183, 132, 131, 88]. We will talk in more
detail about the actual policies in these designs later in the chapter.

Example: Network Edge Routers
From a user-level perspective, a user is sending a sequence of packets, i.e. a flow, on a path through a
number of routers in the network. Thus, a router must share resources between a number of competing
flows. Typically, one of these routers is the bottleneck link along the path and contributes the majority
of the network delay. Often times, this bottleneck router is the edge router, i.e. the router on the edge
of the core of the network. Further, the bottleneck resource of the bottleneck router is the most fre-
quently the outgoing bandwidth of the router [180], since it is much more expensive to overprovision
bandwidth than it is to overprovision other resources. (Though, the complexity of scheduling policies
used at routers is limited due to the fact that routers must make scheduling decisions very quickly so
as to operate at line speed.) Thus, a key determinant of the delay experienced on a network path is
how the bandwidth at the bottleneck router in the network is scheduled.
In standard router designs, variants of fair queueing are commonly used to allocate bandwidth
to packets from competing flows at a router. These policies guarantee that the average bandwidth
given to each flow through the router is approximately equal. This means that
PS
and variants such
as Generalized
PS
(
GPS
) and Discriminatory
PS
(
DPS
) serve as a good model of the scheduling
policy at the router [251, 36, 37].
Many techniques have been applied to try to reduce the delay in routers, including admission
control, active queue management, and others. However, recently, Rai et al. [180] were able to
dramatically reduce response times of flows by implementing a simple scheduling change in routers
– they implement a policy that gives priority to flows with the least attained service, i.e. the flow that
has had the fewest packets sent so far. With only this small change to the way flows are scheduled Rai
et al. were able to reduce response times by an order of magnitude [180, 181]. We will provide more
details about the policy they implement later in the chapter.

6
CHAPTER 1: INTRODUCTION
Example: Wireless Access Points
Wireless networks offer a number of interesting challenges when compared with traditional wired
networks. One of the key challenges is that the wireless channel is a shared resource among all the
users of a given access point. In order to achieve reliable transfers, users must reserve exclusive
access to the shared channel. As a result of this and other concerns, wireless networks are severely
bandwidth-limited, and the wireless link itself is the bottleneck resource. Thus, a key determinant of
user response times is how the wireless channel is scheduled.
Allocation of the shared channel is performed in a centralized manner by the network access
point. Typically, the access point polls clients to give them channel access grants, traditionally grant-
ing access in a round-robin manner so as to guarantee fairness among competing users. So, again,
PS
is a good model of the scheduling policy being used at a standard wireless access point.
But, there are many reasons why this allocation strategy is inefficient and many recent designs
have dramatically reduced response times in wireless networks using simple changes to the scheduling
policies used in wireless access points [103, 137]. These recent designs again apply variants of the
SRPT
policy to schedule user requests. We will provide more details about the specific variants they
implement later in the chapter.

We could easily continue to list other recent scheduling success stories in applications such as operating
systems [74], databases [139, 140], peer-to-peer systems [179], and beyond; however from these examples
we can already see that changing the scheduling policy at the bottleneck device in computer systems can lead
to dramatic performance gains at the system-level. Further, from these examples, we can already observe
that there are many similarities between these success stories.
1.2 The essence of a scheduling success story
Though every system is different and each of the scheduling success stories we just described has its own
nuances, the essence of these scheduling success stories is the same across systems. There is a consistent
three-step design process that is followed.
1. The first step is to determine the bottleneck resource of the system. Identifying the bottleneck resource
is one of the fundamental steps in system design. Knowing what resource is the bottleneck allows
designers to focus on the part of the system responsible for the majority of delay in the system.
This resource is exactly where scheduling changes will have the most dramatic effect on system-level
performance. We saw that in static web servers the bottleneck is typically the limited bandwidth
that the server has bought from its ISP [143, 59]. Similarly, in network routers and wireless access
points bandwidth is again typically the bottleneck [180, 251]. In operating system scheduling, the
bottleneck is typically the CPU [74], while in databases the bottleneck can either be the CPU or the
database locks depending on the workload [139].
2. Once the bottleneck device is known, the next step is to understand how the bottleneck device is cur-
rently scheduled. This knowledge helps to determine what performance improvements are possible.
It turns out that, across computer systems, the status-quo is typically to use a very simple scheduling
1.3: CHOOSING A SCHEDULING POLICY
7
Server
Queue
New Arrivals
Figure 1.1:
An illustration of a single server queue.
policy to schedule the bottleneck device, often either First Come First Served (
FCFS
) or Processor
Sharing (
PS
). In particular, as we saw in our examples, it is most common that the bottleneck device
is scheduled using a form of
PS
. This happens because most systems time-share, giving each request
a small slice of service and cycling through the requests in a round robin fashion. For example,
PS
is a good approximation of the way web servers and routers (at a flow level) allocate bandwidth [97].
Additionally, operating systems tend to use variants of
PS
to schedule jobs at the CPU.
3. After the bottleneck has been identified and it has been determined how the bottleneck is being sched-
uled, the last step in a “scheduling success story” is to design and implement an improved scheduling
discipline for the bottleneck resource. However, the details of the improved policy are very much
application dependent. This step is the defining aspect of the scheduling success story. After the first
two steps, system designers know which resource is the bottleneck and how it is being allocated, but
the question is then,
what is the best, or at least an improved, design for a new scheduling policy?
1.3 Choosing a scheduling policy
The defining aspect in recent scheduling success stories is the decision of which scheduling policy to im-
plement at the bottleneck resource. This is not an easy decision – there are an infinite variety of possible
scheduling disciplines to choose from. As a result, though it is possible to develop a new design through an
ad hoc process of tuning and testing new proposals, performance modeling is often a key tool in developing
a new scheduling discipline for the bottleneck device. In particular, by considering a single server queue (as
illustrated in Figure 1.1) as a model of the bottleneck resource, designers can make use of a vast literature of
analytic results about scheduling in order to better predict the performance of new design proposals. So, all
a system designer needs to do after determining which scheduling discipline is used by the bottleneck device
is to pick up one of the many books on scheduling [61, 120, 121, 177] and look for an improved policy.
Well,
at least it seems that easy.
In reality, there are many gaps between what the traditional analytic
results provide and what system designers need in practice. These gaps mean that the results proven in
theory do not end up applying to the systems that are built in practice.
1.3.1 What traditional theory says
The study of scheduling has a long history, including an extremely diverse set of application areas and
models. The classical case of minimizing response times at a single server with a single queue is the case
that is most relevant to our scheduling success stories. In any scheduling book, one of the first results that
8
CHAPTER 1: INTRODUCTION
0
0.2
0.4
0.6
0.8
0
5
10
15
ρ
E[T]
FCFS, SRPT
PS
(a) Deterministic
0
0.2
0.4
0.6
0.8
0
5
10
15
ρ
E[T]
FCFS, PS
SRPT
(b) Exponential
0
0.2
0.4
0.6
0.8
0
5
10
15
ρ
E[T]
FCFS
PS
SRPT
(c) Weibull
Figure 1.2:
Mean response time,
E
[
T
]
, is shown as a function of load,
ρ
, under
SRPT
,
PS
, and
FCFS
in an M/GI/1 queue. The service distribution has mean 1 and is (a) Deterministic, (b) Exponential, and (c)
Weibull with a variance of 10.
is presented is that a simple, greedy policy is optimal in the single server model. In particular, Shortest
Remaining Processing Time (
SRPT
) minimizes both the mean queue length and the mean response time
regardless of the arrival sequence or job sizes [203, 202].
SRPT
works by devoting the full service capacity
to the job with the smallest remaining size. Thus, it greedily works on the job that can be completed the
quickest.
However, simply knowing that
SRPT
minimizes the mean response time tells us nothing about how
much improvement can be gained from using
SRPT
instead of other common policies like
PS
or
FCFS
.
Therefore, we also need to understand the quantitative comparison of response times under
SRPT
,
PS
, and
FCFS
under practical workload assumptions.
To provide such a comparison, the traditional model that is used is a single server queue with an infinite
buffer (see Figure 1.1) where the interarrival times of requests are assumed to be independent and identically
distributed (i.i.d.) random variables, the single server works at a constant speed, and the service times
(processing times, job sizes) of arrivals are assumed to be i.i.d. Moreover, the sequences of interarrival times
and service times are assumed to be independent. This model is one of the most basic queueing models,
and is referred to as the GI/GI/1 queue. The first GI indicates that the arrival process is a sequence of
generally-distributed, independent random variables; the second GI indicates that the service requirements
are generally-distributed, independent random variables; and the 1 indicates that there is 1 server. Most
typically, scheduling policies are studied in the simpler M/GI/1 model, where the M stands for “Markovian”
and indicates that the arrival process is Poisson, i.e. has exponentially distributed interarrival times.
A huge variety of scheduling policies have been studied in the M/GI/1 setting, and there are a number
of excellent books that summarize the important results [61, 120, 121, 177]. We will survey many of the
results in Chapter 3. However, let us now provide a simple comparison of
SRPT
,
PS
, and
FCFS
in order
to illustrate the performance gains that are possible.
To provide this comparison we will use an M/GI/1 queue with three different service distributions. In
Figure 1.2, we show a comparison between
SRPT
,
PS
, and
FCFS
under (a) a deterministic distribution
where all jobs have size 1, (b) an exponential distribution, which is a common distribution because it is
analytically tractable, and (c) a high-variance Weibull distribution. Figure 1.2 illustrates that
SRPT
provides
an enormous improvement in mean response time in some settings; however the improvement depends very
1.3: CHOOSING A SCHEDULING POLICY
9
much on properties of the workload. In particular, the improvement depends very strongly on the variability
of service demands (job sizes) and the system load (utilization). Using
SRPT
instead of
PS
or
FCFS
provides only limited improvement in mean response time when job sizes are not very variable or if the
system load is low. However, the improvement is dramatic if job sizes are highly variable or the system is
moderate or highly loaded. The reason for this is simple. First, if the system is lightly loaded, then the queue
lengths are small regardless of the scheduling policy used, so reordering jobs in the queue can only have
a limited impact on response times. However, if the system is moderate or heavily loaded, queue lengths
can be very large, and reordering jobs in the queue can have a big impact. Second, variability is important
because when job sizes are highly variable, one large job can wreak havoc under
FCFS
. In particular, under
FCFS
, many small jobs can get stuck behind a large job in the queue. Further, one large job can have a
negative effect on
PS
because a large job will stay in the system for a long time and, thus, limit the service
capacity devoted to other jobs. In contrast, under
SRPT
, small jobs are unaffected by large jobs because
they bypass larger jobs in the queue.
To summarize, the key observation from Figure 1.2 is that,
SRPT
can provide enormous improvements,
but only under certain system loads and job size distributions. Thus, in order to understand whether
SRPT
will provide significant improvements for a given application, it is important to first understand the system
load the application will experience and the distribution of service demands.
Let us start with the system load. Though it is not uncommon for designers to try to overprovision com-
puter systems, the unpredictability and burstiness of computer system workloads means that it is common to
experience extended periods of moderate-to-heavy load. For example, in web applications, surges in traffic
as a result of special promotions, an abrupt increase in a site’s popularity, or many other reasons, result in
extended periods of high load. As a result, even well-provisioned systems spend a significant amount of
time running in moderate or high-load, and maybe even in overload. These periods of high load are typi-
cally some of the most important times to provide users responsive service, and so scheduling policies for
computer applications need to be designed with the high load periods in mind.
Next, let us consider the service demand distributions experienced by computer applications. Over the
last decade, there has been an explosion of workload characterization research in the computer system com-
munity. This research has led to the growing realization that heavy-tailed and highly variable distributions,
such as the Weibull and Pareto, are everywhere in computer system workloads [16, 62, 209, 173, 175].
Examples can be found in UNIX process lifetimes [92, 68], web file sizes [174, 62], and the number of
embedded files in web sites [16, 29].
So, in designing scheduling policies for computer applications, the most important workload setting
to consider is a highly loaded server with highly variable job sizes. This is exactly the setting where the
performance improvements of
SRPT
are most dramatic, so,
SRPT
is the clear choice for use in computer
systems according to traditional theoretic results.
1.3.2 What happens in practice
We just saw that theoretical results motivate the use of
SRPT
in computer systems –
SRPT
minimizes the
mean response time and provides dramatic improvements over common policies like
PS
under practical
workloads. Thus, it would seem that one should always use
SRPT
scheduling, regardless of the application
that is being considered.
But, the picture painted by the theoretical results is a bit deceiving. In particular, the reality of computer
10
CHAPTER 1: INTRODUCTION
systems is far more complex than the simple models and idealized scheduling policies one finds in the books
on scheduling.
To illustrate this briefly, notice the setting that was used in Figure 1.2 to analyze the performance im-
provements obtained by
SRPT
: Figure 1.2 shows the mean response time in an M/GI/1 queue using
SRPT
scheduling. Though we have seen that the assumption of highly variable Weibull service demands is fairly
realistic, it is easy to find fault with each of the other assumptions. Real systems cannot implement pure
SRPT
; real arrival processes are not Poisson; real systems care about more than mean response time; and
real systems do not always use a single server. To drive this point home, let us consider a few applications
in detail.
Example: Static Web Servers
We saw earlier that the bandwidth purchased from the ISP is typically the bottleneck in web servers
that serve primarily static content. Further, we saw that bandwidth is allocated to requests according
to
PS
scheduling. Additionally, the workload in web applications is typically highly variable and
web applications must be designed with high load periods in mind. Thus, web servers seem to be a
perfect place to use
SRPT
.
SRPT
is indeed the motivation for a number of recent web server designs [97, 183, 132, 131, 87,
88]. However, many complications of the real systems prohibit these new designs from using pure
SRPT
. In particular, the proposed designs use the remaining sizes of the files being served in order
to prioritize. However, these remaining sizes are only estimates of the remaining service demand of a
request because the network delays, which are are not known exactly, also affect the service demands.
As a result, many proposals do not use only the remaining sizes of the files, but also attempt to estimate
the propagation delay to the users making the requests [183, 132, 131]. Another complication is the
fact that implementations have tended to use only 5-10 priority levels instead of using a continuum
possible priority levels (as pure
SRPT
does) as a result of the overheads associated with maintaining
priorities and switching between jobs [97, 183]. As a result of these, and other, adjustments to
SRPT
,
the policies that are actually implemented may perform significantly worse than pure
SRPT
, and the
magnitude of the differences is not understood.
Not only are there many reasons why pure
SRPT
cannot be implemented in web servers, there
are a number of reasons why designers do not want to use pure
SRPT
. A primary reason is that
mean response time, although important, is not the only performance measure of interest. Designers
also need to provide fairness and QoS guarantees. Further, it is often important to provide service
differentiation between high priority customers (who have paid for improved service) and standard
customers. As a result of these competing performance metrics, many design suggestions for web
servers have used hybrids of
SRPT
and variants of
PS
[87, 88].
In addition to adjustments to the policy that is implemented, there are many complexities of the
practical workloads that are not accounted for by the traditional M/GI/1 queue. First of all, web
users are interactive. When using a web site a user will click on a link and then wait for the response
before clicking on the next link. Therefore, the arrival process is dependent on the departure process,
unlike in the M/GI/1. Further, users are impatient, and will click on the refresh button if the response
time for a certain page is too long. The effect of this is to abandon requests that are in the queue and
have already received service, thus wasting bandwidth. This is an issue that is ignored by the M/GI/1
model.

1.3: CHOOSING A SCHEDULING POLICY
11
Example: Network Edge Routers
We saw that the bandwidth is typically the bottleneck in edge routers and that the way a standard
router allocates bandwidth can be viewed, at a flow level, as
PS
. Like the case of web servers, routers
are typically highly loaded and flow sizes tend to be highly variable. Thus, routers again seem like a
perfect place to apply
SRPT
scheduling.
However, it is impossible to apply
SRPT
in routers because the sizes/lengths of flows (i.e. the
number of packets in the flows) are unknown a priori. It is not even possible to estimate the lengths
of flows accurately; all that is known about a flow is how much service it has received so far. But, it
turns out that the amount of service received so far provides some indication of the remaining length
of the flow. In particular, if a flow has received a large amount of service already, it is likely to require
an even larger amount of service in order to complete [209]. Using this information, designers have
proposed policies that give priority to the flows that have received the least service so far, e.g. variants
of the Foreground-Background (
FB
) policy [180, 181] and the Multi-level Processor Sharing (
MLPS
)
policy [6, 5]. However, such proposals have been shown to starve large flows, e.g. streaming videos,
of service in addition to increasing jitter when compared with standard router designs [181]. Thus,
hybrid designs combining aspects of
PS
with
FB
and
MLPS
have been proposed [181].
Not only is it impossible to apply
SRPT
in routers, it is obvious that the M/GI/1 queue is an
overly simplistic model of router workloads. Like in web servers, users of routers are interactive and
impatient. When using a web site a user will click on a link and then wait for the response before
clicking on the next link. Further, if a request is delayed too long a user will abandon the request,
e.g. by hitting the refresh button in her browser. The effect of these abandonments is actually quite
dramatic: 20% of network traffic has been shown to correspond to aborted transfers [252]. Another
important aspect of network traffic that is ignored in the M/GI/1 model is the fact that workload
characteristics tend to be time-varying, e.g. time of day effects.

Example: Wireless Access Points
We saw that the wireless link is the bottleneck resource in wireless networks and that
PS
serves as
a good model of the way a traditional access point typically allocates the wireless link bandwidth.
Further, we described that recently suggested designs have been able to dramatically improve perfor-
mance using variants of
SRPT
scheduling. But, for many reasons, these variants are far from pure
SRPT
.
The main reason that pure
SRPT
is not implemented is that the channel conditions are variable.
In a multiuser wireless network, at any given point there are likely to be users that have “good”
channel conditions, which allow data to be sent at a high rate, and users that have “bad” channel
conditions, which limit data transfer speeds. If one ignores the channel conditions and uses pure
SRPT
, the throughput of the network as a whole will be much lower than if one is opportunistic when
scheduling requests. As a result, the designs that are implemented tend to be hybrids of
SRPT
where
both the channel conditions and the remaining size of requests are taken into account [103, 137].
However, note that the fact that channel conditions vary over time means that the remaining sizes
used to schedule are not exact.
In addition to variable channel conditions, wireless networks are subject to a number of other
complexities that are not accounted for in traditional theoretical models. In particular, power man-
agement is a fundamental design constraint and needs to play a role in wireless link scheduling.
12
CHAPTER 1: INTRODUCTION
Further, multi-channel network designs are increasingly being used, and these networks are better
modeled with a multiserver queue than with a single server queue. Finally, many of the complexities
of network traffic that we have discussed for web servers and network routers also play a role in
wireless networks, e.g. interactive and impatient users.

We could have easily continued to list other examples such as disks, databases, and supercomputing
centers, however the above examples are enough to make the point that there are many aspects of real
systems that differ from the traditional theoretical models. Further, the result of these differences is that
pure
SRPT
, the “optimal” policy according to traditional theoretical results, is never used in practice. Thus,
the theoretical results we just described do not immediately apply to real system designs.
1.3.3 Gaps between theory and practice
The previous two sections have illustrated a mismatch between the traditional theoretical research on schedul-
ing and the use of scheduling in modern computer system designs. To summarize a few of the differences
that we saw, notice that we repeatedly observed that real computer systems can never implement the pure,
idealized
SRPT
policy that is optimal in theory. Two of the reasons for this are that (i) it is rare that real
implementations know exact remaining sizes and (ii) real implementations must be adjusted to account for
the overheads associated with preemption. Not only is it unrealistic to consider pure
SRPT
, it is unrealis-
tic to assume a Poisson arrival process since, in reality, users are interactive: users typically must wait to
receive one request before making another, thus the arrival process is dependent on the completion process.
Similarly, it is increasingly unrealistic to consider only a single server queue – server farm and multi-core
architectures are increasingly prevalent. Finally, considering mean response time as the only performance
metric is also unrealistic. In real systems, mean response time is definitely important, but it is also important
to be “fair” and to provide QoS guarantees (among a long list of other metrics).
This laundry list of differences is really only the tip of the iceberg. However, from this list and the
applications we looked at in detail, three different themes are emerging. Though these three themes are not
all inclusive, they cover a wide range of gaps between traditional theoretic results and the needs of system
designers.
The idealized policies studied traditionally in theory cannot be used in practice.
For example, pure
SRPT
is never implemented in practice. Instead, the policies that are implemented
use estimates of remaining size, use only 5-10 priority levels, or are hybrids of
SRPT
and
PS
-type
policies. Each of these variants of
SRPT
will not provide response times that are as small as under
pure
SRPT
, however traditional theoretic results do not provide any information about how much
performance will suffer.
Many performance measures that are important in practice are not studied in theory.
Mean response time is typically the focus of theoretical scheduling research, however in practice QoS
and fairness metrics are also important. Additionally, power management, reliability, and many other
performance measures are important. Once these other measures are considered,
SRPT
is no longer
the clear choice. Worries pervade that
SRPT
is unfair to large job sizes due to its bias towards small
jobs pervade. Similarly, worries about providing good QoS guarantees for large job sizes are common.
Traditional theoretical results cannot be used to address such worries.
1.4: BRIDGING THE GAPS BETWEEN THEORY AND PRACTICE
13
The traditional, simplified theoretical models include many unrealistic assumptions.
The M/GI/1 model is at the heart of a majority of research studying the performance of scheduling
policies, but both the M (Markovian arrivals) and the 1 (single server) are often unrealistic.
1
For
example, real arrival processes tend to be bursty and real users tend to be interactive and impatient.
Further, many modern system designs make use of multiserver architectures, e.g. server farms and
multi-core processers. Though
SRPT
is optimal in the M/GI/1 setting, once one considers interactive,
impatient users and multiserver settings,
SRPT
is no longer the optimal policy for mean response
time. Further, the performance of
SRPT
in these more complex settings has not been studied in the
traditional theoretical literature.
Each of these themes brings into question the usefulness of traditional theoretical results to modern
computer system design. Traditional results suggest that
SRPT
is optimal and can provide dramatic im-
provements in response time, but these results apply only in simplified settings to pure
SRPT
. Once real-
istic settings and policies are considered, the performance improvements that come from using
SRPT
will
be much less dramatic, and the exact degree of improvement is not understood. Further, results about how
SRPT
performs for other metrics of interest are simply not available; thus it is not easy to dismiss worries
about, for example, providing QoS guarantees and fairness under
SRPT
.
The bottom line is that the tradi-
tional analytic results about scheduling provide limited help for system designers because of the may gaps
between theoretical models and real system designs.
1.4 Bridging the gaps between theory and practice
The fact that traditional analytic results do not apply to real system designs is a problem because analytic
results can (and should) be invaluable to system designers during the development process. Without analytic
results, designers need to test every every new design proposal using extensive experiments over a wide array
of settings, which can be prohibitive during early stages of design. Further, even after such testing, designers
are left without performance
guarantees
for the new designs.
However, analytic results could potentially provide provable guarantees to guide the design process. For
example, if (as in the case of web servers) estimates of remaining sizes must be used instead of the true
values, it is important to know how the accuracy of the estimates will affect response times under the policy.
Understanding the impact of the accuracy of the estimates is key because there is often an overhead involved
with making estimates and the amount of overhead is dependent on the accuracy needed for the estimates.
A good example of this tradeoff is the task of estimating network delays for files being sent by a web server.
This is one example of how analytic results can aid the design process, but there are many others. Analytic
results are also important when deciding what level of QoS guarantees can be provided. Analytic results can
help in determining the appropriate number of priority levels to use in order to balance between minimizing
the overheads and maximizing performance. Analytic results are also fundamental to capacity planning of
computer systems. We could list many other examples here, but the main point is simple: without analytic
results, the task of a system designer is made much more difficult. Further, this difficulty is magnified by
the enormous variety of possible scheduling policies from which to choose.
1
Of course, one could argue that the GI (generally distributed, independent job sizes) is also unrealistic, but we feel this is less
of an issue than either the M or the 1.
14
CHAPTER 1: INTRODUCTION
The goal of this thesis provide analytic results that apply to today’s computer systems. Our results will
bridge most (but not all) of the gaps between theory and practice that we have described so far. In particular,
we will take steps towards resolving the issues in each of the three themes we described in the previous
section.
Moving beyond idealized policies
We have seen that the idealized policies studied in theory are not used in practice and, instead, a
wide variety of variants of these policies are used. Thus, traditional analytic results are inadequate
for system designers. One natural approach to remedy this situation is to study each of the variants
that are used in practice directly. However, using such an approach, it would be impossible to keep up
with the new designs being developed across all levels of computer systems. So, we need a different
approach.
The approach we propose in this thesis is to move beyond the study of individual, idealized
policies. Instead, we will formalize many common scheduling heuristics and techniques into classi-
fications of scheduling policies, and then prove bounds on the performance of scheduling policies in
each of these classifications. For example,
SRPT
is characterized by the fact that it uses the schedul-
ing technique of “prioritizing based on remaining sizes” to apply the heuristic of “prioritizing small
jobs.” So, instead of studying
SRPT
, we will define and analyze a class of policies that prioritizes
based on remaining sizes and a class of policies that prioritizes small jobs.
This new style of scheduling research is motivated by the fact that, though the idealized policies
studied in theory are not used in practice, the policies that are implemented in practice tend to be
motivated by the theoretical policies. Thus, real system designs tend to apply the same heuristics
and techniques found in the idealized policies. As a result, this new style of scheduling research
has both practical and theoretical benefits. On the practical side, the scheduling classifications we
define include, not only idealized policies like
SRPT
, but also the variants of these idealized policies
that are actually used in practice. Thus, by providing results about scheduling classifications we
are eliminating the need to analyze, one-by-one, each individual policy implemented in computer
systems. On the theoretical side, analyzing scheduling techniques and heuristics adds structure to the
space of scheduling policies that cannot be attained through the analysis of individual policies alone.
Moving beyond mean response time
Though mean response time is an important metric for computer systems, system designs must do
more than simply provide small response times. We have seen that there are a wide variety of other
performance measures that are also important. It is not enough if a new design provides improved
mean response times, it must also guarantee fairness, provide QoS guarantees, and do many other
things. But, unfortunately, the performance of scheduling policies with respect to many of these
measures is not understood. In order to begin to bridge this gap, in this thesis we will focus on two
measures of broad importance: the distribution of response time and fairness.
Extending our understanding of scheduling policies beyond the mean response time to the dis-
tribution of response times is essential for applicability in modern computer systems because users
can become even more frustrated by highly variable service than by having large response times on
average [65, 256]. For example, reducing the jitter in streaming applications is at least as important as
reducing the response time of the flow. Further, as we have discussed, it is increasingly important to
1.4: BRIDGING THE GAPS BETWEEN THEORY AND PRACTICE
15
provide QoS guarantees, and knowledge about the response time distribution of scheduling policies
is fundamental to this task. In this thesis, we not only provide many new results characterizing the
response time distribution under scheduling policies; we also provide the first analytic results studying
the response time distribution under scheduling classifications.
In addition, extending our discussion beyond mean response time to the fairness of scheduling
policies is essential to the applicability of our results in real systems. One of the fundamental worries
about designs that are motivated by
SRPT
is that large job sizes will have unfairly large and variable
response times as a result of the priority given to small job sizes [31, 211, 216, 224]. Such worries are
difficult to address because of the amorphous nature of “fairness,” and thus there is no analytic work
characterizing the fairness of scheduling policies. In this thesis, we introduce a variety of novel mea-
sures of fairness that are motivated by computer applications. In addition, we provide the first analysis
of the fairness of scheduling policies. Not only that, we extend our analysis to handle classifications
of scheduling policies as well.
Moving beyond the M/GI/1
Due to the difficulty of the analysis of scheduling policies, traditionally they have been analyzed
only in fairly simple models, primarily the M/GI/1 queue. Though this model allows for general job
sizes, as we have discussed, the assumptions of Poisson arrivals and a single server are often overly
restrictive. We will move beyond these restrictive assumptions and study scheduling in settings where
the arrivals are generated by interactive users and in settings where the system uses a multiserver
architecture.
One fundamental difference between arrivals to real systems and Poisson arrivals is that real users
are interactive. That is, they must wait for their previous request to complete before submitting a
new request. This interactivity introduces dependencies between the arrival and completion processes
which is not present in the M/GI/1 model. We will characterize the impact of these dependencies on
the performance of scheduling policies. We will illustrate that if the dependencies are strong enough,
the effectiveness of scheduling can be limited, but that in many practical settings scheduling is still
very beneficial.
Another important difference between real systems and the M/GI/1 is that real systems are in-
creasingly using multiserver architectures. The reason for this is that buying a single fast server is
much more expensive than buying a large number of slower servers. The use of multiserver architec-
tures has a huge effect on the impact of scheduling. For instance, in multiserver architectures
SRPT
is not optimal (in general) for mean response time. Further, while in a single server a single large job
can block the server if small jobs are not given priority, in a multiserver system small jobs can by-
pass a single large job even without being given high priority. This intuition suggests that scheduling
may not be as effective in multiserver settings. However, we will illustrate that clever scheduling in
multiserver systems can still be very beneficial. Further, we will present results about how schedul-
ing affects the design of multiserver systems, i.e. how scheduling impacts the number of servers that
should be used.