of 24
Learning-Augmented Energy-Aware List Scheduling for
Precedence-Constrained Tasks
YU SU
,
California Institute of Technology, Pasadena, United States
VIVEK ANAND
,
Georgia Institute of Technology, Atlanta, United States
JANNIE YU
,
California Institute of Technology, Pasadena, United States
JIAN TAN
,
Alibaba Inc., Sunnyvale, United States
ADAM WIERMAN
,
California Institute of Technology, Pasadena, United States
We study the problem of scheduling precedence-constrained tasks to balance between performance and en-
ergy consumption. We consider a system with multiple servers capable of speed scaling and seek to schedule
precedence-constrained tasks to minimize a linear combination of performance and energy consumption.
Inspired by the single-server setting, we propose the concept of
pseudo-size
for individual tasks, which is a
measure of the externalities of a task in the precedence graph and is learned from historical workload data. We
then propose a two-stage scheduling framework that uses a learned pseudo-size approximation and achieves
a provable approximation bound on the linear combination of performance and energy consumption for both
makespan and total weighted completion time, where the quality of the bound depends on the approximation
quality of pseudo-sizes. We show experimentally that learning-based approaches consistently perform near
optimally.
CCS Concepts: •
Theory of computation
Scheduling algorithms
;•
Computer systems organization
Cloud computing;
Additional Key Words and Phrases: Energy-Aware Scheduling, Precedence-Constrained Tasks, Learning-
Augmented Algorithms
ACM Reference Format:
Yu Su, Vivek Anand, Jannie Yu, Jian Tan, and Adam Wierman. 2024. Learning-Augmented Energy-Aware List
Scheduling for Precedence-Constrained Tasks.
ACM Trans. Model. Perform. Eval. Comput. Syst.
9, 4, Article 13
(September 2024), 24 pages.
https://doi.org/10.1145/3680278
1 Introduction
This article seeks to develop energy-aware scheduling policies for precedence-constrained tasks
that arise in modern machine learning platforms. The problem of how to optimally schedule a job
made up of tasks with precedence constraints has been studied for decades. The initial work on
this scheduling problem arose in the context of scheduling jobs on multi-processor systems [
17
].
Funding for the publications listed above has come in large part from the NSF through CNS-2146814, CPS-2136197, CNS-
2106403, NGSDI-2105648, AitF-1637598, CNS-1518941 and supported Jannie Yu, Yu Su and Adam Wierman.
Authors’ Contact Information: Yu Su, California Institute of Technology, Pasadena, California, United States; e-mail: suyu@
caltech.edu; Vivek Anand, Georgia Institute of Technology, Atlanta, Georgia, United States; e-mail: vivekanand@gatech.
edu; Jannie Yu, California Institute of Technology, Pasadena, California, United States; e-mail: jannie@caltech.edu; Jian Tan,
Alibaba Inc., Sunnyvale, California, United States; e-mail: j.tan@alibaba-inc.com; Adam Wierman, California Institute of
Technology, Pasadena, California, United States; e-mail: adamw@caltech.edu.
This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2024 Copyright held by the owner/author(s).
ACM 2376-3639/2024/09-ART13
https://doi.org/10.1145/3680278
ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 9, No. 4, Article 13. Publication date: September 2024.
13:2
Y. Su et al.
Today this problem attracts attention due to the prominence of large-scale, general-purpose ma-
chine learning platforms, e.g., Google’s TensorFlow [
1
], Facebook’s PyTorch [
49
], and Microsoft’s
Azure Machine Learning (AzureML)
[
13
]. Machine learning jobs on the cloud are often ex-
pressed as
directed acyclic graphs (DAGs)
of precedence-constrained tasks, and how these jobs
are scheduled to run on clusters of machines on the cloud is very crucial to the performance of
the system [
1
]. Another timely example of scheduling precedence-constrained tasks is the paral-
lelization of training and evaluation of large complex neural networks in heterogeneous clusters
that consist of
Central processing Units (CPUs)
,
Graphical processing Units (GPUs)
,
Tensor
Processing Units (TPUs)
, and so forth. This
device placement
problem has attracted considerable
attention in recent years [
24
,
28
,
46
,
66
].
Traditionally, computational efficiency has been the only focus of works studying how to sched-
ule precedence-constrained tasks; e.g., the goal is to complete the tasks as soon as possible given a
fixed set of heterogeneous machines. The most common metric in the literature is total weighted
completion time, i.e., a weighted average of completion time of tasks. The mean response time
is a special case of total weighted completion time (via assigning equal weights to all the tasks),
as is the makespan (via adding a dummy node of weight 1 as the final task with all other tasks
assigned to weight 0). For these performance measures, significant progress has been made in
recent years. New results have emerged providing policies with poly-logarithmic approximation
ratios in increasingly general settings, including settings focused on makespan and total weighted
completion time, settings with heterogeneous related machines, and settings with uniform and
machine-dependent communication times [
20
,
21
,
40
42
,
59
].
However, the increasing scale of machine learning jobs has brought questions about the energy
usage of such jobs to the forefront. Today, the emissions of training an
Artificial Intelligence (AI)
model can be as high as five times the lifetime emission of a car [
58
]. The computation required
for deep learning has been doubling every 3.4 months, resulting in a 300,000x increase from 2012
to 2018 [
4
,
54
]. Indeed, the energy cost for an individual data center is on the order of millions of
dollars, and it has become a significant portion of operating costs for cloud platforms [
30
]. However,
there is an inherent conflict between boosting performance and reducing energy consumption; i.e.,
a larger power budget, in general, allows a higher performance in practice. Thus, it is urgent to
study how we can efficiently schedule machine learning jobs with both performance and energy
consumption in mind. Balancing these performance measures and energy usage is crucial to the
industry as well as societal goals of making cloud computing carbon neutral [
26
].
There has been considerable progress toward understanding how to schedule to balance perfor-
mance and energy measures in single- and multi-server settings without dependencies between
tasks [
3
,
14
,
63
,
64
]. A focus within this line of work is on the question of co-designing the sched-
uling of tasks and speed scaling of servers. While there has been progress in studying speed scal-
ing in simple scheduling problems, the question of how to balance energy usage with traditional
performance metrics, such as total weighted completion time, in more complex settings where
there are dependencies among tasks is a challenging open question. In fact, scheduling precedence-
constrained tasks is NP-hard even when ignoring power consumption; i.e., the goals of both parti-
tioning the jobs across machines and scheduling the jobs among a group of machines are NP-hard
[
37
]. Further, speed scaling adds considerable difficulty to the question of how to schedule tasks
optimally to balance energy and performance. When speed scaling is not considered, the relaxed
version of the scheduling problem can be formulated as a
mixed integer linear program (MILP)
.
Thus, there are many off-the-shelf solvers that work well for these problems on a relatively small
scale in practice. In the presence of speed scaling, however, solving for the optimal schedule even
for small-scale problems becomes even more complex and computationally expensive in practice
because the optimization is no longer linear.
ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 9, No. 4, Article 13. Publication date: September 2024.
Learning-Augmented Energy-Aware List Scheduling for Precedence-Constrained Tasks
13:3
Given the hardness of the problem, a natural question is:
can we design a joint scheduler and speed
scaling policy for precedence-constrained tasks that is provably near-optimal for a linear combination
of performance and energy consumption?
Contributions.
In this article, we present the first learning-augmented algorithm for schedul-
ing precedence-constrained tasks on multiple machines to balance performance and energy. Our
results provide provable guarantees for both makespan and total weighted completion time.
Our algorithm, Learning-augmented Energy-aware List Scheduling (Algorithm
2
), uses list
scheduling in concert with a learned estimate of the so-called “pseudo-sizes” of tasks. Often the
biggest challenge in designing a learning-augmented algorithm is to determine what quantity to
learn. Identifying the concept of pseudo-size as what to learn is a key novel idea underlying our
algorithm. We introduce the pseudo-size of a task as a measure of the externalities of the task on
other tasks that have yet to complete (see Section
3.1
for details). Intuitively, if many other tasks
have precedence constraints that depend on the current task, then the externalities of the task
are high and it is worth investing more energy in order to increase the speed and finish the task
quickly.
Our main results provide performance guarantees that depend on the quality of the learned ap-
proximation of the pseudo-size (Theorem
4.1
for total weighted completion time and Theorem
4.2
for makespan). If pseudo-size estimates are perfect, these results match the best results possible,
and they further provide bounds on the degradation of performance as pseudo-size estimate qual-
ity drops.
Finally, we present simple and effective approaches for learning pseudo-size approximations and
demonstrate the performance of these approaches in real-world workloads of jobs with precedence-
constrained tasks. Our workloads cover a wide variety of compute-intensive and data-intensive
workloads from bioinformatics, astronomy, seismology, and so forth. Our results show that gradi-
ent descent-based learning is near optimal across a wide variety of workloads, never performing
more than 7% worse than our upper bound on the optimal cost.
1
We also highlight that a less com-
putationally intensive approach based on linear regression can perform well in many scenarios
and that, when precedence constraints have simple structures, even a deterministic non-learning-
based estimation of pseudo-sizes can be effective.
Related Literature.
In recent years, due to the successful deployment of machine learning algo-
rithms under widely varying scenarios, the design and optimization of large-scale machine learn-
ing platforms attracts extensive attention from both academic and industry communities. One of
the fundamental challenges is to balance system performance and energy usage while scheduling
machine learning jobs with precedence constraints.
Significant progress has been recently made toward the goal of maximizing performance while
scheduling precedence-constrained tasks if energy concerns are ignored. Under the related ma-
chines model, i.e.,
Q
|
prec
|

ω
j
C
j
, a Speed-based List Scheduling algorithm was proposed to obtain
an
O
(
log
m
)
-approximation [
16
]. Subsequently, an improvement to
O
(
log
m
/
log log
m
)
was made
in 2017 by [
40
] for both objectives: makespan and total weighted completion time. If communi-
cation delays are assumed to be fixed (uniform), two groups of researchers independently made
progress toward a multiplicative approximation algorithm with logarithmic guarantees by adopt-
ing different approaches [
20
,
21
,
42
]. When it comes to incorporating non-uniform communication
delays for the first time, i.e.,
Q
|
prec
,
c
i
,
j
|

ω
j
C
j
,
Generalized Earliest Time First (GETF)
was
proposed in [
59
] to achieve a worst-case bound for both makespan and total weighted completion
time, and it reduces to the state-of-the-art results in the case of zero communication delays.
1
Computing the true optimal cost is intractable for the scale of real-world jobs. See Section
6.1.2
.
ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 9, No. 4, Article 13. Publication date: September 2024.
13:4
Y. Su et al.
There has been a flurry of work studying power management if precedence constraints are ig-
nored, including settings with sleep states [
19
,
23
,
50
], settings that consider speed scaling, settings
with single servers [
6
8
,
12
,
64
], and settings with multiple servers [
3
,
14
,
27
]. A comparison of
different energy conservation schemes has also received considerable attention as well [
29
,
63
].
Further, diverse performance measures have been considered in the literature, including deadline
feasibility [
33
,
39
,
64
], flow time [
2
,
10
,
57
], and so forth. We refer to [
32
] for a comprehensive
survey of related scheduling problems.
When both performance and energy goals are considered for tasks with precedence constraints,
much less is known. The closest work to ours is [
51
]. They consider the problem of scheduling
precedence-constrained tasks to minimize the makespan subject to an energy constraint and obtain
a poly-logarithmic approximation algorithm by reducing the problem to the problem
Q
|
prec
|
C
max
.
However, their technique does not apply to our setting for two reasons. First, we consider a general
objective, which is a linear combination of performance and energy consumption, instead of focus-
ing on the constrained budget problem. Second, total weighted completion time is a more general
choice for the performance measure, of which makespan is a special case. There are a variety of pa-
pers that build on [
51
]; e.g., [
5
] solves the same problem as [
51
] but improves the competitive ratio.
Other papers include [
35
] and [
55
], which give empirical heuristics for the objective of makespan
plus energy, without providing theoretical guarantees. There are a variety of other such works that
design heuristics to balance both performance and energy, e.g., [
11
,
35
,
36
,
43
45
,
56
,
60
62
,
65
].
However, as we focus on algorithms with provable guarantees, we do not dive into the details
of these heuristics. Thus, at this point, it remains unknown if it is possible to design a scheduler
for precedence-constrained tasks that provably minimizes the combination of performance and
energy measures in general settings.
Our work takes a learning-augmented approach, using machine-learned predictions to over-
come the challenges associated with scheduling precedence-constrained jobs to minimize perfor-
mance and energy. The area of learning-augmented algorithms has received considerable attention
in recent years, as machine-learned advice is an increasingly powerful tool for algorithms to ex-
ploit. The field began with a focus on algorithms that can use predictions to improve running times
through techniques such as warm starts or predicting the internal state of algorithms. See, e.g.,
[
22
,
25
,
31
] and the references therein. Our work falls into this category and is the first learning-
augmented algorithm designed to balance energy and performance for precedence-constrained
tasks. Another independent line of work considers future predictions in online decision-making
tasks. Examples of this approach include [
15
,
47
,
52
,
53
] and the references therein. Within this line
of work, a related recent paper [
34
] studies the value of future predictions in the online problem
of scheduling precedence-constrained jobs to minimize completion time (without consideration
of energy). This line of work differs from how we use machine-learned advice in this article. We
focus on predictions of the internal state of the algorithm to achieve a good approximation effi-
ciently. In this context, our goal is to (1) understand the right parameters to predict in order to
design a robust algorithm that performs well when predictions are accurate but does not suffer
badly if predictions are poor; (2) provide bounds on the performance, depending on on the error
of the predictions; and (3) show that the selected parameters can be learned from historical data.
2Model
We study the problem of scheduling a job made of a set
V
of
n
tasks on a system consisting of a set
M
of
m
machines. The tasks form a DAG
G
=
(V
,
E)
, in which each node
j
represents a task and
an edge
(
j

,
j
)
between task
j
and task
j

represents a precedence constraint. We interchangeably
use node or task, as convenient. Precedence constraints are denoted by a partial order
between
two nodes of any edge, where
j

j
means that task
j
can only be scheduled after task
j

completes.
ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 9, No. 4, Article 13. Publication date: September 2024.
Learning-Augmented Energy-Aware List Scheduling for Precedence-Constrained Tasks
13:5
Let
p
j
represent the processing size of task
j
.Iftask
j
is assigned to run at a speed
s
j
on machine
i
, then it would take
p
j
s
j
time units to run on machine
i
.
For simplicity, we assume that the DAG is connected. This is without loss of generality because,
otherwise, the DAG can be viewed as multiple DAGs and the same results can be applied to each
individual connected component. Consequently, our results trivially apply to the case of multiple
jobs. Additionally, our model assumes that each machine can process at most one task at a time;
i.e., there is no
time-sharing
, and the machines are assumed to be
non-preemptive
—i.e., once a task
starts on a machine, the scheduler must wait for the task to complete before scheduling any new
task to this machine. This is a natural assumption in many settings, as interrupting a task and
transferring it to another machine can cause significant processing overhead and communication
delays due to data locality. These assumptions are standard in the related literature, e.g., [
20
,
51
,
59
].
Performance and Energy Metrics.
The objective function we consider is
T
+
λE
, a linear
combination of a performance measure
T
and an energy/power usage
E
. The system operator can
emphasize either performance or energy as desired via a choice of weight
λ
. For this work, we
adopt total weighted completion time as the performance measure, though our results apply to
both makespan and mean response time as well. For the energy metric
E
, our focus is on total
energy usage, which is the sum of energy usage of all tasks in the DAG, i.e.,
E
=

j
∈V
e
(
j
)
.
Speed Scaling.
The scheduler we consider has the ability to scale speed servers in order to
trade off performance and energy. In our model, a server chooses speed
s
j
for task
j
. For a task
running at speed
s
j
, its energy consumption
e
(
j
)
is modeled as the product of the instantaneous
power
f
(
s
j
)
and running time
t
j
of that task, i.e.,
e
(
j
)
=
f
(
s
j
t
j
.Acommonformofthepower
function in the literature is a polynomial, i.e.,
f
(
s
)
=
s
α
,where
α
>
1[
7
,
9
,
10
]. A quadratic form
is most common, so we focus on that in this work, though our main results apply more generally.
Note that for any convex choice of instantaneous power function, given any optimal schedule, the
server always runs at a constant speed during the execution of a single task; otherwise it would
be possible to adopt the average speed for running the task without any sacrifice on performance
measure but potentially conserve more energy.
The Optimization Problem.
We can now formally define the joint scheduling and speed scal-
ing problem via an optimization formulation. Let
x
i
,
j
be a binary decision variable that indicates
whether task
j
is assigned to machine
i
, i.e.,
x
i
,
j
=
{
1
,
if task
j
runs on machine
i
,
0
,
otherwise
.
(1)
Let
C
j
denote the completion time of task
j
and
s
j
denote the assigned speed of task
j
. The schedul-
ing problem (with total weighted completion time as the performance measure) can be formulated
as follows:
min
x
i
,
j
,
C
j
,
s
j
,
T
,
E
T
+
λE
x
i
,
j
∈{
0
,
1
}
i
,
j
(2a)

i
x
i
,
j
=
1
j
(2b)
C
j

+
p
j
s
j
C
j
j

j
(2c)

j
ω
j
C
j
T
j
(2d)
ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 9, No. 4, Article 13. Publication date: September 2024.
13:6
Y. Su et al.

j
f
(
s
j
)
p
j
s
j
=
E
(2e)

i
x
i
,
j
x
i
,
j

=
q
j
,
j

j
,
j

if
j
!
=
j

(2f)
C
j

C
j
p
j

s
j

or
C
j
C
j

p
j
s
j
if
q
j
,
j

=
1
.
(2g)
The optimal value of this problem is denoted as OPT. Constraint (
2b
) requires every task to be
scheduled on some machine. Constraint (
2c
) guarantees that for any successor predecessor pair, the
successor task will not start until the predecessor completes. In Constraint (
2d
),
T
represents the
total weighted completion time, and it can be reduced to other performance measures by assigning
appropriate weights, such as makespan and mean response time. In Constraint (
2e
),
E
is the sum of
multiplication of power function per unit time and running time. Constraint (
2f
) guarantees that
q
j
,
j

=
1iftask
j
and task
j

are assigned to the same machine. Any machine should not process
more than one task at a time, as in Constraint (
2g
). By addition of an auxiliary binary variable
b
j
,
j

,
we can rewrite Constraint (
2g
) as follows:
q
j
,
j

(
C
j
C
j

+
p
j

s
j

)
b
j
,
j

(
T
C
j

+
p
j

s
j

)
j
,
j

(3a)
b
j
,
j

C
j

(
C
j
p
j
s
j
)
q
j
,
j

j
,
j

(3b)
b
j
,
j

q
j
,
j

j
,
j

(3c)
b
j
,
j

∈{
0
,
1
}
j
,
j

.
(3d)
When task
j
and task
j

are assigned to run on the same machine, i.e.,
q
j
,
j

=
1, the auxiliary
variable represents the ordering of these two tasks, i.e.,
b
j
,
j

=
0if
j
j

.
We emphasize that there are other possible formulations of the optimization problem. For ex-
ample, we could construct a different formulation with a time-indexed program, though it would
remain nonlinear due to speed constraints. Thus, we adopt the above formulation for simplicity.
Because solving for precedence-constrained tasks on multiple servers is NP-hard, we focus on
approximation algorithms in this article.
3 Algorithm Design
Inspired by the single-server case, we introduce the notion of
pseudo-size
to quantify the impor-
tance of tasks in the DAG. Intuitively, a task that many other tasks depend on via precedence
constraints should be prioritized to run at a fast speed, while a task that few other tasks depend
on should be set to run at a slow speed. This intuition is simple, but understanding how many
other tasks depend on a given task is complex since it depends on both the task’s children in the
DAG and the “width” of the DAG at the task’s position and the sizes of the various tasks in the
DAG. Our approximation algorithm uses a pseudo-size approximation to capture these factors in
combination with an approximation algorithm for the case of identical machines without energy
concerns, e.g., [
48
], to produce a schedule that balances between performance and energy goals.
The pseudo-size of tasks that we use depends on various features of the precedence graph, such
as degree of nodes, number of children, and so forth, as well as the given number of machines. In
practice, experts can extract these features and feed them to an off-the-shelf learning algorithm to
obtain an approximation, which is then used by the approximation algorithm.
The introduction of pseudo-size is critical in two ways. First, conditional on an approximation
of pseudo-size, we are able to reduce the general problem, for which it is hard to find an optimal
ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 9, No. 4, Article 13. Publication date: September 2024.