Asymptotically Optimal Load Balancing in Large-scale Heterogeneous Systems with Multiple Dispatchers

We consider the load balancing problem in large-scale heterogeneous systems with multiple dispatchers. We introduce a general framework called Local-Estimation-Driven (LED). Under this framework, each dispatcher keeps local (possibly outdated) estimates of queue lengths for all the servers, and the dispatching decision is made purely based on these local estimates. The local estimates are updated via infrequent communications between dispatchers and servers. We derive sufficient conditions for LED policies to achieve throughput optimality and delay optimality in heavy-traffic, respectively. These conditions directly imply delay optimality for many previous local-memory based policies in heavy traffic. Moreover, the results enable us to design new delay optimal policies for heterogeneous systems with multiple dispatchers. Finally, the heavy-traffic delay optimality of the LED framework directly resolves a recent open problem on how to design optimal load balancing schemes using delayed information.


Introduction
Load balancing, which is responsible for dispatching jobs on parallel servers, has attracted significant interest in recent years. This is motivated by the challenges associated with efficiently dispatching jobs in large-scale data centers and cloud applications, which are rapidly increasing in size. A good load balancing policy not only ensures high throughput by maximizing server utilization, but improves the user experience by minimizing delay.
There have been numerous load balancing policies proposed in the literature. The most straightforward one is Join-Shortest-Queue (JSQ), which has been shown to enjoy optimal delay in both non-asymptotic (for homogeneous servers) and asymptotic regimes [22,5,4]. However, it is difficult to implement in today's large-scale data centers due to the large message overhead between the dispatcher and servers. As a result, alternative load balancing policies with low message overhead have been proposed. For example, the Power-of-d policy [12] has been shown to achieve optimal average delay in heavy traffic with only 2d messages per arrival [10]. Another common load balancing policy is the pull-based Join-Idle-Queue (JIQ) [9,16], which has been shown to outperform the Power-of-d policy using less overhead. However, both Power-of-d and JIQ mainly achieve good performance for systems with homogeneous servers. Recently, some works consider heterogeneous servers and propose flexible and low message overhead policies that achieve optimal delay in heavy traffic [29,27]. However, only a single dispatcher is considered in these works. Theoretical analysis of load balancing with multiple dispatchers has mainly focused on the JIQ policy so far [13,17], which has a poor performance in heavy traffic and is even generally unstable for heterogeneous systems [29].
Note that heterogeneous systems with multiple dispatchers are now almost the default scenarios in today's cloud infrastructures. On one hand, the heterogeneity comes from the usage of multiple generations of CPUs and various types of devices [6]. On the other hand, with the massive amount of data, a scalable cloud infrastructure needs multiple dispatchers to increase both throughput and robustness [15].
Power-of-d, the dispatcher only needs to sample d ≥ 2 servers and sends arrivals to the server with the shortest queue length among the d samples. This simple policy has been shown to enjoy a doubly exponential decay rate in response time in the large-system asymptotic regime [12] and achieve optimal delay in heavy-traffic for homogeneous servers [3,10]. Another low-message overhead policy is JIQ (or Pull-based policy) [9,16], under which arrivals are sent to one of the idle servers, if there are any, and to a randomly selected server otherwise. Compared to JSQ and Power-of-d, JIQ has the nice property of zero dispatching delay since each arrival can be instantaneously routed rather than waiting for feedback from servers. Moreover, JIQ has been shown to outperform Power-of-d with even smaller message overhead (at most one per job). In particular, under JIQ, arriving jobs achieve asymptotic zero waiting time in the large-system regime while Power-of-d does not. An even stronger result suggests that, in the Halfin-Whitt asymptotic regime, JIQ achieves the same delay performance as JSQ [14]. Nevertheless, the performance of JIQ drops substantially in heavy traffic with a finite number of servers, even for homogeneous servers. In fact, it is not heavy-traffic delay optimal in this case [29]. Motivated by this, recent works have proposed alternative pull-based policies that not only enjoy all the nice features of JIQ but also achieve optimal delay in heavy-traffic [29,27]. However, these studies only consider the case of single dispatcher.
Compared to the large literature on the single dispatcher case, there are only a few works for the scenario of multiple dispatchers, and they mainly focus on the JIQ policy. In particular, [13] presents a new large-system asymptotic analysis of JIQ without the simplifying assumptions in [9]. The property of asymptotically zero waiting time of JIQ was generalized to the case of multiple dispatchers in [17]. However, the results for JIQ in [9,13,17] all assume that the loads at various dispatchers are strictly equal. Without this assumption, [19] shows that the waiting time under JIQ no longer vanishes in the large-system regime and two enhanced JIQ schemes are proposed. As mentioned earlier, although JIQ is a scalable choice for the multiple-dispatcher case, it is not delay optimal in heavy traffic for homogeneous servers and not even generally stable for heterogeneous systems.
The case of heterogeneous systems with multiple dispatchers has received very little attention from the theoretical community so far. To the best of our knowledge, the framework proposed in [1] is the first attempt to study efficient load balancing schemes with a theoretical guarantee for the scenario of heterogeneous systems with multiple dispatchers. In particular, under the proposed Loosely-Shortest-Queue (LSQ) framework, each dispatcher independently keeps its own local view of sever queue lengths and routes jobs to the shortest among them. Communication is used only to update the local views and make sure that they are not too far from the real queue lengths. The main contributions of [1] are the sufficient conditions for any LSQ policy to achieve strong stability with low message overhead. Additionally, extensive simulations have been used to demonstrate its appeal. Nevertheless, a theoretical guarantee on the delay performance of LSQ policies remains an important unsolved question.
It is worth pointing out that the idea of using local memory to hold possibly old information for load balancing was also explored in two recent works [2,20]. As we discuss later, these two proposed policies are in our LED framework. Both works only consider a single dispatcher and homogeneous servers, which is also a special case of our model. Further, their analysis focuses on the large-system asymptotic regime where the number of servers goes to infinity, while our analysis deals with a finite number of servers.

System Model and Preliminaries
This section describes the system model and assumptions considered in this paper. Then, several necessary preliminaries are presented.

System model
We consider a discrete-time (i.e., time-slotted) load balancing system consisting of M dispatchers and N possiblyheterogeneous servers. Each server maintains an infinite capacity FIFO queue. At each dispatcher, there is a local memory, through which the dispatcher can have some (possibly delayed) information about the system states. In each time-slot, the central dispatcher routes the new incoming tasks to one of the servers, immediately upon arrival. Once a task joins a queue, it remains in that queue until its service is completed. Each server is assumed to be work conserving, i.e., a server is idle if and only if its corresponding queue is empty.

Arrivals
Let A m (t) denote the number of exogenous tasks that arrive at dispatcher m at the beginning of time-slot t. We assume that A Σ (t) = M m=1 A m (t) is an integer-valued random variable, which is i.i.d. across time-slots. The mean and variance of A Σ (t) are denoted by λ Σ and σ 2 Σ , respectively. We further assume that there is a positive probability that A Σ (t) is zero. The allocation of total arriving tasks among the M dispatchers is allowed to use any arbitrary policy that is independent of system states. Note that, in contrast to previous works on multiple dispatchers [9,13,17], we do not require that the loads at all dispatchers are equal. We assume that there is a strictly positive probability for tasks to arrive at each dispatcher at any time-slot t. That is, there exists a strictly positive constant p 0 such that where M = {1, 2, . . . , M }. Moreover, we assume that A m (t) is i.i.d across time-slots with mean arrival rate denoted by λ m . We further let A m n (t) denote the number of new arrivals at server n from dispatcher m at the beginning of time-slot t. Let A n (t) = M m=1 A m n (t) be the total number of arriving tasks at server n at the beginning of time-slot t.

Service
Let S n (t) denote the amount of service that server n offers for queue n in time-slot t. That is, S n (t) is the maximum number of tasks that can be completed by server n at time-slot t. We assume that S n (t) is an integer-valued random variable, which is i.i.d. across time-slots. We also assume that S n (t) is independent across different servers as well as the arrival process. The mean and variance of S n (t) are denoted as µ n and ν 2 n , respectively. Let µ Σ Σ N n=1 µ n and ν 2 Σ Σ N n=1 ν 2 n denote the mean and variance of the hypothetical total service process S Σ (t) N n=1 S n (t). Let ǫ = µ Σ − λ Σ characterize the distance between the arrival rate and the boundary of capacity region.

Queue Dynamics
Let Q n (t) be the queue length of server n at the beginning of time slot t. Let A n (t) denote the number of tasks routed to queue n at the beginning of time-slot t according to the dispatching decision. Then the evolution of the length of queue n is given by where U n (t) = max{S n (t) − Q n (t) − A n (t), 0} is the unused service due to an empty queue. We do not assume any specific distribution for arrival and service processes. Moreover, in contrast to previous works [29,4], we do not require that both arrival and service processes have a finite support. Instead, we only need the condition that their distributions are light-tailed. More specifically, we assume that for each n where the constants θ 1 > 0, θ 2 > 0, D 1 < ∞ and D 2 < ∞ are all independent of ǫ.

Local-Estimation-Driven (LED) framework
We are interested in the case that the local memory at each dispatcher m stores an estimate of the queue length for each server n. In particular, we let Q m n (t) be the local estimate of the queue length for server n from dispatcher m at the beginning of time-slot t (before any arrivals and departures). More specifically, we introduce the following framework for load balancing. The definition of LED is broad, and it includes a variety of classical load balancing policies. For example, it can be seen to include LSQ policy studied in [1], by choosing the dispatching strategy to be that new arrivals at each dispatcher are dispatched to the queue with the shortest local estimate. Moreover, it also includes two recent local memory based policies in [2,20] that are developed for the case of single dispatcher and homogeneous servers.
To study LED, we model the system as a discrete-time Markov chain {Z(t) = (Q(t), m(t)), t ≥ 0} with state space Z, using the queue length vector Q(t) together with the memory state m(t) ( Q 1 (t), Q 2 (t), . . . , Q m (t)). We consider a set of load balancing systems {Z (ǫ) (t), t ≥ 0} parameterized by ǫ such that the mean arrival rate of the total exogenous arrival process {A Note that the parameter ǫ characterizes the distance between the arrival rate and the boundary of the capacity region. We are interested in the throughput performance and the steady-state delay performance in the heavy-traffic regime under any LED policy. A load balancing system is stable if the Markov chain {Z(t), t ≥ 0} is positive recurrent, and Z = {Q, m} denotes the random vector whose distribution is the same as the steady-state distribution of {Z(t), t ≥ 0}. We have the following definition.
Definition 2 (Throughput Optimality). A load balancing policy is said to be throughput optimal if for any arrival rate within the capacity region, i.e., for any ǫ > 0, the system is positive recurrent and all the moments of Q (ǫ) are finite.
Note that this is a stronger definition of throughput optimality than that in [1,21,25] because, besides the positive recurrence, it also requires all the moments to be finite in steady state for any arrival rate within the capacity region.
To characterize the steady-state average delay performance in the heavy-traffic regime when ǫ approaches zero, by Little's law, it is sufficient to focus on the summation of all the queue lengths. First, recall the following fundamental lower bound on the expected sum queue lengths in a load balancing system under any throughput optimal policy [4]. Note that this result was originally proved with the assumption of finite support on the service process (Lemma 5 in [4]), which can be generalized to service processes with light-tailed distributions with a careful analysis of the unused service, see our proof of Lemma 6. Lemma 1. Given any throughput optimal policy and assuming that (σ (ǫ) Σ ) 2 converges to a constant σ 2 Σ as ǫ decreases to zero, then where ζ σ 2 Σ + ν 2 Σ . The right-hand-side of Eq. (4) is the heavy-traffic limit of a hypothesized single-server system with arrival process A (ǫ) Σ (t) and service process N n S n (t) for all t ≥ 0. This hypothetical single-server queueing system is often called the resource-pooled system. Since a task cannot be moved from one queue to another in the load balancing system, it is easy to see that the expected sum queue lengths of the load balancing system is larger than the expected queue length in the resource-pooled system. However, if a policy achieves the lower bound in Eq. (4) in the heavy-traffic limit, based on Little's law this policy achieves the minimum average delay of the system in steady-state, and thus said to be heavy-traffic delay optimal, see [4,10,21,24,25,29].
Definition 3 (Heavy-traffic Delay Optimality in Steady-state). A load balancing scheme is said to be heavytraffic delay optimal in steady-state if the steady-state queue length vector Q (ǫ) satisfies where ζ is defined in Lemma 1.

Dispatching Preference
In order to provide a unified way to specify the dispatching strategy in LED, we first introduce a concept called dispatching preference. In particular, let P m n (t) be the probability that new arrivals at dispatcher m are dispatched to server n at time-slot t. We define β m n (t) P m n (t) − µn µΣ , which is the difference in probability that server n will be chosen under a particular dispatching strategy and random routing (weighted by service rate). Then, we have the following definition.
Definition 4 (Dispatching preference). Fix a dispatcher m, let σ t (·) be a permutation of (1, 2, . . . , N ) that satisfies The dispatching preference at dispatcher m is a N -dimensional vector denoted by ∆ m (t), the nth component of which is given by ∆ m n (t) β m σt(n) . In words, the dispatching preference at a dispatcher m specifies how servers with different local estimates are preferred in a unified way such that it is independent of the actual values of local estimates. It only depends on the relative order of local estimates. More specifically, fix a dispatcher m, by definition we can see that weighted random routing strategy has no preference for any servers and ∆ m n (t) = 0 for any n. On the other hand, if new arrivals are always dispatched to the server with the shortest local estimate (e.g, LSQ policy), we have ∆ m 1 (t) > 0 and ∆ m n (t) < 0 for all 2 ≤ n ≤ N . Thus, we can see that a positive value for ∆ m n (t) means that the dispatching strategy has a preference for the server with the nth shortest local estimation. This observation directly motivates the following two definitions.
Definition 5 (Tilted dispatching strategy). A dispatching strategy adopted at dispatcher m is said to be tilted if there exists a k ∈ {2, 3, . . . N } such that for all t, ∆ m n (t) ≥ 0 for all n ≤ k and ∆ m n (t) ≤ 0 for all n ≥ k. Definition 6 (δ-tilted dispatching strategy). A dispatching strategy adopted at dispatcher m is said to be δ-tilted if for all t (i) it is a tilted dispatching strategy and (ii) there exists a positive constant δ such that ∆ m 1 (t) ≥ δ and ∆ m N (t) ≤ −δ. Remark 1. Note that similar definitions were first provided in [29] for the case of a single dispatcher with up-to-date information. Based on these definitions, sufficient conditions were presented for throughput and heavy-traffic optimality. However, these conditions cannot be directly applied to our model due to the following two major challenges. One is that, in our model, each dispatcher only has access to outdated information. The other is that each dispatcher has no idea of the arrivals at the servers coming from other dispatchers, since there is no communication between them. To handle these challenges, we have to develop new techniques.
We end this section by providing intuitions behind the two definitions. To start, it can be seen easily that N n=1 ∆ m n (t) = 0 for all m and t via the definition of dispatching preference. Roughly speaking, a tilted dispatching strategy means that compared to (weighted) random routing (which does not have any preference), the probabilities of choosing servers with shorter local estimates (the first k shortest ones) are increased, and, as a result, the probabilities of choosing servers with longer local estimates are reduced. This is the reason why we call it tilted, since more preference is given to queues with shorter local estimates. Therefore, a tilted dispatching strategy can be viewed as a strategy that is as least as 'good' as (weighted) random routing. On the other hand, a δ-tilted dispatching strategy can be viewed as a strategy that is strictly better than (weighted) random routing. The reason is that, besides the fact that it is tilted, it also requires that there is a strictly positive preference of the server with the shortest local estimation.

Main Results
In this section, we first present the sufficient conditions for LED policies to be throughput optimal and heavytraffic delay optimal. Then, we explore several example policies within LED framework to demonstrate its flexibility in designing new load balancing schemes.

Sufficient Conditions
Let us begin with the sufficient conditions for LED policies to be throughput optimal. In particular, we specify conditions for the dispatching strategy and update strategy that guarantee throughput optimality.
To state the theorem, we need the following notation. Let I m n (t) be an indicator function which equals 1 if and only if the local estimate of server n's queue length at dispatcher m gets updated, i.e., the estimated queue length Q m n (t) is set to the actual queue length Q n (t) at the end of time-slot t. Theorem 1. Consider an LED policy. Suppose the dispatching strategy at each dispatcher is tilted and the update strategy can guarantee the condition that there exists a positive constant p such that holds for all Z and (m, n, t) ∈ M × N × N. Then, this policy is throughput optimal, i.e., the system under this policy is positive recurrent with all the moments being bounded for any ǫ > 0.
Proof. See Section 5.1 Note that this theorem directly implies that LSQ is not only strongly stable but also enables the system to have all the moments bounded in steady-state. Moreover, it suggests that any dispatching strategy that is as good as (weighted) random routing is sufficient to guarantee throughput optimality. Further, the update probability can be a function of the traffic load. Now, we turn to presenting the sufficient conditions for LED policies to be delay optimal in heavy traffic. In order to achieve delay optimality, we need stronger conditions on both the dispatching strategy and the update strategy.
Theorem 2. Consider an LED policy. Suppose the dispatching strategy at each dispatcher is δ-tilted with a uniform lower bound δ > 0 being independent of ǫ. Suppose the update strategy can guarantee that there exists a positive constant p (independent of ǫ) such that holds for all Z and (m, n, t) ∈ M × N × N. Then, this policy is heavy-traffic delay optimal.
Proof. See Section 5.2 This theorem not only establishes a delay performance guarantee for many previous local-memory based policies (e.g., LSQ in [1], low-message policies in [2,20]), but provides us with the flexibility to design new delay optimal load balancing for different scenarios with heterogeneous servers and multiple dispatchers, as discussed in the next section. More importantly, our results directly suggest that it is possible to use only delayed information to achieve delay optimality, which resolves one of the open problems listed in [8].
High-level proof idea. We end this section by providing drift-based intuitions behind the technical proofs. In particular, let us consider two queueing systems: a local-estimation system and the actual system (i.e., queue lengths at servers). For throughput optimality, it requires the actual system to have a drift towards the origin. First, by the definition of a tilted dispatching strategy, it provides an equivalent drift on the local-estimation system that is towards the origin. Then, the condition on the update strategy guarantees that the localestimation system is not too far away from the actual system. Hence, the actual system also has a drift towards the origin. The heavy-traffic delay optimality not only requires a drift towards the origin, but also needs a drift towards the line that all the queue lengths are equal. First, by the definition of a δ-tilted dispatching strategy, there is a drift towards the line that all the local estimates are equal within a given dispatcher. Then, by the condition for the update strategy, the drift on the local-estimation system can be transfered to a drift on the actual system, and hence delay optimality. Note that, in the current proof, in order to make this 'drift-transfer' process valid, we impose the condition that both δ and p are independent of ǫ, which is not necessarily required and both of them could possibly be a particular function of ǫ as in [26]. This relaxation could be an interesting future research direction.

Examples
To illustrate the applications of Theorems 1 and 2, in this section, we introduce examples of LED policies that are both throughput optimal and heavy-traffic delay optimal. The flexibility provided by our sufficient conditions not only allows us to include previous policies as special cases, but enables us to design new flexible policies.
Let us first introduce some typical δ-tilted dispatching strategies.
Example 1 (Local-Join-Shortest-Queue (L-JSQ)). At the beginning of each time-slot t, the dispatcher forwards its arrivals to the server with the shortest local estimation with ties broken arbitrarily. That is, consider dispatcher m, the chosen server is i * ∈ arg min n { Q m n } . This dispatching strategy is the same as that in the LSQ policy in [1]. By the definition of dispatching preference, we can see that under L-JSQ, ∆ m 1 (t) = 1 − µ σt(1) /µ Σ > 0 and ∆ m n (t) = −µ σt(n) /µ Σ < 0. Hence, it is δ-tilted even for heterogeneous servers with δ = µ min /µ Σ where µ min = min n µ n .
Instead of always joining the server with the shortest local estimate, it is also possible to join a sever whose queue length is below a threshold while satisfying the condition of δ-tilted dispatching preference.
Example 2 (Local-Join-Below-Average (L-JBA)). At the beginning of each time-slot t, the dispatcher forwards its arrivals to a randomly chosen server whose local estimate is below or equal to the average local queue length estimation. That is, consider dispatcher m with the average local estimate beingQ m (t) = 1 N n Q m n (t). Let A {n : Q m n (t) ≤Q m (t)}. Then, for each i ∈ A, P m i (t) = µ i / n∈A µ n , and for i / ∈ A, P m i (t) = 0. It can be easily shown from the definition that L-JBA is also δ-tilted. Note that, compared to L-JSQ, in the heterogeneous case, it needs the dispatcher to know the service rate of each server, which can be easily obtained by the update strategies introduced next. This strategy is more flexible than L-JSQ since it does not require new arrivals to be only sent to the server with the shortest local estimate, which could be used in the scenarios with data locality. Moreover, some randomness in the dispatching strategy is also useful, as discussed in the next section.
Further, it is possible to generalize many previous heavy-traffic delay optimal policies into the LED framework. For example, we can directly apply the Power-of-d policy as our dispatching strategy.
Example 3 (Local-Power-of-d (L-Pod)). At the beginning of each time-slot t, the dispatcher randomly chooses d ≥ 2 servers and sends arrivals to the server that has the shortest local estimation among the d servers.
It can be easily shown that L-Pod is tilted for homogeneous servers. Moreover, for a given m, we have ∆ m 1 (t) = d−1 N and ∆ m N (t) = − 1 N , and hence it is δ-tilted with δ = 1 N . Now, let us turn to discussing update strategies that satisfy the condition in Theorem 2. In particular, the update strategy can either be push-based (dispatcher samples servers) or pull-based (servers report to dispatchers). It has been shown in [1] that even for d = 1, the push-update strategy is guaranteed to satisfy the condition in Theorem 2.
Definition 8 (Pull-Update). At the end of each time-slot, for each server n if there are completed tasks, then the server will uniformly at random pick a dispatcher m and then abide by one of the following two rules: • If the server becomes idle (i.e., no tasks), it sends (n, 0) to dispatcher m.
• If not, it sends (n, Q n ) to dispatcher m with probabilityp.
It has been shown in [1] that for anyp > 0, the pull-update strategy is guaranteed to satisfy the condition in Theorem 2. Now, having introduced both the dispatching strategy and the update strategy, we can combine them to obtain different LED policies that are delay optimal in heavy-traffic. For example, we have L-JSQ-Push, L-JSQ-Pull, L-JBA-Push, L-JBA-Pull for heterogeneous servers, as well as L-Pod-Push and L-Pod-Pull for homogeneous servers.
We end this section by summarizing the contributions of the LED framework. (i) It covers previous polices. L-JSQ-Push (withp = 1) and L-JSQ-Pull are the same as LSQ policies considered in [1], which include the policies developed in both [2] and [20] as special cases. Thus, by Theorems 1 and 2, all these policies are throughput and heavy-traffic delay optimal. (ii) It allows randomness in dispatching. The randomness introduced in L-JBA and L-Pod is helpful when dealing with the scenario with an extreme low budget on the message overhead, as discussed next. (iii) It enables trade-offs between memory and message overhead. That is, if each dispatcher directly uses the traditional Power-of-d without any memory, then at least 4 messages needed to guarantee delay optimality in heavy-traffic. In contrast, in both L-Pod-Push and L-Pod-Pull, the worst-case message overhead is just 1 per arrival. In addition, the message can be further reduced by choosing a smaller value ofp in the update strategy.

Discussion
Before moving to the proofs, we would like to discuss key features and insights about LED, and point out possible refinements on LED.

Key features of LED
In this section, we highlight the key features of the LED framework, including low message overhead, zero dispatching delay, low computational complexity and appealing performance across various loads.
Low message overhead. It should be noted that the communication overhead occurs only during the update phase in LED policies. For the push-update strategy, the number of messages per arrival is at most 2d (d can even be one). For the pull-update strategy, the number of messages per arrival is at most 1. In contrast, JSQ needs 2N messages per arrival and Power-of-d needs at least 4 messages per arrival. Although JIQ has a comparative worst-case message overhead as LED policies, it is not stable for heterogeneous servers.
Zero dispatching delay. Another key feature of all LED policies is that there is zero dispatching delay. That is, the dispatcher can immediately route its new arrivals to the chosen server since the decision is made purely based on its local estimations. Moreover, the communication between dispatchers and servers happens only after the decision is made. This is in contrast to typical push-based policies like JSQ and Power-of-d, under which the dispatcher has to wait for the response of sampled servers to make its dispatching decision, resulting in a non-zero dispatching delay.
Low computational complexity. In order to implement LED policies, each dispatcher has to keep an array of size N its local estimations. Such a space requirement is negligible in a modern cluster. Further, the operations required by dispatching strategies of LED policies are very efficient. For example, in order to find the server with the minimal local estimate in L-JSQ, we can keep the array in a min-heap data structure. For L-JBA, we can calculate the average by using an efficient running average algorithm. For the simple L-Pod, it only needs random number generators.
Appealing performance across loads. Although the theoretical delay optimality for the LED framework holds in the heavy-traffic asymptotic regime, the family of LED policies includes efficient policies that significantly outperform alternative low-message overhead policies with the same (or even smaller) amount of communications. For example, if the dispatching strategy adopts L-JSQ in LED, then it reduces to the LSQ policy proposed in [1], which appeals to enjoy good performance over a wide range of traffic loads in different scenarios via extensive simulations.
As mentioned earlier, the class of heavy-traffic delay optimal LED policies is broad and includes flexible choices of different dispatching and update strategies based on different application scenarios. The actual delay performance (except the heavy-load scenario) varies with the particular choice of dispatching strategy or update strategy under different scenarios. Thus, it is not possible to pick one particular LED policy that fits every circumstance, which is also not the focus of this paper. Instead, it would be useful to present some useful insights about the LED framework, as presented in the following. These insights could serve as the guidance on the choice or design of new LED policies.

Useful insights from LED
The main trait of the LED framework is that only local, possibly delayed and inaccurate information, is used for making the dispatching decision. In the following, we present two useful insights about the use of inaccurate delayed information for load balancing.
Inaccurate information can improve performance. A big problem for load balancing with multiple dispatchers is herd behavior, which means that arrivals at different dispatchers join the same server. This often leads to a poor delay performance in practice [18]. For example, JSQ used in the case of multiple dispatchers leads to a serious herd behavior since all the dispatchers will route arrivals to the single shortest queue. In contrast, under the LED framework, each dispatcher may believe that a different queue is the shortest according to its own local estimates because these estimates are inaccurate and delayed. Thus, jobs at different dispatchers are sent to different queues that may not have the actual shortest length but still have relatively small queue lengths. This intuition is illustrated by Fig. 1. In particular, we consider a set up with 10 dispatchers and 100 heterogeneous servers. All the LED policies are configured to have the same average message overheads as Power-of-2. It can be seen that the LED policies are not only stable but achieve a much better performance compared to JSQ, which suffers from the herd behavior in the multiple-dispatcher case.
Randomness is useful for heavily-delayed information. As mentioned earlier, the LED framework provides us with the possibility of exploring load balancing with extremely low message overhead by choosing a small valuep in the update strategy. As a result, the local information at each dispatcher will only be updated after a long time interval. In this case, if a deterministic dispatching strategy (e.g., L-JSQ) is adopted, it would again incur herd behavior (even for a single dispatcher case) since all the arrivals during the long update interval will join the same queue. This is another motivation for considering L-JBA and L-Pod, which naturally introduce a certain level of randomness and hence help avoid the herd behavior as suggested by [11].
To illustrate this insight, we consider a set up with 10 dispatchers and 100 homogeneous servers. We compare the delay performance of L-JSQ-Push, L-Pod-Push and L-JBA-Push with the update probability set top = 0.01 and d = 2. As shown in Fig. 2, both L-JBA-Push and L-Pod-Push outperforms L-JSQ-Push, which suffers from herd behavior because of heavily-delayed information.

Refinements on LED
Our main results suggest that there is a large class of heavy-traffic delay optimal LED policies. On the one hand, it provides us with flexibility to tailor our policy design for different application scenarios with different choices of dispatching and update strategies. On the other hand, it also suggests the need for refinements on LED beyond delay optimality in heavy-traffic. To this end, we introduce two possible directions for refinements. Degree of queue imbalance. As introduced in [28], degree of queue imbalance is a refined metric to further distinguish heavy-traffic delay optimal policies. The idea is that, instead of looking at the average queue length (and hence average delay), the degree of queue imbalance measures the expected difference in queue lengths among the servers. By following the proof of Proposition 5.6 in [28], we can establish that the degree of queue imbalance of all heavy-traffic delay optimal LED policies is O( 1 δ 2 p 4 ). Thus, even though by Theorem 2, any positive δ and p are sufficient for delay optimality in heavy-traffic, a dispatching strategy with smaller δ or a update strategy with a smaller p could affect the performance in practice.
Other asymptotic regimes. In this paper, we focus on the heavy-traffic asymptotic regime where the number of servers is fixed and the load approaches one. As mentioned before, there are also other asymptotic regimes in the analysis of load balancing schemes. One possible direction is to extend the fluid-limit techniques for the large-system regime in [20] to the case of multiple dispatchers and heterogeneous servers. Another alternative regime is the many-server heavy-traffic regime (e.g., Halfin-Whitt regime), which tends to keep a balance between heavy-traffic regime and large-system regime. Studying LED in such a regime is another interesting direction for future work.

Proofs
In this paper, we extend the Lyapunov drift-based approach developed in [4] to allow for unbounded supports of arrival and service processes. In particular, we replace the finiteness condition on the drift in [4] by a stochastically dominated condition, as shown in (C2) in Lemma 2. As proved in [7], this weaker condition, combined with a negative drift condition, can still guarantee finite moment bounds. Besides a weaker condition, we also replace the one-step drift with a T -step drift. Formally, we use the following lemma to derive bounded moments in steady state.

Lemma 2.
For an irreducible aperiodic and positive recurrent Markov chain {X(t), t ≥ 0} over a countable state space X , which converges in distribution to X, and suppose V : X → R + is a Lyapunov function. We define the T time slot drift of V at X as where I(.) is the indicator function. Suppose for some positive finite integer T , the T time slot drift of V satisfies the following conditions: • (C1) There exists an η > 0 and a κ < ∞ such that for any t 0 = 1, 2, . . . and for all X ∈ X with V (X) ≥ κ, • (C2) |∆V (X)| ≺ W for all t 0 and all X ∈ X , and E e θW = D is finite for some θ > 0, Then {V (X(t)), t ≥ 0} converges in distribution to a random variable V for which there exists a θ * > 0 and a C * < ∞ such that E e θ * V ≤ C * , which directly implies that all the moments of V exist and are finite.

Proof of Theorem 1
To start with, let us first show that the Markov chain {Z(t) = (Q(t), m(t)), t ≥ 0} with m(t) ( Q 1 (t), Q 2 (t), . . . , Q m (t)) is irreducible and aperiodic. Let the initial state be Z(0) = (Q(0), m(0)) = (0 1×N , 0 1×MN ) and the state space Z consists of all the states that can be reached from the initial state. Consider any state Z, the queue length vector Q can reach the initial state with a positive probability since the event that there are no exogenous arrivals and all the offered service is at least one during each time-slot happens with positive probability under our assumptions. Moreover, under the condition for the update strategy given by Eq. (5), the event that Q remains as the initial state while all Q m reach to the initial state happens with a positive probability. Therefore, any state in the state space can reach the initial state, and hence the Markov chain is irreducible. The aperiodicity of the Markov chain comes from the fact that the transition probability from the initial state to itself is positive.
In order to show positive recurrence, we adopt the Foster-Lyapunov theorem. In particular, we consider the following Lyapunov function W (Z(t)) = Q(t) , and in the rest of the proof we use W (t) as an abbreviation of W (Z(t)) Let X m n (t) |Q n (t) − Q m n (t)|. The conditional mean drift of W (t) defined as D(Z(t 0 )) E [W (t 0 + T ) − W (t 0 ) | Z(t 0 )] can be decomposed as follows where Let us first consider the tern D X m n (t 0 ). Note that for all t 0 , m and n E [X m where (a) follows from the condition in Eq. (5) and µ max = max n µ n . Then, we have (the time reference t 0 is dropped for simplicity) where (a) follows from Eq. (8). Let us turn to consider the term D Q (t 0 ). By the queue dynamics in Eq. (2), where (a) follows from the facts that Q n (t) + A n (t) − S n (t) + U n (t) = max(Q n (t) + A n (t) − S n (t), 0) for any t ≥ 0, and (max(a, 0)) 2 ≤ a 2 for any a ∈ R; (b) holds by our assumption of light-tailed distributions for the total arrival process and each service process in Eq. (3). In particular, we have that the second moments for total arrival process and service process of each server are finite (independent of ǫ), and hence there exists a finite upper bound K which is independent of the load parameter ǫ. Now, let us continue to work on Eq. (10). In particular, we have For Eq. (11), we have where (a) follows from the definition of β m n (t). Then, it can be further simplified as follows.
where in (a), p m is the probability that arrivals are allocated to dispatcher m (or it can be viewed as the fraction of the total arrivals that are allocated to dispatcher m).
Combining Eqs. (11), (12) and (13), yields We are going to handle each term one by one. To upper bound T 1 , we use the following result on X m n (t) = |Q n (t) − Q m n (t)|. Lemma 3. Under the condition given by Eq. (5), for any t 0 and Z(t 0 ), there exists a finite T 1 independent of ǫ and a finite constant L that is only a function of p and µ Σ , such that for all T ≥ T 1 holds for all m and n.
Proof. See Appendix A.
By using Lemma 3 with T ≥ T 1 , we have For T 2 , we have where (a) comes from the definition of dispatching preference vector ∆ m (t); (b) holds since dispatching preference is tilted and Q m σt (1) where µ min = min n µ n . Now, combining Eqs. (14), (15) and (16), yields Substituting the result above back into Eq. (10), yields Now, we are ready to substitute Eq. (9) and Eq. (17) back into Eq. (7). As a result, we have where in (a) ξ = min(2 ǫµmin µΣ , p) and K 1 2λ Σ M N LT + KT + λ Σ + µ max . Pick any α > 0 and let Then, B is a finite subset. For any Z ∈ B c , D(Z) ≤ −α, and for any Z ∈ B, D(Z) ≤ K 1 . By Foster-Lyapunov theorem, we have established positive recurrence. Having shown that the Markov chain {Z(t), t ≥ 0} is ergodic, we are left with the task of showing that all the moments are finite in steady-state. In order to do so, we use Lemma 2. In particular, we choose the Lyapunov function as V (Z (ǫ) ) = Q (ǫ) and then verify the two conditions. In the following, the superscript (ǫ) will be omitted for ease of notations. To verify condition (C2), we have where (a) holds since | x − y | ≤ x − y for each x, y in R N . (b) follows from triangle inequality and the fact that U n (t) ≤ S n (t) for all t and t. Then, by our assumptions of light-tailed distributions for both total arrival and service processes, there exists a random variable W such that |∆V (X)| ≺ W for all t 0 and all X ∈ X , and E e θW = D is finite for some θ > 0, which verifies (C2).
For (C1), we have where (a) follows from the fact that f (x) = √ x is concave; (b) comes from Eq. (17). Thus, condition (C1) is valid and hence the proof of Theorem 1 is complete.

Proof of Theorem 2
In order to prove the result, we need two intermediate results. One is called state-space collapse as stated in Proposition 1, which is the key ingredient for establishing heavy traffic delay optimality. Roughly speaking, it means that the multi-dimension space for the queue length vector reduces to one dimension in the sense that the deviation from the line (on which all the queue lengths are equal) is bounded by a constant, independent of ǫ. Another intermediate result is concerned with unused service. Based on these two intermediate results, we can prove heavy-traffic delay optimality. We omit the time reference t 0 for simplicity when necessary.
where the inequality (a) follows from the fact that | x − y | ≤ x − y holds for any x, y ∈ R N ; inequality (b) follows from triangle inequality; (c) holds due to the non-expansive property of projection to a convex set; (d) follows from Eq. (18). Then by our assumptions of light-tailed distributions for both total arrival and service processes, there exists a random variable W such that |∆V ⊥ (X)| ≺ W for all t 0 and all X ∈ X , and E e θW = D is finite for some θ > 0, which verifies (C2). Let us turn to condition (C1). By the proof of Lemma 3.6 in [29], it suffices to establish the following result in order to verify (C1). That is, there exists T > 0, K 2 ≥ 0 and η > 0 that are all independent of ǫ, such that for all t 0 and Z ∈ Z holds for all ǫ ∈ (0, ǫ 0 ). Note that where (a) follows from the tower property of conditional expectation and the fact that A(t) is independent of Z(t 0 ) given Z(t). Moreover, Q ⊥,n (t) denotes the nth component of the vector Q ⊥ (t). Now let us first focus on Eq. (21).
Combining the result above with Eq. (22), yields Note that by definition Q ⊥,n (t) = Q n (t)− Q avg (t), in which Q avg (t) is the average queue length among N queues at the beginning of time-slot t. Moreover, Q ⊥,n (t) can be written as for all m and t, in whichQ m (t) Our main task now is to upper bound each term above. Let us start with Eq. (26). In particular, we can bound it by using the following result.
Lemma 4. There exist finite positive constants η and C such that For Eqs. (27) and (28), we can bound both of them by using the result in Lemma 3, respectively. In particular, for Eq. (27), we have For Eq. (28), we have We have obtained bounds for Eqs. (26), (27) and (28). Let us turn to focus on Eq. (24), which can be upper bounded by the following result.
Lemma 5. For any t 0 and Z, where k 3 is a finite constant independent of ǫ.
Proof. See Appendix C Now, we are ready to bound the left-hand-side of Eq. (20) by using the bounds for both Eq. (23) and Eq. (24). In particular, we have where (a) follows from K 2 = C + 2µ Σ M N LT + K 3 , which is independent of ǫ. Hence, this verifies condition (C1) with η = µΣδp 2 2 √ N , which is also independent of ǫ. Combined with condition (C2), we have finished the proof of Proposition 1.
Having proved the state-space collapse result, we turn to prove another intermediate result regarding unused service, as stated in the following lemma. In words, this lemma says that in heavy traffic unused service tends to be zero.
To see this, we consider the Lyapunov function W 1 (Z(t)) = Q(t) 1 . Since LED is throughput optimal with all the moments being finite, we have that the mean drift of W 1 (Z(t)) in steady-state is zero. Then, we have which directly implies the result in Eq. (32). Now let us fix n ∈ N , we have for any t ≥ 0 and constant S ′ U 2 n (t) ≤ U n (t)S n (t) = U n (t)S n (t)I (S n (t) ≤ S ′ ) + U n (t)S n (t)I (S n (t) > S ′ ) ≤ U n (t)S ′ + S 2 n (t)I (S n (t) > S ′ ) . In steady state, we have ≤ ǫS ′ + E S 2 n (0)I (S n (0) > S ′ ) where (a) follows from the fact that E U (ǫ) 1 = ǫ and service process is i.i.d.; in (b), we choose S ′ such that E S 2 n (0)I (S n (0) > S ′ ) ≤ β, which is possible by the exponential decay rate of S n (0) under the light-tailed assumption. Thus, we have lim ǫ↓0 E U 2 n ≤ β, for any β > 0. Hence, we have lim ǫ↓0 E U 2 n = 0 for each n, which directly implies our result. Now, we are prepared to show that under the conditions in Theorem 2, the system achieves optimal delay in heavy traffic. More specifically, by Lemma 3 in [26], we need only to verify the following condition. lim ǫ↓0 E Q (ǫ) (t + 1) 1 U (ǫ) (t) 1 = 0.
Let us define B . We can bound it as follows.
where the equality (a) comes from the property Q

Conclusion
We have introduced the Local-Estimation-Driven (LED) framework for load balancing policies in possibly heterogeneous systems with multiple dispatchers. Under this framework, each dispatcher keeps local and possibly outdated estimates of the queue lengths for all the servers, and makes its dispatching decision only based on these local estimates. Communication between dispatchers and servers is only used to update the local estimates. We have established sufficient conditions for LED policies to achieve both throughput optimality and delay optimality in heavy traffic. These sufficient conditions not only establish delay optimality for many previous local-memory based policies, but enable us to tailor the design of new delay optimal policies based on different application requirements. The heavy-traffic delay optimality of LED policies also resolves a recent open problem on the development of load balancing schemes that have only access to delayed information. In future work, it will be interesting to investigate LED framework in other asymptotic regimes, e.g., the large-system regime and the many-server heavy-traffic regime.
where (a) comes from the bound in Claim 1; (b) holds since given Z(t 0 ), Q min (t 0 + 1) is independent of the event I m max (t 0 ) = 1, I m min (t 0 + 1) = 1; (c) holds by the bound in Claim 1 again. Thus, combining the bounds for Eqs. (38) and (39), yields in which the last inequality follows from the fact that Q ⊥ (t 0 ) ≤ √ N (Q max (t 0 ) − Q min (t 0 )) and M 1 = µ Σ with δ ≤ 1. Hence, the proof of Lemma 4 is complete.

C Proof of Lemma 5
First, note that by Eq. (19), we have where (a) follows from Jensen's inequality for concave function; in (b) C 2 is a finite constant independent of ǫ, which holds by our light-tailed assumption. Now, by using the result above, we have where (a) is true since x 1 ≤ √ N x for any x ∈ R N . Hence, the proof is complete.