Download and Access Trade-offs in Lagrange Coded Computing

Lagrange Coded Computing (LCC) is a recently proposed technique for resilient, secure, and private computation of arbitrary polynomials in distributed environments. By mapping such computations to composition of polynomials, LCC allows the master node to complete the computation by accessing a minimal number of workers and downloading all of their content, thus providing resiliency to the remaining stragglers. However, in the most common case in which the number of stragglers is less than in the worst case scenario, much of the computational power of the system remains unexploited. To amend this issue, in this paper we expand LCC by studying a fundamental trade-off between download and access, and present two contributions. In the first contribution, it is shown that without any modification to the encoding process, the master can decode the computations by accessing a larger number of nodes, however downloading less information from each node in comparison with LCC (i.e., trading access for download). This scheme relies on decoding a particular polynomial in the ideal that is generated by the polynomials of interest, a technique we call Ideal Decoding. This new scheme also improves LCC in the sense that for systems with adversaries, the overall downloaded bandwidth is smaller than in LCC. In the second contribution we study a real-time model of this trade-off, in which the data from the workers is downloaded sequentially. By clustering nodes of similar delays and encoding the function with Universally Decodable Matrices, the master can decode once sufficient data is downloaded from every cluster, regardless of the internal delays within that cluster. This allows the master to utilize the partial work that is done by stragglers, rather than to ignore it, a feature that most past works in coded computing are lacking.


I. INTRODUCTION
The immensity of contemporary datasets no longer allows computations to be done on a single machine, and distributed computations are inevitable. Since most users cannot afford to maintain a network of servers (or workers), burdensome computations are often outsourced to third party cloud services. However, this approach opens a Pandora's box of resiliency, security, and privacy issues. First, it was demonstrated in the past (e.g. [16]) that a fraction of the servers, referred to as stragglers, can be 5 to 8 times slower than the average, and hence computation tasks that rely on successful completion of all subtasks are destined to be delayed considerably. Second, many computations are highly susceptible to adversaries, or Byzantine workers, that might attempt to alter the result of the computation for their benefit [2]. Third, privacy infringement is major concern in the information age, and hence privacypreserving computation protocols are essential.
The term Coded Computing broadly refers to a family of techniques that utilize coding to inject computation redundancy in order to alleviate the various issues that arise in distributed computations. Over the past few years, Coded Computing has seen a tremendous success in providing elegant solutions to the aforementioned issues in various tasks of interest, such as gradient coding (e.g., [5], [6], [10]), matrix multiplication (e.g., [3], [4], [17]), and bandwidth reduction in iterative algorithms (e.g., [7]). More recently, Lagrange Coded Computing (LCC) has been proposed in [18] as a universal data encoding technique that can simultaneously alleviate the issues of resiliency, security, and privacy for arbitrary multivariate polynomial computations, hence expanding coded computing to new domains.
In LCC, the dataset is encoded by evaluations of the wellknown Lagrange polynomial, and each codeword symbol is stored on a different worker in the distributed system. Then, the workers apply the multivariate polynomial of interest on their encoded data, as if no coding is taking place, and return the computation results back to the master. By viewing the computation as a composition of a multivariate polynomial (the computation that is to be executed), and a univariate one (the encoding Lagrange polynomial), the task of finalizing the computation in the presence of stragglers and adversaries boils down to polynomial interpolation with errors and erasures. Then, the master finalizes the computation by evaluating the interpolated polynomial at appropriately chosen points. Being fundamental to our current contribution, the LCC scheme is described in greater detail in Subsection II-A.
However, LCC allows no flexibility in terms of downloadaccess trade-off. That is, the master performs the computation by accessing a minimum number of workers, and downloading their data in its entirety. As a result, in every scenario with less than the maximum number of stragglers, some non-stragglers remain idle during the download process, and the communication bottleneck intensifies due to unexploited parallel links between these non-stragglers and the master. Moreover, LCC considers every worker as being either a straggler or a nonstraggler, and the partial work that is done by stragglers is ignored. In this paper we improve LCC by addressing these aspects, the static and the dynamic, of the download-access trade-off.
In our first contribution, it is shown that with no further changes in the encoding phase, the decoding phase can be flexible in terms of the number of workers that are accessed, and the number of symbols that are downloaded from each of them. This is done by having the server perform extra linear computations; these computation turn multiple lowdegree polynomial evaluations (the computation results) to a smaller number of high-degree polynomial evaluations. These high-degree polynomials lie in the ideal that is generated by the lower-degree ones, and hence we term this technique Ideal Decoding. More importantly, the surprising corollary of this part of the paper is that the overall download bandwidth can be reduced for systems with adversaries, when compared to ordinary LCC.
In our second contribution, we consider a dynamic model of the access-download trade-off, where the master has continuous access to all servers, and the data arrives sequentially. By encoding the polynomial itself with Universally Decodable Matrices (UDMs), a previously defined notion, we match the amount of download from each server to the naturally occurring delays in the system. Specifically, we cluster the workers in the system according to the expected computation times, and have workers in the same cluster operate on the same encoded data. By allowing the functions that are applied in each cluster to differ, it is shown that the decoding can be completed once sufficient information has arrived from each cluster, regardless of the internal delays within that cluster. This paper is structured as follows. Preliminary background on LCC and UDMs is given in Section II, and our contributions are formally stated in Section III. The former extension of LCC is given in Section IV, and the latter in Section V.

II. PRELIMINARIES
We use the standard notation [N ] for the set {1, . . . , N }, denote the underlying field 1 by F, and denote the composition operation between polynomials by •.
We consider a system with a master node and N workers, in which a dataset X = (x 1 , . . . , x K ), with x k ∈ F M ×1 for every k ∈ [K], is coded asx 1 , . . . ,x N , and each codeword symbolx n is stored in one of the N workers. The master node is interested in the computation results To achieve this, f is applied by the workers on their stored data, and the results of the computation {ỹ n = f (x n )} n∈[N ] on the codeword symbols are transmitted back to the master. Many tasks which are studied in coded computation fall under this framework, including matrix multiplication, and gradient coding whenever the loss function is a polynomial, or is approximated by one.
For integers A and S, a coding scheme is said to be Sresilient and A-secure if the master is capable of extracting {y k } k∈ [K] , even if up to S workers fail to respond in a timely manner, and up to A workers reply with purposefully erroneous data. In addition, for an integer T the scheme is called T -private if every set of T colluding workers remain information-theoretically oblivious to the content of X, i.e., if I({x t } t∈T ; X) = 0 for every T ⊆ [N ] of size at most T , where I denotes mutual information, and X is seen as chosen uniformly at random.
In Section V, it is further assumed that the results of the computation on the coded dataỹ n = f (x n ) = (f 1 (x n ), . . . , f L (x n )), from every worker n ∈ [N ], arrive at the master sequentially. That is, f 1 (x n ) arrives, followed by f 2 (x n ), and so on. In addition, we allow the polynomial f itself to be coded, and the encoding can potentially differ from one worker to another. That is, each worker n ∈ [N ] corresponds to L polynomials h n,1 , . . . , h n,L , each of which is a linear combination of the polynomials {f } ∈[L] .

A. Lagrange Coded Computing
Lagrange Coded Computing [18] follows the outline that is described above, and achieves resiliency, security, and privacy that is known to be optimal in many cases. LCC relies heavily on the Lagrange polynomial, as follows.
Given the data matrix X, fix K distinct elements β = (β 1 , . . . , β K ) and additional N distinct elements α = (α 1 , . . . , α N ) in F. By using the well-known Lagrange interpolation formula, define u = u X,β as the lowest degree polynomial such that u(β k ) = x k for every k ∈ [K]. Notice that u is in fact a vector of polynomials, but we refer to it as 1 Our techniques operate over any large enough finite field or any infinite one. a polynomial for simplicity. It is well known that the degree of u (or more precisely, the degree of every component of u) is at most K −1. Then, in the storage phase, the polynomial u is evaluated at α n and the evaluation is sent to worker n, i.e., In the computation phase, every worker applies the polynomial f on its stored data, and sends the results back to the master. Sincex n = u(α n ), it follows that f (x n ) is an evaluation at α n of the univariate polynomial f • u, whose degree is at most G(K − 1). Hence, since u(β k ) = x k for every k ∈ [K], it follows that the results of the computation {y k } k∈ [K] can be obtained by decoding the coefficients of f • u, and evaluating it at β 1 , . . . , β K (see Figure 1).
Moreover, whenever there exists a privacy requirement (i.e., when T > 0 and F is finite), the data matrix X is padded and u(β K+t ) = t t for every t ∈ [T ]. Then, the encoding is performed by evaluating u at the points of α, and we have the following theorem. Theorem 1. [18] Lagrange Coded Computing provides an S-resilient, A-secure, and T -private scheme for computing {y k } k∈ [K] for any polynomial f , as long as N ≥ (K + T − 1)G + S + 2A + 1.
Remark 1. LCC has additional applications in obtaining another aspect of information-theoretic privacy. In the socalled function-privacy, the identity of the polynomial f should be kept private from sets of colluding servers. This problem, that is also known as Private Computation [13], is a generalization of the well-studied Private Information Retrieval problem, and is studied in [9].

B. Universally Decodable Matrices
Universally Decodable Matrices (UDMs) have been studied in the past for various applications, such as slow-fading channels [14], and decoding of scalar codes in the presence of stragglers [8]. They are tightly connected to various previously defined notions, such as m-codes [11], and their corresponding metric was thoroughly studied in [1].
Definition 1. For integers L and P , matrices B 1 , . . . , B P ∈ F L×L are called UDMs if for every nonnegative integers n 1 , . . . , n P that sum to L, the following matrix is invertible- For example, the following matrices are UDMs for L = 4, P = 3 and F = GF (2).
where I is the identity matrix, and J is a matrix whose antidiagonal is all 1's, and the remaining entries are 0. We focus our attention on the following construction of UDMs, that requires |F| ≥ P − 1. A crucial ingredient of the proof of Theorem 2 is the following proposition, which utilizes the notion of Hasse derivative. For a nonnegative integer n, the n'th Hasse derivative of a polynomial ζ(x) In what follows, we let H G(K + T − 1) + 1.
Theorem 3. In Lagrange Coded Computing, it is possible to complete the computation by downloading L/R symbols from any set of RH + 2A workers, for every rational R = Re R d > 1 such that R e |L, R d |H, and N ≥ RH + 2A.
It will be clear in the sequel that the requirements R e |L and R d |H are mere convenience, and can be alleviated at the price of rounding operations. Theorem 3 is proved by using a technique we term Ideal Decoding. In this technique, every server n ∈ [N ] linearly combines the results {ỹ n, } ∈[L] , together with powers of α n , to produce evaluations of certain polynomials {g i } i∈[L/R] , which lie in the ideal which is generated by {f •u} ∈[L] in the ring of univariate polynomials over F. These g i 's are interpolated by the master from their evaluations, and the original {f • u} ∈[L] are obtained by computing some polynomial combinations of the g i 's. We emphasize that this final polynomial computation can be done by a combination of shifts, additions, and negations of field elements, and does not require polynomial multiplications. Having obtained {f • u} ∈[L] , the master finalizes the computation as in ordinary LCC.
A surprising corollary of Theorem 3 is that for systems with adversaries, the overall download of our suggested scheme outperforms that of ordinary LCC (see Remark 2).
While the scheme of Theorem 3 enables the user to download fewer symbols from every worker than in ordinary LCC, computing these symbols requires the computation of all functions f i on the coded data. Furthermore, the reduction factor R must be known a priori, and hence, this scheme is not suitable to handle run-time delays in the system.
To amend these issues, in the second part of this paper (Section V), we consider systems in which the workers are arranged in clusters. Then, the data is encoded by an LCC scheme whose code length is the number of clusters (rather than the number of workers), and all servers in a cluster store the same codeword symbol (we refer to such systems as clustered LCC). By encoding f with UDMs, it is shown that the computation can be completed by downloading L elements from each cluster, regardless of their exact source within the cluster. This scheme enables stronger stragglers tolerance, in the sense that it exploits the partial work that is done by the stragglers, in a way that can accommodate any possible combination of delays within each cluster. Further, in cases where the number of adversaries per cluster is known, we have the following.
Theorem 5. In clustered LCC with at most A i adversaries in each cluster i, it is possible to complete the computation by downloading (2A i + 1)L sequentially arriving symbols from cluster i for each one of H clusters.

IV. ACHIEVING A DOWNLOAD-ACCESS TRADE-OFF
In this section we prove Theorem 3 by introducing extra linear computations at the workers. Let R > 1 be the required reduction factor, which is known to all workers, and for now assume that it is an integer (fractional reduction factors will be treated in the sequel). In addition, for every ∈ [L] denote f • u by r . Following the storage phase and the computation phase, every server n ∈ [N ] contains {ỹ n, } ∈[L] , and computes . . .
Sinceỹ n, = r (α n ) and deg(r ) ≤ H −1 for every ∈ [L], it follows that each server n ∈ [N ] now holds L/R evaluations at α n of the polynomials {g i } L/R i=1 , each of which is of degree at most RH − 1. Hence, having received the responses from any set of at least RH + 2A servers, the user is able to obtain the coefficients of all g i 's by Reed-Solomon decoding. Now, it is readily verified that for every i ∈ [L/R] the first H coefficients of g i coincide with those of r (i−1)R+1 , the next H with those of r (i−1)R+2 , and so on. Hence, all the polynomials {r } ∈[L] can be found, and the scheme is finalized by evaluating them at β 1 , . . . , β K . It is apparent from the simple case of an integer R that the gist of the extra linear computations by the workers is to obtain polynomial evaluations of some higher degree g i 's in the ideal which is generated by the r i 's. Then, after obtaining the coefficients of the g i 's, the coefficients of the r i 's can be trivially extracted. Now, let R be fractional. In this case, one must choose different polynomials {g i } i∈[L/R] judiciously, so that this extraction is still possible. In what follows, the polynomials g i are defined anew so that some overlap exists between the coefficients of the r i 's in them. Then, after the interpolation by the master, this overlap is resolved by performing a polynomial combination of the g i 's. We begin with an illustrative example. g 4,n ỹ n,9 + α 3 nỹn,10 + α 7 nỹn,11 . It is readily verified that for every i and n, the value g i,n is an evaluation at α n of a polynomial g i , whose degree is at most 10. Hence, all g i 's can be extracted from RH = 11 workers. Then, the master computes Since deg(r i ) ≤ 3 for all i, the coefficients of r 1 , r 2 , r 4 , r 5 , r 7 , r 8 , r 10 and r 11 in the above expression do not overlap, and can therefore be extracted. Then, r 3 can be found from g 1 , r 1 and r 2 ; r 6 from g 2 , r 3 , r 4 , and r 5 ; r 9 from g 3 , r 6 , r 7 , and r 8 ; and finally, r 9 from g 4 , r 10 , and r 11 . To obtain r 12 , . . . , r 22 , we define g 5 , . . . , g 8 similarly by using r 12 , . . . , r 22 , and conclude the scheme by evaluating all r i 's. Overall, we have accessed 11 workers and downloaded 8 elements from each, instead of accessing 4 workers and downloading 22 elements from each.
In general, for ∈ [L] and h ∈ [H], define g ( ,h,n) as Finally, let 1 1, and for i ∈ {2, . . . , L/R} define Following the computation ofỹ n, for every ∈ [L], every server n computes {g ( i,hi,n) } L/R i=1 and sends the results to the master. It follows from (2) that for every i ∈ [L/R], the expression g ( i,hi,n) is an evaluation at α n of and deg(g i ) ≤ HR − 1. Hence, all polynomials g i (x) can be interpolated from the responses of HR + 2A workers. It remains to show how the coefficients of the r i 's can be found from the g i 's.
We show how certain r i 's are extracted from {g i (x)} i i=1 , where i is an integer such that h 2 , . . . , h i = H and h i +1 = H (this i clearly exists, and it is at most R d + 1). The remaining r i 's are extracted similarly. Given g 1 , . . . , g i , the master computes the following sum, in which the last term x H(R−1) r i+ji+1 of g i cancels out the first term r i+1 of g i+1 , for every i ∈ Therefore, to show that all r i 's in the above expression can be extracted, it is shown that the monomial degrees in the above sums, as are the ones in the first and the last summand, do not overlap. Since ( 1 , h 1 ) = (1, H) and deg(r i ) ≤ H − 1 for all i, it follows that r 1 and x h1 r 1 +1 do not share a common monomial, and hence the first sum does not overlap r 1 . For k ∈ [i − 1], in order to show that the k'th sum does not share a common monomial with the (k + 1)'th sum, we must show that which readily follows from the definitions (see Lemma 7 in the appendix). Finally, to show that the last sum (3) does not share a common monomial with the last summand (4), we ought to show that and this inequality follows from the definitions as well (see Lemma 8 in the appendix). Thus, we have obtained the coefficients of all involved r i 's, except for r 2 , r 3 , . . . , r i , that were canceled out in (4). However, r 2 can be extracted from g 1 and r 1 , . . . , r 2−1 ; r 3 can be extracted from g 2 and r 2 , . . . , r 3 −1 ; and so on.
Remark 2. In cases where A > 0, the download bandwidth of the suggested scheme strictly outperforms the one in ordinary LCC. In LCC, the user downloads L symbols from each one of H +2A workers, L(H +2A) symbols overall. In our scheme however, the user downloads L/R symbols from RH + 2A workers, L(H + 2A/R) symbols overall.
For instance, if Example 1 is accompanied by A = 1 adversary, then the g i 's are interpolated by accessing 13 workers, and downloading 8 symbols from each, an overall of 104 symbols. In ordinary LCC, the 22 polynomials r i are interpolated by accessing 6 workers, and downloading 22 symbols from each, an overall download of 132 symbols.

V. UTILIZING PARTIAL WORK BY UDMS
In this section we prove Theorem 4 and Theorem 5 by adding a layer of encoding. That is, we encode the polynomial f itself by using UDMs, and apply the encoded polynomials on a partially replicated Lagrange code. To exploit the full potential of the scheme in this section, one must possess some knowledge regarding the expected computation power of the workers in the system; this assumption aligns well with contemporary cloud services, in which performance guarantees of workers are given as a function of their cost. The approaches towards proving both theorems are similar. However, in one the error correction is performed between the clusters, and in the other, within each cluster. We begin by proving Theorem 5, and then proceed to show that Theorem 4 is an easy corollary of it.
We partition the N workers to C different clusters C 1 , . . . , C C of varying sizes P 1 , . . . , P C , respectively. Broadly speaking, one should group together slow workers to large clusters, and fast workers to small ones. In addition, for every i ∈ [C] assume that there are at most A i adversaries in First, the data matrix X is encoded by using a Lagrange code of length C, producing codeword symbolsx 1 , . . . ,x C . For every i ∈ [C], the codeword symbolx i is replicated P i times and each copy is stored on a different server in C i . Second, in the computation phase, every server computes L functions on its stored codeword symbol. These L functions are linear combinations of the polynomials {f i } L i=1 , and are unique to each server. The precise functions h i,1 , . . . , h i,L of worker i can either be agreed upon in advance, be transmitted by the master to the worker incrementally or together, or be computed at the worker after receiving the polynomials f i .
For i ∈ [C] identify the workers in C i by the integers 1, 2, . . . , P i , and let B 1 , . . . , B Pi be L×L UDMs over F. The L functions of server j in C i are (h j,1 , . . . , h j,L ) (f 1 , . . . , f L ) · B j .
Then, each server j computes h j,1 (x i ), transmits the result to the master, continues to compute and transmit h j,2 (x i ), and so on. In what follows, it is shown that once at least (2A i + 1)L responses are received from cluster i for at least H clusters, regardless of their particular source within each cluster, the master is capable of finalizing the computation. The decoding process operates in two steps. In one, the true value ofỹ i = f (x i ) is extracted from the partial responses of every server in C i . Then, these error free results are given to a decoding algorithm for LCC of length C, which finalizes the computation. Hence, we focus on the decoding process at the cluster level, which is identical in all clusters.
For a given cluster C i and j ∈ [P i ], let u j be the number of responses that were obtained from worker j up to a given point in time, and notice that 0 ≤ u j ≤ L for every j. Thus, the response from C i can be written as The outline of the proof is as follows, and complete details can be found in the appendix.
Proof sketch. It suffices to show that no two distinct codewords (w 1 , . . . , w N ) and (w 1 , . . . , w N ) can be made equal by an addition of an error vector which results from the presence of A i adversaries in the cluster. Thanks to linearity, this is equivalent to obtaining the zero codeword by encoding {ỹ i, } ∈[L] that are not all zero, and introducing 2A i adversaries. If such a scenario is possible, one uses Proposition 1 to get that L−1 j=0ỹ i,j+1 x j is the zero polynomial, a contradiction.
Therefore, once at least (2A i +1)L responses has arrived at the master from each one of at least H clusters C i , the master obtains the respective f (x i ) = (f 1 (x i ), . . . , f L (x i )), and the computation can be finalized by LCC decoding. Now, to prove Theorem 4, observe that if there are no stragglers in a given cluster C i , then L responses from it suffice to obtainỹ i . Furthermore, if there are at most A adversaries overall in the system, in the worst case there will be at most one adversary in each cluster, and hence the master may potentially fail to produceỹ i in at most A clusters. Therefore, having obtainedỹ i from at least H +2A clusters i, at most A of which are potentially erroneous, the master can apply LCC decoding, and the theorem follows.