Optimal job fragmentation
It has been recently discovered that on an unreliable server, the job completion time distribution function (df) can be heavy-tailed (HT) even when job size df is light-tailed (LT) [1, 5]. A key to this phenomenon is the RESTART feature where if a job is interrupted in the middle of its processing, the entire job needs to restart from the beginning, i.e., the work that is partially completed is lost. A standard mechanism for reducing the job completion time in an unreliable service environment is checkpointing [3, 4, 6]. We view checkpointing as a job fragmentation operation, where the server processes one fragment of the job at a time. If the server becomes unavailable, say due to failure, then only the work corresponding to the fragment being processed at the time of failure is lost. In this paper, we are motivated by the question: Can fragmentation 'lighten' the tail df of the completion time? In Section 3, we provide sufficient conditions on the fragmentation policy that gives rise to LT completion time so long as the job size df is LT. We then characterize the optimal fragmentation policy seeking to minimize the expected job completion time. This policy requires a priori knowledge of the job size. We then describe a sub-optimal fragmentation policy that is blind to the job size and is provably very close to optimal. We also describe the asymptotic tail behavior of the job completion time df under both policies. Assuming the server unavailability periods are LT, both policies produce LT completion times when the job size df is LT. For the case of regularly varying job size df, the job completion time under both policies is regularly varying with the same degree - this is the lightest possible asymptotic tail behavior (in the degree sense).
© 2009 ACM.