Ziv, Avi and Bruck, Jehoshua (1994) Analysis of Checkpointing Schemes for Multiprocessor Systems. California Institute of Technology . (Unpublished) http://resolver.caltech.edu/CaltechPARADISE:1994.ETR003
PDF (Adobe PDF (2.7MB))
See Usage Policy.
See Usage Policy.
Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechPARADISE:1994.ETR003
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- tolerance. Fault-tolerance is achieved, in those systems, by duplicating the task into more than one processor, and comparing the states of the processors at checkpoints. Many schemes that achieve fault tolerance exist, and most of them use checkpointing to reduce the time spent retrying a task. Performance evaluation for most of the schemes either relies on simulation results, or uses a simplified fault model. This paper suggests a novel technique, based on a Markov Reward Model (MRM), for analyzing the performance of checkpointing schemes for fault-tolerance. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task completion time and total work of four checkpointing schemes, TMR, DMR-B-2, DMR-F-1 and RFCS. We show that generally increasing the number of processors reduces the average completion time, but increases the total work done by the processors. Namely, the TMR scheme, which uses three processors, is the quickest but does the most work, while the DMR-B-2 scheme, which uses only two processors, is the slowest of the four schemes but does the least work. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change. For example, when we assume that the schemes are implemented on workstations connected by a LAN and the time to move data between workstations is relatively long, the DMR-B-2 scheme can become quicker than the TMR scheme.
|Item Type:||Report or Paper (Technical Report)|
|Group:||Parallel and Distributed Systems Group|
|Usage Policy:||You are granted permission for individual, educational, research and non-commercial reproduction, distribution, display and performance of this work in any format.|
|Deposited By:||Imported from CaltechPARADISE|
|Deposited On:||04 Sep 2002|
|Last Modified:||26 Dec 2012 13:52|
Repository Staff Only: item control page