CaltechAUTHORS
  A Caltech Library Service

Analysis of Checkpointing Schemes for Multiprocessor Systems

Ziv, Avi and Bruck, Jehoshua (1994) Analysis of Checkpointing Schemes for Multiprocessor Systems. California Institute of Technology . (Unpublished) http://resolver.caltech.edu/CaltechPARADISE:1994.ETR003

[img]
Preview
PDF (Adobe PDF (2.7MB))
See Usage Policy.

2656Kb
[img]
Preview
Postscript
See Usage Policy.

351Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechPARADISE:1994.ETR003

Abstract

Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- tolerance. Fault-tolerance is achieved, in those systems, by duplicating the task into more than one processor, and comparing the states of the processors at checkpoints. Many schemes that achieve fault tolerance exist, and most of them use checkpointing to reduce the time spent retrying a task. Performance evaluation for most of the schemes either relies on simulation results, or uses a simplified fault model. This paper suggests a novel technique, based on a Markov Reward Model (MRM), for analyzing the performance of checkpointing schemes for fault-tolerance. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task completion time and total work of four checkpointing schemes, TMR, DMR-B-2, DMR-F-1 and RFCS. We show that generally increasing the number of processors reduces the average completion time, but increases the total work done by the processors. Namely, the TMR scheme, which uses three processors, is the quickest but does the most work, while the DMR-B-2 scheme, which uses only two processors, is the slowest of the four schemes but does the least work. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change. For example, when we assume that the schemes are implemented on workstations connected by a LAN and the time to move data between workstations is relatively long, the DMR-B-2 scheme can become quicker than the TMR scheme.


Item Type:Report or Paper (Technical Report)
Related URLs:
URLURL TypeDescription
http://www.paradise.caltech.edu/papers/etr003.psPublisherUNSPECIFIED
Group:Parallel and Distributed Systems Group
Record Number:CaltechPARADISE:1994.ETR003
Persistent URL:http://resolver.caltech.edu/CaltechPARADISE:1994.ETR003
Usage Policy:You are granted permission for individual, educational, research and non-commercial reproduction, distribution, display and performance of this work in any format.
ID Code:26071
Collection:CaltechPARADISE
Deposited By: Imported from CaltechPARADISE
Deposited On:04 Sep 2002
Last Modified:26 Dec 2012 13:52

Repository Staff Only: item control page