CaltechAUTHORS
  A Caltech Library Service

Performance optimization of checkpointing schemes with task duplication

Ziv, Avi and Bruck, Jehoshua (1997) Performance optimization of checkpointing schemes with task duplication. IEEE Transactions on Computers, 46 (12). pp. 1381-1386. ISSN 0018-9340. http://resolver.caltech.edu/CaltechAUTHORS:ZIVieeetc97a

[img]
Preview
PDF
See Usage Policy.

84Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:ZIVieeetc97a

Abstract

In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults by comparing the processors' states at checkpoints, and reducing fault recovery time by supplying a safe point to rollback to. In this paper, we show that, by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. Results we obtained show that, in some cases, using compare and store checkpoints can reduce the overhead of DMR checkpointing schemes by as much as 30 percent.


Item Type:Article
Additional Information:© Copyright 1997 IEEE. Reprinted with permission. Manuscript revised 13 Aug. 1997. The research reported in this paper was supported in part by U.S. National Science Foundation Young Investigator Award CCR-9457811, by the Sloan Research Fellowship, by a grant from the IBM Almaden Research Center, San Jose, California, and by a grant from the AT&T Foundation. This research was performed, in part, using the CSCC parallel computer system operated by Caltech on behalf of the Concurrent Supercomputing Consortium. Access to this facility was provided by Caltech.
Subject Keywords:Fault-tolerant computing, checkpointing, task duplication, parallel computing, performance optimization
Record Number:CaltechAUTHORS:ZIVieeetc97a
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:ZIVieeetc97a
Alternative URL:http://dx.doi.org/10.1109/12.641939
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:9693
Collection:CaltechAUTHORS
Deposited By: Archive Administrator
Deposited On:03 Mar 2008
Last Modified:26 Dec 2012 09:51

Repository Staff Only: item control page