CaltechAUTHORS
  A Caltech Library Service

Performance Optimization of Checkpointing Schemes with Task Duplication

Ziv, Avi and Bruck, Jehoshua (1994) Performance Optimization of Checkpointing Schemes with Task Duplication. California Institute of Technology . (Unpublished) http://resolver.caltech.edu/CaltechPARADISE:1994.ETR004

[img]
Preview
PDF (Adobe PDF (2.1MB))
See Usage Policy.

2125Kb
[img]
Preview
Postscript
See Usage Policy.

664Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechPARADISE:1994.ETR004

Abstract

Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the redundancy in hardware and software resources. In these systems, checkpointing serves two purposes: it helps in detecting faults by comparing the processors states at checkpoints, and it facilitates the reduction of fault recovery time by supplying a safe point to rollback to. The efficiency of checkpointing schemes is influenced by the time it takes to perform the comparisons and to store the states. The fact that checkpoints consist of both storing of states and comparison between states, with conflicting objectives regarding the frequency of those operations, limits the performance of current checkpointing schemes. In this paper we show that by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. We will present both analytical results and experimental results that were obtained on a cluster of workstations and a parallel computer. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. As a particular example of this approach we analyzed the DMR checkpointing scheme with store and compare checkpoints on two types of architectures, one where the comparison time is much higher than the store time (like a cluster of workstations connected by a LAN) and one where the store time is much higher than the comparison time (like the Intel Paragon supercomputer). We have implemented a prototype of the new DMR schemes and run it on workstations connected by a LAN and on the Intel Paragon supercomputer. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes on both architectures can be improved by as much as 40%.


Item Type:Report or Paper (Technical Report)
Related URLs:
URLURL TypeDescription
http://www.paradise.caltech.edu/papers/etr004.psPublisherUNSPECIFIED
Group:Parallel and Distributed Systems Group
Record Number:CaltechPARADISE:1994.ETR004
Persistent URL:http://resolver.caltech.edu/CaltechPARADISE:1994.ETR004
Usage Policy:You are granted permission for individual, educational, research and non-commercial reproduction, distribution, display and performance of this work in any format.
ID Code:26070
Collection:CaltechPARADISE
Deposited By: Imported from CaltechPARADISE
Deposited On:04 Sep 2002
Last Modified:26 Dec 2012 13:52

Repository Staff Only: item control page