CaltechAUTHORS
  A Caltech Library Service

Efficient checkpointing over local area networks

Ziv, Avi and Bruck, Jehoshua (1994) Efficient checkpointing over local area networks. In: Fault-Tolerant Parallel and Distributed Systems, College Station, TX, 12-14 June 1994. IEEE , Piscataway, NJ, pp. 30-35. ISBN 0818668075 http://resolver.caltech.edu/CaltechAUTHORS:ZIVftpds94

[img]
Preview
PDF
See Usage Policy.

392Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:ZIVftpds94

Abstract

Parallel and distributed computing on clusters of workstations is becoming very popular as it provides a cost effective way for high performance computing. In these systems, the bandwidth of the communication subsystem (Using Ethernet technology) is about an order of magnitude smaller compared to the bandwidth of the storage subsystem. Hence, storing a state in a checkpoint is much more efficient than comparing states over the network. In this paper we present a novel checkpointing approach that enables efficient performance over local area networks. The main idea is that we use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (where the state is only stored). The store-checkpoints reduce the rollback needed after a fault is detected, without performing many unnecessary comparisons. As a particular example of this approach we analyzed the DMR checkpointing scheme with store-checkpoints. Our main result is that the overhead of the execution time can be significantly reduced when store-checkpoints are introduced. We have implemented a prototype of the new DMR scheme and run it on workstations connected by a LAN. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes over LAN's can be improved by as much as 20%.


Item Type:Book Section
Additional Information:© 1994 IEEE. Reprinted with Permission. The research reported in this paper was supported in part by the NSF Young Investigator Award CCR-9457811, by the Sioan Research Fellowship, by a grant from the IBM Almaden Research Center, San Jose, California and by a grant from the AT&T Foundation.
Subject Keywords:fault tolerant computing; local area networks; Ethernet technology; checkpointing; communication subsystem; compare-checkpoints; high performance computing; local area networks; rollback; storage subsystem; store-checkpoints
Record Number:CaltechAUTHORS:ZIVftpds94
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:ZIVftpds94
Alternative URL:http://dx.doi.org/10.1109/FTPDS.1994.494471
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:9930
Collection:CaltechAUTHORS
Deposited By: Kristin Buxton
Deposited On:27 Mar 2008
Last Modified:26 Dec 2012 09:54

Repository Staff Only: item control page