Ziv, Avi and Bruck, Jehoshua (1994) Efficient checkpointing over local area networks. In: Fault-Tolerant Parallel and Distributed Systems, College Station, TX, 12-14 June 1994. IEEE , Piscataway, NJ, pp. 30-35. ISBN 0818668075 http://resolver.caltech.edu/CaltechAUTHORS:ZIVftpds94
See Usage Policy.
Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:ZIVftpds94
Parallel and distributed computing on clusters of workstations is becoming very popular as it provides a cost effective way for high performance computing. In these systems, the bandwidth of the communication subsystem (Using Ethernet technology) is about an order of magnitude smaller compared to the bandwidth of the storage subsystem. Hence, storing a state in a checkpoint is much more efficient than comparing states over the network. In this paper we present a novel checkpointing approach that enables efficient performance over local area networks. The main idea is that we use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (where the state is only stored). The store-checkpoints reduce the rollback needed after a fault is detected, without performing many unnecessary comparisons. As a particular example of this approach we analyzed the DMR checkpointing scheme with store-checkpoints. Our main result is that the overhead of the execution time can be significantly reduced when store-checkpoints are introduced. We have implemented a prototype of the new DMR scheme and run it on workstations connected by a LAN. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes over LAN's can be improved by as much as 20%.
|Item Type:||Book Section|
|Additional Information:||© 1994 IEEE. Reprinted with Permission. The research reported in this paper was supported in part by the NSF Young Investigator Award CCR-9457811, by the Sioan Research Fellowship, by a grant from the IBM Almaden Research Center, San Jose, California and by a grant from the AT&T Foundation.|
|Subject Keywords:||fault tolerant computing; local area networks; Ethernet technology; checkpointing; communication subsystem; compare-checkpoints; high performance computing; local area networks; rollback; storage subsystem; store-checkpoints|
|Usage Policy:||No commercial reproduction, distribution, display or performance rights in this work are provided.|
|Deposited By:||Kristin Buxton|
|Deposited On:||27 Mar 2008|
|Last Modified:||26 Dec 2012 09:54|
Repository Staff Only: item control page