Ziv, Avi and Bruck, Jehoshua (1994) Performance Optimization of Checkpointing Schemes with Task Duplication. California Institute of Technology . (Unpublished) http://resolver.caltech.edu/CaltechPARADISE:1994.ETR004
PDF (Adobe PDF (2.1MB))
See Usage Policy.
See Usage Policy.
Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechPARADISE:1994.ETR004
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the redundancy in hardware and software resources. In these systems, checkpointing serves two purposes: it helps in detecting faults by comparing the processors states at checkpoints, and it facilitates the reduction of fault recovery time by supplying a safe point to rollback to. The efficiency of checkpointing schemes is influenced by the time it takes to perform the comparisons and to store the states. The fact that checkpoints consist of both storing of states and comparison between states, with conflicting objectives regarding the frequency of those operations, limits the performance of current checkpointing schemes. In this paper we show that by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. We will present both analytical results and experimental results that were obtained on a cluster of workstations and a parallel computer. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. As a particular example of this approach we analyzed the DMR checkpointing scheme with store and compare checkpoints on two types of architectures, one where the comparison time is much higher than the store time (like a cluster of workstations connected by a LAN) and one where the store time is much higher than the comparison time (like the Intel Paragon supercomputer). We have implemented a prototype of the new DMR schemes and run it on workstations connected by a LAN and on the Intel Paragon supercomputer. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes on both architectures can be improved by as much as 40%.
|Item Type:||Report or Paper (Technical Report)|
|Group:||Parallel and Distributed Systems Group|
|Usage Policy:||You are granted permission for individual, educational, research and non-commercial reproduction, distribution, display and performance of this work in any format.|
|Deposited By:||Imported from CaltechPARADISE|
|Deposited On:||04 Sep 2002|
|Last Modified:||26 Dec 2012 13:52|
Repository Staff Only: item control page