Performance Optimization of Checkpointing Schemes with Task Duplication
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the redundancy in hardware and software resources. In these systems, checkpointing serves two purposes: it helps in detecting faults by comparing the processors states at checkpoints, and it facilitates the reduction of fault recovery time by supplying a safe point to rollback to. The efficiency of checkpointing schemes is influenced by the time it takes to perform the comparisons and to store the states. The fact that checkpoints consist of both storing of states and comparison between states, with conflicting objectives regarding the frequency of those operations, limits the performance of current checkpointing schemes. In this paper we show that by tuning the checkpointing schemes to a given architecture, a significant reduction in the execution time can be achieved. We will present both analytical results and experimental results that were obtained on a cluster of workstations and a parallel computer. The main idea is to use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (storing the states to reduce recovery time). With two types of checkpoints, we can use both the comparison and storage operations in an efficient way and improve the performance of checkpointing schemes. As a particular example of this approach we analyzed the DMR checkpointing scheme with store and compare checkpoints on two types of architectures, one where the comparison time is much higher than the store time (like a cluster of workstations connected by a LAN) and one where the store time is much higher than the comparison time (like the Intel Paragon supercomputer). We have implemented a prototype of the new DMR schemes and run it on workstations connected by a LAN and on the Intel Paragon supercomputer. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes on both architectures can be improved by as much as 40%.