Analysis of Checkpointing Schemes for Multiprocessor Systems
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- tolerance. Fault-tolerance is achieved, in those systems, by duplicating the task into more than one processor, and comparing the states of the processors at checkpoints. Many schemes that achieve fault tolerance exist, and most of them use checkpointing to reduce the time spent retrying a task. Performance evaluation for most of the schemes either relies on simulation results, or uses a simplified fault model. This paper suggests a novel technique, based on a Markov Reward Model (MRM), for analyzing the performance of checkpointing schemes for fault-tolerance. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task completion time and total work of four checkpointing schemes, TMR, DMR-B-2, DMR-F-1 and RFCS. We show that generally increasing the number of processors reduces the average completion time, but increases the total work done by the processors. Namely, the TMR scheme, which uses three processors, is the quickest but does the most work, while the DMR-B-2 scheme, which uses only two processors, is the slowest of the four schemes but does the least work. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change. For example, when we assume that the schemes are implemented on workstations connected by a LAN and the time to move data between workstations is relatively long, the DMR-B-2 scheme can become quicker than the TMR scheme.