CaltechAUTHORS
  A Caltech Library Service

Computing in the RAIN: a reliable array of independent nodes

Bohossian, Vasken and Fan, Chenggong C. and LeMahieu, Paul S. and Riedel, Marc D. and Xu, Lihao and Bruck, Jehoshua (2001) Computing in the RAIN: a reliable array of independent nodes. IEEE Transactions on Parallel and Distributed Systems, 12 (2). pp. 99-114. ISSN 1045-9219. http://resolver.caltech.edu/CaltechAUTHORS:BOHieeetpds01

[img]
Preview
PDF
See Usage Policy.

885Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:BOHieeetpds01

Abstract

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology.


Item Type:Article
Additional Information:© Copyright 2001 IEEE. Reprinted with permission. Manuscript received 1 Mar. 2000; revised 1 Aug. 2000; accepted 15 Aug. 2000. This work was supported in part by an US National Sceince Foundation Young Investigator Award (CCR-9457811), by a Sloan Research Fellowship, by an IBM Partnership Award, and by DARPA through an agreement with NASA/OSAT.
Subject Keywords:Distributed computing, scalable architectures, interconnection networks, fault tolerance, data storage, cluster computing
Record Number:CaltechAUTHORS:BOHieeetpds01
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:BOHieeetpds01
Alternative URL:http://dx.doi.org/10.1109/71.910866
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:5359
Collection:CaltechAUTHORS
Deposited By: Archive Administrator
Deposited On:13 Oct 2006
Last Modified:25 Feb 2014 21:59

Repository Staff Only: item control page