CaltechAUTHORS
  A Caltech Library Service

Computing in the RAIN: A Reliable Array of Independent Nodes

Bohossian, Vasken and Fan, Charles C. and LeMahieu, Paul S. and Riedel, Marc D. and Xu, Lihao and Bruck, Jehoshua (2000) Computing in the RAIN: A Reliable Array of Independent Nodes. In: Parallel and distributed processing : 15 IPDPS 2000 workshops, Cancun, Mexico, May 1 - 5, 2000, proceedings. Lecture Notes in Computer Science. No.1800. Springer , Berlin, Heidelberg, pp. 1204-1213. ISBN 9783540674429. https://resolver.caltech.edu/CaltechAUTHORS:20190828-102317828

Full text is not posted in this repository. Consult Related URLs below.

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20190828-102317828

Abstract

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through softw are-implemented fault tolerance, the system tolerates multiplenode, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to RAIN finity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures; 2) fault management techniques based on group membership; and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: highly available video and web servers, and a distributed checkpointing system.


Item Type:Book Section
Related URLs:
URLURL TypeDescription
https://doi.org/10.1007/3-540-45591-4_167DOIArticle
https://resolver.caltech.edu/CaltechAUTHORS:BOHieeetpds01Related ItemJournal Article
https://rdcu.be/b32XbPublisherFree ReadCube access
ORCID:
AuthorORCID
Riedel, Marc D.0000-0002-3318-346X
Bruck, Jehoshua0000-0001-8474-0812
Additional Information:© Springer-Verlag Berlin Heidelberg 2000. Supported in part by an NSF Young Investigator Award (CCR-9457811), by a Sloan Research Fellowship, by an IBM Partnership Award and by DARPA through an agreement with NASA/OSAT.
Funders:
Funding AgencyGrant Number
NSFCCR-9457811
Alfred P. Sloan FoundationUNSPECIFIED
IBMUNSPECIFIED
Defense Advanced Research Projects Agency (DARPA)UNSPECIFIED
NASA/JPL/CaltechUNSPECIFIED
Subject Keywords:Fault Tolerance; Link Failure; Communication Layer; Fault Management; Consistent History
Series Name:Lecture Notes in Computer Science
Issue or Number:1800
Record Number:CaltechAUTHORS:20190828-102317828
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20190828-102317828
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:98300
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:28 Aug 2019 20:07
Last Modified:08 May 2020 21:09

Repository Staff Only: item control page