CaltechAUTHORS
  A Caltech Library Service

Computing in the RAIN: A Reliable Array of Independent Nodes

Bohossian, Vasken and Fan, Charles C. and LeMahieu, Paul S. and Riedel, Marc D. and Xu, Lihao and Bruck, Jehoshua (1999) Computing in the RAIN: A Reliable Array of Independent Nodes. California Institute of Technology . (Unpublished) https://resolver.caltech.edu/CaltechPARADISE:1999.ETR029

[img]
Preview
PDF (Adobe PDF (2.8MB))
See Usage Policy.

2MB
[img]
Preview
Postscript
See Usage Policy.

1MB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechPARADISE:1999.ETR029

Abstract

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to RAINfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures; 2) fault management techniques based on group membership; and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: highly available video and web servers, and a distributed checkpointing system.


Item Type:Report or Paper (Technical Report)
Related URLs:
URLURL TypeDescription
http://www.paradise.caltech.edu/papers/etr029.psPublisherUNSPECIFIED
ORCID:
AuthorORCID
Riedel, Marc D.0000-0002-3318-346X
Bruck, Jehoshua0000-0001-8474-0812
Group:Parallel and Distributed Systems Group
Record Number:CaltechPARADISE:1999.ETR029
Persistent URL:https://resolver.caltech.edu/CaltechPARADISE:1999.ETR029
Usage Policy:You are granted permission for individual, educational, research and non-commercial reproduction, distribution, display and performance of this work in any format.
ID Code:26045
Collection:CaltechPARADISE
Deposited By: Imported from CaltechPARADISE
Deposited On:03 Sep 2002
Last Modified:22 Nov 2019 09:58

Repository Staff Only: item control page