CaltechAUTHORS
  A Caltech Library Service

Distributed Caching for Processing Raw Arrays

Zhao, Weijie and Rusu, Florin and Dong, Bin and Wu, Kesheng and Ho, Anna Y. Q. and Nugent, Peter (2018) Distributed Caching for Processing Raw Arrays. In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management (SSDBM '18). ACM , New York, NY, Art. No. 22. ISBN 978-1-4503-6505-5. http://resolver.caltech.edu/CaltechAUTHORS:20180709-162157896

[img] PDF - Submitted Version
See Usage Policy.

531Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20180709-162157896

Abstract

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time.


Item Type:Book Section
Related URLs:
URLURL TypeDescription
https://doi.org/10.1145/3221269.3221295DOIArticle
https://arxiv.org/abs/1803.06089arXivDiscussion Paper
ORCID:
AuthorORCID
Ho, Anna Y. Q.0000-0002-9017-3567
Nugent, Peter0000-0002-3389-0586
Additional Information:© 2018 ACM. This work is supported by a U.S. Department of Energy Early Career Award (DOE Career).
Funders:
Funding AgencyGrant Number
Department of Energy (DOE)UNSPECIFIED
Record Number:CaltechAUTHORS:20180709-162157896
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20180709-162157896
Official Citation:Weijie Zhao, Florin Rusu, Bin Dong, Kesheng Wu, Anna Y. Q. Ho, and Peter Nugent. 2018. Distributed caching for processing raw arrays. In Proceedings of the 30th International Conference on Scientific and Statistical Database Management (SSDBM '18). ACM, New York, NY, USA, Article 22, 12 pages. DOI: https://doi.org/10.1145/3221269.3221295
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:87672
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:10 Jul 2018 14:40
Last Modified:10 Jul 2018 14:40

Repository Staff Only: item control page