CaltechAUTHORS
  A Caltech Library Service

Optimizing Workflow Data Footprint

Singh, Gurmeet and Vahi, Karan and Ramakrishnan, Arun and Mehta, Gaurang and Deelman, Ewa and Zhao, Henan and Sakellariou, Rizos and Blackburn, Kent and Brown, Duncan and Fairhurst, Stephen and Meyers, David and Berriman, G. Bruce and Good, John and Katz, Daniel S. (2007) Optimizing Workflow Data Footprint. Scientific Programming, 15 (4). pp. 249-268. ISSN 1058-9244. http://resolver.caltech.edu/CaltechAUTHORS:20180427-155708572

[img] PDF - Published Version
Creative Commons Attribution.

1739Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20180427-155708572

Abstract

In this paper we examine the issue of optimizing disk usage and scheduling large-scale scientific workflows onto distributed resources where the workflows are data-intensive, requiring large amounts of data storage, and the resources have limited storage resources. Our approach is two-fold: we minimize the amount of space a workflow requires during execution by removing data files at runtime when they are no longer needed and we demonstrate that workflows may have to be restructured to reduce the overall data footprint of the workflow. We show the results of our data management and workflow restructuring solutions using a Laser Interferometer Gravitational-Wave Observatory (LIGO) application and an astronomy application, Montage, running on a large-scale production grid-the Open Science Grid. We show that although reducing the data footprint of Montage by 48% can be achieved with dynamic data cleanup techniques, LIGO Scientific Collaboration workflows require additional restructuring to achieve a 56% reduction in data space usage. We also examine the cost of the workflow restructuring in terms of the application's runtime.


Item Type:Article
Related URLs:
URLURL TypeDescription
http://dx.doi.org/10.1155/2007/701609DOIArticle
ORCID:
AuthorORCID
Brown, Duncan0000-0002-9180-5765
Additional Information:© 2007 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This work was supported by the National Science Foundation under the grant CNS 0615412. R. Sakellariou and H. Zhao would like to acknowledge partial support from the EU-funded CoreGrid Network of Excellence (grant FP6-004265) and the UK EPSRC grant GR/S67654/01. The authors also thank the Open Science Grid for resources used for the motivation of this work. K. Blackburn and D. Meyers were supported by the LIGO Laboratory and NSF grants PHY-0107417 and PHY-0326281. The work of D. Brown was supported by the LIGO Laboratory and NSF grant PHY-0601459. The work of S. Fairhurst was supported by the LIGO Laboratory and NSF grant PHY-0326281 and PHY-0200852. LIGO was constructed by the California Institute of Technology and the Massachusetts Institute of Technology and operates under cooperative agreement PHY-0107417. This paper has been assigned LIGO Document Number LIGO-P070017-00-Z. Montage was supported by the NASA Earth Sciences Technology Office Computing Technologies (ESTOCT) program under Cooperative Agreement Notice NCC 5-6261. This research was done using resources provided by the Open Science Grid, which is supported by the National Science Foundation and the U.S. Department of Energy’s Office of Science.
Group:Infrared Processing and Analysis Center (IPAC), LIGO
Funders:
Funding AgencyGrant Number
NSFCNS-0615412
CoreGrid Network of ExcellenceFP6-004265
Engineering and Physical Sciences Research Council (EPSRC)GR/S67654/01
LIGO LaboratoryUNSPECIFIED
NSFPHY-0107417
NSFPHY-0326281
NSFPHY-0601459
NSFPHY-0326281
NSFPHY-0200852
NSFPHY-0107417
NASANCC 5-6261
Department of Energy (DOE)UNSPECIFIED
Other Numbering System:
Other Numbering System NameOther Numbering System ID
LIGO DocumentP070017-00-Z
Record Number:CaltechAUTHORS:20180427-155708572
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20180427-155708572
Official Citation:Gurmeet Singh, Karan Vahi, Arun Ramakrishnan, et al., “Optimizing Workflow Data Footprint,” Scientific Programming, vol. 15, no. 4, pp. 249-268, 2007. doi:10.1155/2007/701609
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:86108
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:30 Apr 2018 16:18
Last Modified:30 Apr 2018 16:18

Repository Staff Only: item control page