A Caltech Library Service

MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

Kim, Minji and Zhang, Xiejia and Ligo, Jonathan G. and Farnoud, Farzad and Veeravalli, Venugopal V. and Milenkovic, Olgica (2016) MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinformatics, 17 . Art. No. 94. ISSN 1471-2105. PMCID PMC4759986.

[img] PDF - Published Version
Creative Commons Attribution.

[img] PDF (Additional file 1: Comparison between Kraken and MetaPhyler) - Supplemental Material
Creative Commons Attribution.

[img] PDF (Additional file 2: Datasets used for testing MetaCRAM) - Supplemental Material
Creative Commons Attribution.

[img] PDF (Additional file 3: Software instruction) - Supplemental Material
Creative Commons Attribution.

[img] PDF (Additional file 4: Outcome of MetaCRAM) - Supplemental Material
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Background: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. Results: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. Conclusions: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. Availability: The MetaCRAM software is freely available at The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.

Item Type:Article
Related URLs:
URLURL TypeDescription Center for Biotechnology Information Sequence Read Archive Center for Biotechnology Information Sequence Read Archive Center for Biotechnology Information Sequence Read Archive Center for Biotechnology Information Sequence Read Archive Center for Biotechnology Information Sequence Read Archive CentralArticle ItemMetaCRAM software
Farnoud, Farzad0000-0002-8684-4487
Additional Information:© 2016 Kim et al. Open Access. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated. Received: 5 October 2015; Accepted: 2 February 2016; Published: 19 February 2016. Acknowledgements: This work was supported by National Science Foundation grants CCF 0809895, CCF 1218764, IOS 1339388, CSoI-CCF 0939370, National Institute of Health U01 BD2K for Targeted Software Development U01 CA198943-02, and the National Science Foundation Graduate Research Fellowship Program under Grant Number DGE-1144245. The authors also thank Amin Emad for useful discussions in the early stage of the project. Authors’ Contributions: MK, XZ, JL, FF, VV and OM contributed to the theoretical development of the algorithmic method. MK, XZ and JL implemented the algorithms, while MK, XZ, and FF, tested it on a number of datasets. MK also co-wrote the paper and suggested a number of components in the execution pipeline. OM conceived the works, proposed the compression architecture and wrote parts of the paper. All authors read and approved the final manuscript. The authors declare that they have no competing interests.
Funding AgencyGrant Number
NSFCCF 0809895
NSFCCF 1218764
NSFIOS 1339388
NSFCCF 0939370
NIHU01 CA198943-02
NSF Graduate Research FellowshipDGE-1144245
Subject Keywords:Metagenomics; Genomic Compression; Parallel Algorithms
PubMed Central ID:PMC4759986
Record Number:CaltechAUTHORS:20160226-085751361
Persistent URL:
Official Citation:MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression Minji Kim, Xiejia Zhang, Jonathan G. Ligo, Farzad Farnoud, Venugopal V. Veeravalli and Olgica Milenkovic BMC Bioinformatics 2016 17:94 DOI: 10.1186/s12859-016-0932-x
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:64794
Deposited By: Melissa Ray
Deposited On:29 Feb 2016 19:30
Last Modified:09 Mar 2020 13:18

Repository Staff Only: item control page