A Caltech Library Service

Fast Statistical Alignment

Bradley, Robert K. and Roberts, Adam and Smoot, Michael and Juvekar, Sudeep and Do, Jaeyoung and Dewey, Colin and Holmes, Ian and Pachter, Lior (2009) Fast Statistical Alignment. PLOS Computational Biology, 5 (5). Art. No. e1000392. ISSN 1553-7358. PMCID PMC2684580.

[img] PDF - Published Version
Creative Commons Attribution.

[img] PDF - Supplemental Material
Creative Commons Attribution.


Use this Persistent URL to link to this item:


We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at, and the source code is available at

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Pachter, Lior0000-0002-9164-6231
Additional Information:© 2009 Bradley et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: October 22, 2008; Accepted: April 20, 2009; Published: May 29, 2009. Ian Holmes was partially supported by NIH/NHGRI grant 1R01GM076705. Lior Pachter was partially supported by NSF CAREER award CCF 03-47992. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. FSA borrows heavily from previous work, both in its code base and its intellectual foundations. Ideas. The distance-based approach to multiple alignment was proposed in [13],[14]. This included the idea of modifying the accuracy criterion suggested [53] and [54] to include gaps and the demonstration that the resulting modified expected accuracy could be used to control the expected sensitivity and specificity. Furthermore, [13],[14] introduced the sequence annealing approach to building multiple alignments, via the description of alignments using partially ordered sets [31],[55],[56]. The graph-based approach to alignment was formalized by [57] and these results were used in the DIALIGN program [58]. The query-specific learning method for re-estimating alignment parameters on the fly was inspired by [15] and conversations with Joseph Bradley about query-specific structure learning of graphical models. The iterative refinement technique is based on ideas in [59]. The FSA algorithm was parallelized using a modification of the approach in MW [60]. The coloring in the GUI according to posterior probabilities of alignment is inspired by the AU viewer of BAli-Phy [9]. Software. The sequence annealing implementation in FSA is based on Ariel Schwartz's AMAP program [14], which implements the Pearce-Kelly algorithm [61]. FSA's query-specific learning uses code created with Gerton Lunter's HMMoC compiler for HMMs [15]. The “aligner” example distributed with HMMoC, which implements a learning procedure for gap parameters, was an inspiration for FSA's learning strategies. FSA's banding code is taken directly from the “aligner” example. FSA's sequence and alignment representation code was inspired by similar code in the dart library [62]. Several Perl packages distributed with FSA are derived from packages of the same name in dart. Author Contributions: Wrote the paper: RKB CD LP. Led the development of FSA, wrote most of the code base, and developed the query-specific learning method: RKB. Redesigned the sequence annealing algorithm, constituted the core development team, and managed the project: RKB CD LP. Developed the GUI: AR. Developed a preliminary version of the GUI: MS. Developed the iterative refinement technique: SJ. Developed the parellelization and database modes: JD CD. Provided advice on the dart library, including its algorithms, programming and software components: IH. Created the FSA webserver: LP. The authors have declared that no competing interests exist.
Funding AgencyGrant Number
Issue or Number:5
PubMed Central ID:PMC2684580
Record Number:CaltechAUTHORS:20170306-135830452
Persistent URL:
Official Citation:Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, et al. (2009) Fast Statistical Alignment. PLoS Comput Biol 5(5): e1000392. doi:10.1371/journal.pcbi.1000392
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:74800
Deposited By: George Porter
Deposited On:06 Mar 2017 22:48
Last Modified:24 Feb 2020 10:30

Repository Staff Only: item control page