CaltechAUTHORS
  A Caltech Library Service

PPM performance with BWT complexity: a fast and effective data compression algorithm

Effros, Michelle (2000) PPM performance with BWT complexity: a fast and effective data compression algorithm. Proceedings of the IEEE, 88 (11). pp. 1703-1712. ISSN 0018-9219. http://resolver.caltech.edu/CaltechAUTHORS:EFFprocieee00

[img]
Preview
PDF
See Usage Policy.

169Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:EFFprocieee00

Abstract

This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the prediction by partial mapping (PPM) algorithms and the low complexity of codes based on the Burrows Wheeler Transform (BWT). Like the BWT-based codes, the proposed algorithm requires worst case O(n) computational complexity and memory; in contrast, the unbounded-context PPM algorithm, called PPM*, requires worst case O(n2) computational complexity. Like PPM*, the proposed algorithm allows the use of unbounded contexts. Using standard data sets for comparison, the proposed algorithm achieves compression performance better than that of the BWT-based codes and comparable to that of PPM*. In particular, the proposed algorithm yields an average rate of 2.29 bits per character (bpc) on the Calgary corpus; this result compares favorably with the 2.33 and 2.34 bpc of PPM5 and PPM* (PPM algorithms), the 2.43 bpc of BW94 (the original BWT-based code), and the 3.64 and 2.69 bpc of compress and gzip (popular Unix compression algorithms based on Lempel-Ziv (LZ) coding techniques) on the same data set. The given code does not, however, match the best reported compression performance-2.12 bpc with PPMZ9-listed on the Calgary corpus results web page at the time of this publication. Results on the Canterbury corpus give a similar relative standing. The proposed algorithm gives an average rate of 2.15 bpc on the Canterbury corpus, while the Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, 2.23 bpc for BZIP2 (a popular BWT-based code), and 3.31 and 2.53 bpc for compress and gzip, respectively.


Item Type:Article
Additional Information:© Copyright 2000 IEEE. Reprinted with permission. Manuscript received February 15, 2000; revised August 8, 2000. This work was partially supported by NSF Award CCR-9909026.
Subject Keywords:Burrows Wheeler Transform, lossless source coding, prediction by partial mapping algorithm, suffix trees, text compression, source code design, source coding theory, transforms for source coding
Record Number:CaltechAUTHORS:EFFprocieee00
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:EFFprocieee00
Alternative URL:http://dx.doi.org/10.1109/5.892706
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:6246
Collection:CaltechAUTHORS
Deposited By: Archive Administrator
Deposited On:29 Nov 2006
Last Modified:26 Dec 2012 09:19

Repository Staff Only: item control page