Block Sorting treats a body of text as a large block of strings. The strings can be sorted using a reversible algorithm of some kind, yielding a more compressible dataset. The best known example of this technique is the Burrows-Wheeler Transform, or BWT. The Burrows-Wheeler Transform refers to a reversible transformation that can make a given text more compressible. Once this transform has been applied, the text can be compressed with a combination of Move to Front encoding followed by an entropy encoder.
bsdtar, libarchive
Libarchive is a programming library that can create and read several different streaming archive formats, including most popular tar variants and the POSIX cpio format. The bsdtar program is an implementation of tar(1) that is built on top of libarchive. It started as a test harness, but is quickly moving toward becoming a candidate system tar for FreeBSD
http://people.freebsd.org/~kientzle/libarchive/
12Ghosts Zip
This package includes 12Zip and 12Zip2. The first version uses Zip compatible compression, and the second uses a BWT variant.
Version 7.0 of the package is shipping as of May, 2004
http://www.12ghosts.com/ghosts/zip.htm
BWT in Matlab
Imran Akthar’s implementation of the BWT transform in Matlab. Free.
http://www.geocities.com/imran_akthar/files/bwt_matlab_code.zip
Embedded BWT Compression
Tom has posted his source code for embedded BWT compression. Basically, he’s trying to pull it off with low amounts of RAM.
http://tom.iahu.ca:8080/bwt.html
BAR
BAR Archiver is a free, cross-platform file compression and archiving tool written in Java. It outperforms most of the popular file
compression tools available as commercial utilities.
Compression is based on Burrows Wheeler Transformation and modified Huffman entropy coder. The compression ratio is in par
with most of the commercial grade compression utilities.
http://fermatjen.tripod.com/bar/
Full-Text Searching & the Burrows-Wheeler Transform
In this article, I examine an indexing method that lets you find any character sequence in the source text in time only proportional to the sequence length using a structure that can compress the entire source text and index into less space than the text alone. This technique is exceptionally fast at detecting and counting occurrences of any string in the source text.
http://www.ddj.com/documents/s=8944/ddj0312d/0312d.htm?temp=jiRiV3fTQc
WinBZip2
WinBZip2 is an small freeware utility for working with files produced by bzip2 utility (usually it’s files with .bz2 extension) and also create such files. Main features include the compressing, decompressing and testing the integrity of bzip2 files.
http://www.irnis.net/soft/winbzip2/
Transform
Transform is a BWT compressor written by Michael Bone. It supports a number of very interesting features, such as automatic Base 64 representation and image output. Shareware.
Version 1.02 shipped in October, 2003.
http://users.senet.com.au/~mjbone/
zipstream, bzip2stream: iostream wrappers for the zlib and bzip2 libraries
This article describes STL-compliant iostream implementations that compress and decompress using the deflate and bzip2 algorithms. It makes it really easy to use compressed streams in your C++ app.
Updated July, 2003, to fix a gzip header problem.
Updated August, 2003 to fix a couple of minor problems.
http://www.codeproject.com/vcpp/stl/zipstream.asp
UHBC 1.0
UHBC is a blocksorting compressor optimized for high compression ratios. The program uses recent research results by S. Deorowicz and J. Abel for improved second stage processing after the BWT. Some extensions and sophisticated modeling provide top compression ratios, giving the best known result for Calgary Corpus (2.208 bps average). UHBC is free for non-commercial use. No source available.
Reader Grebnov I. had this to say: Compression is very good, but too slow + no source code..
ftp://ftp.elf.stuba.sk/pub/pc/pack/uhbc10.zip
GRZip
GRZip - is a high-performance file compressor based on Burrows-Wheeler Transform and Advanced Weighted Frequency Counting. It uses The Block-Sorting Lossless Data Compression Algorithm, which has received considerable attention over recent years for both its simplicity and effectiveness. This implementation has compression rate of 2.364 bps on the Calgary Corpus. The site has Windows, Linux, and Dos executables along with source.
http://magics.h10.ru/main.php?page=GRZip&folder=projects
Text Preprocessing for Data Compression
by Jürgen Abel and William Teahan. This paper looks at a few different techniques for preprocessing data before performing text compression, and compares the gains achieved when combining the preprocessors with PPM, BWT, and LZ algorithms.
http://www.data-compression.info/JuergenAbel/Preprints/Preprint_Text_Preprocessing.pdf
Xceed Streaming Compression Library
The Xceed Streaming Compression Library is a high-performance “raw” compression library. It offers the ability to compress and decompress streaming data, buffers, strings or single files and supports multiple compressed data formats. Unlike the Xceed Zip Compression Library, this ultralight library doesn’t offer Zip file handling capabilities.
http://www.xceedsoft.com/products/Stream/index.htm
UFUP
UFUP advertises itself as a “decompression utility that handles common Unix package formats such as bzip2, gzip, and tar.”
http://sourceforge.net/projects/ufup/
ABC - Advanced Blocksorting Compressor
ABC is a free data compression program based on the Burrows-Wheeler transformation. The source code is free for academic, research and educational use as depicted in the Abel Public License (APL). The program is developed in DELPHI as a command line program just like GZIP.
Update: Jürgen has released the source code for ABC at long last! The Delphi source is available for download from the web site and can be used under his own APL.
http://www.data-compression.info/ABC/
Grundlagen des Burrows-Wheeler-Kompressionsalgorithmus
This is another German preprint of Jürgen Abel describing the principles of the Burrows-Wheeler Compression Algorithm. An
implementation of a BWT based compressor with a compression rate of 2.25 bps on the Calgary Corpus is presented. The paper will be
published in the German journal “Informatik - Forschung und Entwicklung”, Springer-Verlag Heidelberg, Association for Informatics
(Gesellschaft für Informatik eV.)
http://www.data-compression.info/JuergenAbel/Preprints/Preprint_Grundlagen_BWCA.pdf
Michael Walden’s Compression Pointers
A comprehensive set of compression pointers. Unfortunately, Michael is using some sort of software that makes bookmarking into his index impossible. So instead, you must link to the main page, shown here, and locate the links to “Data Compression”. Under that he has links to General Resources, Software, and Theory.
http://www.users.voicenet.com/~mwalden/
Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages
This preprint of Jürgen Abel gives a short introduction into the BWCA field and proposes several improvements for the WFC stage and the IF stage.
It further introduces a new RLE scheme for bypassing the run length
information around the WFC stage. The paper is the basis of the
BWCA program ABC and the revealed approach achieves a
compression rate of 2.238 bps, which is the best result for a pure
BWCA without any preprocessing before the BWT to date (March 2003).
http://www.data-compression.info/JuergenAbel/Preprints/Preprint_After_BWT_Stages.pdf
Verlustlose Datenkompression auf Grundlage der Burrows-Wheeler-Transformation
This is a preprint of a paper by Jürgen Abel describing the functionality
of a basic but quite fast Burrows-Wheeler compressor. Jürgen reports that this will be published in PIK, a German-language journal on Communications and Information Theory.
http://www.data-compression.info/JuergenAbel/Preprints/Preprint_Verlustlose_Datenkompressi
on_BWT.pdf
Burrows-Wheeler Transformation / Block Sorting (BWT)
Jürgen Abel has done an enormous amount of research on the Burrows-Wheeler Transform, and has published the results on his web site. On this page you will find:
- A summary of this compression technique.
- Links to over 70 online papers.
- Links to at least that many people involved in BWT research or development.
- Extensive links to BWT source code.
This web page may now be the definitive source of information for this field.
http://www.data-compression.info/Algorithms/BWT
Bzip2 classes
Gilad Novik created a pair of classes to compress and decompress data using the bzip2 format.
http://www.codeproject.com/useritems/BZ2Class.asp
Compressia
Compressia is a BWT-based archive utility with particularly high compression ratios. On Calgary, Canterbury, ACT-Text and ACT-Exe it surpasses all other BWT utilities. On Canterbury corpus it also surpasses all PPM utilities. Beta version 1.0 is available on the web site as of February, 2003.
DataCompression.info reader Juan L. said It is by far the best I’ve tried.
BZip2 for Java
Aftex Software makes a Java version of Bzip2. This includes input and output stream classes, which can be used in other Java applications. The program has both a GUI and command line interface.
http://www.aftexsw.com/bzip.html
bwtzip: A Linear-Time Portable Research-Grade Universal Data Compressor
bwtzip is an ongoing project, distributed under the GNU General Public License, to implement a Burrows-Wheeler compressor in standard, portable C++. It is research-grade in that it is highly modularized and abstracted, so that it is simple to swap out parts of the compressor without affecting anything else. This makes it easy to experiment with different algorithms at different stages of compression.
Looks like Steven T. Lavavej released a new version of bwtzip in early February, 2003. A wide variety of improvements, most of them in implementation - not visible to the end user. A description of recent changes is found here
http://stl.caltech.edu/bwtzip.html
Jürgen Abel
Jürgen is the proprietor of
www.data-compression.info,
an excellent resource for developers and researchers. Jürgen has a good supply of links to papers, conferences, books, etc. on the site, as well as executables and source for ABC, a freeware BWT compressor he wote in Delphi.
http://www.data-compression.info/JuergenAbel/
BWTCoder: Industrial strength BWT compression
This is a preliminary shot at creating an open source BWT compression engine. Things look very preliminary at this point with just a couple of files available for download and not much message traffic.
http://sourceforge.net/projects/bwtcoder/
szip homepage
Szip is a freeware portable general purpose lossless compression program. It has a high speed and compression, but high memory demands (up to 20MB) too. The compression is done using a variant of blocksorting, which explains its rather high memory requirements.
Update: Michael Schindler has at long last posted the source code for szip.
http://www.compressconsult.com/szip/
Embedded BWT Compression
An implementation of BWT designed with the goal of minimizing memory usage. Source code and a documentation page.
http://www.iahu.ca:8080/bwt.html
Karl Malbrain - BWT Source
Karl has created a complete BWT package, and has posted the source on this site. He also has an adaptation of N. Jesper Larsson’s
Burrows-Wheeler Suffix Sorting for your perusal.
http://www.geocities.com/malbrain
Sax.net Streaming Compression
Sax.net Streaming Compression helps you keep your data small and fast. Use high-performance compression and data compression code, using a class library that was designed from the ground up for integration with the Microsoft .NET framework.
In addition to being able to specify whether to prefer speed over size, Sax.net Compression offers you a choice of two compression algorithms: Industry-standard Deflate (ZIP) compression, and the newer Burrows-Wheeler (BZip2) transform, which generates especially great results when compressing XML data.
http://www.sax.net/dotnet/compression/