data compression link collection


Coding refers to techniques used to encode tokens or symbols. Two of the best known coding algorithms are Huffman Coding and Arithmetic Coding. Coding algorithms are effective at compressing data when they use fewer bits for high probability symbols and more bits for low probability symbols.

Arithmetic Coding revealed - A guided tour from theory to praxis

An updated and translated version of our German paper “Proseminar Datenkompression - Arithmetische Kodierung” from 2001. To the best of our knowledge, it is the first comprehensive paper that describes the whole way from the basic principles of AC up to a simple implementation, fully documented with C++ source code.

* * * * *

Posted in November 6th, 2004

Huffman Coding Class

This version of file encoder and decoder program is based on the Huffman coding method. It explicitly demonstrates the details of the files during the encoding and decoding. The algorithm is encapsulated in a class En_Decode in standard C++.

* *      

Posted in June 6th, 2004

Basic Compression Library

Marcus Geelnard has created a batch of compression routines that you can plug and ply into your programs at will. Marcus is using the wonderfully open zlib license, which means thare are just about no reason you can’t use this code. The 1.0.5 added an LZ77 codec to the RLE, Huffman, and Rice coders

Satisfied user Todd W said: I needed a simple set of compression routines for use in an embedded system. I must be able to store a fair amount of information in a small EEPROM as a generic database. The Huffman coder works very well in the application and has met my needs exactly! Very nice!

* * * * *

Posted in May 14th, 2004

Michael Dipperstein’s Airthmetic Coding Page

The third in Michael’s collection of pages explaining lossless compression algorithms. A nice tutorial accompanied by ANSI C source.

* * * * *

Posted in May 2nd, 2004

Five Cents on Arithmetic Encoding

Aliaksei Sanko makes a few improvements to the code in the original 1987 CACM article. His sample includes a templated producer and consumer.


Posted in May 1st, 2004

Compression and Encryption Sources

Links to a variety of lossless coders, includes source for Huffman, arithmetic, LZSS, and other compressors.

* * * * *

Posted in May 1st, 2004

Adaptive Huffman Compression

C# implementation of adaptive Huffman coding. Implements both the FGK and Vitter algorithm variations. Compression provided through two public classes, AdaptiveHuffmanProvider and AdaptiveHuffmanStream. Good compression ratios for text-based data

* * * * *

Posted in April 25th, 2004


BAR Archiver is a free, cross-platform file compression and archiving tool written in Java. It outperforms most of the popular file
compression tools available as commercial utilities.
Compression is based on Burrows Wheeler Transformation and modified Huffman entropy coder. The compression ratio is in par
with most of the commercial grade compression utilities.


Posted in April 11th, 2004

libhuffman - Huffman encoder/decoder library

libhuffman is a Huffman encoder/decoder library and a command line interface to the library. The encoder is a 2 pass encoder. The first pass generates a huffman tree and the second pass encodes the file. The decoder is one pass and uses a huffman code table at the beginning of the compressed file to decode the file.

Beta 3 shipped in October, 2003.

* * * *  

Posted in October 27th, 2003

PAQ4 Archiver

The latest in the series of multi-model compressors from Matt Mahoney. This improves on PAQ3n’s remarkable Calgary corpus performance by an additional 12K, at some expense in speed. Takes a whopping 84MB at runtime!


Posted in October 26th, 2003

Michael Dipperstein’s Huffman Code Page

Michael Dipperstein describes his personal quest for understanding an implementation of Huffman coding. Full source included.

The page was updated with new source December, 2002.

* * * * *

Posted in October 18th, 2003


Matt Mahoney says that with recent improvements by Serge Osnach, PAQ3N does better on the Calgary Corpus than any other open source compressor.

* * * * *

Posted in October 18th, 2003

RF Coding

A paper by Eduardo Enrique González Rodriguez that describes a proposed new method of entropy encoding. Eduardo overcomes some of the problems faced by Huffman coding in certain circumstances, such as a very small alphabet.


Posted in October 5th, 2003

RF Coder Source Code

Published in Source Code, Coding

The source code to accompany Eduardo Enrique Gonzalez Rodriguez’s artice on RF coding, his proposed new entropy encoder. (Note that this archive contains his paper as well, so you don’t need to download both.)


Posted in October 5th, 2003

Binary Encoding

A survey paper that covers the current roster of popular encoding algorithms, such as Base64, Base32, Hex, etc.


Posted in October 1st, 2003


A Huffman file compressor for *NIX platforms. The author admits that the compression is perhaps not the best, but says it’s going to be just right for embedded programming because of its miniscule footprint and high speed.

Version 1.01 was shipping in June, 2003.

* * *    

Posted in June 12th, 2003

Self-similar Huffman Trees with Extreme Guessing Propertie

An article by John Pliam. Needs a better summary, but I don’t quite get it, other than the fact that there is some cryptographic thinking going on here.


Posted in May 4th, 2003

C Library to search over compressed texts

A team from the University of Pisa in Italy have created a set of two libraries to demonstrate the ability to perform searches in compressed text. The code is packaged as two LGPL libraries: HuffwLib and CGrepLib, and sample programs are available on the site for demonstration purposes.

* * * * *

Posted in April 15th, 2003

K-ary Huffman Encoding

The definition from the NIST Dictionary of Algorithms and Data Structures.


Posted in April 5th, 2003

Prefix Code

The definition from the NIST Dictionary of Algorithms and Data Structures.

* * * * *

Posted in April 5th, 2003

Shannon-Fano Coding

The definition from the NIST Dictionary of Algorithms and Data Structures.


Posted in April 5th, 2003

Arithmetic Coding

Published in Links, Arithmetic Coding

The NIST page on arithmetic coding from their Dictionary of Algorithms and Data Structures. A terse definition and a couple of links.


Posted in March 21st, 2003

Contributions à la compression de données

Steven Pigeon’s Ph. D Thesis from the University of Montreal. Proposes a new set of universal codes, which he dubs taboo codes, as well as new optimization algorithms for (Start, Step, Stop) codes. Plus lossy variations on LZW.


Posted in March 17th, 2003

Canonical Huffman Coder Construction

The first time I ever heard the phrase Canonical Huffman Coder was in reference to the technique used in PKZip to store Huffman tables. I don’t know where the technique originated, but it is basically a way to construct a Huffman table so that the actual codes don’t have to be stored when storing the table. This makes for some nice space savings when compared to a first-pass naïve implementation. (Like the ones I’ve done in the past.)

It turns out that somebody named Gareth was attempting to implement this code but was having a bit of trouble. His post to comp.compression brought out some useful help from some of the newsgroup regulars, and did a lot to shed light on the topic, and includes a reference to the paper that actually gave birth to the concept.

* * * *  

Posted in March 13th, 2003

A Memory-Efficient Huffman Adaptive Coding Algorithm for Very Large Sets of Symbols

This paper describes an adaptive Huffman scheme that is designed to work with large numbers of symbols - at least that’s what the abstract says. The web page indicates that there are executables in the archive, but all I see is a Word document.


Posted in March 10th, 2003

Steven Pigeon’s Phasing In Codes

Coding integers of an arbitrary length is an interesting problem. Steven Pigeon discusses a way to code integers in a very efficient manner, which approaches the optimal Log2(n) value.


Posted in March 7th, 2003

Arithmetic Coding Revisited: A New Floating Point Technique

Anand Jain discusses his new technique for arithmetic coding. An executable file is included, but no source code.


Posted in February 22nd, 2003

Compression via Arithmetic Coding in Java

Bob Carpenter has created a nice Java package that implements a PPM/arithmetic coding compression system. This page includes links to the source code, javadocs, and a fair amount of tutorial material. Very complete!

* * * * *

Posted in December 11th, 2002

bwtzip: A Linear-Time Portable Research-Grade Universal Data Compressor

bwtzip is an ongoing project, distributed under the GNU General Public License, to implement a Burrows-Wheeler compressor in standard, portable C++. It is research-grade in that it is highly modularized and abstracted, so that it is simple to swap out parts of the compressor without affecting anything else. This makes it easy to experiment with different algorithms at different stages of compression.

Looks like Steven T. Lavavej released a new version of bwtzip in early February, 2003. A wide variety of improvements, most of them in implementation - not visible to the end user. A description of recent changes is found here


Posted in December 11th, 2002

McKee’s Directed Acyclic Graph Compression

Will McKee has released this as freeware - includes complete source to a string substitution compressor. From the description it sounds as though it’s variant on LZSS, but I’ll defer to anyone willing to do a real analysis.


Posted in December 9th, 2002