Data-Compression.org

data compression link collection

Suffix Trees

Suffix trees are a data structure that makes it convenient to do string matching against an entire data set in O(N) time. This is really wonderful, but creaing the suffix tree isn’t always that easy.


Adaptive Huffman Compression

C# implementation of adaptive Huffman coding. Implements both the FGK and Vitter algorithm variations. Compression provided through two public classes, AdaptiveHuffmanProvider and AdaptiveHuffmanStream. Good compression ratios for text-based data

http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=c8bc181b-5ddf-4969-
aeca-a508374f1282

* * * * *

Posted in April 25th, 2004

SuffixTree

An Open Source Suffx Tree implementation. Looks as though this might be oriented towards teaching applications, as it is an interactive application, not just an implementation of the data structure.

http://bioinformatics.rit.edu/~tex/publications/suffixtree/

         

Posted in September 7th, 2003

ANSI C implementation of a Suffix Tree

A C implementation of a Suffix Tree. This project also includes a Perl module that can use the C code.

http://cs.haifa.ac.il/~shlomo/suffix_tree/

*        

Posted in September 7th, 2003

Extended Application of Suffix Trees to Data Compression

By Jesper Larsson, Published in Proceedings of the IEEE Data Compression Conference 1996. Hey Jesper, how about putting a PDF version on line?

http://www.larsson.dogma.net/dccpaper.pdf

* * * * *

Posted in April 26th, 2003

The Context Trees of Block Sorting Compression

By Jesper Larsson. Published in Proceedings of the IEEE Data Compression Conference 1998, PDF version of the paper.

http://www.larsson.dogma.net/lev3.pdf

* * * * *

Posted in April 26th, 2003

Attack of the Mutant Suffix Trees

By Jesper Larsson. Published in Proceedings of the IEEE Data Compression Conference 1998, a PDF version of the paper.

http://www.larsson.dogma.net/lic.pdf

* * * *  

Posted in April 26th, 2003

Notes on Suffix Sorting

by Jesper Larsson. Technical report. LU-CS-TR:98-204, LUNFD6/(NFCS-3134)/1-6/(1998). A PDF version of the paper.

http://www.larsson.dogma.net/tr204.pdf

         

Posted in April 26th, 2003

Suffix Trees on Words

By Jesper Larsson, Arne Andersson and Kurt Swanson. Technical report. LU-CS-TR:95-158, LUNFD6/(NFCS-3107)/1-14/(1995). Early version of a paper published in Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching 1996; to appear in Algotrithmica

http://www.larsson.dogma.net/words-alg.pdf

* * * * *

Posted in April 26th, 2003

Jesper Larsson’s Research Links

Published in Links, Suffix Trees

I am a graduate student (doktorand) in algorithms and data structures at the Department of Computer Science at Lund University. Special interests are text searching and data compression.

I use the links on this page to papers, but don’t index the page itself

http://www.larsson.dogma.net/research.html

         

Posted in April 26th, 2003

Jesper Larsson

Jesper Larsson spent a fair amount of his years in academia
studying Suffix Trees. His home page has links to his thesis and a few other papers on suffix trees and other string matching/Data Compression topics.

http://www.larsson.dogma.net

         

Posted in April 23rd, 2003

Suffix Tree

The definition from the NIST Dictionary of Algorithms and Data Structures.

http://www.nist.gov/dads/HTML/suffixtree.html

* * *    

Posted in April 5th, 2003

Suffix Array

The definition from the NIST Dictionary of Algorithms and Data Structures.

http://www.nist.gov/dads/HTML/suffixarray.html

         

Posted in April 5th, 2003

bwtzip: A Linear-Time Portable Research-Grade Universal Data Compressor

bwtzip is an ongoing project, distributed under the GNU General Public License, to implement a Burrows-Wheeler compressor in standard, portable C++. It is research-grade in that it is highly modularized and abstracted, so that it is simple to swap out parts of the compressor without affecting anything else. This makes it easy to experiment with different algorithms at different stages of compression.

Looks like Steven T. Lavavej released a new version of bwtzip in early February, 2003. A wide variety of improvements, most of them in implementation - not visible to the end user. A description of recent changes is found here

http://stl.caltech.edu/bwtzip.html

         

Posted in December 11th, 2002

From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction

1997, Robert Giegerich, Stefan Kurtz. We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen’s online construction, to explain its historic predecessors. The submitter of this paper indicates that it has user-friendly terminology, always welcome in Journal papers.

http://citeseer.nj.nec.com/giegerich97from.html

         

Posted in November 16th, 2002

Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching

R. Grossi and J. S. Vitter. “Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching,” Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC ‘00), Portland, OR, May 2000, 397-406.

http://www.cs.duke.edu/~jsv/Papers/catalog/node79.html

         

Posted in April 8th, 2002

Suffix Trees

A short and sweet tutorial on suffix trees. What they are and how to construct them.

http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix.html

* * * * *

Posted in December 5th, 2000

Opportunistic data structures with applications

Two papers in which it is show how to combine the BWT with the suffix array
data structure, in order to build a sort of compressed suffix array.
In the first paper it is proven that
the space occupancy of the compressed suffix array can be bounded in
terms of the entropy of the input string. In the second paper it is
proposed and tested a practical implementation of this data structure.

The first paper will appear in the Proc. of 41st IEEE Symposium on
Foundations of Computer Science; the second one in the Proc. of the 12th
SIAM-ACM Symposium on Discrete Algorithms.

http://www.mfn.unipmn.it/~manzini/papers/focs00.html

* * * * *

Posted in September 23rd, 2000

Strmat

A collection of C programs that do string matching and pattern discovery. This appears to be free code by D. Gusfield, who also has a book called “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology”.

One DCL reader commented The strmat package is wonderful.

http://www.cs.ucdavis.edu/~gusfield/strmat.html

* * * *  

Posted in January 23rd, 2000

Doug McIlroy’s source code page

Links to a documented implementation of a suffix sort. This may not be a compression topic per se, but suffix trees are useful for compressing data.

http://cm.bell-labs.com/cm/cs/who/doug/source.html

* * * * *

Posted in November 7th, 1999

Notes on suffix tree construction

Notes on suffix tree construction, some course notes for COMP 612: Graduate Seminar in Compiler Construction, includes some pointers to important papers.

DataCompression.info user David D. had this complaint: None of the links work on this page, all there is is a short paragraph on suffix trees. I have to agree, it’s a rather strange page.

http://www.cs.rice.edu/cgi-bin/texis/webinator/mysearch/+JwwBm+eNpxnwww0q_qc51hhoLaBGnnnrm
wwmeOmww/context.html

*        

Posted in November 7th, 1999