Compress files using Huffman coding.
Compresses the input file.
$ huffman [input-file] -o [output-file]Decompresses the input file.
$ huffman [input-file] -d -o [output-file]- [input-file]: The file to be processed (required).
- -o [output-file]: Specifies the output file (required).
- -d: Enables decompression mode (default is compression).
In 1952 David A. Huffman created an algorithm used for lossless data compression [@huffman1952method], building upon the advances of Claude Shannon's work on information theory. It has been proven to be optimal for symbol-by-symbol coding with a known input probability distribution. The algorithm assigns variable-length codes to input characters, and the shorter codes are assigned to more frequently occurring characters.
Note
The program has been designed to fulfill the following requirements:
Write a program that uses Huffman coding (classic or dynamic) to compress files. This project will allow you to practice working with binary data and using tree-like data structures. To handle binary data, you can use the bytestring package.\
We decided that our take on the implementation will provide a simple command-line interface.
$ wget https://pl.wikipedia.org/wiki/Polska -O polska.txt
$ sha256sum polska.txt
645e668ee...9c530 polska.txt
$ stack run -- polska.txt -o polska.txt.huff
polska.txt (1271161 bytes) -> polska.txt.huff (887871 bytes) [69.85%]
$ ls -lah polska.txt polska.txt.huff
-rw-r--r-- 1 rafisto rafisto 1.3M Jun 9 14:14 polska.txt
-rw-r--r-- 1 rafisto rafisto 868K Jun 9 23:27 polska.txt.huff
$ stack run -- polska.txt.huff -d -o polska.out.txt
polska.txt.huff (887871 bytes) -> polska.out.txt (1271161 bytes) [143.17%]
$ sha256sum polska.out.txt
645e668ee...9c530 polska.out.txtAs described in [@wikihuffman], can be summarized in the following steps:
- Count the frequency of each symbol in the input data.
- Create a leaf node for each symbol and add it to the priority queue.
- While there is more than one node in the queue:
- Remove the two nodes of highest priority (lowest probability) from the queue.
- Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
- Add the new node to the queue.
- The remaining node is the root node, and the tree is complete.
A set of codes
In any stream of bits, defined as
The project uses Stack as a build tool, which allows for simple dependency management and building.
$ stack buildAfterwards, the executable can be run with the following command:
$ stack exec huffman -- [input-file] -o [output-file]To make it available globally, we can use stack install.
$ stack install
$ huffman
Usage: huffman <input-file> [-d] -o <output-file>
$ huffman src/Main.hs -o Main.hs.huff
src/Main.hs (1887 bytes) -> Main.hs.huff (1411 bytes) [74.77%]The Huffman coding algorithm with its presented implementation in Haskell has shown to be an effective way to compress text files. We can assume a compression ratio of around 70% for large text files, with the decompression process being lossless and efficient.
The current implementation can be extended in several ways.
- Optimization of the priority queue, library containers provides a more efficient implementation.
- As proposed by [@knuth1985dynamic], dynamic Huffman coding allows data to be encoded without knowing the frequency distribution beforehand, by maintaining a dynamic tree structure that adapts to the input data as it is processed.
- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes." Proceedings of the IRE, 40(9), 1098–1101. [@huffman1952method]
- Knuth, D. E. (1985). "Dynamic huffman coding." Journal of Algorithms, 6(2), 163–180. [@knuth1985dynamic]
- Wikipedia contributors. "Huffman coding — Wikipedia, The Free Encyclopedia." 2025. [Online; accessed 9-June-2025]. [@wikipedia_huffman]