Next Previous Contents

Compression algorithm

Markus Gutschke, gutschk@math.uni-muenster.de

1 July 1997


Some notes about the compression algorithm

The compressor achieves an average compression rate of 60% of the original size which is on par with "gzip". It seems that you cannot do much better for compressing compiled binaries. This means that the break even point for using compressed images is reached, once the uncompressed size approaches 1.5kB. We can stuff more than 12kB into an 8kB EPROM and more than 25kB into an 16kB EPROM. As there is only 32kB of RAM for both the uncompressed image and its BSS area, this means that 32kB EPROMs will hardly ever be required.

The compression algorithm uses a 4kB ring buffer for buffering the uncompressed data. Before compression starts, the ring buffer is filled with spaces (ASCII character 0x20). The algorithm tries to find repeated input sequences of a maximum length of 60 bytes. All 256 different input bytes plus the 58 (60 minus a threshold of 2) possible repeat lengths form a set of 314 symbols. These symbols are adaptively Huffman encoded. The algorithm starts out with a Huffmann tree that assigns equal code lengths to each of the 314 symbols (slightly favoring the repeat symbols over symbols for regular input characters), but it will be changed whenever the frequency of any of the symbols changes. Frequency counts are kept in 16bit words until the total number of compressed codes totals 2^15. Then, all frequency counts will be halfed (rounding to the bigger number). For unrepeated characters (symbols 0..255) the Huffman code is written to the output stream. For repeated characters the Huffmann code, which denotes the length of the repeated character sequence, is written out and then the index in the ring buffer is computed. From this index, the algorithm computes the offset relative to the current index into the ring buffer. Thus, for typical input data, one would expect that short to medium range offsets are more frequent than extremely short or medium range to long range offsets. Thus the 12bit (for a 4kB buffer) offset value is statically Huffman encoded using a precomputed Huffman tree that favors those offset values that are deemed to be more frequent. The Huffman encoded offset is written to the output data stream, directly following the code that determines the length of repeated characters.

This algorithm, as implemented in the C example code, looks very good and its operating parameters are already well optimized. This also explains why it achieves compression ratios comparable with "gzip". Depending on the input data, it sometimes excells considerably beyond what "gzip -9" does, but this phenomenon does not appear to be typical. There are some flaws with the algorithm, such as the limited buffer sizes, the adaptive Huffman tree which takes very long to change, if the input characters experience a sudden change in distribution, and the static Huffman tree for encoding offsets into the buffer. The slow changes of the adaptive Huffman tree are partially counteracted by artifically keeping a 16bit precision for the frequency counts, but this does not come into play until 32kB of compressed data is output, so it does not have any impact on our use for "etherboot", because the BOOT Prom does not support uncompressed data of more then 32kB (c.f. doc/spec.doc).

Nonetheless, these problems do not seem to affect compression of compiled programs very much. Mixing object code with English text, would not work too well though, and the algorithm should be reset in between. Actually, we might gain a little improvement, if text and data segments were compressed individually, but I have not experimented with this option, yet.


Next Previous Contents