Putting the squeeze on data

In work that could make it easier to handle huge data sets, MIT researchers improve data compression's fidelity without sacrificing speed

Larry Hardesty, MIT News Office

December 21, 2009

Although compressing preexisting image files is not a typical application of linear compression, it provides a vivid means of comparing algorithms. A source image (top) was compressed using two linear-compression algorithms. When the images were decompressed, the older algorithm (middle) provided a slightly more faithful reconstruction of the original — but it took more than 100 times as long to execute as the new MIT algorithm (bottom).

Data compression is one of the fundamental research areas in computer science, letting information systems do more with less. It’s the reason the iPod nano can hold thousands of songs instead of hundreds, and it’s what keeps transmitted images from choking the Internet. If every digital file is a string of bits — zeroes and ones — then compression is a way to represent the same information with fewer bits.

Most compression techniques trade space for time: while the compressed file takes up less memory, it has to be decoded before its contents are intelligible. In applications where memory is in short supply but data needs constant updating, it can be prohibitively time consuming to keep decompressing a file, modifying it, and then recompressing it. As a result, such applications — monitoring Internet traffic, for instance, or looking for patterns in huge collections of scientific data — often use a type of compression called linear compression. With linear compression, a computer program can modify the data in a compressed file without first decoding it.

Last year, Associate Professor Piotr Indyk of MIT's Computer Science and Artificial Intelligence Laboratory and his graduate student Radu Berinde introduced two different versions of a new linear-compression algorithm that perform as well as any yet invented — and for some applications, better. Both versions of the algorithm, however, had limitations: under certain extreme conditions, they’d just stop working. But this fall, at the Allerton Conference on Communication, Control, and Computing hosted by the University of Illinois at Urbana-Champaign, Indyk and Berinde presented a new version of the algorithm that combined the advantages of its predecessors and overcame their drawbacks.

“The two previous algorithms each had their own faults,” says Deanna Needell, a postdoc at Stanford who helped develop one of the other leading linear-compression algorithms. “And this tends to be the case in many situations in this area, that if you have two algorithms, one is good at one thing and the other is good at another. And [the new algorithm] sort of merged the two benefits. It’s like, here’s this algorithm that does both of the good things.”

Some compression techniques, like the zip algorithm commonly used for Internet downloads, are what’s called “lossless”: when you unzip the zipped version of a file, you recover every bit of the original. Other compression techniques are “lossy”: the MP3 version of a song, for example, takes up about a tenth as much space as the CD version, but it irreversibly discards a lot of subtle audio data.

Linear compression is lossy: expanding the compressed file doesn’t give you all the data in the original. But for many applications, that doesn’t matter. Take, for instance, Internet traffic monitoring. Packets of data traveling over the Internet pass through a succession of special-purpose computers called routers; each router examines the packet’s ultimate destination and tells it where to go next. There’s no way a router could store information about all the packets that pass through it in the course of a day, but with linear compression, it can store an approximation. Decoding the data can still disclose what Indyk calls the “heavy hitters” — the sites that are sending and receiving the most packets — which is what most researchers are interested. In other applications, the heavy hitters might be the members of a large population whose blood tests positive for a disease, or the concentrations of particular molecules in a chemical sample.

Going the distance

According to Indyk, there are three principal criteria for evaluating the performance of a linear-compression algorithm. One is the degree of compression: how much smaller the compressed file is than the uncompressed data. The second is recovery time: how long it takes to decode the compressed data. (Indyk says that some of the early linear-compression algorithms would take “hours or even days” to reconstruct an image captured by a one-megapixel camera.) And since linear compression is lossy, the third is how accurately the algorithm can reconstruct the original file.

In the last seven years, Indyk says, the field has progressed to the point where linear-compression algorithms can perform well along any two of those three parameters at the expense of the third. Indyk and Berinde chose to trade some fidelity in reconstruction for efficient extraction and good compression. Indeed, Indyk and some of his other students have recently demonstrated that there’s a mathematical limit to how much space savings linear compression can afford — and his and Berinde’s algorithm reaches it.

The insight behind the MIT researchers’ algorithm is fairly technical, but Indyk tried to explain it in layman’s terms. If you take two very different files — strings of ones and zeroes — of similar size, “the difference between them has a geometric interpretation,” Indyk explains. That is, there’s a way to mathematically describe the difference between the files in terms of distance: one file can be thought of as being close to or far away from the other.

With linear compression, there’s generally a trade-off between how fast the compression algorithm is and how much of the original file can be recovered. Slower but more accurate algorithms tend to preserve the geometric distance between files: if two uncompressed files are far apart, the compressed versions will be, too. With faster algorithms, on the other hand, the compressed files tend to be much closer to each other than the source files were.

Indyk and Berinde found a way to analyze the difference between compressed files using a different mathematical notion of geometric distance; under that analysis, some fast compression algorithms still preserve the distance between files. By taking advantage of this new perspective, the researchers were able to devise a decompression algorithm that recovers much more information from the original file without sacrificing any speed.

Work like Indyk and Berinde’s holds out the hope that soon, linear-compression algorithms will no longer need to sacrifice performance along one of the three parameters that Indyk mentioned — compression, time and accuracy. “I think we’re actually pretty close to that,” says Needell. “We’re pretty much to the end: we’re almost there.” She adds, however, that “there’s plenty of other directions that the field will go in.” For instance, she says, many of the mathematical techniques that linear-compression algorithms rely on could be adapted to improve the “recommendation engines” on web sites like Netflix or Amazon, which try to predict which books or movies a customer might like on the basis of prior history.