Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who's never really read that much on compression stuff, I have absolutely zero clue what this visualisation is actually showing me.

That's compounded by the lack of legend. What do the different shades of blue and purple tell me? What is Orange?

e.g. on a given text in an orange block it puts e.g. x4<-135. x4 seems to indicate that the first 4 binary values for the block are important, but I can't figure out what that 135 is referencing (I assume it's some pointer to a value?)





It is a backreference, the main way of dealing with full or partial repetitions in the LZ77 algorithm. It literally means: copy 4 characters from the backward offset of 135. Note that this "backward offset" can overlap previously repeated characters, so x10<-1 equally means: copy the last character 10 times.

Using this example paragraph, at compression level 1 or higher (copy with the quotation symbols):

“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.”

The red bit at the beginning is Zlib header information and parameters. This basically tells the decoder the format of the data coming up, how big the data is, etc.

The following grey section is the huffman coding tables - more common characters in the input are encoded in a fewer number of bits. This is what later tells the decoder, that 000 means 'e' and 1110110 means 'I'.

Getting into the content now - this is where the decoder can start emitting the uncompressed text. The first 3 purple characters are the unicode values for the fancy opening quote - because they're rare in this text, they're each encoded as 6 or 7 bits. Because they take a lot of bits, this website is showing them as a purple color, as well as physically wider. The nearby 't' is encoded in 4 bits, 0110, and is represented in a bluer color.

The orange bits you've mentioned are back references - "x10 <- 26" here means "go back 26 characters in what you've decoded, and then copy 10 characters again." In this way, we can represent "t was the " in only 12 bits, because we've seen it previously.

The grey at the end is a special "end of stream" marker, followed by a red checksum which allows decoders to make sure there wasn't any corruption in the input.

I think that's everything. Further reading: https://en.wikipedia.org/wiki/Zlib https://en.wikipedia.org/wiki/Deflate https://en.wikipedia.org/wiki/Huffman_coding


Thank you! I appreciate the explanation

Happy to help :) I think compression algorithms are super cool, and zlib is a nice example of how just two simple techniques (Huffman coding and dictionary compression) can combine to usefully compress nearly any real-world data.

Newer compression algorithms like zstd, brotli and lz4 basically just use these same methods in different ways. (There's also slightly newer alternatives to Huffman coding, like Asymmetric Numeral Systems and Arithmetic Coding, but fundamentally they're the same concept).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: