Stop using gzip

geofft · on Dec 12, 2015

The trouble with this is that, as a software author, it doesn't really matter if it takes 70 seconds instead of 33 to install my software. 70 seconds is fast enough, for someone who's already decided to start downloading something as involved as Meteor; even if it took one second it wouldn't get me more users. And it would have to take over 5-10 minutes before I start losing users.

On the other hand, having to deal with support requests from users who don't have any decompressor other than gzip will cost me both users and my time. Some complicated "download this one if you have xz" or "here's how to install xz-utils on Debian, on RHEL, on ..." will definitely cost me users, compared to "if you're on a UNIXish system, run this command".

From a pure programming point of view, sure, xz is better. But there's nothing convincing me to make the engineering decision to adopt it. The practical benefits are unnoticeable, and the practical downsides are concrete.

jwr · on Dec 12, 2015

There are downsides which most young fellas around here don't really appreciate. Hands up, how many of you remember bzip? Not bz2, but bzip, the original one, mostly seen as "version 0.21"?

Well, at the time it was released, people were making much the same arguments (with kittens). It compressed so much better than gzip, no reason to use the obsolete gzip format and tools, etc. And some of us jumped on the hype bandwagon and started recompressing our data, only to find out afterwards that bzip2 is now the new thing and the format is not only obsolete, but also patent-encumbered and in general needs to be phased out.

From a long-term perspective I'm fine with gzip. At least I know that I'll be able to open my data in 10 years time, which is not the case with bzip-0.21. The jury is still out on "xz", in my opinion.

nolok · on Dec 12, 2015

Agreed, and I think people do not fully understand it because the way it happens is not safety based the way you describe it. The question is not is X safe and Y unsafe, the question is each on its own. Is X safe, and will it still work on old systems or in ten years? Yes, 100%.is y safe and still work in ten years? Well some old system might have issues, and the patents might still exists, and....

If you're a developer, gzip is simply the best option. It's not the best, but it's good enough and it's safe.

pslam · on Dec 12, 2015

I had to explain the same thing to an engineering team the other day. There was a push to switch from a "fast but ok" compression algorithm to a "faster and better" compression algorithm. This seemed like win-win, but I explained that:

* The faster compression made a difference of about 100 milliseconds to a user experience lasting minutes.

* The better compression made a different of about 1 second to most users.

* The change of compression algorithm would take time away from engineering teams, and ultimately introduce bugs.

So in the end it was sidelined until other fundamental changes (file format etc) made it able to be coat-tailed into production.

Same story as yours: don't fix what ain't broke. Tallest working radio antenna in the world -> longest lying broken antenna in the world.

Taek · on Dec 12, 2015

We started packaging things as .tar.xz for a while, but a large number of users were having terrible trouble opening the file.

Mostly Mac, but iirc some Linux users were confused as well. We switched back to gzip because everyone knows how to use it.

lqdc13 · on Dec 12, 2015

Interestingly, it's opened by the same command on both OSes -

    tar -xf some_file.tar.xz 
    tar -xf some_file.tar.gz

Freak_NL · on Dec 12, 2015

I guess a lot of people are or were used to:

  tar -xzf file.tar.gz

To be honest, I only started doing -xf a year ago. I was used to -xjf and -xzf.

semi-extrinsic · on Dec 12, 2015

tar flags must be a prime example of cargo culting. Hands up everyone who's done or seen someone do

  tar -xzvf file.tgz > /dev/null

mgkimsal · on Dec 12, 2015

never seen that in my life until now - what does it do? Just unzip the file to dev/null? What's the purpose? Does the verbose flag show you what's inside but the dev/null means it's not written to disk while unzipping?

hobarrera · on Dec 12, 2015

Actually, the files are decompressed to the current directory, it's the just output of the verbose flag that goes to /dev/null. Which makes is even more senseless.

semi-extrinsic · on Dec 12, 2015

Exactly. I've seen people who always do `tar xzvf` and have no idea removing the `v` is the correct way to make it not print the name of every file in the archive.

merb · on Dec 12, 2015

I use the -v switch since I want to see what I decompress, however I didn't knew that I could supress the z switch.

bzbarsky · on Dec 13, 2015

You didn't use to be able to suppress the 'z' switch. You had to specify 'z' or 'j' depending on whether you wanted gzip or bzip2 decompression. It's a somewhat recent (sometime in the last 15 years, I think) change to "tar" to make it just detect the compression algorithm.

semi-extrinsic · on Dec 12, 2015

Isn't it better though to omit the -v switch and do `ls *` and/or `tree` afterwards? That gives you the same information but structured so it's much easier to understand.

pwaring · on Dec 13, 2015

The advantage of -v is that you can see what is being extracted as it happens. This is useful if you have a tarball with thousands of small files, as otherwise it's hard to tell whether tar has got stuck or there are just a lot of files.

hobarrera · on Dec 12, 2015

I don't understand why you'd do that. tar does not compress to stdout.

semi-extrinsic · on Dec 12, 2015

Redirecting stdout is to cancel the -v flag (verbose, lists every file extracted).

mbakke · on Dec 12, 2015

If you don't explicitly specify (de)compression method (z for gzip, J for xz), tar will try to guess it.

marcosdumay · on Dec 12, 2015

That's new. Tar didn't use to behave that way, and plenty of people got used to specifying flags.

neerdowell · on Dec 12, 2015

And many implementations don't behave that way.

hobarrera · on Dec 12, 2015

That's gnu tar. I don't think that works on BSD.

Freaky · on Dec 12, 2015

I'm fairly sure bsdtar had automatic compression detection before GNU tar did. Been the default on FreeBSD since version 5.3 (2004).

Also works with non-tar formats like Zip, RAR and 7z: https://www.freebsd.org/cgi/man.cgi?query=libarchive-formats...

neerdowell · on Dec 12, 2015

Correct, on OpenBSD:

    $ tar -xf foo.tar.xz                                      
    tar: End of archive volume 1 reached
    tar: input compressed with xz

    $ tar -xf foo.tar.gz                                      
    tar: End of archive volume 1 reached
    tar: input compressed with gzip; use the -z option to decompress it

faho · on Dec 12, 2015

Does that _tell_ you what it's compressed with but then not decompress it? That's the absolute worst way to do it!

neerdowell · on Dec 12, 2015

It tells you how it's compressed and how to decompress it if it knows how. OpenBSD's tar doesn't support xz so it can't help there, but does support gzip so it suggests using -z.

Not letting untrusted input automatically increase the attack surface it's exposed to is a feature.

faho · on Dec 12, 2015

>Not letting untrusted input automatically increase the attack surface it's exposed to is a feature.

How is that a feature? The user's explicitly asking for this.

This feature reminds me of vim, that suggests closing with ":quit" when you press C-x C-c (i.e. the keychord to close emacs). It knows full well what you want to do and even has special code to handle it, but then insists to hand you more work.

dllthomas · on Dec 13, 2015

Vim suggests closing with ":quit" when you hit C-c; the C-x is irrelevant.

Upon receiving a C-c, it does not know full well what the user wants to do.

When vim receives a C-c from you (or someone who just stumbled into vim and doesn't know how to exit) the user wants to exit.

When vim receives a C-c from me, it's because I meant to kill the process I spawned from vim, and it ended before the key was pressed. I very much do not want it to quit on me at that point.

Showing a message seems the best compromise.

neerdowell · on Dec 12, 2015

`tar -xf` is not "explicitly asking" for gzip. `tar -zxf` is "explicitly asking" for gzip.

I don't really care what vim does, that's a different argument. There have been many vulnerabilities in gzip, and in tar implementations that let untrusted input choose how it gets parsed, those vulnerabilities might as well be in tar itself.

huuu · on Dec 12, 2015

The same applies to OGG audio. Better than MP3 but the average user is unable to play it. So everybody just sticks to MP3.

danieltillett · on Dec 12, 2015

MP3s with a decent bit rate is as good is it gets. Of course something like ogg back in the napster days would have been fantastic, but MP3 at 320 Kbps is fine for anyone who doesn't pay $1000 a meter for speaker wire.

DanBC · on Dec 12, 2015

But MP3 was patent encumbered and so a bunch of music creation software had weird work-arounds.

danieltillett · on Dec 12, 2015

I am not sure what your point is? Yes MP3 files were not the right choice years ago, but today who cares.

swhipple · on Dec 12, 2015

If I recall correctly, there are some alive patents in the U.S. until 2017 when dealing with MP3 encoders, requiring purchasing a license per copy distributed.

The workaround that the parent is talking about is usually "get LAME from a different distributor", which is still done by Audacity and others.

amluto · on Dec 12, 2015

That's the thing. With better formats, you don't need 320 kbps for transparency.

tomphoolery · on Dec 12, 2015

For those of us that sample from songs that we buy, WAVs are a bit easier to work with because the DAW doesn't have to spend time converting it. That said, since most of my tracks these days are using either the 48k or 96k sample rate, it still needs to be converted from 44.1 :)

danieltillett · on Dec 12, 2015

But can you tell the difference between a variable encoded 320 Kbps MP3 using a modern encoder and a wav file? I have some reasonable equipment and I most definantly can't.

krisdol · on Dec 12, 2015

When I DJed I used a mix of FLAC files and 320 Kbps CBR and high level VBR. On performance equipment I could tell VBR was not holding up. There is also some quality loss that you encounter when slowing down MP3s that is not present for FLAC or WAV, especially when kept in key, but for the most part that is only audible beyond the 8-15% range, and it was not common to alter the tempo that much for me. I ended up settling mostly on FLAC when I can get it and 320 CBR otherwise. I don't think I ever heard the difference.

danieltillett · on Dec 12, 2015

Can you hear the difference between 320 Kbps CBR and VBR? I have to say I have never tried slowing down the music to try and hear the difference so it is possible under these conditions that it might make a difference.

krisdol · on Dec 13, 2015

I would say yes, but again, playing music at very amplified volumes and tweaking its tempo is not a common use-case.

michaelcampbell · on Dec 12, 2015

most people can't. They say they can, but under scrutiny it generally falls apart. At some point, you're listening to sound and not music anyway.

stcredzero · on Dec 12, 2015

The degree of difference depends on the kind of music you listen to. Live recordings of acoustic ensembles in airy cathedrals -- in that case you can tell the difference. On tracks that have a highly produced studio sound, where everything is an electronic instrument -- not going to be much of a difference.

danieltillett · on Dec 12, 2015

I tried doing tests like this and I could not find any recordings where I could tell the difference at 160 kbps VBR. I not saying that it is impossible, but the conditions must be pretty rare and the difference very minor - compared to the massive degradation that come from room effects it amounts to nothing.

stcredzero · on Dec 14, 2015

compared to the massive degradation that come from room effects it amounts to nothing.

Truth.

Chamber music in an echo-y cathedral. With bad encoding, you can hear a noticeable difference in the length of time the reverberations are audible and and the timbre of those reverberations cab be quite different. With lots of acoustic music, the "accidental beauty" produced by such effects can be quite important.

Finding this convinced me to re-encode my music collection in 320kbps MP3 for anything high quality, and algorithmically chosen variable bitrates for lower quality recordings -- usually around 160 kbps. That was quite a number of years ago, though. I'd probably use another format today.

1ris · on Dec 12, 2015

That's not true. MP3 simply never gets transparent and you can notice with 5 dollar in-ears. And people in general notice. This leads to absurdities such as bitrates of 320 kbps, even thou these do not sound significantly better than 128 kbps and are still not transparent.

On the other hand 128 kbps AAC is transparent for almost any input. AAC is supported abou everywhere where mp3 is. The quality alone should be convincing. The smaller size make the usage of mp3 IMHO insane.

OTOH "the scene" still does MPEG-2 releases I think.

danieltillett · on Dec 12, 2015

I have listened to a lot of MP3 at different bit rates and with modern encoders and variable bit rates I can't tell the difference between anything above 160 Kbps - most of the time it is hard to tell the difference between 128kbp and anything higher. Really at 320kps you are entering the realm of fantasy if you think you can hear any difference.

krisdol · on Dec 12, 2015

I absolutely heard a difference between 320 and everything below. You can tell me I didn't, but I did. There is a world of difference between 160Kbps and 256, and 128 is a lot worse. If you can't hear it, I understand, but the blame isn't the algorithm -- it is your equipment, your song selection, or your ears.

4ad · on Dec 12, 2015

This is not true. It is trivial for almost anyone to distinguish 320kbps mp3 from uncompressed audio, with built-in DACs and $5 headphones, with as little as 5 minutes of training.

mafro · on Dec 12, 2015

Like the parent comment, more bold statements about perceivable sound quality, with no evidence.

danieltillett · on Dec 12, 2015

How can it not be true as I am describing my experience. Are you really telling me that I actually can tell when I say I can't?

Freaky · on Dec 12, 2015

You're also describing everyone else's experience:

> Really at 320kps you are entering the realm of fantasy if you think you can hear any difference.

It depends on the encoder, the track, your equipment, and how good you are at picking out artifacts. Some people do surprisingly well in double-blind tests, though I doubt anyone can do it all the time on every sample.

danieltillett · on Dec 12, 2015

There is no scientific evidence that anyone can do it at all above 192 kbps.

Freaky · on Dec 14, 2015

This is why ABX testing is so big in lossy audio circles. People can and do demonstrate their ability to distinguish between lossy and lossless encodings with certain samples in double-blind tests, at all sorts of bitrates. I've done it myself occasionally.

That people have been doing this for many years is one of the big reasons modern encoders are so good - they've needed tonnes of careful tuning to get to this point.

semi-extrinsic · on Dec 12, 2015

[citation needed]

hussong · on Dec 12, 2015

You are making some bold statements about the general transparency of different audio formats that contradict pretty much everything I've read about this topic so far. Hence, I'd like to learn more, do you have any links that you would recommend?

1ris · on Dec 12, 2015

Well, try it yourself :). Make sure to make it blind-test with the help of somebody else. Ideally such things would be subject to scientific studies. But these are kind of expensive and nobody cares for mp3 anyway. I'm not aware of any recent.

Hydrogenaudio listening tests [1] are studies by volunteers, but they focus on non-transparent compression. Anyway, it also illustrate aswell how bad mp3 is.

[1] http://wiki.hydrogenaud.io/index.php?title=Hydrogenaudio_Lis...

danieltillett · on Dec 12, 2015

This page says at 128 Kbps all the encoders were the same (a 5 way tie). Their wiki says at 192 Kbps MP3 is transparent [1].

1. http://wiki.hydrogenaud.io/index.php?title=Transparency

e12e · on Dec 12, 2015

You realize that that test doesn't test anywhere near 320kbps mp3, right?

1ris · on Dec 12, 2015

Yes, that's what the last two sentences are about.

hannob · on Dec 12, 2015

We have OGG opus these days which is even better than OGG vorbis.

zanny · on Dec 12, 2015

I actually tried this last year, and found out the hard way after reencoding my mp3 collection to vbr opus at around half the bitrate (I did some light quality testing to make sure it was of similar fidelity, of course you lose some quality going lossy -> lossy) that either opus-enc or gstreamer at the time would produce choppy broken audio.

And it was reproducible on all my computers. I couldn't use my opus collection at all because either the encoder was broken or the playback was broken.

I need to do it again at some point, when I have 8 hours to transcode everything and try again. See if they've fixed it.

wooger · on Dec 17, 2015

I'd expect similarly awful results for any lossy -> lossy transcoding.

Don't ever do that. If you didn't rip CDs to lossless or buy lossless, you are stuck with the format you've got.

michel-slm · on Dec 12, 2015

Using xz for Linux builds of your software might make sense though, or would do so in the future. Recent releases of Fedora and RHEL already uses xz to compress their RPM packages.

Debian/Ubuntu dpkg supports compressing with xz too -- and it's a hard dependency of dpkg at least as far back as the precise (12.04) LTS release. So I'd say the majority of Linux users already have access to xz.

ant6n · on Dec 12, 2015

I just started typing in 'xz compression' in Google to learn more about it, it offered 'xz compression not available'. And some more queries that indicate it's not quite ubiquitous.

kazinator · on Dec 13, 2015

It's good to be informed about the capabilities of xz. I will keep using gzip, but consider xz in situations where the size or time matters. I might not care about 100 megs versus 50 very much, but I will about two gigabytes versus one.

rileymat1 · on Dec 12, 2015

Is it likely that a user has gzip on a system but not tar itself? From the article:

What about tooling? OSX: tar -xf some.tar.xz (WORKS!) Linux: tar -xf some.tar.xz (WORKS!) Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)

zorked · on Dec 12, 2015

Tar does not implement decompression. If you don't have xz installed it won't work.

masklinn · on Dec 12, 2015

That doesn't seem correct, on osx 10.11 `tar xf` can extract `.tar.xz` yet doesn't fork an xz. AFAIK 10.11 doesn't even come with xz.

jessaustin · on Dec 12, 2015

tar can link the xz lib without forking.

masklinn · on Dec 12, 2015

So tar does "implement decompression" (and compression, by delegating the work to libarchive) and it can work even "if you don't have xz installed".

jessaustin · on Dec 12, 2015

It would require liblzma, but you are correct that the library is a separate thing from the executable xz.

masklinn · on Dec 12, 2015

> It would require liblzma

Yep, in the same way it requires libz and libbz2.

killercup · on Dec 12, 2015

IIRC bsdtar (e.g. on OS X) includes xz.

darkr · on Dec 12, 2015

On the vast majority of Linux distributions, you can pretty much guarantee that both tar and zlib will be installed.

Both tend to be part of an essential core of packages required to install a system.

the_mitsuhiko · on Dec 12, 2015

Pretty sure tar -xf does not actually work on osx unless you download a recent tar.

M4v3R · on Dec 12, 2015

I have tar that came with the system (latest OSX) and tar -xf works just fine. And it did work fine for as long as I can remember.

masklinn · on Dec 12, 2015

It does, and should have since at least 10.6 (I can find references to 10.6's tar being built on libarchive and that's one of libarchive's headline features; 10.5 predates libarchive so it may not have supported that)

cellularmitosis · on Dec 12, 2015

I'd argue that bzip2 is a better example of a compression algorithm which no one needs anymore.

Considering these features:

  * Compression ratio
  * Compression speed
  * Decompression speed
  * Ubiquity

And considering these methods:

  * lzop
  * gzip
  * bzip2
  * xz

You get spectrums like this:

  * Ratio:    (worse) lzop  gzip bzip2  xz  (better)
  * C.Speed:  (worse) bzip2  xz  gzip  lzop (better)
  * D.Speed:  (worse) bzip2  xz  gzip  lzop (better)
  * Ubiquity: (worse) lzop   xz  bzip2 gzip (better)

So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.

mappu · on Dec 12, 2015

You can easily apply the same argument to xz here, by introducing something rarer with an even better compression ratio (e.g. zpaq6+). Now xz isn't the best at anything either.

But despite zpaq being public domain, few people have heard of it and the debian package is ancient, and so the ubiquity argument really does count for something after all.

voltagex_ · on Dec 12, 2015

"This package has been orphaned, but someone intends to maintain it. Please see bug number #777123 for more information"

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777123 - get in touch with the new owner of the package if you're interested. It's probably on their Never Ending Open Source To Do List.

dchest · on Dec 12, 2015

No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

mappu · on Dec 12, 2015

>No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

Are you implying that xz out-compresses zpaq? Can you supply a benchmark?

Here's one from me - http://mattmahoney.net/dc/text.html - showing a very significant compression ratio advantage to zpaq.

dchest · on Dec 13, 2015

No, where did you find this implication in my comment? What I meant is that:

* xz is faster than bzip2 and provides better compression ratio [than bzip2]

* zpaq is slower [than bzip2 and provides better compression ratio than bzip2]

But looks like I'm mistaken? It seems like it can be faster and give better compression ratio than bzip2, can it?

thaumasiotes · on Dec 12, 2015

> So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.

Two points:

(1) It's very, very easy for the best solution to a problem not to simultaneously be the best along any single dimension. If you see a spectrum where each dimension has a unique #1 and all the #2s are the same thing, that #2 solution is pretty likely to be the best of all the solutions. Your hypothetical example does actually make a compelling argument that bzip2 is useless, but that's not because it doesn't come in #1 anywhere; it's because it comes in behind xz everywhere. (Except ubiquity, but that's likely to change pretty quickly in the face of total obsolescence.)

(2) lzop, in your example, is technically "the best at something". But that something is compression and decompression speed, and if your only goal is to optimize those you can do much better by not using lzop (0 milliseconds to compress and decompress!). So that's actually a terrible hypothetical result for lzop.

Heck, zero compression easily wins three of your four categories.

lmm · on Dec 12, 2015

Zero compression is very often the correct choice these days.

tmd83 · on Dec 12, 2015

No even when speed matters sometime lz4 is the best answer. I wrote a data sync that worked over 100mbps WAN and using lz4 on the serialised data transferred far faster than the raw data. Not just on network you can often be processing data faster (specially on spinning disk) since the reduction in disk I/O can in some cases can actually make the processing faster.

jtolmar · on Dec 12, 2015

Being second-best on ratio and ubiquity is still pretty handy for serving files. It's compress-once, decompress on somebody else's machine, so neither of those matter. Ratio saves you money and ubiquity means people can actually use the file.

bfung · on Dec 12, 2015

> It's compress-once, decompress on somebody else's machine, so neither of those matter.

Last week, there was a drive mount that was filling up, rate was roughly 30Gb/hr. The contents of that mount was used by the web application. Deletion was not an option. Something that compressed quickly was needed. And on the retrieval end, when the web app needs to do decompression, seconds matter.

faizshah · on Dec 12, 2015

I found lz4 to be the best for general purpose analysis, it increased the throughput of my processing 10x compared to bz2. Then if you're working with very large files you can use the splittable version of lz4, 4mc, which also works as a Hadoop InputFormat. I just wish they would switch the Common Crawl archives to lz4.

I should probably mention the compression ratio was slightly worse than bz2 (maybe 15% larger archive) but for the 10x increase in throughput I didn't really mind that much. I could actually analyze my data from my laptop!

aidenn0 · on Dec 12, 2015

If I'm actually doing something with my data, gzip -1 beats out lz4 for streaming, as gzip -1 can usually keep up with the slower of the in/out sides, and gzip -1 is higher compression ratio than lz4 and faster compression (but not decompression) than lz4hc.

faizshah · on Dec 12, 2015

I just tested this on my laptop, I used the first 5 million JSONLines of /u/stuck_in_the_matrix reddit dataset (~4.6GB).

For compression lz4 took ~22 seconds (~210 MB/s) and I got ~30% compression, gzip -1 took ~56 seconds (~80 MB/s) and I got ~22% compression.

For decompression lz4 gave me 500MB/s while gunzip gave me 300MB/s.

Commands used:

    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | gzip -1 > test.gz

    gunzip -c test.gz | pv > /dev/null


    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | lz4

    lz4 -cd stdin.lz4 | pv >/dev/null

aidenn0 · on Dec 13, 2015

Interesting; on a mix of source and binaries (archived fully-built checkouts) gzip -1 outperformed lz4 in compression ratio.

faizshah · on Dec 14, 2015

No you're correct, gzip -1 outperformed lz4 in my test in compression ratio. I don't know why I typed "30% compression" instead of "compression ratio of 30%." Sorry about that.

dchest · on Dec 12, 2015

FYI, this cool little project — https://code.google.com/p/miniz/ — implements faster gzip compression at level 1.

rshm · on Dec 12, 2015

Last time i checked lz4 did not had a streaming decompression support on their python lib. It will be problem for larger files like common crawl if you are not planning to pre-decompress before processing.

faizshah · on Dec 12, 2015

It's not a problem for me since I mostly use Java. However, you can probably just pipe in your data from the lz4 CLI then use that InputStream for whatever python parser you're using and you should be fine.

The biggest problem is using a parser that can do 600MB/s streaming parsing. If you use a command line parser don't try jq even with gnu parallel.

kbenson · on Dec 12, 2015

Being the best at something does not make it necessarily the best choice for most situations. This is trivially shown through this example. Assume for the four measured aspects, there is a program this is the best, but in the other three aspects it is orders of magnitude worse than the best in that aspect. Now consider another program which is best in nothing, but is 95% the way to best in every aspect. It's never best in any aspect, but it's clearly a good choice for many, if not most, situations.

lectrick · on Dec 12, 2015

Doesn't bzip2 have a concurrent mode that those others don't?

randerson · on Dec 12, 2015

bzip2 can take advantage of any number of CPU cores when compressing.

justinmayer · on Dec 12, 2015

Bzip2 doesn't handle multiple cores as far as I'm aware, but tools such as pbzip2 can. I wrote about this some time ago: https://hackercodex.com/guide/parallel-bzip-compression/

That said, parallel XZ is even better: https://github.com/vasi/pixz

SapphireSun · on Dec 12, 2015

So can pigz http://www.zlib.net/pigz/

mattst88 · on Dec 12, 2015

I don't believe that's true, though there are multiple projects that offer that feature. lbzip2.org and compression.ca/pbzip2 to name a couple.

hrez · on Dec 12, 2015

Really? How? My bzip2 has no option for it and when tested it stuck to one CPU. xz on the other hand has

  -T, --threads=NUM   use at most NUM threads; the default is 1; set to 0
                      to use the number of processor cores

mappu · on Dec 12, 2015

Side note: `xz` only got -T option in stable releases less than 12 months ago (5.2.0 in 2014-12-21), so it hasn't made it into every distro yet.

zbuf · on Dec 12, 2015

The xz installed on my system carries a rather promising -T option, but then this text below it

> Multithreaded compression and decompression are not implemented > yet, so this option has no effect for now.

TheWoodsy · on Dec 12, 2015

I believe you're looking for pbzip2. Parallel bzip2 file compressor. I've replaced it as my go-to compression.

xixi77 · on Dec 12, 2015

lbzip2 is pretty good

hdmoore · on Dec 12, 2015

pbzip2 output isn't universally readable by third-party bz2 decompressors (Hadoop, for example).

amelius · on Dec 12, 2015

If you'd included the "zip" format in your analysis, gzip would not be the best at something anymore.

_ofdw · on Dec 13, 2015

I use bzip2 purely for sentimental reasons.

gizmo · on Dec 12, 2015

One of the great things about gz archives is that the --rsyncable flag can be used to create archives that can be rsynced efficiently if they change only slightly, such as sqldumps and logfiles. Basically the file is cut into a bunch of chunks, and each chunk is compressed independently of the other chunks. xz doesn't seem to have an equivalent feature because the standard implentation isn't deterministic[1].

Changing from one compression format to another seems harmless, but it always pays to think carefully about the implications.

[1]: https://www.freebsd.org/cgi/man.cgi?query=xz&sektion=1&manpa...

mappu · on Dec 12, 2015

The `--rsyncable` patch never got upstreamed, and in recent debian the feature is totally broken (rsync needs to transmit ~100% of the file again).

`pigz` has a similar flag that works reliably, though.

kevinoid · on Dec 12, 2015

Are you referring to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=708423 ? If so, it was fixed in December 2013 and doesn't affect the current Debian release (Jessie), although it does affect the previous release (Wheezy) and there is an open request to backport the fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=781496

aktau · on Dec 12, 2015

Yep. I wrote a post about it, also doing some comparisons: http://www.aktau.be/2014/10/23/pg-dump-and-pigz-easy-rsyncab...

jzwinck · on Dec 12, 2015

There are many more concerns to address than just compression ratio. Even the ratio one is questionable, because some people have really fast networks but we all have basically the same speed of computers. So a 4x CPU time and memory pressure penalty may be much worse on a system than a 2x stream size increase. Another use case is a tiny VM instance: half a gigabyte of RAM is not actually present in every machine today. Embedded, too.

Another way compression formats can win you much more than a 2x space reduction is by supporting random access within their contained files. Gzip sort of supports this if you work hard at it. Xz and bzip2 appears similar (though the details are different). I achieved a 50x speedup with this in real applications, and discussed it a bit here: http://stackoverflow.com/questions/429987/compression-format...

dmit · on Dec 12, 2015

Agreed on usefulness of random access. Here's a couple more links on seekable gzip compression:

http://lh3.github.io/2014/07/05/random-access-to-zlib-compre...

http://www.ebaytechblog.com/2015/10/09/gzinga-seekable-and-s...

acqq · on Dec 12, 2015

Thanks for the random access discussion!

And you are right for embedded! .xz just doesn't work there.

I've also found that on the faster systems, for different uses of mine, when I want the compression to last as little as possible and the total round trip time matters (compression and decompression), gzip -1 gives the best resulting size for the reasonably short time I want to spend.

geographomics · on Dec 12, 2015

I've come across quite a lot of firmware on embedded Linux devices that uses LZMA (the xz compression algorithm) to compress the kernel, u-boot, and/or filesystems. One memory optimisation for these, as they are typically being decompressed straight into RAM, is for the decompressor to refer to its output as the dictionary rather than building a separate one, as would be the case in decompressing to the network or disk.

3minus1 · on Dec 12, 2015

> So a 4x CPU time and memory pressure penalty may be much worse on a system than a 2x stream size increase.

If it's being downloaded once

Twirrim · on Dec 12, 2015

Even then it depends.

If it takes you 60 seconds to download as gz, and 50 as xz. The decompression needs to take less than 10 seconds more for it to be comparable and you've got to be sure that your end users have more memory and sufficient processing power to throw at the task.

LeoPanthera · on Dec 12, 2015

He didn't mention the biggest difference between gzip and xz - ram usage. At maximum compression, you need 674 MiB free to make a .xz file, and 65 MiB to decompress it again. That's not much on most modern systems, but it's quite a lot on smaller embedded systems.

Admittedly, in most cases, that isn't much excuse though.

gizmo · on Dec 12, 2015

It can also lead to disaster on a web server when linux decides to OOM kill a critical part of the infrastructure like the database server or memcached. Then you can get a cascading problem of services failing, all because of a careless unzip statement. (I've been there.)

phaemon · on Dec 12, 2015

You can set exclusions for the OOM killer to prevent this. See:

http://backdrift.org/oom-killer-how-to-create-oom-exclusions...

Spooky23 · on Dec 12, 2015

....or just use gzip and get 90% of the value minus the high probability that that setting will be fubar at an inconvenient time.

donatj · on Dec 12, 2015

This is why gzip rocks, the relatively low memory usage, in particular for compressing things on the fly.

wtetzner · on Dec 12, 2015

Yeah, I use gzip for log processing tasks: zcat input.gz | log-processor | gzip > output.gz.

hdmoore · on Dec 12, 2015

Summary: Compatibility and decompression speed is more important than compression ratios for many use cases. Gzip is nearly universal, where lz4, xz, and parallel bzip2 are not.

The challenge of sharing internet-wide scan data has unearthed a few issues with creating and processing large datasets.

The IC12 project[1] used zpaq, which ended up compressing to almost half the size of gzip. The downside is that it took nearly two weeks and 16 cores to convert the zpaq data to a format other tools could use.

The Critical.IO project[2] used pbzip2, which worked amazingly well, except when processing the data with Java-based tool chains (Hadoop, etc). The Java BZ2 libraries had trouble with the parallel version of bzip2.

We chose gzip with Project Sonar[3], and although the compression isn't great, it was widely compatible with the tools people used to crunch the data, and we get parallel compression/decompression via pigz.

In the latest example, the Censys.io[4] project switched to LZ4 and threw data processing compatibility to the wind (in favor of bandwith and a hosted search engine).

-HD

1. http://internetcensus2012.bitbucket.org/images.html 2. https://scans.io/study/sonar.cio 3. https://sonar.labs.rapid7.com/ 4. https://censys.io/

bitwize · on Dec 12, 2015

Me, I wish people would stop using RAR. It's proprietary and doesn't have a real compression advantage vs. e.g., 7-Zip, bzip2, or xz.

ak217 · on Dec 12, 2015

For anyone looking to stop making compromises, I recommend pixz. It's binary compatible with xz, and is better at compression speed, decompression speed, and ratio than both gzip and xz on multicore systems. I've adopted it in production to great benefit.

justinmayer · on Dec 12, 2015

Totally agree with this. As someone with a commit bit to the project, as well as a long-time user, I'd like to second the recommendation. Pixz is a terrific parallel XZ compression/expansion tool. I find it indispensable for logs and database backups. Link: https://github.com/vasi/pixz

Fish shell users can take advantage of the Extract and Compress plugins I wrote, which utilize Pixz if installed: https://github.com/justinmayer/tackle/tree/master/plugins/ex...

reycharles · on Dec 12, 2015

How is the memory consumption?

KirinDave · on Dec 12, 2015

Being a windows user these days, I am getting kinda frustrated with how anemic everyone is at even trying to google for 20s to find the windows solution.

7zip is the program you want to handle most everything, with both gui and command line options: http://www.7-zip.org/

Given how radically MS is trying to reform itself to be an open-source friendly company and how ineffectually inoffensive they've been the last 5 years, can we at least try and throw them a bone or two?

justinmayer · on Dec 12, 2015

The article is not talking about Windows, so folks here aren't either. Why are you surprised that folks are uninterested in Windows?

I've preferred Mac systems for longer than most of the HN crowd has been alive, so I understand what it's like to feel ignored and in the minority. For years, Mac users were treated as pariahs. The tables have turned, and as someone who has been in your situation, I should have great empathy for your predicament.

And I do, but your tone in general -- and your last sentence in particular -- makes it very hard to empathize. Microsoft used its near-monopoly status to stifle innovation for years, and many of us have figurative scars that will never heal. You seem to think that they are making great strides (while I see them as half-hearted overtures), but either way I'm not about to "throw them a bone." Their decades of misdeeds, in my eyes, will not be expiated so easily.

Perhaps Microsoft will someday be worthy of forgiveness, either from the perspective of morality (e.g., Mozilla) or product excellence (e.g., Apple). Until that day, Microsoft will continue to reap what they have sown, given no more attention than they have earned.

KirinDave · on Dec 12, 2015

> And I do, but your tone in general -- and your last sentence in particular -- makes it very hard to empathize. Microsoft used its near-monopoly status to stifle innovation for years, and many of us have figurative scars that will never heal. You seem to think that they are making great strides (while I see them as half-hearted overtures), but either way I'm not about to "throw them a bone." Their decades of misdeeds, in my eyes, will not be expiated so easily.

I feel like Apple's has forgotten what was important, created then ruined a market, and lost everything that made it interesting (long before Steve Jobs passed away, by the way). Which cuts all the more deeply because back in the early 2ks they were walking the walk and taking a lot from NeXT's culture of developer friendliness. I grew up deeply invested in Macs and NeXT, which makes the realization painful, but... Apple wants to annihilate maker culture as it monetizes its platform. It's also stopped caring about design on a grand scale, instead appealing to very shallow notions of "visual simplicity".

That's all gone now, and they're consequently useless to me. I'd rather patronize a company currently doing the right thing after a troubled past than pretend a previously aligned company was still there.

It should be very telling that Apple AND Google's flagship hardware announcement of 2015 was something that Microsoft has been doing for years.

And if Microsoft suddenly goes evil again? Fuck them, I'll drop them and move somewhere else. Not Linux, unless the distros pull their act together, but I'm sure a competitor will emerge. Or I'll make one.

justinmayer · on Dec 12, 2015

You make some good points here, which I completely understand. Hopefully the future will bring better options for us all.

KirinDave · on Dec 12, 2015

I know your feeling too, thanks for accepting I feel differently. And sometimes I ask myself "What the hell am I doing with this Surface book?" I won't pretend I don't have doubts.

It's sort of a rough time for devs right now even as we enjoy unprecedented prosperity and recognition. Big businesses are attempting to monetize and control every aspect of developers.

TheGrassyKnoll · on Dec 12, 2015

"...decades of misdeeds..." Which they benefited handsomely from and were never sufficiently punished. I'm with you, my trust of MS is still pretty close to nil.

cgriswald · on Dec 12, 2015

GP seems to be talking about this line in the article:

> Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)

I do think the author missed the ball in not doing basic research for the Windows platform. His point is that people should switch from one compression tool to another. If people on Windows were unable to compress or decompress such files, that would be a huge problem for his argument.

That said, I don't agree with the GP's tangent.

Yaggo · on Dec 12, 2015

How about Microsoft offering basic utility software with their OS releases, e.g. curl/ssh/awk/grep/sed/tar etc.

KirinDave · on Dec 12, 2015

They do, although that doesn't include tar. Powershell does all of that.

Yaggo · on Dec 13, 2015

Powershell is very capable tool but unfortunately not very interoperable.

KirinDave · on Dec 14, 2015

What do you mean? It seems to me like an extremely interoperable tool, as it can invoke arbitrary .net code.

Yaggo · on Dec 14, 2015

... not very interoperable outside MS/Windows ecosystem.

KirinDave · on Dec 15, 2015

It can invoke arbitrary executables as well, which makes it exactly as interoperable as Bash.

Powershell can run on Linux, too. I've even met a few people who quietly prefer it.

So what, exactly, were you referring to? Shell choices are like editor choices: arbitrary and largely equivalent and without any real meaning or impact on a developer's productivity.

Yaggo · on Dec 15, 2015

The context was: someone had difficulties to find "tar -xf" equivalent for Windows. I pointed out that it would be nice if Microsoft included tar etc. basic utilities in their OS releases. With out-of-the-box Windows machine, you cannot ssh, you cannot untar, etc. Windows-way of doing things is totally different than *nix culture (OS X, Linux, etc). In that context powershell is not "interoperable" (maybe bad wording from me).

KirinDave · on Dec 15, 2015

> The context was: someone had difficulties to find "tar -xf" equivalent for Windows.

Right and putting that in google, "tar equivalent for windows", immediately nets 5 useful results. You can use tar, or a windows command line variant of 7z or tar, or a gui.

> With out-of-the-box Windows machine, you cannot ssh, you cannot untar, etc.

On an out-of-the-box Linux machine, you generally can't do a lot of things either. It seems particularly ironic that in a discussion about how we shouldn't be using old UNIX tools just because they're entrenched, you then call for compatibility.

> Windows-way of doing things is totally different than *nix culture (OS X, Linux, etc).

Stupid legacy path limits not included, Powershell is in my experience just a superior way to do things. I should maybe restart my blog to talk about that.

But even if we ignore Powershell and windows, your statement is divisive within the Linux community. MANY people prefer shells on Linux that don't adhere to the bash legacy. TCSH and CSH are very popular, to this day. Are they 'not interoperable'?

Everyone's got a big chip on their shoulder about how development tooling "should be." One of the things I've come to realize is how arbitrary, unnecessary, and useless these mores are. They just hold us back.

Yaggo · on Dec 15, 2015

> On an out-of-the-box Linux machine, you generally can't do a lot of things either.

Minimal distros aside, you must be kidding.

KirinDave · on Dec 15, 2015

No. I'm not. But minimal distros are the primary surface area linux exposes for many people these days. The desktop userbase is (justifably) almost non-existent, and most core cloud distros don't even come loaded with curl by default. It's even more extreme as you work with docker.

dTal · on Dec 12, 2015

I really don't see what that has to do with Microsoft. 7-zip is GPL and cross platform. In fact - I just checked - a command line 7-zip is available by default on my Linux Mint install.

KirinDave · on Dec 12, 2015

Right, but an xz-supporting compression tool is shipped with many Linux distros, whereas you do need to grab a tool to support xz on any version of Windows.

But you're right, 7z is an under-appreciated format.

mahouse · on Dec 12, 2015

I have had awful experiences with 7-zip. Recently I was unable to decompress a multi-file rar, it kept saying it was corrupted. I downloaded it several times until I realised it was actually fine, because winrar could decompress it.

KirinDave · on Dec 12, 2015

I sorta view RAR files as the problem. Only WinRAR ever gets them really right. I dunno why that is, but I also dunno why anyone keeps using RAR. It's something of a joke in the windows dev community.

mahouse · on Dec 12, 2015

Don't you have to seek through the entire tar file to find a file? You see, that is a joke.

orionblastar · on Dec 12, 2015

When I owned an Amiga they kept on changing the archive format to find a better one that saved space.

The had arc, pak, zip, zoo, warp, lharc, and every Amiga BBS I got on used a different archive format. Everyone had a different opinion on which archive format compressed things in the best way.

I think eventually they decided on lharc when they started to put PD and shareware files on the Internet.

Tar.gz is used because there are instructions for it everywhere and it seems like a majority of free and open source projects archive in it. It is a more popular format than the others right now. Might be because it is an older format and had more ports of it done.

But I really like 7zip, it seems to compress smaller archives, before 7Zip I used to use RAR but WinRAR wasn't open source and 7Zip is so I switched.

With high speed Internet it doesn't seem to matter much anymore unless the file is in over a gigabyte in size. Even then Bit Torrent can be used to download the large files. I think BitTorrent has some sort of compression included with it if I am not mistaken. To compress packets to smaller sizes over the torrent network and then resize them when the client downloads them. That is if compression is turned on and both clients support it.

cesarb · on Dec 12, 2015

> When I owned an Amiga they kept on changing the archive format to find a better one that saved space.

It happened on DOS too: ZIP, ARJ, RAR, ...

That was back on the days of floppy disks (which usually had at most 1440 KiB) and small hard disks (a few tens of megabytes). Even a few kilobytes could make a huge difference.

As storage and transfer speeds grew, "wasting" a few kilobytes is no longer that much of an issue, and other considerations like compatibility become more important. Furthermore, many new file formats have their own internal compression, so compressing them again gains almost nothing regardless of the compression algorithm.

The reason both ZIP and GZIP became ubiquitous is, IMO, that the compression algorithm both use (DEFLATE) was released as guaranteed to be patent-free, back in a time where IIRC most of the alternatives were either patented or compressed worse. As a consequence, everything that needed a lossless compression method chose DEFLATE (examples: the HTTP protocol, the PNG file format, and so on).

Freaky · on Dec 12, 2015

LHA and LZX were the popular pair when my Amiga days came to an end. Being a commercial product the latter kind of occupied a similar position RAR does on Windows.

Microsoft ended up adopting LZX for things like CAB and CHM files.

anonova · on Dec 12, 2015

> So, who does use xz?

Arch Linux started using lzma2 compression for their packages nearly 6 years ago!

https://www.archlinux.org/news/switching-to-xz-compression-f...

pdkl95 · on Dec 12, 2015

It's very common to see xz files in Gentoo as well.

    ls /usr/portage/distfiles/      \
      | sed 's/.*[.]//g'            \
      | sort | uniq -c | sort -n -r \
      | head -n 6

       3377 gz
       3051 xz
       1656 bz2
        295 zip
        194 tgz
        107 jar

cyphax · on Dec 12, 2015

Slackware's official packages are compressed with XZ, and it's been that way for a while, too. :)

rlonstein · on Dec 12, 2015

* MikTeX (windows TeX/LaTeX) system started using it circa 2007. * TexLive switched, iirc, circa 2008. * ArchLinux started circa 2010. * Linux kernel 2.6.0, iirc, circa 2011(?) * Gnome switched circa 2011.

There are others but I can't remember. It's fairly common now.

samstokes · on Dec 12, 2015

    OSX: tar -xf some.tar.xz (WORKS!)
    Linux: tar -xf some.tar.xz (WORKS!)

I had no idea tar could autodetect compression when extracting. (I wonder if this is GNU tar only, or whether the OSX default tar can do it too?) I've been typing `tar zx` or `tar jx` for too long.

pdkl95 · on Dec 12, 2015

I highly recommend using atool[1], and never worrying about extracting archives again. It's a wrapper around basically every compression/archive tool in remotely common use.

Bonus: it decompresses to a safely-named subdirectory, but moves the contents of that subdirectory back to the current directory if the archive contained exactly one file. Highly convenient without any risk of accidentally expanding 1000 files into the current directory.

After creating this macro, I've basically never had to care about how to decompress/unarchive anything.

    # 'x' for 'eXpand'
    alias xx='command atool -x'

    # use
    % cd $UNPACK_DIR    # (optional) (can be the PARENT dir)
    % xx foo.zip        # or .tar.{gz,xz} or whatever
    foo.zip: extracted to `foo' (multiple files in root)
    % cd foo/
    % ls | wc -l 
    3

atool actually has many other useful features, but it's worth it just for the extractor.

[1] http://www.nongnu.org/atool/

vbezhenar · on Dec 12, 2015

For OS X and macports I use p7zip package which intalls 7z command. It understand .zip, .rar, of course .7z and probably other formats. I wrote simple Automator script to use that command from Finder and it works just fine (actually .zip and .tar.gz formats are supported by OS X archive tool, but .rar is not and I often have to deal with that format).

_ugfj · on Dec 12, 2015

Looks awesome by description but last release is from 2012 which makes me worried slightly.

pdkl95 · on Dec 12, 2015

Old doesn't mean bad. Sometimes, it means "finished".

About the only thing that a small would need to be updated at this point is support for a new compressor. (that 2012 release mainly added suppport for plzip)

lqdc13 · on Dec 12, 2015

There are usually issues in compatibility between linux zip utility and the mac one. Has to do with zip not being backward compatible.

How does atool deal with cases where there are two versions of the same extractor?

pdkl95 · on Dec 12, 2015

I'm not familiar with that issue, but you can set the path to any extractor in /etc/atool.conf or ~/.atoolrc

    # ~/.atoolrc
    path_zip /path/to/preferred/bin/zip

See atool(1) for details. http://linux.die.net/man/1/atool

As far as I know, different zip formats are not auto-detected. However, it does (optionally) use file(1) to detect the file format, which can be overridden with the 'path_file' option, so a hack may be possible?

x0 · on Dec 12, 2015

Wow, really? Honestly I've never used the z or j flags, I could not even tell you what they do. I use `tar caf whatever.(txz|tgz|tar.lz|tar.lzo) /path/to/files` to create, and just `tar xf whatever` to extract.

teach · on Dec 12, 2015

Some of us have been using Unix tools a very long time. For example, I'm pretty sure I started using tar and gzip in 1995.

feld · on Dec 12, 2015

I think it's libarchive used by BSD tar that allows this auto detection as well

jakschu · on Dec 12, 2015

Best thing about libarchive/bsdtar is, it also handles zip, rar, cpio, iso files and many others. So basically bsdtar xf is what I'm using to extract almost every archive.

LeoPanthera · on Dec 12, 2015

bsdtar (the version in OS X) can do it too. Older versions of OS X used to come with "gnutar" as a separate binary, but not recent versions.

bhouston · on Dec 12, 2015

I wish lzma (xz) was integrated into the browser and curl as an Accept-Encoding. Would be amazing for us (clara.io), and I am sure a lot of others.

wmf · on Dec 12, 2015

Browsers are getting Brotli which is comparable to xz: https://groups.google.com/a/chromium.org/forum/#!msg/blink-d...

bhouston · on Dec 12, 2015

My tests with brotli is that is it overrated - it is slow and has poor compression ratios compared to xz. It confuses me why it is being pushed so hard...

https://github.com/google/brotli/issues/165

wmf · on Dec 12, 2015

Eh, Google gave one example where Brotli does well and you gave one where it does poorly; we're not exactly in science territory here.

Confusion · on Dec 12, 2015

Brotli does not work well for bhoustons use case, so his original wish stands and your helpful suggestion that he should be able to use Brotli in the near future does unfortunately not fulfill his wish.

kijin · on Dec 12, 2015

Yeah, the example given by GP involves large binary streams. Brotli was designed for small text documents with lots of English words in them, as we often see on the web.

zurn · on Dec 12, 2015

Where does it say that Brotli is for small English text documents? I didn't see anything like that in the draft spec or the Google blog post.

The spec doesn't say much on the subject but has this item in the Purpouse section: "Compresses data with a compression ratio comparable to the best currently available general-purpose compression methods and in particular considerably better than the gzip program"

wmf · on Dec 12, 2015

Brotli includes a built-in dictionary that contains a lot of English words, HTML tags, etc. so it will give better compression for that kind of input.

zurn · on Dec 12, 2015

Yes, it has that optimization for short data (though it's not restricted to English), but the PR and specs say it's still meant to be a general purpouse compressor. And it does very well on most types of large data.

zurn · on Dec 12, 2015

It's a real problem in a compressor proposed for general purpouse use when it's shown that a naturally occurring major class of data has this bad performance.

donatj · on Dec 12, 2015

Decompression of xz is slow and quite memory intense. I'd argue that the light memory footprint of gz decompression is better suited to the web, particularly mobile where you need to balance battery v bandwidth.

LeoPanthera · on Dec 12, 2015

Compressing xz is relatively slow, and would be expensive for web servers. Maybe something like LZO would be better? (Which OpenVPN uses to compress data in transit.)

bhouston · on Dec 12, 2015

I just want to serve large static data asserts from aws cloudfront. :)

sbuttgereit · on Dec 12, 2015

I think this is one of those things where the author is pretty much 100% right and it just won't happen. Habits are hard to break and in many cases, the negatives just don't impose a high enough cost to matter.

There are times when I do seriously look for the optimum way to do things like this and then there's most of the time I just want to spend brain cycles on more important problems.

yborg · on Dec 12, 2015

The author is not 100% right, as is always the case with this, it depends on the data you are compressing. Here is a stackexchange with some relevant experiments: https://superuser.com/questions/581035/between-xz-gzip-and-b...

I believe that the biggest driver of using old-school ZIP or GZIP is the fact that everyone knows that everything can decompress these formats. And in a modern world of terabyte disks in every laptop, multicore multi-Ghz CPUs, and megabit bandwidth, it isn't worth the effort of using a format that saves an additional 20% on compressed size at the cost of someone not being able to decompress it.

bhouston · on Dec 12, 2015

On typical source trees and mesh data xz is in the range of twice at good at compressing than gzip. That is very significant imo.

barrkel · on Dec 12, 2015

That's only really much good if you're in the business of archiving things like that. For most people, source trees are ad-hoc downloads for patch fixes, oddball platform compiles, etc. And then the universality of gzip is better than any marginal space savings from xz.

jmspring · on Dec 12, 2015

I mentioned it when it came up on another thread. Compare apples and apples -- use one of the standard corpuses when running bench marks.

Ian Witten put together the Calgary corpus - https://en.m.wikipedia.org/wiki/Calgary_corpus

profquail · on Dec 12, 2015

Windows users: 7-zip can extract .xz files should you need to (article didn't mention a Windows solution).

mappu · on Dec 12, 2015

Although 7-Zip can't "look through" a .tar.{x,g}z file - browsing a .tar.xz will require fully decompressing the .tar to a temporary location.

Redoubts · on Dec 12, 2015

Tar isn't great about any of those files either, it's just building a list while decompressing to /dev/null

http://serverfault.com/q/59795

aninhumer · on Dec 12, 2015

Isn't that mostly just because of how tar is designed? It's a concatenation of individual files with headers, so you have to decompress the whole thing to get a file list anyway. At which point you might as well save the decompressed tar in temporary.

sp332 · on Dec 12, 2015

It's also quite a bit faster than Windows at opening and creating zip (edit: not gzip) files.

kbuck · on Dec 12, 2015

You're confusing gzip (.gz) and PKZip (.zip). Windows has no native support for gzip, only PKZip.

mkj · on Dec 12, 2015

Which version of RHEL does "Linux" include? The world isn't all Ubuntu recent releases.

bbatha · on Dec 12, 2015

RHEL 6 doesn't include it. So that's most of enterprise deployments...

smcleod · on Dec 12, 2015

It's such a shame that so many slow moving 'enterprises' still have RHEL 6 servers, it's so incredibly outdated - not only does it limit what they can do, but it negatively affects peoples impressions of linux.

X-Istence · on Dec 12, 2015

And this is the reason I am still supporting Python 2.6 even-though I wish to drop it, drop it hard.

Phemist · on Dec 12, 2015

For me, as a Python user, I've found that gzip is currently the only compression format that allows streaming compression/decompression. I don't want to have to store hundreds of gigabytes of data and THEN compress it, rather than compressing it right during file generation. I haven't found any other compression lib that supports this out of the box.