The trouble with this is that, as a software author, it doesn't really matter if it takes 70 seconds instead of 33 to install my software. 70 seconds is fast enough, for someone who's already decided to start downloading something as involved as Meteor; even if it took one second it wouldn't get me more users. And it would have to take over 5-10 minutes before I start losing users.
On the other hand, having to deal with support requests from users who don't have any decompressor other than gzip will cost me both users and my time. Some complicated "download this one if you have xz" or "here's how to install xz-utils on Debian, on RHEL, on ..." will definitely cost me users, compared to "if you're on a UNIXish system, run this command".
From a pure programming point of view, sure, xz is better. But there's nothing convincing me to make the engineering decision to adopt it. The practical benefits are unnoticeable, and the practical downsides are concrete.
There are downsides which most young fellas around here don't really appreciate. Hands up, how many of you remember bzip? Not bz2, but bzip, the original one, mostly seen as "version 0.21"?
Well, at the time it was released, people were making much the same arguments (with kittens). It compressed so much better than gzip, no reason to use the obsolete gzip format and tools, etc. And some of us jumped on the hype bandwagon and started recompressing our data, only to find out afterwards that bzip2 is now the new thing and the format is not only obsolete, but also patent-encumbered and in general needs to be phased out.
From a long-term perspective I'm fine with gzip. At least I know that I'll be able to open my data in 10 years time, which is not the case with bzip-0.21. The jury is still out on "xz", in my opinion.
Agreed, and I think people do not fully understand it because the way it happens is not safety based the way you describe it. The question is not is X safe and Y unsafe, the question is each on its own. Is X safe, and will it still work on old systems or in ten years? Yes, 100%.is y safe and still work in ten years? Well some old system might have issues, and the patents might still exists, and....
If you're a developer, gzip is simply the best option. It's not the best, but it's good enough and it's safe.
I had to explain the same thing to an engineering team the other day. There was a push to switch from a "fast but ok" compression algorithm to a "faster and better" compression algorithm. This seemed like win-win, but I explained that:
* The faster compression made a difference of about 100 milliseconds to a user experience lasting minutes.
* The better compression made a different of about 1 second to most users.
* The change of compression algorithm would take time away from engineering teams, and ultimately introduce bugs.
So in the end it was sidelined until other fundamental changes (file format etc) made it able to be coat-tailed into production.
Same story as yours: don't fix what ain't broke. Tallest working radio antenna in the world -> longest lying broken antenna in the world.
never seen that in my life until now - what does it do? Just unzip the file to dev/null? What's the purpose? Does the verbose flag show you what's inside but the dev/null means it's not written to disk while unzipping?
Actually, the files are decompressed to the current directory, it's the just output of the verbose flag that goes to /dev/null. Which makes is even more senseless.
Exactly. I've seen people who always do `tar xzvf` and have no idea removing the `v` is the correct way to make it not print the name of every file in the archive.
You didn't use to be able to suppress the 'z' switch. You had to specify 'z' or 'j' depending on whether you wanted gzip or bzip2 decompression. It's a somewhat recent (sometime in the last 15 years, I think) change to "tar" to make it just detect the compression algorithm.
Isn't it better though to omit the -v switch and do `ls *` and/or `tree` afterwards? That gives you the same information but structured so it's much easier to understand.
The advantage of -v is that you can see what is being extracted as it happens. This is useful if you have a tarball with thousands of small files, as otherwise it's hard to tell whether tar has got stuck or there are just a lot of files.
$ tar -xf foo.tar.xz
tar: End of archive volume 1 reached
tar: input compressed with xz
$ tar -xf foo.tar.gz
tar: End of archive volume 1 reached
tar: input compressed with gzip; use the -z option to decompress it
It tells you how it's compressed and how to decompress it if it knows how. OpenBSD's tar doesn't support xz so it can't help there, but does support gzip so it suggests using -z.
Not letting untrusted input automatically increase the attack surface it's exposed to is a feature.
>Not letting untrusted input automatically increase the attack surface it's exposed to is a feature.
How is that a feature? The user's explicitly asking for this.
This feature reminds me of vim, that suggests closing with ":quit" when you press C-x C-c (i.e. the keychord to close emacs). It knows full well what you want to do and even has special code to handle it, but then insists to hand you more work.
Vim suggests closing with ":quit" when you hit C-c; the C-x is irrelevant.
Upon receiving a C-c, it does not know full well what the user wants to do.
When vim receives a C-c from you (or someone who just stumbled into vim and doesn't know how to exit) the user wants to exit.
When vim receives a C-c from me, it's because I meant to kill the process I spawned from vim, and it ended before the key was pressed. I very much do not want it to quit on me at that point.
`tar -xf` is not "explicitly asking" for gzip. `tar -zxf` is "explicitly asking" for gzip.
I don't really care what vim does, that's a different argument. There have been many vulnerabilities in gzip, and in tar implementations that let untrusted input choose how it gets parsed, those vulnerabilities might as well be in tar itself.
MP3s with a decent bit rate is as good is it gets. Of course something like ogg back in the napster days would have been fantastic, but MP3 at 320 Kbps is fine for anyone who doesn't pay $1000 a meter for speaker wire.
If I recall correctly, there are some alive patents in the U.S. until 2017 when dealing with MP3 encoders, requiring purchasing a license per copy distributed.
The workaround that the parent is talking about is usually "get LAME from a different distributor", which is still done by Audacity and others.
For those of us that sample from songs that we buy, WAVs are a bit easier to work with because the DAW doesn't have to spend time converting it. That said, since most of my tracks these days are using either the 48k or 96k sample rate, it still needs to be converted from 44.1 :)
But can you tell the difference between a variable encoded 320 Kbps MP3 using a modern encoder and a wav file? I have some reasonable equipment and I most definantly can't.
When I DJed I used a mix of FLAC files and 320 Kbps CBR and high level VBR. On performance equipment I could tell VBR was not holding up. There is also some quality loss that you encounter when slowing down MP3s that is not present for FLAC or WAV, especially when kept in key, but for the most part that is only audible beyond the 8-15% range, and it was not common to alter the tempo that much for me. I ended up settling mostly on FLAC when I can get it and 320 CBR otherwise. I don't think I ever heard the difference.
Can you hear the difference between 320 Kbps CBR and VBR? I have to say I have never tried slowing down the music to try and hear the difference so it is possible under these conditions that it might make a difference.
The degree of difference depends on the kind of music you listen to. Live recordings of acoustic ensembles in airy cathedrals -- in that case you can tell the difference. On tracks that have a highly produced studio sound, where everything is an electronic instrument -- not going to be much of a difference.
I tried doing tests like this and I could not find any recordings where I could tell the difference at 160 kbps VBR. I not saying that it is impossible, but the conditions must be pretty rare and the difference very minor - compared to the massive degradation that come from room effects it amounts to nothing.
compared to the massive degradation that come from room effects it amounts to nothing.
Truth.
Chamber music in an echo-y cathedral. With bad encoding, you can hear a noticeable difference in the length of time the reverberations are audible and and the timbre of those reverberations cab be quite different. With lots of acoustic music, the "accidental beauty" produced by such effects can be quite important.
Finding this convinced me to re-encode my music collection in 320kbps MP3 for anything high quality, and algorithmically chosen variable bitrates for lower quality recordings -- usually around 160 kbps. That was quite a number of years ago, though. I'd probably use another format today.
That's not true. MP3 simply never gets transparent and you can notice with 5 dollar in-ears. And people in general notice. This leads to absurdities such as bitrates of 320 kbps, even thou these do not sound significantly better than 128 kbps and are still not transparent.
On the other hand 128 kbps AAC is transparent for almost any input. AAC is supported abou everywhere where mp3 is. The quality alone should be convincing. The smaller size make the usage of mp3 IMHO insane.
OTOH "the scene" still does MPEG-2 releases I think.
I have listened to a lot of MP3 at different bit rates and with modern encoders and variable bit rates I can't tell the difference between anything above 160 Kbps - most of the time it is hard to tell the difference between 128kbp and anything higher. Really at 320kps you are entering the realm of fantasy if you think you can hear any difference.
I absolutely heard a difference between 320 and everything below. You can tell me I didn't, but I did. There is a world of difference between 160Kbps and 256, and 128 is a lot worse. If you can't hear it, I understand, but the blame isn't the algorithm -- it is your equipment, your song selection, or your ears.
This is not true. It is trivial for almost anyone to distinguish 320kbps mp3 from uncompressed audio, with built-in DACs and $5 headphones, with as little as 5 minutes of training.
You're also describing everyone else's experience:
> Really at 320kps you are entering the realm of fantasy if you think you can hear any difference.
It depends on the encoder, the track, your equipment, and how good you are at picking out artifacts. Some people do surprisingly well in double-blind tests, though I doubt anyone can do it all the time on every sample.
This is why ABX testing is so big in lossy audio circles. People can and do demonstrate their ability to distinguish between lossy and lossless encodings with certain samples in double-blind tests, at all sorts of bitrates. I've done it myself occasionally.
That people have been doing this for many years is one of the big reasons modern encoders are so good - they've needed tonnes of careful tuning to get to this point.
You are making some bold statements about the general transparency of different audio formats that contradict pretty much everything I've read about this topic so far. Hence, I'd like to learn more, do you have any links that you would recommend?
Well, try it yourself :). Make sure to make it blind-test with the help of somebody else. Ideally such things would be subject to scientific studies. But these are kind of expensive and nobody cares for mp3 anyway. I'm not aware of any recent.
Hydrogenaudio listening tests [1] are studies by volunteers, but they focus on non-transparent compression. Anyway, it also illustrate aswell how bad mp3 is.
I actually tried this last year, and found out the hard way after reencoding my mp3 collection to vbr opus at around half the bitrate (I did some light quality testing to make sure it was of similar fidelity, of course you lose some quality going lossy -> lossy) that either opus-enc or gstreamer at the time would produce choppy broken audio.
And it was reproducible on all my computers. I couldn't use my opus collection at all because either the encoder was broken or the playback was broken.
I need to do it again at some point, when I have 8 hours to transcode everything and try again. See if they've fixed it.
Using xz for Linux builds of your software might make sense though, or would do so in the future. Recent releases of Fedora and RHEL already uses xz to compress their RPM packages.
Debian/Ubuntu dpkg supports compressing with xz too -- and it's a hard dependency of dpkg at least as far back as the precise (12.04) LTS release. So I'd say the majority of Linux users already have access to xz.
I just started typing in 'xz compression' in Google to learn more about it, it offered 'xz compression not available'. And some more queries that indicate it's not quite ubiquitous.
It's good to be informed about the capabilities of xz. I will keep using gzip, but consider xz in situations where the size or time matters. I might not care about 100 megs versus 50 very much, but I will about two gigabytes versus one.
Is it likely that a user has gzip on a system but not tar itself?
From the article:
What about tooling?
OSX: tar -xf some.tar.xz (WORKS!)
Linux: tar -xf some.tar.xz (WORKS!)
Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)
It does, and should have since at least 10.6 (I can find references to 10.6's tar being built on libarchive and that's one of libarchive's headline features; 10.5 predates libarchive so it may not have supported that)
You can easily apply the same argument to xz here, by introducing something rarer with an even better compression ratio (e.g. zpaq6+). Now xz isn't the best at anything either.
But despite zpaq being public domain, few people have heard of it and the debian package is ancient, and so the ubiquity argument really does count for something after all.
> So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.
Two points:
(1) It's very, very easy for the best solution to a problem not to simultaneously be the best along any single dimension. If you see a spectrum where each dimension has a unique #1 and all the #2s are the same thing, that #2 solution is pretty likely to be the best of all the solutions. Your hypothetical example does actually make a compelling argument that bzip2 is useless, but that's not because it doesn't come in #1 anywhere; it's because it comes in behind xz everywhere. (Except ubiquity, but that's likely to change pretty quickly in the face of total obsolescence.)
(2) lzop, in your example, is technically "the best at something". But that something is compression and decompression speed, and if your only goal is to optimize those you can do much better by not using lzop (0 milliseconds to compress and decompress!). So that's actually a terrible hypothetical result for lzop.
Heck, zero compression easily wins three of your four categories.
No even when speed matters sometime lz4 is the best answer. I wrote a data sync that worked over 100mbps WAN and using lz4 on the serialised data transferred far faster than the raw data. Not just on network you can often be processing data faster (specially on spinning disk) since the reduction in disk I/O can in some cases can actually make the processing faster.
Being second-best on ratio and ubiquity is still pretty handy for serving files. It's compress-once, decompress on somebody else's machine, so neither of those matter. Ratio saves you money and ubiquity means people can actually use the file.
> It's compress-once, decompress on somebody else's machine, so neither of those matter.
Last week, there was a drive mount that was filling up, rate was roughly 30Gb/hr. The contents of that mount was used by the web application. Deletion was not an option. Something that compressed quickly was needed. And on the retrieval end, when the web app needs to do decompression, seconds matter.
I found lz4 to be the best for general purpose analysis, it increased the throughput of my processing 10x compared to bz2. Then if you're working with very large files you can use the splittable version of lz4, 4mc, which also works as a Hadoop InputFormat. I just wish they would switch the Common Crawl archives to lz4.
I should probably mention the compression ratio was slightly worse than bz2 (maybe 15% larger archive) but for the 10x increase in throughput I didn't really mind that much. I could actually analyze my data from my laptop!
If I'm actually doing something with my data, gzip -1 beats out lz4 for streaming, as gzip -1 can usually keep up with the slower of the in/out sides, and gzip -1 is higher compression ratio than lz4 and faster compression (but not decompression) than lz4hc.
No you're correct, gzip -1 outperformed lz4 in my test in compression ratio. I don't know why I typed "30% compression" instead of "compression ratio of 30%." Sorry about that.
Last time i checked lz4 did not had a streaming decompression support on their python lib. It will be problem for larger files like common crawl if you are not planning to pre-decompress before processing.
It's not a problem for me since I mostly use Java. However, you can probably just pipe in your data from the lz4 CLI then use that InputStream for whatever python parser you're using and you should be fine.
The biggest problem is using a parser that can do 600MB/s streaming parsing. If you use a command line parser don't try jq even with gnu parallel.
Being the best at something does not make it necessarily the best choice for most situations. This is trivially shown through this example. Assume for the four measured aspects, there is a program this is the best, but in the other three aspects it is orders of magnitude worse than the best in that aspect. Now consider another program which is best in nothing, but is 95% the way to best in every aspect. It's never best in any aspect, but it's clearly a good choice for many, if not most, situations.
One of the great things about gz archives is that the --rsyncable flag can be used to create archives that can be rsynced efficiently if they change only slightly, such as sqldumps and logfiles. Basically the file is cut into a bunch of chunks, and each chunk is compressed independently of the other chunks. xz doesn't seem to have an equivalent feature because the standard implentation isn't deterministic[1].
Changing from one compression format to another seems harmless, but it always pays to think carefully about the implications.
There are many more concerns to address than just compression ratio. Even the ratio one is questionable, because some people have really fast networks but we all have basically the same speed of computers. So a 4x CPU time and memory pressure penalty may be much worse on a system than a 2x stream size increase. Another use case is a tiny VM instance: half a gigabyte of RAM is not actually present in every machine today. Embedded, too.
Another way compression formats can win you much more than a 2x space reduction is by supporting random access within their contained files. Gzip sort of supports this if you work hard at it. Xz and bzip2 appears similar (though the details are different). I achieved a 50x speedup with this in real applications, and discussed it a bit here: http://stackoverflow.com/questions/429987/compression-format...
And you are right for embedded! .xz just doesn't work there.
I've also found that on the faster systems, for different uses of mine, when I want the compression to last as little as possible and the total round trip time matters (compression and decompression), gzip -1 gives the best resulting size for the reasonably short time I want to spend.
I've come across quite a lot of firmware on embedded Linux devices that uses LZMA (the xz compression algorithm) to compress the kernel, u-boot, and/or filesystems. One memory optimisation for these, as they are typically being decompressed straight into RAM, is for the decompressor to refer to its output as the dictionary rather than building a separate one, as would be the case in decompressing to the network or disk.
If it takes you 60 seconds to download as gz, and 50 as xz. The decompression needs to take less than 10 seconds more for it to be comparable and you've got to be sure that your end users have more memory and sufficient processing power to throw at the task.
He didn't mention the biggest difference between gzip and xz - ram usage. At maximum compression, you need 674 MiB free to make a .xz file, and 65 MiB to decompress it again. That's not much on most modern systems, but it's quite a lot on smaller embedded systems.
Admittedly, in most cases, that isn't much excuse though.
It can also lead to disaster on a web server when linux decides to OOM kill a critical part of the infrastructure like the database server or memcached. Then you can get a cascading problem of services failing, all because of a careless unzip statement. (I've been there.)
Summary: Compatibility and decompression speed is more important than compression ratios for many use cases. Gzip is nearly universal, where lz4, xz, and parallel bzip2 are not.
The challenge of sharing internet-wide scan data has unearthed a few issues with creating and processing large datasets.
The IC12 project[1] used zpaq, which ended up compressing to almost half the size of gzip. The downside is that it took nearly two weeks and 16 cores to convert the zpaq data to a format other tools could use.
The Critical.IO project[2] used pbzip2, which worked amazingly well, except when processing the data with Java-based tool chains (Hadoop, etc). The Java BZ2 libraries had trouble with the parallel version of bzip2.
We chose gzip with Project Sonar[3], and although the compression isn't great, it was widely compatible with the tools people used to crunch the data, and we get parallel compression/decompression via pigz.
In the latest example, the Censys.io[4] project switched to LZ4 and threw data processing compatibility to the wind (in favor of bandwith and a hosted search engine).
For anyone looking to stop making compromises, I recommend pixz. It's binary compatible with xz, and is better at compression speed, decompression speed, and ratio than both gzip and xz on multicore systems. I've adopted it in production to great benefit.
Totally agree with this. As someone with a commit bit to the project, as well as a long-time user, I'd like to second the recommendation. Pixz is a terrific parallel XZ compression/expansion tool. I find it indispensable for logs and database backups. Link: https://github.com/vasi/pixz
Being a windows user these days, I am getting kinda frustrated with how anemic everyone is at even trying to google for 20s to find the windows solution.
7zip is the program you want to handle most everything, with both gui and command line options: http://www.7-zip.org/
Given how radically MS is trying to reform itself to be an open-source friendly company and how ineffectually inoffensive they've been the last 5 years, can we at least try and throw them a bone or two?
The article is not talking about Windows, so folks here aren't either. Why are you surprised that folks are uninterested in Windows?
I've preferred Mac systems for longer than most of the HN crowd has been alive, so I understand what it's like to feel ignored and in the minority. For years, Mac users were treated as pariahs. The tables have turned, and as someone who has been in your situation, I should have great empathy for your predicament.
And I do, but your tone in general -- and your last sentence in particular -- makes it very hard to empathize. Microsoft used its near-monopoly status to stifle innovation for years, and many of us have figurative scars that will never heal. You seem to think that they are making great strides (while I see them as half-hearted overtures), but either way I'm not about to "throw them a bone." Their decades of misdeeds, in my eyes, will not be expiated so easily.
Perhaps Microsoft will someday be worthy of forgiveness, either from the perspective of morality (e.g., Mozilla) or product excellence (e.g., Apple). Until that day, Microsoft will continue to reap what they have sown, given no more attention than they have earned.
> And I do, but your tone in general -- and your last sentence in particular -- makes it very hard to empathize. Microsoft used its near-monopoly status to stifle innovation for years, and many of us have figurative scars that will never heal. You seem to think that they are making great strides (while I see them as half-hearted overtures), but either way I'm not about to "throw them a bone." Their decades of misdeeds, in my eyes, will not be expiated so easily.
I feel like Apple's has forgotten what was important, created then ruined a market, and lost everything that made it interesting (long before Steve Jobs passed away, by the way). Which cuts all the more deeply because back in the early 2ks they were walking the walk and taking a lot from NeXT's culture of developer friendliness. I grew up deeply invested in Macs and NeXT, which makes the realization painful, but... Apple wants to annihilate maker culture as it monetizes its platform. It's also stopped caring about design on a grand scale, instead appealing to very shallow notions of "visual simplicity".
That's all gone now, and they're consequently useless to me. I'd rather patronize a company currently doing the right thing after a troubled past than pretend a previously aligned company was still there.
It should be very telling that Apple AND Google's flagship hardware announcement of 2015 was something that Microsoft has been doing for years.
And if Microsoft suddenly goes evil again? Fuck them, I'll drop them and move somewhere else. Not Linux, unless the distros pull their act together, but I'm sure a competitor will emerge. Or I'll make one.
I know your feeling too, thanks for accepting I feel differently. And sometimes I ask myself "What the hell am I doing with this Surface book?" I won't pretend I don't have doubts.
It's sort of a rough time for devs right now even as we enjoy unprecedented prosperity and recognition. Big businesses are attempting to monetize and control every aspect of developers.
"...decades of misdeeds..."
Which they benefited handsomely from and were never sufficiently punished.
I'm with you, my trust of MS is still pretty close to nil.
GP seems to be talking about this line in the article:
> Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)
I do think the author missed the ball in not doing basic research for the Windows platform. His point is that people should switch from one compression tool to another. If people on Windows were unable to compress or decompress such files, that would be a huge problem for his argument.
It can invoke arbitrary executables as well, which makes it exactly as interoperable as Bash.
Powershell can run on Linux, too. I've even met a few people who quietly prefer it.
So what, exactly, were you referring to? Shell choices are like editor choices: arbitrary and largely equivalent and without any real meaning or impact on a developer's productivity.
The context was: someone had difficulties to find "tar -xf" equivalent for Windows. I pointed out that it would be nice if Microsoft included tar etc. basic utilities in their OS releases. With out-of-the-box Windows machine, you cannot ssh, you cannot untar, etc. Windows-way of doing things is totally different than *nix culture (OS X, Linux, etc). In that context powershell is not "interoperable" (maybe bad wording from me).
> The context was: someone had difficulties to find "tar -xf" equivalent for Windows.
Right and putting that in google, "tar equivalent for windows", immediately nets 5 useful results. You can use tar, or a windows command line variant of 7z or tar, or a gui.
> With out-of-the-box Windows machine, you cannot ssh, you cannot untar, etc.
On an out-of-the-box Linux machine, you generally can't do a lot of things either. It seems particularly ironic that in a discussion about how we shouldn't be using old UNIX tools just because they're entrenched, you then call for compatibility.
> Windows-way of doing things is totally different than *nix culture (OS X, Linux, etc).
Stupid legacy path limits not included, Powershell is in my experience just a superior way to do things. I should maybe restart my blog to talk about that.
But even if we ignore Powershell and windows, your statement is divisive within the Linux community. MANY people prefer shells on Linux that don't adhere to the bash legacy. TCSH and CSH are very popular, to this day. Are they 'not interoperable'?
Everyone's got a big chip on their shoulder about how development tooling "should be." One of the things I've come to realize is how arbitrary, unnecessary, and useless these mores are. They just hold us back.
No. I'm not. But minimal distros are the primary surface area linux exposes for many people these days. The desktop userbase is (justifably) almost non-existent, and most core cloud distros don't even come loaded with curl by default. It's even more extreme as you work with docker.
I really don't see what that has to do with Microsoft. 7-zip is GPL and cross platform. In fact - I just checked - a command line 7-zip is available by default on my Linux Mint install.
Right, but an xz-supporting compression tool is shipped with many Linux distros, whereas you do need to grab a tool to support xz on any version of Windows.
But you're right, 7z is an under-appreciated format.
I have had awful experiences with 7-zip. Recently I was unable to decompress a multi-file rar, it kept saying it was corrupted. I downloaded it several times until I realised it was actually fine, because winrar could decompress it.
I sorta view RAR files as the problem. Only WinRAR ever gets them really right. I dunno why that is, but I also dunno why anyone keeps using RAR. It's something of a joke in the windows dev community.
When I owned an Amiga they kept on changing the archive format to find a better one that saved space.
The had arc, pak, zip, zoo, warp, lharc, and every Amiga BBS I got on used a different archive format. Everyone had a different opinion on which archive format compressed things in the best way.
I think eventually they decided on lharc when they started to put PD and shareware files on the Internet.
Tar.gz is used because there are instructions for it everywhere and it seems like a majority of free and open source projects archive in it. It is a more popular format than the others right now. Might be because it is an older format and had more ports of it done.
But I really like 7zip, it seems to compress smaller archives, before 7Zip I used to use RAR but WinRAR wasn't open source and 7Zip is so I switched.
With high speed Internet it doesn't seem to matter much anymore unless the file is in over a gigabyte in size. Even then Bit Torrent can be used to download the large files. I think BitTorrent has some sort of compression included with it if I am not mistaken. To compress packets to smaller sizes over the torrent network and then resize them when the client downloads them. That is if compression is turned on and both clients support it.
> When I owned an Amiga they kept on changing the archive format to find a better one that saved space.
It happened on DOS too: ZIP, ARJ, RAR, ...
That was back on the days of floppy disks (which usually had at most 1440 KiB) and small hard disks (a few tens of megabytes). Even a few kilobytes could make a huge difference.
As storage and transfer speeds grew, "wasting" a few kilobytes is no longer that much of an issue, and other considerations like compatibility become more important. Furthermore, many new file formats have their own internal compression, so compressing them again gains almost nothing regardless of the compression algorithm.
The reason both ZIP and GZIP became ubiquitous is, IMO, that the compression algorithm both use (DEFLATE) was released as guaranteed to be patent-free, back in a time where IIRC most of the alternatives were either patented or compressed worse. As a consequence, everything that needed a lossless compression method chose DEFLATE (examples: the HTTP protocol, the PNG file format, and so on).
LHA and LZX were the popular pair when my Amiga days came to an end. Being a commercial product the latter kind of occupied a similar position RAR does on Windows.
Microsoft ended up adopting LZX for things like CAB and CHM files.
* MikTeX (windows TeX/LaTeX) system started using it circa 2007.
* TexLive switched, iirc, circa 2008.
* ArchLinux started circa 2010.
* Linux kernel 2.6.0, iirc, circa 2011(?)
* Gnome switched circa 2011.
There are others but I can't remember. It's fairly common now.
OSX: tar -xf some.tar.xz (WORKS!)
Linux: tar -xf some.tar.xz (WORKS!)
I had no idea tar could autodetect compression when extracting. (I wonder if this is GNU tar only, or whether the OSX default tar can do it too?) I've been typing `tar zx` or `tar jx` for too long.
I highly recommend using atool[1], and never worrying about extracting archives again. It's a wrapper around basically every compression/archive tool in remotely common use.
Bonus: it decompresses to a safely-named subdirectory, but moves the contents of that subdirectory back to the current directory if the archive contained exactly one file. Highly convenient without any risk of accidentally expanding 1000 files into the current directory.
After creating this macro, I've basically never had to care about how to decompress/unarchive anything.
# 'x' for 'eXpand'
alias xx='command atool -x'
# use
% cd $UNPACK_DIR # (optional) (can be the PARENT dir)
% xx foo.zip # or .tar.{gz,xz} or whatever
foo.zip: extracted to `foo' (multiple files in root)
% cd foo/
% ls | wc -l
3
atool actually has many other useful features, but it's worth it just for the extractor.
For OS X and macports I use p7zip package which intalls 7z command. It understand .zip, .rar, of course .7z and probably other formats. I wrote simple Automator script to use that command from Finder and it works just fine (actually .zip and .tar.gz formats are supported by OS X archive tool, but .rar is not and I often have to deal with that format).
Old doesn't mean bad. Sometimes, it means "finished".
About the only thing that a small would need to be updated at this point is support for a new compressor. (that 2012 release mainly added suppport for plzip)
As far as I know, different zip formats are not auto-detected. However, it does (optionally) use file(1) to detect the file format, which can be overridden with the 'path_file' option, so a hack may be possible?
Wow, really? Honestly I've never used the z or j flags, I could not even tell you what they do. I use `tar caf whatever.(txz|tgz|tar.lz|tar.lzo) /path/to/files` to create, and just `tar xf whatever` to extract.
Best thing about libarchive/bsdtar is, it also handles zip, rar, cpio, iso files and many others.
So basically bsdtar xf is what I'm using to extract almost every archive.
My tests with brotli is that is it overrated - it is slow and has poor compression ratios compared to xz. It confuses me why it is being pushed so hard...
Brotli does not work well for bhoustons use case, so his original wish stands and your helpful suggestion that he should be able to use Brotli in the near future does unfortunately not fulfill his wish.
Yeah, the example given by GP involves large binary streams. Brotli was designed for small text documents with lots of English words in them, as we often see on the web.
Where does it say that Brotli is for small English text documents? I didn't see anything like that in the draft spec or the Google blog post.
The spec doesn't say much on the subject but has this item in the Purpouse section: "Compresses data with a compression ratio comparable to the best currently available general-purpose compression methods and in particular considerably better than the gzip program"
Yes, it has that optimization for short data (though it's not restricted to English), but the PR and specs say it's still meant to be a general purpouse compressor. And it does very well on most types of large data.
It's a real problem in a compressor proposed for general purpouse use when it's shown that a naturally occurring major class of data has this bad performance.
Decompression of xz is slow and quite memory intense. I'd argue that the light memory footprint of gz decompression is better suited to the web, particularly mobile where you need to balance battery v bandwidth.
Compressing xz is relatively slow, and would be expensive for web servers. Maybe something like LZO would be better? (Which OpenVPN uses to compress data in transit.)
I think this is one of those things where the author is pretty much 100% right and it just won't happen. Habits are hard to break and in many cases, the negatives just don't impose a high enough cost to matter.
There are times when I do seriously look for the optimum way to do things like this and then there's most of the time I just want to spend brain cycles on more important problems.
I believe that the biggest driver of using old-school ZIP or GZIP is the fact that everyone knows that everything can decompress these formats. And in a modern world of terabyte disks in every laptop, multicore multi-Ghz CPUs, and megabit bandwidth, it isn't worth the effort of using a format that saves an additional 20% on compressed size at the cost of someone not being able to decompress it.
That's only really much good if you're in the business of archiving things like that. For most people, source trees are ad-hoc downloads for patch fixes, oddball platform compiles, etc. And then the universality of gzip is better than any marginal space savings from xz.
Isn't that mostly just because of how tar is designed? It's a concatenation of individual files with headers, so you have to decompress the whole thing to get a file list anyway. At which point you might as well save the decompressed tar in temporary.
It's such a shame that so many slow moving 'enterprises' still have RHEL 6 servers, it's so incredibly outdated - not only does it limit what they can do, but it negatively affects peoples impressions of linux.
For me, as a Python user, I've found that gzip is currently the only compression format that allows streaming compression/decompression. I don't want to have to store hundreds of gigabytes of data and THEN compress it, rather than compressing it right during file generation. I haven't found any other compression lib that supports this out of the box.
I generally use gzip for everything because it's everywhere and good enough, but xz and bzip also support streaming, in fact anything that tar compresses does afaik.
> dd if=/dev/urandom bs=1M count=5 | gzip > test
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.438033 s, 12.0 MB/s
> dd if=/dev/urandom bs=1M count=5 | xz > test
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.52744 s, 3.4 MB/s
> dd if=/dev/urandom bs=1M count=5 | bzip2 > test
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.804324 s, 6.5 MB/s
The algorithms support streaming, but that doesn't mean implementations in (in this case) Python libraries still might not do so. Although I can't understand why they wouldn't, presumably they just wrap the same C libraries everybody else uses, and a streaming interface wrapping another stream (i.e. file, socket) should feel very natural and easy.
Decompressing takes 4 times as long? I wonder if that is slow enough to create a bottleneck in processing. Not everyone uses compression for purely archival purposes. In the genomics field, most sequencing data are gzipped to save disk space. And most programs used to process the sequencing data can take in the gzipped files directly.
geezip is fun to say. Until there's a catchy name for "crosszip"/xz/whatever, I think we're preaching to the wrong choir. There's a human element in toolchains. Address it.
it's a shame that algorithm improvements would necessitate a shift away from the name "gzip". It would be better if the intent to compress/decompress was orthogonal to the features of the implementation (compression ratio, speed, split-ability, etc...)
The article misses (but the comments here touch on) that all compression algorithms have a built in obsolescence, even the fancy shiny xz.
It's not the algorithm per-se that go obsolete, but their use in specific cases until all are diminished. Whether lossy or lossless, eventually other technological advancements renders them unnecessary.
And it seems that strongest algorithm is usually the earliest to be widely adopted; these are almost never toppled.
Just like .gz, look at MP3 or JPEG -- 'better' alternatives exist, but the next widely adopted step will be to eliminate that compression entirely. The first radio station playout systems were hardware MPEG audio compression, and the next most widespread step was to uncompressed WAVs. Even video pipelines based on uncompressed frames are becoming more widespread. Eventually the complexity and unpredictability of compression is shunned for simplicity.
Read the gzip docs and the focus is around compression of text source code, a key use case at the time but barely considered these days -- tar.gz source archives exist almost only out of habit; they could just as well be tar.
Media codecs are a little different because there is a significant cost to replacing hardware that only supports the old standard. It seems to me that AAC is becoming pretty ubiquitous and probably will be the go to standard for the next 10-20 years. We are also at a point where the vast majority of users aren't going to notice the difference between a 100kbps MP3 vs AAC stream, which is going to be less than 1% of an he stream, so there is little incentive to innovate by the players with the deepest pockets. Until network capacity is free (or at least cheap compared to to the cost of CPE) pipelines based on uncompressed media are not going to be a thing outside of the production end of things.
For source tar balls though there is basically 0 cost to switching. Since you can download a new compressor in under a minute and should be able to assume that your users are pretty sophisticated. The incentives are similar to media in that cost transfer has to be weighed against the cost of CPE, except that since users supply the CPE the cost is effectively 0 and compression will probably always make sense.
gzip is fast, gzip -1 is even faster, gzip has low memory requirements, gzip is widely adopted. Those are the reasons of gzip is still being used, and why gzip has future. I.e. the gzip "ecosystem" is rich and useful, despite not being the best compressor in terms of compressed size.
P.S. There are gzip-compatible implementations with tiny per-connection encoding memory requirements (< 1KB).
tl;dr: xz compresses better but is significantly slower. This isn't the deepest analysis of potential tradeoffs you might be able to find.
A few reasons why gzip is still useful to have around:
* Speed is critical for many applications, and so size can take a backseat when performance is critical or resources are low.
* gzip is basically guaranteed to be available everywhere in utility and library forms.
* Download speeds vary and so the faster your pipe, the less the archive size factor will matter, and the faster-worse compression might win out in other comparisons.
* xz doesn't compress every type of data this much better than gzip. I've dealt with scenarios where the difference is consistently less than 2%, and the extra time xz spends is actually a tremendous waste.
Sure, for package downloads where xz files will be significantly smaller it makes sense to save the bandwidth, time and storage space. But it's not 100% cut and dry.
Ahh, kids. "Let's all start adopting this new thing that has existed for a few years that's slower and uses more ram because we're wasting literally tens of megabytes all the time!"
There's also lzip, which apparently uses a similar compression algorithm as xz but is apparently built with partial recovery of corrupted archives in mind (so more useful for long-term archival or backup storage). It's made by the same guy who made ddrescue.
The only people who care about compression ratios are:
(1) People who still use 56k modems to download content
(2) People who host extremely popular downloads and who want to minimize their outbound bandwidth bills
If you're not one of these two, you almost certainly care more about compatibility and compression time than compression ratio. gzip continues to win on both those fronts, and it explains why it's still the most popular compression format other than ZIP (which is a better choice than gzip if you frequently need to extract a single file from a compressed archive).
Until there's a compression tool released that can compress at wire speed (like gzip) and has a significantly better compression ratio, don't expect the landscape to change much.
> We present a fast compression and decompression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented.
The 'xzgrep' script (and related xzdiff, xzless, xzmore scripts) are part of the standard xz package, though they are an optional feature so YMMV between distros.
~ $ xx /usr/portage/distfiles/xz-5.0.8.tar.gz
~ $ cd xz-5.0.8/
~/xz-5.0.8 $ ./configure --help | grep -A1 scripts
--disable-scripts do not install the scripts xzdiff, xzgrep, xzless,
xzmore, and their symlinks
zgrep simply decompresses all the data and feeds it into regular grep. If the data is indexed in some way, it is possible to do better by not having to look at all the data exhaustively.
It's possible to compress and index a file at the same time, gaining both a size and speed advantage over the original. For example: https://en.wikipedia.org/wiki/FM-index
What about availability? I often find myself having to download and compile (de)compressing software because the authors of some other software I need decided to ship it in something other than the standard (.tar.gz), which is available in basically all *nix boxes.
The Weissman score is less than worthless, it "isn't even wrong". This comes up in most HN discussions about compression algorithms and I'm waiting for it to go away.
^ Yes. Clearly people can't take jokes here, already downvoted.
Edit: Even the above got downvoted and I get reminded again that HN is full of stuck up, overly sensitive and unhumorous engineers. With the exception of this being a great place to find interesting things to read the community aspect is uninviting and unforgiving.
But trivia is still worthy of knowing about, however tired and overdone, or dated and played-out. Just like, if it were XKCD yesterday, it was The Far Side twenty years ago.
Not many single-letter options were available. Tar is kind of like ls that way. At least it's easy to remember for those of us who already learned to use lowercase-j for bzip2.
On the other hand, having to deal with support requests from users who don't have any decompressor other than gzip will cost me both users and my time. Some complicated "download this one if you have xz" or "here's how to install xz-utils on Debian, on RHEL, on ..." will definitely cost me users, compared to "if you're on a UNIXish system, run this command".
From a pure programming point of view, sure, xz is better. But there's nothing convincing me to make the engineering decision to adopt it. The practical benefits are unnoticeable, and the practical downsides are concrete.