An Unbelievable Demo

cperciva · on June 4, 2021

Reminds me of when Apple started providing "smaller size updates" to OS X. I was curious about the details since my doctorate had touched on the topic, so I worked my contacts (I had a few in Apple engineering from the FreeBSD / OS X relationship) and after a few months I got back as answer: "We're using a tool called bsdiff, are you familiar with it?" I was indeed, since I was the author of said tool.

(Just to be clear, there was no license violation involved in this case; just a lack of awareness of the provenance of the open source software they were using.)

vanderZwan · on June 4, 2021

While I'm not the author of anything, I did on one occasion share Russ Cox' articles on regexes with a fellow developer, only for that developer to reply "that guy is making a mountain out of a molehill, just use re2".

jffry · on June 4, 2021

For anybody who's lost: Russ Cox is the original author of re2, a fast C++ library implementing regular expressions that are guaranteed to run in linear-time.

There's a collection of articles they wrote talking about regexes and various pitfalls: https://swtch.com/~rsc/regexp/

deathanatos · on June 5, 2021

This is still relevant today, too. The last several JavaScript vulns. that people at my company have had to upgrade around were because of accidentally quadratic regex in JavaScript. One poor library[1] was attempting to match a header whose grammar was a whoppingly complex,

  1#token

(this is in the HTTP spec's notation: it means 1 or more `token`s, comma separated with optional whitespace around the comma) and hit this, by trying to split the incoming values with,

  / *, */

I was shocked that this wasn't compiled to a DFA. (I checked, too: my JS exhibited the behavior in the bug report.)

This is, I also think, another reason why "simple" text protocols are not really so simple. The grammar above is "trivial", and yet, this is the end result. I don't feel like the library is particularly at fault: I doubt I would have caught this in code review.

[1]: https://www.npmjs.com/advisories/1748

phkahler · on June 4, 2021

>> "that guy is making a mountain out of a molehill, just use re2"

That's an odd thing about the tech world, it's accessible. As you get better in different areas you are actually more and more likely to make contact with important people (big names? people who did important stuff?). This can creep up on you if you're not aware what level you're operating at. It can be a small world.

toddmatthews · on June 4, 2021

yep. i remember asking a question on google groups about some AppEngine query early on a Saturday morning thinking, "this will never be answered".

about 15 minutes later Guido van Rossum answered my question.

derefr · on June 4, 2021

Also, people who say so-and-so company (usually Google) is hard to contact for support, or that they require expensive support contracts before they'll talk to you, have likely never tried sending email to the appropriate mailing list for the product.

It's amazing how often doing this completely bypasses any corporate first-line-support structure in the way, and just puts the email right into the inbox of the line engineers working directly on the product. It's also amazing how quickly those line engineers reply. (It's as if they treat "replying to random messages on the product mailing list" as their highest-priority job. Or maybe it's just that they're technical people, and my questions are usually very nerd-snipe-y, and get them hooked.)

deanCommie · on June 4, 2021

It's a great concept in theory, but in practice...find me the email list for Google Photos. Or Google Keep. These are two Google products that I use daily (including paying for one!)

derefr · on June 4, 2021

Well, yeah, there does have to be a public mailing list.

My point was that there are often public mailing lists, where engineers with real engineering problems could discuss those problems with the engineers responsible for the product/service; and yet the engineer with the problem nevertheless doesn't even think of using the mailing list to reach out, but instead decides to go through regular customer-service support channels to get their problem solved.

Silhouette · on June 4, 2021

I've also got serious problems (as in, the service we're paying real money for is totally broken) solved by contacting a friend who worked at $BIG_COMPANY and the friend escalating internally.

The point is that I shouldn't have to bypass the official channels that way. These organisations are operating at the level of ad-hoc individual heroics, which is the lowest tier in terms of organisational maturity. In a start-up where everyone has to do everything and no-one has worked anything out yet, that's completely understandable. In a many-billions-of-dollars business with enough influence that someone's quality of life or the viability of some other business could be profoundly affected if the giant screwed up, we should be demanding better by now.

pvorb · on June 5, 2021

What I've seen so far is that companies in general don't get better at this as they grow. They add process over process and each level makes tackling 90 percent of support requests more efficient (for them) but the more difficult requests just don't make it to the person who could help.

Silhouette · on June 6, 2021

That is certainly one pattern we see repeating, but I don't think it's what is happening with many of the big tech companies. They're opting for the alternative where tackling 90% of support requests is highly efficient because those requests are simply routed to /dev/null.

In one sense, it's hard to blame them. After all, if no-one who matters to their revenue stream is actually going to change behaviour because of that dismissive policy, it saves them all the overheads of providing useful support and costs them practically nothing. It's just good business, right?

What is strange is how they've got away with it for so long and most people still don't seem to be switching to alternatives, even as the tech giants casually squash them without even noticing. At some point around here, the words "competition" and "regulation" enter the room.

filoleg · on June 4, 2021

>Well, yeah, there does have to be a public mailing list.

Imo, this is not scalable or sustainable, and mailing lists are not a replacement for adequate customer support.

The only reason sending emails directly to mailing lists for specific Google products works is precisely because those mailing lists are not public and not flooded with bajillions of emails from the general public. So those who send the emails are already somewhat pre-screened in a way, because if you know that mailing list email address in the first place, you are very unlikely to send something like "my cousin couldn't remember password to their google photos account, can you fix this please". That's why everything there ends up being read and addressed. If those mailing lists were public, then they would be just as useless and ineffective as the current customer support routes currently are for Google.

Tl;dr: mailing lists for specific products are a nifty workaround for the time being, but they aren't a good sustainable solution for shitty customer support. Making those mailing lists public will not only not help solving the problem, it will just make those mailing lists as ineffective as the current customer support. There is no "one weird trick" to solve the customer support adequacy issues with Google,it has to be an actual customer solution that won't be easy and will take time.

derefr · on June 5, 2021

I'm confused about what you mean about "public." I'm just a regular guy with no connections to Google, other than being a GCP customer. I found the mailing list addresses for each GCP service listed directly in the support documentation. Literally anyone who has GCP problems would end up finding those addresses, if 1. they clicked on the "help" button and went through the workflow presented, and 2. didn't first pay for extended white-glove support and then immediately reach for it for any problem that came up.

By my thinking, that's a "public" mailing list. They're not hiding it from you. The opposite, really — they're trying to get everyone to know and use it, by making it free to any GCP customer, while the actual CSR kind of support requires paying for a subscription to a higher support tier. The mailing list, presented in Google Groups format, is literally what GCP calls their "support forum." It's supposed to take on all comers, including dumb customer asks.

vinodc · on June 6, 2021

I think the disconnect here is that Google engineers are much more likely to answer the low volume of technical "nerd-snipey" questions from other developers than the high volume of non-technical questions they'd get from the general public for something like Gmail.

zikzak · on June 5, 2021

Lately, I've seen a few open source projects go this way: GitHub issues, this gets unwieldy so they create a Discord, this descends into chaos so they nominate a community volunteer, issues are then filtered by thier preferences, etc.

I like Discord, especially in the early days, you can reach out to the principal dev, etc. But it soon seems like they either disappear to get work done (good) or spend all thier time on it (bad). Either way you end up with chaos.

derefr · on June 5, 2021

Our company has a Discord, but we employ a professional Community Manager for it. When community members bring issues up:

1. they're encouraged to do so in public, so that other community members can help if possible, and/or so bots can reply with suggested FAQ answers;

2. the Community Manager will answer with the company line for questions the company has set answers to (e.g. "when are you releasing X?" or "why is [abusive DoS-like pattern of requests to your service] not working?");

3. otherwise, if the Community Manager knows the answer for sure off the top of their head, they'll give the answer;

4. and if not, the Community Manager relays the question to an engineer in our Slack, where we either have an answer off the top of our heads, or we file it as an issue.

Seems to work just fine for us so far.

Some of the engineers are also sometimes in the Discord (and we're all registered to it), but other than the Community Manager, it's not our job to be in there.

mercer · on June 4, 2021

I've more than once had one of the core developers of Elixir or Phoenix answer a question almost right after asking it in the Slack or IRC channel. I often felt a bit embarrassed to take up their time considering how 'basic' these questions were.

I've had similar experiences in other language/framework communities. It's amazing how helpful some of these very productive people can be to random chat visitors :)!

mst · on June 5, 2021

The trick is to remember that they're almost certainly working on that stuff because they enjoy making users happy, so they're also doing support for the same reason.

I do feel a little bit embarrassed if it turns out they're reading the docs to me, but I feel embarrassed about that whether it's an expert or a fellow n00b ;)

mercer · on June 6, 2021

Haha, I've found that to be good training to always read the docs first.

TimPC · on June 4, 2021

It depends on your issue. We got good support emailing with the TF Lite team on a neural net bug. I think if you’re interacting with open source in a value add way google support is often quite good. If you’re looking for support for integrating for sales or classic customer support it can be terrible to non-existent.

shaftway · on June 4, 2021

> If you’re looking for support for integrating for sales or classic customer support it can be terrible to non-existent.

> Maybe it's just that they're technical people, and my questions are usually very nerd-snipe-y, and get them hooked.

Integrating sales or classic customer support is boring.

I mean, I get that it pays the bills, but when I've got a million priorities, boring work that I don't really get credit for goes to the bottom of the pile.

rantwasp · on June 4, 2021

no. there is a big difference between having support and having someone that is passionate about something helping you out. support should be there and should be available from the simplest issues to the most complicated things about A PRODUCT. you will not get much traction if you ask the same things to the people expert person on a mailing list.

derefr · on June 4, 2021

I guess I've never needed "support" in that sense.

I almost always solve problems with the products/services we use myself — up to and including forking the vendor's codebase to fix their shit for them — because it's almost always the fastest way to do things. I've already been working with their product for a while, and I already know exactly what my own problem is. Provided I also know the language their code is written in, that translates to being able to code a patch myself, faster than I can get someone on their end to comprehend the problem I'm having.

That applies up until the point where there's a problem surface that's just plain inaccessible to me (i.e. the inside of a proprietary mobile app or SaaS service), at which point I have to reach out to tell them that it's broken / missing something on their end. (And even then, if I have a spare hour and access to the offending binary, I'll reverse-engineer it a bit to see if I can hotpatch it while waiting for them to get back to me.)

I suppose, for people who don't think this way, there can be value in "support." But IMHO there's more value in just hiring some DevOps engineers who do think that way. Then all the easy "support" requests get handled in-house, and so you'll only ever need the kind of "support" that involves direct bug reports to the engineers from the vendor who built the thing.

rantwasp · on June 4, 2021

you are by definition a power user. if your product is for power users that’s fine.

if your product is targeted to everyone but only power users can figure it out when there is an issue... well you have a problem.

also, being able to figure something out != you should figure it out. your time is limited and the complexity of remembering all those things that you figured out (even if you have the time) will quickly overwhelm you. unless it’s literally your job to support the product you should care about the interface of the product and what guarantees it makes

re: hiring devops engineers. i’m sorry, what? if my email suddenly does not work I’m supposed to hire a devops engineer now?

gowld · on June 4, 2021

> forking the vendor's codebase to fix their shit for them

How well does that work for a hosted cloud service?

derefr · on June 4, 2021

I mean, if I know that some service is using e.g. Redis under the covers, and the problem is in Redis itself, then submitting a patch upstream to Redis; waiting for it to get upstreamed; and then telling the cloud host to update their Redis version to solve the problem — is usually a pretty reliable path.

But otherwise, like I said, that's when "the problem surface is inaccessible."

emanlin · on June 4, 2021

The appropriate mailing list email address is often listed right at the bottom of the blog article too.

criddell · on June 4, 2021

I read about a similar example this week. Some news orgs filed FOIA requests for Dr. Anthony Fauci's email and I was surprised at how many regular people just emailed him and got a response.

Apparently the guy answers about 1000 emails per day.

noir_lord · on June 6, 2021

Had this happen to me, I was complementing a particular debugger I liked on an IRC channel and some random id said "Thanks, it's a lot of work".

I didn't realise the author of that tool was in the channel - kinda neat that we have such a flat open structure at times.

nostrademons · on June 4, 2021

Google interviewing lore is that there once was a candidate who was asked if they were familiar with MapReduce and replied "MapReduce? Is that like Hadoop?"

Reportedly, this was also a major factor in Google's strategy shift to open-source a lot of their infrastructure (GRPC, Bazel, TensorFlow, LevelDB, etc.)

YakBizzarro · on June 4, 2021

I used and hacked bsdiff in the past. A very neat concept and tool, many many thanks for making and publishing it!

grupthink · on June 4, 2021

Yeah, but did you win the Putnam?

mishraprince · on June 4, 2021

That one reply in that flame war is still hilarious so many years later. To anyone who didn't get the reference of Putnam https://news.ycombinator.com/item?id=35079

abalaji · on June 4, 2021

wow, that’s one helluva response. HN in 2007 seems like a wonderful place

libria · on June 4, 2021

Like the featured article, HN threads - then and now - are pretty low-key in regards to participant introductions. Usernames are just text, no flair or qualifiers so the focus is on the speaker's content itself. `libria` is free to converse with `cperciva` as an equal peer which is nice, because most other IRL forums I'd be acutely aware that's not the case ;)

Hallucinaut · on June 4, 2021

I like the part where someone calls cperciva's idea bad, he says what's bad about it then the founder of Dropbox replies saying they're just starting and it sounds like they're in the same space.

Pretty good bad idea

ksec · on June 4, 2021

Interesting this reply and one above are from accounts that is rather new.

bredren · on June 4, 2021

tldr;

In 2007, an academic argues with PG and others. After long exchange, academic defensively states their own achievements.

Another user challenges academic with GP question, to which academic provides unexpected affirmative response.

yuppiemephisto · on June 4, 2021

I don't agree with that phrasing. A lot of people on that thread seemed sure cperciva was some arrogant dickhead bound for failure, but tarsnap is going a lot stronger than many of them are. Also people in the thread were amazingly rude to him, while he seemed pretty polite to me.

cperciva · on June 4, 2021

I think you're giving cperciva too much credit. He was just as arrogant as people thought, and he was definitely defensive.

yjftsjthsd-h · on June 4, 2021

I wonder how many people will downvote you without looking at your username?

belter · on June 4, 2021

This thread is turning out to be better than the one from 2007...

gavinhoward · on June 5, 2021

I can't downvote, but I am glad you said something because I didn't look!

gowld · on June 4, 2021

But does he refer to himself in the third-person? That's the pinnacle of arrogance.

cperciva · on June 4, 2021

cperciva rarely refers to himself in the third person, but it does occasionally happen.

More often, we refer to ourselves in the first person plural.

yuppiemephisto · on June 5, 2021

TIL. Sounded reasonable to me, so quite likely I'm arrogant too.

Tarsnap's still there though, so I'd say that's a good defense.

cperciva · on June 5, 2021

I didn't say that I was wrong.

But I was having a bad day, and I said things in a different and more abrasive way than I normally would.

pjerem · on June 4, 2021

I think you missed the point. Scroll up, it's the same guy :)

bredren · on June 4, 2021

Haha. I did. That’s wonderful.

mercer · on June 4, 2021

But just the once, though, huh?

eska · on June 6, 2021

I sent an email to a corporate Steam email address, asking whether I’d be allowed to post a screenshot from half-life in a computer graphics shading thesis. Ended up with a response from Gabe Newell, CEO, shortly after, and another engineer who invented skeletal animation, all excited to talk about it with some random kid

scrollaway · on June 4, 2021

I first encountered bsdiff when working on reverse engineering Blizzard MPQs (their proprietary packaging format, long abandoned now in favour of other stuff). They started using it for mpq patches some time in … 2007 I want to say? I was 16, had just started to code.

10-ish years later, I'm doing other stuff but still working with Blizzard-related tooling sometimes. I was talking to a Battle.net engineer about their latest-and-greatest game update protocols. He tells me they're thinking of adopting this great thing called bsdiff for the next version. I giggled a bit.

Anyway, hi! I implemented a version of your algorithm when I still barely knew wtf I was doing. https://github.com/jleclanche/python-ptch

rubicon33 · on June 4, 2021

I don't understand what bsdiff does, or is. I am a software developer and I frankly have no clue what I would ever use bsdiff for! I've read what it does (libraries for building and applying patches to binary files) and still don't really have a sense for what the purpose of this tool is.

What are some real life use cases for it? When does a developer need such a tool?

throwaway3699 · on June 4, 2021

Implementing software updates where you don't want to ship entire binaries again (and only the diff) would be one. In some video games the assets are also packed into massive binaries, so you don't want to ship gigabytes of data because you replaced one icon. Sadly many games do this anyway nowadays.

OldTimeCoffee · on June 4, 2021

There are other solutions to this problem that the game industry uses. Binary diff patching is slow, incremental, involves large diffs and has the possibility of corruption. It was used back in the mid 90's (RTPatch was the big name), but really isn't used anymore because of the drawbacks.

Games frequently use an override directory or file. The patch contains only the files that have changed and is loaded after the main index and replaces the entries in the index with the updated ones. This is the most common way of doing a patch if it's not just overwriting the original files.

Some games load their file as a virtual filesystem and then the patch just replaces the entries in the virtual store with new ones. Guild Wars 2 works this way. This is only common in MMOs though.

formerly_proven · on June 4, 2021

Games also use this because it's a straightforward way to almost guarantee a physical ordering of the files in the VFS, which is/was a common optimization strategy in the days of CDs and hard drives (profile what order the game needs files, then put them in the archive in exactly that order = tada, loading 4000 files behaves like a sequential read).

Another reason is that certain operating systems originating in the state of Washington have performance problems when you access small files or directories containing many files.

spinax · on June 4, 2021

Circa 1995: we (company) used RTPatch as the users at the time were floppy based via Post, but enjoyed BBSes (and Prodigy, etc.) as it was a social community due to the nature of the software/industry. We could upload small RTPatch based updates and bugfixes to our tiny company BBS, users could dial in and download a rtpatch a lot faster than a floppy in the mail (besides avoiding the usual corrupt floppies that plagued the tech).

codefreakxff · on June 4, 2021

Fun fact. This is not a new concept. Doom used overrides with their WAD file format. Mod authors could release their mod files replacing or adding content without stealing the game level files.

There may be prior art to that, but as a young coder that was the first time I’d seen it

erosenbe0 · on June 4, 2021

Yes, overrides. I've heard talks on this at conferences with a couple big publishers. A lot of effort is put into it but obviously if we were distributing OS security updates like Apple it would be a whole different ballgame.

smoldesu · on June 4, 2021

To my knowledge, most developers have gone back to binary patching for obfuscation purposes. Bethesda does this now (and ID, by extension), as well as many other developers I've seen.

derefr · on June 4, 2021

For at least the Nintendo Switch (not sure other modern conosles), the digital distribution infrastructure is built in terms of overlay packfiles. Games, updates, and DLC on disk are all single-file archives / filesystem images. The OS, when launching a game, mounts the game + its updates + its DLCs together as a transparent overlay filesystem. The game just sees a unified representation of its newest version, with whatever DLC has been installed, sitting under (IIIRC) /title.

I wouldn't be surprised if the other consoles also do things this way. It's a very sensible way to manage updates — especially when a game is running off of physical media but the updates are held in local storage. It also means there's no point where the update gets "merged in" to the base image, which means updates can be an atomic thing — either you have the whole update file downloaded + sig-checked (and thus it gets added to the overlay-list at boot) or you don't.

And, if all the consoles are doing it, I wouldn't be surprised if studios that do a lot of work on console don't just use that update strategy even on PC, for uniformity of QA, rather than for "obfuscation."

derefr · on June 4, 2021

Different things.

Games are directories/packfiles containing many individual files, mostly binary art assets, plus one executable that takes up a negligible proportion of the total size. When binary art assets in the directory/packfile are updated between versions, they don't really "change" in the sense that a source-code file might be changed a git commit; instead, they get replaced. (I.e. every file change is essentially a 100% change.)

The "binary diff patching" you're talking about the game industry using, was just the result of xor-ing the old and new packfiles, and then RLE-encoding the result (so areas that were "the same" were then represented by an RLE symbol saying "run of zeros, length N"). For the particular choices being made, this is indeed much less bandwidth-efficient than just sending a new packfile containing the new assets, and then overlay-mounting the new packfile over the old packfile.

bsdiff isn't for directories full of files that get 100% rewritten on update. (There's already a pretty good solution to that — tar's differential archives, esp. as automated by a program like http://tardiff.sourceforge.net/tardiff-help.html .)

Instead, bsdiff is for updates to executable binaries themselves (think Chrome updates), or to disk images containing mostly executable binaries + library code (think OS sealed-base-image updates — like CoreOS; or, as mentioned above, macOS as of Catalina + APFS.)

In these cases, almost all the files that change, change partially rather than fully. Often with very small changes. The patches can be much smaller, if they're done on the level of e.g. individual compiled function that have changed within a library, rather than on the level of the entire library. (Also, more modern algorithms than xor+RLE can be used — and bsdiff does — but even xor + RLE would be a win here, given the shape of the data.)

There's also Google's Courgette (https://www.chromium.org/developers/design-documents/softwar...), which goes further in optimizing for this specific problem domain (diffing executable binaries), by having the diff tool understand the structure/format of executables well-enough to be able to create efficient patches for when functions are inserted, deleted, moved around, or updated such that their emitted code changes size — in other words, at times when the object code gets rearranged and jumps/pointers must be updated.

The goal of tools like bsdiff or Courgette isn't to reduce an update from 1GB to 200MB for ~10k customers. The goal is to reduce an update from 10MB to 50KB for 100 million customers. At those scales, you really don't want to be sending even a 10MB file if you can at-all help it. The server time required to crunch of the patch is more than paid off by your peering-bandwidth savings.

marcan_42 · on June 4, 2021

XOR+RLE is almost useless for binaries, because almost any change will cause instructions to be added or deleted, offsetting the entire binary after the first change, making the xor fail to converge. On top of that, these changes cause changes in addresses in the first part of the binary, so you end up with a zillion similar-looking xor deltas in the first part of the file that won't compress well with RLE.

In fact, if you use smarter compression than RLE, I wouldn't be surprised if the update was larger than the original binaries after the xor, as an offset xor will likely increase chaos (entropy) in the file, making it compresss worse than the original.

bsdiff was specifically designed to intelligently handle these situations, which is why it works.

Just tested it on Chromium from my package server (90.0.4430.72 vs 90.0.4430.212):

- Original binary: 266MB

- Gzipped binary: 106MB

- Gzipped XOR "patch": 228MB

- bsdiff patch: 47MB

derefr · on June 4, 2021

Pedantic point: it's not "almost useless for binaries." It's almost useless for compiled, PIC binaries in modern executable formats like PE or ELF that allow for lots of load-time address-space rearrangement.

XOR-and-RLE works well for binaries from non-HLL languages (assembler, mostly) where — due mostly to early assemblers' lack of support for forward-referencing subroutine labels from the data section — subroutines tend to ossify into having defined address-space positions.

You can observe this by the fact that IPS-patchfile representations (which, while a different algorithm, is basically equivalent to XOR-and-RLE in its results) of the deltas between different versions/releases of old game ROMs written in assembly, are actually rather small relative to the sizes of the ROM images themsleves. The v1.1 ROMs are almost always byte-for-byte identical in ROM-image layout to the v1.0 versions, except for where (presumably) explicit changes were made in the assembler source code. Translated releases are the same (sometimes, but not always, because they were actually done by the localization team bit-twiddling the original ROM, because they didn't have access to the original team's assembly code.)

(This is also why archives that contain all the various versions/releases of a given game ROM, are highly compressible using generic compressors like LZMA.)

marcan_42 · on June 4, 2021

Yes, if we're talking about 8-bit era games, certainly things were different back then and things rarely moved around in memory between versions :-)

OldTimeCoffee · on June 5, 2021

bsdiff is pretty similar to RTPatch, which is what the game industry used in the past. I'm unaware of what you're describing ever being used in practice, especially among large game houses.

That said, patches aren't really downloaded as standalone patches anymore because of Steam distribution. The way Steam handles it is documented, and if you're interested, it's available here: https://partner.steamgames.com/doc/sdk/uploading#Building_Ef...

But as an overview, Steam splits files into 1MB chunks and only downloads the 1MB chunks that have changed. The 1MB chunks are compressed in transit. Steam also dedups the 1MB chunks. I would assume that this works fine to manage the tradeoffs between size and efficiency.

Riverheart · on June 4, 2021

That's how Tales of Maj'Eyal (ToME) supports modding

maccard · on June 4, 2021

One of the reasons games do this is the data is compressed, so a "patch" might be indistinguishable from a real update.

Also, as a dev, you have no idea what version your users are updating _from_. You either need to generate some number of patches for every version you could be updating from, and figure out if you should just download the whole thing again in any of those cases anyway.

kevincox · on June 4, 2021

There are solutions to this!

The simplest one is generate patches for recent versions, where recent can be years in the past. It is a linear operation but you only run it on release so it probably isn't a huge cost. You can also use some heuristics such as if if diff is >20% of the file just stop and force users still on that version to do a full update.

A second option is using zsync[1]. zsync is basically a precomputed rolling checksum. The client can download this manifest and they download just the parts of the file they need. This way you don't care about the source, if there is any similarity they can save resources.

And of course these can be combined. Generate exact deltas for recent versions and a zsync manifest for fallback.

[1] http://zsync.moria.org.uk/

Side note: One nice thing about zsync is that the actual download happens from the original file using range requests. This is nice for caching as a proxy only needs to cache the new data once. Is there a diff tool that generates a similar manifest for exact diffs? So instead of storing the new data in the delta file it just references ranges of the new file.

mewse · on June 4, 2021

We usually don’t compress the data on disk; decompression would make loading and file access slower.

Instead, we just pack the uncompressed files together (frequently using normal zip in a no-compression mode) so that we can avoid needing to ask the OS to open and close files for us or examining the contents of a directory, both of which can be kind of startlingly slow (by video game standards) on some common OSes. Instead, we will generally cache the directory data from the zip file and just use that rather than go to disk.

(of course, the whole download/patch would all be compressed for network transfer, but files would then be decompressed during the installation process)

maccard · on June 4, 2021

> We usually don’t compress the data on disk; decompression would make loading and file access slower.

_You_ mightn't but the last three AAA games I worked on do/did. PS5 expectes compressed files, and does HW decompression (ahem, mostly) on the fly.

Danieru · on June 4, 2021

On the switch the reads are so slow the fastest loading requires atbleast mild compression. At least it did when I was testing packaging for my latest switch release. Despite the weak cpu.

Ps4 also did the compressed packages by default thing if I remember right. The upside there being ample cpu for decompression such that no compression was never fastest.

sumtechguy · on June 4, 2021

I could see that keeping one big file would still be advantageous too in that environment. As a fopen on a set of small sized files plus read plus close over and over does add up. Just in cpu time and memory slack. Whereas treating it as one giant packed backing store would have advantages in speed. But at a cost of dev time. Even if you are compressing it as well could be an advantage. But I would expect there to be some spot where it crosses over from being an advantage to a disadvantage. I suspect it would be on small files/objects. But that would just be for speed. You may have to optimize for size in some cases. In the end it is usually better to build some test code and find out. Sometimes the results are what you expect. Sometimes you are spending time on things that do not matter anymore but were a big deal 20 years ago.

mewse · on June 5, 2021

Oh, definitely! I haven’t worked on anything targetting a console past the PS3 generation and it completely slipped my mind that the latest gen consoles are architected specifically for streaming compressed data.

On the Windows/Mac/Linux title I’m working on now, I definitely measure a sizeable improvement to performance when loading from an uncompressed zip rather than from a compressed one. But even that could be down to the particular set of libraries I’m using to handle it.

tialaramex · on June 4, 2021

> We usually don’t compress the data on disk; decompression would make loading and file access slower.

Did you actually benchmark this? It probably makes sense in your head, but on any vaguely modern hardware it's very unlikely to actually be true because of how exponential the memory hierarchy is.

bsenftner · on June 4, 2021

Console hardware tends to have fast processors & cache but extremely slow RAM. Benchmarking a console's memory vs cache access tends to be one of the first things a team of principal game devs do and that information becomes bible for their titles.

PeterisP · on June 4, 2021

IIRC in a bunch of scenarios compression makes loading and file access faster, as you're I/O limited and it's quicker to read less data and decompress. You do need to choose simple/quick/not-that-much-compressing compression algorithms for that.

greyhair · on June 4, 2021

I have worked on a number of embedded products which ran from compressed root file system from eMMC. The overhead was a wash because RAM is so much faster than eMMC. What you spent in decompression time was covered by reduced eMMC access time.

xmprt · on June 4, 2021

If you know a user is on version 3 and need to update to version 5, then why not just send out all the patches between 3 and 5? Why do you need to generate a new patch for each pair of versions.

It feels a bit egregious when I have to download a 100MB update just because a few characters were buffed or nerfed. More involved changes end up being over 1GB.

maccard · on June 4, 2021

Because it's not just version 3 to version 5, it's version 3 to version 84.

Not all versions are made equal either - one might be a character buff, another might reorder assets in the "big huge binary blob file" for performance improvements. At a certain point, rather than downloading 30MB per update for 25 versions, and applying each incrementally (remember that you have to do them in order too), just download the full 1GB once and overwrite the whole whing.

uuidgen · on June 4, 2021

Microsoft made sure in windows 10 that it's almost unusable without SSD. SO you big binary blob file have random r/w access.

Most backup software is able to do good binary deltas of arbitrary data for decades. Even dumb checkpointing resolves problem of downloading 25 versions - you download latest checkpoint and deltas from there.

Don't excuse poor design and programming, when you know a file structure, creating a differential update should be short task. With a tiny bit of algorithmic knowledge you could even optimize the process to only download needed assets inside of you big binary blob - if the asset was changed 7 times during your last 25 version you only need to download the last one.

xmprt · on June 4, 2021

I'd personally like to see a company put a little thought into innovating how they store data on disk so patches can be quickly applied like with git while also not requiring a full source recompilation.

milankragujevic · on June 4, 2021

It can go worse - some cheap and badly designed Android phones which download updates from every month when you first buy it until the current month, so maybe 10+ updates, but they aren't deltas (diffs) but full images. Ridiculous on so many levels.

beagle3 · on June 4, 2021

It’s because they only tested updates from one version to the next, and not every version to every newer version.

It is a complete image, but phones today have nontrivial state that may be a problem - e.g. your baseband processor might have its own rom with its own update protocol, which changed between image 2 and image 7, so image 10 after image 1 will be unable to update the baseband.

mst · on June 5, 2021

If it's a cheap phone, I'd rather them do something brute force but reliable than try and be clever when they know they don't have the budget to QA it.

I honestly consider that a pretty reasonable trade-off.

oblio · on June 4, 2021

> One of the reasons games do this is the data is compressed, so a "patch" might be indistinguishable from a real update.

Does this happen with more advanced compression algorithms? I've rsynced zip files of different versions of internal software and the diff was always much, much smaller than the entire package.

sgtnoodle · on June 4, 2021

Zip files have all the metadata in a footer rather than a header. As a result, compressed files can be added and overwritten by appending to the file without disturbing already compressed data. Additionally, the "deflate" compression likely does not span across files, so files that did not change from version to version would have a similar compressed byte sequence, regardless of the order they were added to the archive.

I'd argue that zip is a relatively simple compressed archive format. Its simplicity is its charm and the reason it's so popular. More space-efficient algorithms would be less likely to be "patchable" as there would be less redundancy / structure in the compressed representation to exploit (the best compression seems like it would have similar properties of random data.)

versteegen · on June 4, 2021

> Additionally, the "deflate" compression likely does not span across files

Clarification: .zip (unlike .tar.gz for example, or "solid" .7z) compresses each file separately, that's nothing to do with the compression algorithm used. In addition, DEFLATE, the LZ77-based compression which is by far most commonly used in .zip (and also by gzip) has a window size of 32kB (uncompressed). So yes, even if you used DEFLATE on a solid stream (e.g. zipped a .tar archive) it couldn't remove any cross-file redundancy once it's gone past the first 32kB of each file.

beagle3 · on June 4, 2021

On the other hand HIgh voltage sid collection (hvsc) distributes a zipped zip;

Each file is 1k-20k, of which there are 40,000 or so. But they are catalogued in 3-4 deep directories, so if you just zip them, the metadata takes 30% or so of the zip.

But the metadata does compress very well, so they zip it again.

BBC-vs-neolibs · on June 4, 2021

zstd strikes a nice balance here. It can inject markers in the bytestream to make it "rsync friendly", but one could just as well say "binary diff friendly".

zstd itself also has the (pretty new) ability to use a shared file as shorthand during compression. What that means in practice is that diffs can be REALLY tiny if you have the previous archive download.

Hi dang.

kevincox · on June 4, 2021

In general yes. After the first difference the compressed streams will be basically random compared to each other. However there are numerous things that may avoid this.

For zip files each individual file is compressed independently. So unchanged files and prefixes don't need to be resent, even if once a file changes the entire tail end of it needs to be resent.

Some times compression algorithms "reset" periodically. For example the `gzip --rsyncable` patch. This basically resets the compression stream so that a change will only affect part of the compressed file. This does have a cost in terms of compressed size because the compressor can't deduplicate across resets. However if the resets are infrequent you can maintain fairly good delta transfer with little space overhead.

Additionally some delta transfer tools detect common compression and decompress the file "in transfer", performing the delta checks on the original file.

whelming_wave · on June 4, 2021

Kind of an aside from your question, but binary patches not being much smaller than the full thing might happen more with modern games?

Heresay, but from what I've heard modern games may ship multiple copies of some assets with different levels or features so they can be loaded as a sequential read off the disk. While a block-oriented compression algorithm might sync up more reliably, if you're packing 200MB of assets for a level and they're all compressed to take advantage of the fact they'll be read sequentially could mean a change 25MB in would still ship ~175MB of changes.

uuidgen · on June 4, 2021

So far in in most of the world even platter disks (which have really poor performance with modern Windows) are faster than network. Which means you can download description of the difference and reorder the file locally much faster than downloading it. Yes it needs the file to made in a way that is update friendly - most current compression algorithms can be configured like that. Yes, compression will be slightly lower, but you will save on both download size AND disk space, because right now most patches require you to download patch and then apply it, requiring twice the space. If you have 5% larger asset file but can patch it with a few memcpy on a mapped file it is a win in every way imaginable.

It is just really poor programming, nothing more. And it's everywhere. If find source >/dev/null takes 6 seconds there is no reason for gradle to take 2 minutes on a rebuild. If the dev is used to that, why would they even think about patch optimisation?

erosenbe0 · on June 4, 2021

Windows game devs just traditionally didn't treat things with such granularity as you might find in a *nix environment where every little thing is a file. Content is then managed as larger blobs and you have a database of offsets or mappings.

maccard · on June 4, 2021

I can't say, I'm not sure what compression we used on the last two games I worked on, sorry!

wruza · on June 4, 2021

It is very unlikely that it was a “continuous” compression anyway. Continuous archives disallow random file access property, and games require exactly that for assets. You can’t decompress few gb on average to fetch a sprite of a cat.

The reason games (and software in general) do full downloads instead of binary patches is purely overdefensive and/or stupid. Store software could just check checksums after a patch and re-download only if they fail.

maccard · on June 4, 2021

it's a UE4 game so the code is available on Github, I believe it was a header, with an encrypted index into a large binary blob of compressed data.

BlueTemplar · on June 4, 2021

Well, whether they use full versions or diffs is also going to depend on how much they care about bandwidth.

erosenbe0 · on June 4, 2021

Speaking from experience, AAA games have quite a bit of architecture behind them that can date back a decade or two. So you end up with some tradeoffs. The code may be well-tuned, resource efficient, and mostly crash proof, but some elements can be a bit dated relative to the size and scale of modern assets.

rubicon33 · on June 5, 2021

Ahh, thank you for the explanation. That's an awesome tool!

I actually could really see using it, now that I understand what it does.

I've worked on some firmware projects where we did OTA updates and were guilty of shipping the entire binary. Luckily, even the entire binary was rather small, but still it would have been very cool to be able to create a diff and ship only the diff!

cperciva · on June 4, 2021

Does it help if I tell you that bsdiff has saved end users somewhere over 10,000 years of waiting for software updates to download?

throw14082020 · on June 4, 2021

yea well thats not enough! Xcode should be using it too. They make us download 12GB for every update.

/s end open source entitlement section

Thanks for your great work

codeulike · on June 4, 2021

Its old now but here's a description from Chromium about something they use called Courgette which is similar/related to bsdiff

https://blog.chromium.org/2009/07/smaller-is-faster-and-safe...

The explanation here is pretty fascinating

https://www.chromium.org/developers/design-documents/softwar...

kevincox · on June 4, 2021

The really cool TL;DR here is that courgette "disassembles" the binary before diffing. Basically turning internal references into symbolic references. This way adding an extra instruction to a function won't affect all of the relative addresses in surrounding code.

faeyanpiraat · on June 4, 2021

If you have a game, where you have large packs of files (like a 2GB sized textures.pak), and in a patch you want to add 2 small things, you can just ship the differencd between the old anc new pack file, and skip transferring the rest.

OldTimeCoffee · on June 4, 2021

They do this, but not as a binary diff. The new files are just shipped as an override of the old files that are loaded after the old files. It's substainally faster than applying a binary diff and file sizes really aren't a concern for the most part.

The game industry really hasn't used binary diffs and patching since the 90s when they used RTPatch.

mcguire · on June 4, 2021

I've noticed.

Flight Simulator downloads tens of GB every time I start it. :-/

nix23 · on June 4, 2021

Chrome and Chromium used bsdiff for updates:

https://www.chromium.org/developers/design-documents/softwar...

codeulike · on June 4, 2021

The page says they use something called Courgette which is similar to bsdiff

nix23 · on June 4, 2021

They used bsdiff before and now Courgette.

masklinn · on June 4, 2021

> I don't understand what bsdiff does, or is.

It‘s a binary diff/patch utility.

> What are some real life use cases for it? When does a developer need such a tool?

Incrementally update binary files e.g. assets or runnable binaries instead if having to re-send the entire thing on every update so e.g. games, browsers, package managers, …

The standard diff/patch utilities are text-based and even when they do extend to binary data their algorithms and heuristics tend to be biased towards textual contents.

Bsdiff was built specifically with an eye towards executables.

gowld · on June 4, 2021

Nowadays, Apple ships the whole 4GB OS, even when they have to ship 3 point updates in one month for small bugs.

warrenm · on June 11, 2021

You obviously haven't done macOS updates in many years

Twirrim · on June 4, 2021

Years ago, I interviewed a candidate for a role on my team. As usual, one of the ways I break the ice with candidates is to get them to talk "war stories".

The team he'd worked on had produced a tool that was only ever intended to be used by the team to solve a particular problem they had. It contained proprietary code.

Unknown to the team, word had spread about the tool, and others had started to use it, including solutions architects. Who started shipping it to customers to use, who absolutely loved it.

That'd be fine except one of the core libraries it used was GPLv3 licensed, and there was non-open source proprietary code used in the tool.

The nightmare scenario he found himself in was having to rapidly re-architect the tool around a non-GPLv3 licensed library, without breaking any functionality, all the while having to have regular sync up meetings with a furious CEO and Legal department (who, to be clear, were mad about the situation, not this particular developer or his team, who weren't to blame)

duxup · on June 4, 2021

I worked in support and came up with a quick script to check for a very specific issue. It was super simple and really just applied to one customer to find one bug they encountered. It really didn't do much more than look through a bunch of counters and a bunch of if statements ...

Next thing I knew someone had copied it and started running as a rule on every data set / customer they could get their hands on, and of course it was false positive city.

Finally after lots of emails where I would just type "Don't use that script, it doesn't work." some engineer wandered up the stairs to support to talk to me.

They were fielding escalation after escalation for these false positives. Support would run the script, see some flags, and turn their brain off and escalate. Management was so scared of this bug / issue that they would do the same.

So he tells me to use a trick he used with a similar situation.

I announced a new and improved script and management militantly demanded everyone use it, and that all escalation using the 'outdated ' script would be rejected.

The new script just identified if it was the right customer for that script and set a bit if it wasn't. If that bit was set the engineer knew immediately they could ignore the output and would say that their analysis didn't find the problem in question and advised some basic troubleshooting next steps (copy and pasted mostly).

The support soon realized that just running that script wasn't getting them much more than a few minutes of breathing room away from the case, management realized this too and saw all these next steps coming back and the focus switched to 'hey we should do these next steps all the time too'.

That was also one of the ways I started to understand how the engineering team worked and really helped start a good relationship with them.

SiempreViernes · on June 4, 2021

That was a very valuable anecdote, thank you!

Aachen · on June 4, 2021

> having to rapidly re-architect the tool around a non-GPLv3 licensed library

... or just go with it and have it be open source? The old version is already open and free for anyone to request the code of. No rush at that point, you can withhold updates for a little while while you rearchitect this or take the situation as it is and have the next few bugfix releases also fall under GPL until you get around to replacing the core component (iff one insists that the future additions must absolutely be proprietary).

Quickly removing the code doesn't change the previously released versions' license.

awwaiid · on June 4, 2021

The license of the previous code didn't magically become the GPL altogether, instead it by default became un-distributable. They were required to (a) stop distribution of the existing code since it at best had no clear license, and (b) if they wanted, going forward remedy the license by clearly making it GPL or doing the rewrite of the dependency. Or even reach out to the library author and ask for an LGPL or other alternative - there is sometimes (often?) some flexibility there.

The built-in conflict resolution in the GPL is no-distribution.

rst · on June 4, 2021

Depends who had rights to the proprietary code. If it was someone else's (for example, if the internal version had used both GPLed code and some proprietary third-party library), open sourcing the whole thing just might not be possible.

alkonaut · on June 4, 2021

> Quickly removing the code doesn't change the previously released versions' license.

But would $BIG_CORP publish source on request for a proprietary product just because they built one version with a GPL library by mistake and later fixed it? Has this been successful, ever?

Macha · on June 4, 2021

Isn't this how OpenWRT got started?

erosenbe0 · on June 4, 2021

Depends on the composition of CORP. If it is comprised of developers with strong respect for OSS or FS community they might not stand for non-disclosure. OTOH, plenty of orgs would be mainly folks without a strong opinion either way and would just follow executive direction.

Twirrim · on June 4, 2021

> ... or just go with it and have it be open source?

They didn't have the legal right to do so.

Aachen · on June 4, 2021

How is that? The company added code to a GPL project, that means it is a derivative work and also comes with software freedoms, or at least that's how the story reads to me since there is no mention of other claims or parties to the mix. That means the company owns the copyright to the added code and is free to comply with the contract (license).

cowsandmilk · on June 5, 2021

Company A writes library with commercial license.

Company B uses that library and a GPL library in a product. They distribute the product.

Company B has no right to relive se Company A’s commercially licensed library under the GPL. Hence, stop distribution and replace GPL library.

Aachen · on June 6, 2021

See:

> ... since there is no mention of other claims or parties to the mix [like company A]. That means the company [B] owns the copyright to the added code and is free to comply with the contract (license).

I'm also not sure whether the confidentiality clause would weigh heavier than a 'must provide source on request' clause, perhaps it could be resolved by not distributing the part that's covered by the confidentiality clause since another standalone library is clearly not a derivative work of the GPL-licensed library. Then only B has to distribute what they made for everyone's benefit.

Aeolun · on June 4, 2021

It does hopefully prevent any of your customers from ever asking for it.

rusk · on June 4, 2021

An app so killer they're willing to rewrite a part of it. Nice problem to have!

lurkerasdfh8 · on June 4, 2021

> one of the core libraries it used was GPLv3 licensed

it's still untested in the real world if gplv3 "taints" derivative work that broadly.

Would they had to open source the entire solution? or just changes made to the library? nobody knows, and there's a lot of FUD to promote BSD licenses instead of sitting down and properly defining the limits in a practical way.

knolan · on June 4, 2021

My stories are very minor. I did my PhD in a reasonably well regarded mechanical engineering lab. My area is experimental fluid mechanics. I ended up writing a lot of Matlab code while there and even worked with a spin out company from the lab in the biotech sector for a while.

I'd get a lot of other students coming to me for coding help. Most just wanted me to do their job for them and I was too naive to say no. One wanted to count cells in a microfluidic device using image processing. I sat down with them for a couple of hours and walked them through a few methods they could look into to get started collecting all the examples in a script. Basic stuff so they wouldn't feel overwhelmed. A few months later I see he published my simple introduction as a paper with zero modifications. He had the good grace to at least thank me in the acknowledgments.

Several years later while working a $BIG_TECH lab we interviewed a candidate from my old lab. They presented their work and had performed some data analysis of thermal camera images. Turns out they were using a script I'd written there and it was still actively used to work with the thermal camera. Nobody ever modified or improved the code — many engineering students are terrified of code. I was annoyed because there was no interest in further development the code, I don't think they even read it or understood it.

While at the spinout company I developed a software tool for the PCR optofluidics platform that was being developed. It was considerably faster and more robust than the hacked together script they were using before and had a user friendly UI that I build with feedback from the biologists on the team. A few years later the founder and one of their new students published a paper documenting their amazing tool without any reference or acknowledgment whatsoever. That one pissed me off.

There is a lot of ignorance around code authorship and respect for the developer in physical sciences research. Like I said, many are terrified by code but don't value the time and expertise it requires; once they have it they no longer think about its maintenance or acknowledging the author.

stdbrouw · on June 4, 2021

> I'd get a lot of other students coming to me for coding help. Most just wanted me to do their job for them and I was too naive to say no. One wanted to count cells in a microfluidic device using image processing. I sat down with them for a couple of hours and walked them through a few methods they could look into to get started collecting all the examples in a script. Basic stuff so they wouldn't feel overwhelmed. A few months later I see he published my simple introduction as a paper with zero modifications. He had the good grace to at least thank me in the acknowledgments.

That can't be the whole story, surely? You verbally suggested a couple of possibilities for what might work and wrote down a couple of lines of code, and then I imagine the student tried out all of those possibilities and reported what did and didn't work? I mean, what academic journal would want to publish half-working examples with unstudied properties?

Mind you, I agree about your broader point that (especially in academia) a lot of people don't really understand and respect code authorship.

erosenbe0 · on June 4, 2021

IMO, academia is not better or worse -- often they do apply their citation culture to code.

The overall problem is that people tend to think only in terms of first order consequences. A little copying here and there might have minimal financial or reputational risk. But the second order consequences of that becoming the example and the norm for the junior ranks and the next generation causes larger organizational risk. So got to consider the bigger picture and nip the ethical lapses immediately.

knolan · on June 4, 2021

I’ve checked the paper again and there is also some basic CFD analysis performed by a third author.

He had some sample images he had taken and I used them to demonstrate the basic code. The figures in the paper are the ones I generated in my sample code.

technofiend · on June 4, 2021

I had a friend who worked at Compaq on tools that did desktop imaging. There was one tool he used that had a big splash screen when it started naming the author. Everyone was in awe of that guy so he used the cachet of people knowing his name to move on to bigger and better roles. My friend decided to see if he could replicate that success and when he wrote the next version of the tool he too had a big splash screen with his name up front. Sure enough it worked and for a certain subset of everyone, everyone at Compaq knew his name and he too was able to move on to what he considered a better role.

There were a couple of downsides; he got calls for years after asking for help with the tool, and one boss seemed envious which led to other issues.

sleepydog · on June 4, 2021

I think it is very common for code to outlast its authors in academia for non-CS/EE fields, where programming is seen more as a means to an end rather than its own pursuit. For example, my wife in her Psych PhD inherited a giant hairball of Matlab code for interfacing with an eye tracker and scripting various experiments. She made her own modifications, and I helped too. The last we heard, the code is still being used. There are likely dozens of copies of it under various names, with sections commented out or added to the end.

I'm sorry to hear about you not getting credit, that's inexcusable, in the same way that not accrediting the researchers who did the work in a paper or book (I have heard this story too often) is inexcusable.

knolan · on June 7, 2021

Just remembered another recent one. A couple of years after I had left, my former employer closed the regional site. I saw the writing on the wall I guess. A few colleagues from there have set up a new consultancy company, they’re going to build demos as a service apparently. Several of the showcases on their website are my work.

_l4jh · on June 4, 2021

I have a similar experience to this myself.

A long time ago (early 2000s) I wrote a handful of tools to ease physical to virtual (P2V) migration for Windows Server. I was a systems administrator in the UK working for a large American firm.

I wrote the tools in my own time and unlike in other countries my employer had no ownership of them, just to clear that up at the start. I did use these tools at work but never developed them on company time or resources. They were released as open source.

Fast forward 6 months and we had a meeting with a virtualisation consultancy trying to sell us some tools to assist in a wider P2V programme. We had a sales guy and a tech guy visit to show us their stuff. After half an hour of them talking up all they can offer the tech guy fired up a tool and I instantly recognised it was my tool but with some rebranding.

My manager looked at me slightly confused as he recognised it too. I let them continue for a few minutes to properly confirm my suspicions then mentioned that this tool was in fact a tool we already use. They were very confused until I loaded up my version on a remote system to show them.

Needless to say the rest of the presentation was extremely awkward. I believe these two gentlemen thought they were in fact their own tools developed in house. It turned out several of "their" tools were in fact rebranded versions of mine.

I would love to say there was some kind of exciting conclusion but in reality all that happened was they were clearly spooked by this as their/my tools were removed and never included again in their P2V toolkit.

I suspect the moment they left the meeting with us they called to report what happened and rather than risk me following up (not sure how I could do that to be perfectly honest, it was FOSS after all just not used the proper way) they decided it would be safer to just pull these non-critical tools. They were just "nice to haves" anyway.

My boss excitedly share the experience with the team and we had a good laugh about it and how the sales guy went from Mr Confident to stammering and stressed in a matter of seconds.

We never did buy their toolkit.

thinkindie · on June 4, 2021

did you follow up with that company to make sure they granted you recognition and money for your tools?

fifilura · on June 4, 2021

Recognition - yes, but money - why? It was open source.

This meeting does not sound at all awkward to me if I was the sales person. To me this sounds like how lots of companies work with open source to make a business.

They take some open source tools (maybe they built it themselves, maybe not), package then and maybe sell some cloud service around it or simply just support.

If I was the sales person I would be delighted to meet the person who wrote a part of the package.

The only reason to be embarrassed is if it did not contain the correct attribution and recognition.

Edit: the fact that it happened 2000 could have made it embarrassing though. Many things have changed since then...

_l4jh · on June 4, 2021

Yes the tools were all open source. This was in 2004 and I just zipped up a compiled exe, the src directory and a license.txt (GPL2) and put it on my website. Similar to Nirsoft which is what inspired me.

The reason behind it being awkward was for half an hour the sales guy had talked up how they were the only company developing tools like this, etc, etc. How the 'big players like VMware don't care about these pain points admins have to deal with' or words to that effect (which was true and why I made the tools in the first place).

Then the moment he finished the sales pitch and I see these 'one of a kind tools' they have been developing I respond with "That's my tool, see..."

I believe the sales guy, at least, believed all the tools he was talking about were made in house. Open source wasn't a widely understood thing back then with Microsoft talking about Linux and open source being a "cancer" and such. Really knocked him off course as I guess he had never been in that situation before (I doubt many have?).

As for following up with the company. I did nothing. I was young and while I knew what they did was wrong (they literally removed my name, link to my website, etc. and put their company name but changed no functionality of the tools as far as I could see and certainly no source code was available!) I didn't have the confidence (or desire tbh) to chase up on some little tools I made to learn and make my life a little easier at work. The company disappeared (I don't know why) sometime around 2010 iirc.

mywittyname · on June 4, 2021

Honestly, it sounds like they might have been willing to purchase a license from you to redistribute the code. Had you reached out, you may have received a few bucks and proper attribution for your work (and your work would have reached a broader audience).

_l4jh · on June 4, 2021

Possibly. Of course they could have contacted me first rather than modifying my program to pass it off as theirs.

I mean it was GPL2 so they could have just used it as is.

Of course this was the early 2000s where many people saw "open source" and felt like it meant they could just do whatever they want. I bet they never entertained the idea I would ever find out let alone be sitting in a sales pitch :)

bilalhusain · on June 4, 2021

> I've learned ... instead to simply say "I have a lot of experience with that technology" and leave it at that.

Shows Brendan's maturity.

I am not sure what should be the appropriate reaction or corrective measure in these situations. We should talk more about handling these unfair situations.

Someone else can become more successful building on top of one's open source project. On a resume, a top contributor and a minor contributor to open source project might have same weightage depending on how you present it - making the situation unfair for a person dedicatedly working on a single project (quality) vs minor contributor to multiple projects (quantity).

But deleting name and credits is wrong. An acknowledgement from the benefitting person (if not the recognition/reward) has far more positive impact on career than justifying to other's that your work was stolen.

It was a bit strange to read some of the initial negative comments. I see Brendan being a sport. I would argue that reading the story as a report against unknown persons at Sun makes more sense. I don't see much sense in blaming victim. And, in my opinion, the VIP had a good run but he isn't the bad guy here.

brendangregg · on June 4, 2021

Thanks. There was a time when many observability products were adding latency heat maps, and at one conference expo floor there were three companies with latency heat maps on their screen at the same time, pitching them as a flagship feature. If I walked near them they'd start trying to explain them to me, and I never figured out an appropriate response. If I said "hey, great to see you added them, I invented these back at Sun" I'd get funny looks.

I think it's a small world, and everything is software, so the chance you'll bump into someone who wrote software you are using I think is pretty high. I was once trying to get my head around Andi Kleen's pmu-tools, and I had the github repo open in my browser on my laptop I was carrying, when the guy sitting next to me on a bus says he's Andi Kleen. (Ok, it was a bus taking Linux conference attendees to an event, not a random bus, but I still found it remarkable timing -- I was studying pmu-tools at that exact time!)

FridayoLeary · on June 4, 2021

Still, it must be quite rewarding to know that everyone, no matter how big is using your tools. Before i knew anything about open source, i was somewhat surprised to see that even the giant that is Apple had open source licenses on their iPod. I assumed that apple had enough resources to develop all their own software, but no, they go just like everyone else and pick off-the-shelf software.

bostonsre · on June 4, 2021

Thank you for sharing some of what you've learned with everyone in everything that you've published. I've been reading the latest addition of your systems performance book the past few weeks and it is amazing. You're work is pretty awe inspiring.

bostonsre · on June 4, 2021

s/'re/r/g

loup-vaillant · on June 4, 2021

> If I said "hey, great to see you added them, I invented these back at Sun" I'd get funny looks.

I don't understand. What kind of funny looks were they? Disbelief? Distrust? Fear of your mental health? Realization of having been lied to by their bosses (oops it wasn't really an internal tool)?

Also, what were the impact of those funny looks? How did they make you feel? Was there any longer term consequences of telling them you wrote the thing?

brendangregg · on June 4, 2021

Disbelief and suspicion. And fear of my mental health I guess: What's wrong with this person?

Maybe I just don't look or dress or sound like what one would expect. But there's context here too: At the time it's when these things are flagship features and on the booth monitors, and the booth staff are explaining the virtues of these features to everyone they meet. They are making it a big deal of it at the time, so maybe that makes it even more unbelievable that the inventor would wander by at that moment.

Now imagine what would happen if companies had a thanks page along with the other boilerplate pages (contact us, about us) on their website. If you're making millions from a thing, thank the original person for that thing. (I put thanks pages at the end of my slide decks, it's not hard.) These interactions would go a lot better -- "my name is on your company website" -- and could lead to fruitful discussions and collaboration instead of weird looks.

mijoharas · on June 4, 2021

Years back I was at a deep learning conference and was reading Andrej Karpathy's blog during one of the talks. Demis hassabis had come in slightly late and sat at the last free seat that happened to be next to me.

He leaned over, asked if I liked the blog, and (slightly proudly if I remember correctly) mentioned that deep mind had hired Andrej for an internship starting soon.

Cthulhu_ · on June 4, 2021

> I am not sure what should be the appropriate reaction or corrective measure in these situations. We should talk more about handling these unfair situations.

Start recording; have them, a big multinational with a massive legal department, admit to violating and stripping a license from source code. Then sue them. They should know better, and they're making billions off of other people's work. That in itself is fair enough, if the license permits it, but removing the license is crossing the line.

simonh · on June 4, 2021

Oh it needs to be redressed and some knuckles soundly rapped, maybe someone even fired depending on the situation, but suing is a last, last, last resort. "WARNING: Do not feed the Lawyers".

jclulow · on June 4, 2021

Be extremely careful recording conversations without consent in the US.

apendleton · on June 4, 2021

Most jurisdictions in the US are one-party-consent. I think the tech crowd tends to have a skewed perception of recording consent rules because California happens to be one of the relatively few two-party-consent states, but it's the exception rather than the rule.

nix23 · on June 4, 2021

But in most country's you get sued first for illegally recording other peoples, and your proof is nothing worth because it's illegally obtained.

Better make detailed notes, who said what with time and date.

steego · on June 4, 2021

Check the laws first. Some places only require one party (you the recorder) to consent.

nix23 · on June 4, 2021

>Some places only require one party (you the recorder) to consent

Some places...nord Korea? You the recorder have to consent?? I consent to myself that i record others without their knowledge?

cge · on June 4, 2021

Laws around recording typically also cover cases where an outside person, who isn't a party in the conversation, is recording. The idea is that there are three possibilities: all parties in the conversation consent to recording, one of the parties consents (almost certainly the person who wants the recording), and none of the parties consent (ie, someone is spying on the conversation). One-party consent is legal in a variety of countries and regions of countries, while zero-party consent is illegal pretty much everywhere I know.

nix23 · on June 4, 2021

That's not what he said...if everyone consent that's called a interview.

steego · on June 5, 2021

Tell me what I said.

nix23 · on June 6, 2021

> Some places only require one party (you the recorder) to consent.

That's whats you said, and it's not true, without consent from at least one recorded or being in a situation where recording is normal (tv etc) it's pretty much everywhere illegal.

And i wrote 'Most Country's' which is a hint that there is some 'Some others'. There is even a Country where singing under the shower is forbidden, in MOST others...it's not.

cromulent · on June 4, 2021

Most of Australia, for example.

https://www.sydneycriminallawyers.com.au/blog/is-it-legal-to...

nix23 · on June 6, 2021

Sounds a bit more complicated then what you think it is:

>But the reality is that it is normally against the law to record a phone call without the other person’s consent.

>In fact, ‘covertly’ (secretly) using a listening device such as a mobile phone or digital recorder and publishing or otherwise distributing that material can amount to a criminal offence.

Recording private conversations:

>The laws only apply to ‘private conversations’, which is one where the parties may reasonably assume that they don’t want to be overheard by others.

>One of the exceptions to the prohibition against recording and/or publishing or distributing records of private conversations is where police officers have obtained what’s known as a ‘surveillance device warrant’ – also known as a ‘wire tap’ – which allows for the recorded material to be used for investigations and tendered in court provided, of course, that the material is relevant to the proceedings at hand.

Between jurisdictions:

>It is legal in all jurisdictions to record a phone call if ALL PARTIES to the phone call consent.

https://www.sydneycriminallawyers.com.au/blog/is-it-legal-to...

But hey if your a Police Officer working on a case your are correct, you don't need the Consent of the other person ;)

cromulent · on June 8, 2021

No, it's just as simple as I think it is.

Vic, Qld, NSW, SA, Tas, all OK. That's most of Aus.

dvlsg · on June 4, 2021

The majority of the states in the US have one-party consent too, I believe.

avhon1 · on June 5, 2021

Yep. Here's a list of one-party recording consent states from [1]:

Alabama, Alaska, Arizona, Arkansas, Colorado, District of Columbia, Georgia, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Michigan, Minnesota, Mississippi, Missouri, Montana*, Nebraska, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Virginia, West Virginia, Wisconsin, Wyoming

[1] https://recordinglaw.com/united-states-recording-laws/one-pa...

nix23 · on June 6, 2021

Yeah but the consent of ONE of the recorded...not the recorder itself right?

randomswede · on June 7, 2021

Consent by one party of the conversation. If you initiate a recording of a conversation, you can reasonably have consented to you, yourself, recording the conversation.

Note that I believe (and, IANAL) that if at least one party to the conversation resides in a "two-party consent" jurisdiction, you will need the consent from all such parties.

kbelder · on June 4, 2021

Oregon.

nix23 · on June 4, 2021

Oregon...Australia...that says everything...

stevebmark · on June 4, 2021

I didn't read maturity from this. I read timidness, conflict aversion, lack of standing up for oneself. Someone touring the world making hundreds of thousands of dollars, demoing your own software and claiming its their own? Violating your license setup, the foundation of OSS? I would have spelled it out as clearly as possible, including the legal implications, spelled out my assumption this person was claiming they worked hard on these tools when instead they did minimal stealing, and either talked about legal follow-up action or financial follow-up action. This is a time where anger, frustration, and being stern are justified.

lowbloodsugar · on June 4, 2021

>I am not sure what should be the appropriate reaction or corrective measure in these situations.

When it happens internally, i.e. I catch someone doing it, then either it is a "first time offense" of the clueless, or it is the act of an unethical person who will be unrepentant. For the clueless, it might be that they undervalue themselves, and therefore undervalue assigning credit. The unethical person, however, understands what they are doing and is simply untrustworthy. They will also likely have a lawyer, because they've done this before. So it can be pricey to get rid of them, but get rid of them you must. They are poison to your team.

kuroguro · on June 4, 2021

A colleague of mine was attending a local tech conference just before Covid hit. I believe it was aimed at newcomers and various companies tried to show off exciting tech that interns/new coders could potentially work on.

She couldn't believe what she saw. A major govt. backed logging company (which does do a lot of dev work themselves) were showing off one of our projects as theirs! We were using depth cameras to estimate the volume of wood loaded on a truck as it drives trough a gate. They even used screenshots that I had made!

Now, they were involved in the project. But they were basically clients of our client. They provided us with a place to test the system as their trucks ran trough. They didn't own the software let alone did any work on it. Why they would present it as something interns could potentially work on is beyond me.

People are weird.

firtoz · on June 4, 2021

I would imagine they would say to an intern once they joined, "yeah you won't work on THAT project, but here's some other thing for you to do". It's like the estate agency putting a listing of a nice place on a website and then when it comes to viewing the place they say "well that one was just taken but let me show you something similar".

imhoguy · on June 4, 2021

Maybe they hunted for some passionate overzealous intern to rewrite it from scratch for them. Kind of "make me a clone of facebook for $10" gigs :)

jgilias · on June 4, 2021

Maybe they wanted to show something that the interns could try and replicate. I mean, if everything your in house dev team is working on are CRUD applications built on 90s tech, that's not very exciting.

gowld · on June 4, 2021

Does logging software have extensive logging?

willcipriano · on June 4, 2021

I did not see that coming, well worth the read if you just came here for the comments. I won't spoil it for you.

dstick · on June 4, 2021

Thanks for this heads-up. Wow. What a plot twist...

nwsm · on June 4, 2021

I was expecting something Theranos-esque

quasarj · on June 4, 2021

I guessed it right away, but had no one to see my glory

aero-glide2 · on June 4, 2021

It was quite obvious the way he worded it

greyhair · on June 4, 2021

I had a similar, smaller scale of that happen, in a sort of reverse direction a few years ago. My director dropped a resume on my desk for someone coming from a company I had worked at ten years prior, thinking I might have met them (it was a small organization). I didn't recognize the name, but skimmed through the resume quickly, and their primary claim on their work history was something I invented just six months before leaving the organization. I couldn't believe it. What were the odds of that resume ending on my desk? Basically zero, but what a huge mistake. And they didn't claim they had maintained and extended it, they claimed they had invented it! He didn't get called in for an interview.

steelframe · on June 4, 2021

Having conducted technical interviews at FAANG companies for over 10 years, I've gotten to the point where I never believe anything anyone claims on their resumes unless I can independently verify them. I also make it a point to ask probing questions about where they got the idea of "inventing" what they did and what alternatives to "making a brand new thing" they considered at the time.

However one of my favorite questions to ask is, "What don't you like about what you invented?" A true creator of something is always acutely aware of its flaws. They're unsettled about its shortcomings and wants to do better. In describing the flaws they demonstrate deep insight into the problem space, and they explain expertly how things don't go perfectly in certain corner cases or how the code could have been better organized.

The poser will almost always struggle to come up with criticisms of what they "invented" and try to pass it off as a spectacular feat of engineering brilliance.

TimPC · on June 4, 2021

Your favourite question has some cultural gaps as in many countries in interview settings people downplay weaknesses and flaws. It’s why a lot of weakness questions are often ineffective. Unless you are acutely aware of when a person is doing this BECAUSE it’s an interview you’re going to get some answers that might lead you to reject good candidates.

bradlys · on June 4, 2021

Honesty is something that never goes down well in an interview when it comes to being critical. (At least in the US)

bartread · on June 5, 2021

I've made a lot of foul-ups in the course of my career. When people ask me about mistakes I've made and what I've learned from them during interviews it's generally an easy question to answer because I have this overstuffed mental file folder of examples.

I can't speak for the US but, in the UK, don't misrepresent your work in a job interview. I can't say you'll never miss out on an offer by being honest, though I don't believe I ever have, but would you really want to work for people who'd prefer you to lie or misrepresent mistakes you've made than be open and truthful about them?

To me that's something of a red flag: it's at least indicative of a culture where mistakes are likely to be covered up, leading to a lack of reflection, learning and improvement... and also quite possibly storing up bigger problems for later.

(FWIW, I started as a developer and am now the CTO of a mid-sized multinational market research and insight company. This is nowhere near as grand as it might sound, and isn't meant to be a boast, but hopefully illustrates that being honest doesn't appear to have done my career any long-term harm. Some things that have, if not derailed my career, caused me to take some fairly substantial detours: (i) taking things too personally, (ii) placing too much weight on others' assessments of me, (iii) and I say this as somebody who is wary of people who change jobs too often, but... staying in a job way past the point where there was anything else I could learn/give/progress. I am, of course, but a single data point.)

steelframe · on June 5, 2021

> Honesty is something that never goes down well in an interview when it comes to being critical. (At least in the US)

I wouldn't paint all tech companies in the U.S. with such broad strokes. In the interview loop for the job I have now in the U.S., every single interviewer asked me a question about "how things could have gone better." I talked about mistakes I made, lessons learned, and how I could do better next time.

I am told the feedback from that loop was across-the-board "outstanding."

lowbloodsugar · on June 4, 2021

Amazon thinks otherwise. This is their Earn Trust Leadership Principal:

"Leaders listen attentively, speak candidly, and treat others respectfully. They are vocally self-critical, even when doing so is awkward or embarrassing. Leaders do not believe their or their team’s body odor smells of perfume. They benchmark themselves and their teams against the best." [1]

[1] https://www.amazon.jobs/en/principles

TimPC · on June 4, 2021

There is a difference between being self-critical once employed (where I agree it's a useful practice) and being self-critical during an interview (which is often viewed as the process of selling yourself in order to get a job).

harshalizee · on June 4, 2021

Amazon's "principles" is also just that, corporate shpiel. 100% of Amazon employees, including Jeff Bezos, would fail if they were actually tested against their own dogma.

steelframe · on June 5, 2021

I'm aware of what you're referring to, and I can assure you that this cultural gap can often extend beyond just the interview and into the workplace.

Aperocky · on June 4, 2021

> "What don't you like about what you invented?" A true creator of something is always acutely aware of its flaws.

OH YES.

I've done a lot of things for both work and side, and this is the one thing I can go on and on.

Things that are a choice of the lesser evil, things that are unfortunate, things that we just don't have time for.

dhosek · on June 4, 2021

I thought this was going in the direction that I experienced. I had a recruiter try to recruit me for the position that I was already in (we were expanding the team, not replacing me).

fenomas · on June 4, 2021

Great article. Gives me flashbacks to the time someone sent me a link to a newly-released version of Minecraft, and it turned out to be using my own voxel engine. :D

(No credit of course, and the marketing copy around it made it sound like it was all their own code. Welcome to open source I guess!)

Agentlien · on June 4, 2021

I would love more details!

Could you please link the voxel engine? Which version of Minecraft was this?

fenomas · on June 4, 2021

Annoyingly there's not much to tell. It was a minimal web version, released as a one-off marketing thing and then never updated, and Mojang didn't reply when I tried to get in touch (via contacts at microsoft). So I never really found out anything about it.

The game is still live though: https://classic.minecraft.net - Originally it supported multiplayer, but that stopped working the day after the game launched.

The voxel engine is here: https://github.com/andyhall/noa/tree/develop

webXL · on June 4, 2021

> Minecraft Classic - official game from Mojang (I'm as surprised as you are)

lol. Microsoft should just release all the source for this since Minecraft isn't exactly state of the art these days. The brand is nearly all of the value.

rubatuga · on June 4, 2021

Do tell more.