Nixos-unstable’s ISO_minimal.x86_64-Linux is 100% reproducible

dcposch · on June 20, 2021

This really deserves more love.

Who remembers Ken Thompson's "Reflections on Trusting Trust"?

The norm today is auto-updating, pre-built software.

This places a ton of trust in the publisher. Even for open-source, well-vetted software, we all collectively cross our fingers and hope that whoever is building these binaries and running the servers that disseminate them, is honest and good at security.

So far this has mostly worked out due to altruism (for open source maintainers) and self interest (companies do not want to attack their own users). But the failure modes are very serious.

I predict that everyone's imagination on this topic will expand once there's a big enough incident in the news. Say some package manager gets compromised, nobody finds out, and 6mo later every computer on earth running `postgres:latest` from docker hub gets ransomwared.

There are only two ways around this:

- Build from source. This will always be a deeply niche thing to do. It's slow, inconvenient, and inaccessible except to nerds.

- Reproducible builds.

Reproducible builds are way more important than is currently widely appreciated.

I'm grateful to the nixos team for being beating a trail thru the jungle here. Retrofitting reproducibility onto a big software project that grew without it, is hard work.

esjeon · on June 21, 2021

> I'm grateful to the nixos team for being beating a trail thru the jungle here. Retrofitting reproducibility onto a big software project that grew without it, is hard work.

Actually, it's Debian guys who pushed reproducible build hard in the early days. They upstreamed necessary changes and also spread the concept itself. This is a two-decade long community effort.

In turn, NixOS is mostly just wrapping those projects with their own tooling, literally a cherry on the top. NixOS is disproportionately credited here.

mikepurvis · on June 21, 2021

I think both efforts have been important and have benefitted each other. Nix has always had purity/reproducibility as tenets, but indeed it was Debian that got serious about it on a bit-for-bit basis, with changes to the compilers, tools like diffoscope, etc. The broader awareness and feasibility of reproducible builds then made it possible for Nix to finally realise the original design goal of a content-addressed rather than input-addressed store, where you don't need to actually sign your binary cache, but rather just sign a mapping between input hashes and content hashes.

Ericson2314 · on June 21, 2021

> where you don't need to actually sign your binary cache, but rather just sign a mapping between input hashes and content hashes.

Though you can and should sign the mapping!

mikepurvis · on June 21, 2021

Of course, yes— that was what I was saying. But the theory with content-addressability is that unlike a conventional distro where the binaries must all be built and then archived and distributed centrally, Nix could do things like age-out the cache and only archive the hashes, and a third party could later offer a rebuild-on-demand service where the binaries that come out of it are known to be identical to those which were originally signed. A similar guarantee is super useful when it comes to things like debug symbols.

theon144 · on June 21, 2021

By the way, here's the stats on Debian's herculean share of the efforts: https://wiki.debian.org/ReproducibleBuilds

raziel2p · on June 21, 2021

The ratio of reproducible to non-reproducible packages doesn't seem to have changed that much in the last 5 years.

kzrdude · on June 21, 2021

They have new challenges with new packages. In the last 5 years there entered a lot of rust packages for example, a new compiler to tackle reproducibility with (and not trivial, even if upstream has worked on it a lot).

stavros · on June 21, 2021

In my experience, rustc builds are reproducible if you build on the same path. They come out byte for byte identical.

kungito · on June 21, 2021

Yeah I remember there was some drama regarding build machine path leaking into the release binaries

kzrdude · on June 21, 2021

Aha.. don't all compilers behave the same way, with debug info?

I mean it's worthwhile to fix, but that behaviour seems so standard.

KirillPanov · on June 21, 2021

No, rust leaks the path to the source code on the build machine. This path likely does not even exist on the execution machine, so there's absolutely no good reason for this leakage. It is very nonstandard.

It is really, really annoying that the Rust team is not taking this problem seriously.

shawnz · on June 21, 2021

I don't think this is correct. Most compilers include the path to the source code on the build machine in the debug info, and it's a common problem for reproducible builds. This is not a rust-specific issue.

Obviously the binary can't contain paths from the execution machine because it doesn't know what the execution machine will be at compile time, and the source code isn't stored on the execution machine anyway. The point of including the source path in the debug info is for the developer to locate the code responsible if there's a crash.

See: https://reproducible-builds.org/docs/build-path/

colejohnson66 · on June 21, 2021

But is it only on debug builds? Or are release builds affected? Because if it’s the latter, that’s a big issue. But for the former, does it really matter?

bmwiedemann · on June 29, 2021

At least in openSUSE, we always build with gcc -g and then later strip debug symbols into separate debuginfo files. This leaves a unique hash in the original file and that makes them vary if the build path changes.

chriswarbo · on June 21, 2021

> This is a two-decade long community effort.

So is Nix/NixOS, which has reproducibility in mind from the start.

The earliest example I can find is "Nix: A Safe and Policy-Free System for Software Deployment" from 2004 ( https://www.usenix.org/legacy/event/lisa04/tech/full_papers/... ):

> Build farms are also important for release management - the production of software releases - which must be an automatic process to ensure reproducibility of releases, which is in turn important for software maintenance and support.

Eelco's thesis (from 2006) also has this as the first bullet-point in its conclusion:

> The purely functional deployment model implemented in Nix and the cryptographic hashing scheme of the Nix store in particular give us important features that are lacking in most deployment systems, such as complete dependencies, complete deployment, side-by-side deployment, atomic upgrades and rollbacks, transparent source/binary deployment and reproducibility (see Section 1.5).

zucker42 · on June 21, 2021

I don't think NixOS is getting too much credit. This is an accomplishment, even if it was built on the shoulders of giants.

catern · on June 21, 2021

That's somewhat uncharitable. patchelf, for example, is one tool developed by NixOS which is widely used for reproducible build efforts. (although I don't know concretely if Debian uses it today)

Foxboron · on June 21, 2021

patchelf is not really widely used for solving reproducible builds issues. It's made for rewriting RPATHs which is essential for NixOS, but not something you would be seeing in other distributions except for when someone need to work around poor upstream decisions.

dcposch · on June 21, 2021

Has a full linux image--something you can actually boot--existed as a reproducible build before today?

Gaelan · on June 21, 2021

I have to imagine it's been done, at least with some stripped-down kernel+busybox situation. Not sure, though.

emanlin · on June 21, 2021

Forgive my ignorance but isn’t that Slackware?

heisenzombie · on June 21, 2021

No, if I build Slackware on my computer and you build Slackware on yours; the binaries we end up with will not be bit-for-bit identical.

pxc · on June 22, 2021

tl;dr: Debian's work is very important here, but NixOS' reproducibility aims are more general than Debian's and began more than 8 years earlier

Despite the fact that Debian (as a project) has shouldered far more of the work with upstream projects to make bit-identical reproducibility possible at build time, Debian (as a distro) doesn't have a design that makes this kind of reproducibility as feasible, practical, or robust at the level of a whole system or disk image in the way that NixOS has achieved here. To quote the Debian project itself[0]:

> Reproducible builds of Debian as a whole is still not a reality, though individual reproducible builds of packages are possible and being done. So while we are making very good progress, it is a stretch to say that Debian is reproducible.

Beyond the fact that some packages still have issues upstream and the basic technical problem of versioning (i.e., apt fetching binaries from online archives in a stateful way) Debian additionally struggles with an extremely heterogeneous and manual process of acquiring and uploading source packages[1]. Debian doesn't even have the resources to construct a disk image where the version of every package is pinned, short of archiving all the binaries (which is how they do ‘reproducible’ ISO production now[2]). But pulling down all of the pre-built binaries for your distro isn't really ‘reproduction’ in the same sense as ‘reproduction’ in Debian's (package-level) reproducibility project.

Some points of comparison

  • NixOS always fixes the whole dependency tree
  • Debian requires a ‘snapshot’ repository to fix a dependency tree
  • most NixOS packages are updated through automatic tools and all the build recipes are stored under version control in one place
  • Debian packages can be updated any way that suits their maintainers, and the build recipes/rules can be stored anywhere (it's the maintainer's job to keep them in version control if they want, then upload them to Debian repositories as source packages)
  • Nix (transparently!) caches both build outputs and package sources, which means
    ◦ if the original source tarballs (e.g., on GitHub or SourceForge) are unavailable, Nix won't even notice that if it can pull them from the ‘binary’ cache
    ◦ if there is no cache of the build outputs, Nix will automatically fall back to fetching and unpacking the sources from the upstream mirror
  • Debian's technical and community relationships to upstream source code are both less robust
    ◦ Debian requires manual management (creating and uploading) of complete source code archives in their own format[1]
    ◦ sometimes Debian infrastructure can't even reproduce upstream source code from their own archives[3]
    ◦ if Debian's source archives are unavailable for a package, there is just no way to build it (since source package archives also contain the build instructions, dependency metadata, etc.)

Actually reproducing a NixOS image is less manual and can be done without relying on any online Nix/NixOS-specific infrastructure, and this is a real advancement over what's possible with binary distros like Debian. (Some other binary distros, like openSUSE) also have centralized version control for package definitions.)

One way to conceptualize the qualitative differences in reproducibility outlined above is by examining the ways that Nix strengthens Debian's definition of reproducibility[4], which reads:

> A build is reproducible if given the same source code, build environment and build instructions,

For Nix, the build instructions can simply encode all of what Debian calls the ‘relevant attributes of the build environment’:

> Relevant attributes of the build environment would usually include dependencies and their versions, build configuration flags and environment variables as far as they are used by the build system (eg. the locale). It is preferable to reduce this set of attributes.

And similarly, for NixOS, the acquisition of source code is folded into the build instructions and the ‘build environment’ (i.e., caches being available or GitHub not being down). So every Nix package that is reproducible at all is reproducible in a more general way than a reproducible Debian package.

And NixOS/Nix have had to do real work to make their systems reproducible in ways that Debian is not. Unlike much of Debian's work its benefits can't really be shared with distros of a different design— but the converse is sometimes true as well. For example, Debian's work on rooting out non-determinism in package post-install hooks[5] is useless (and unnecessary) for NixOS, Guix, and Distri, since their packages don't have post-install hooks.

There are also lots of little ways that issues Debian has worked on either reflect the relative weakness of this notion of reproducibility (e.g., ‘All relevant information about the build environment should either be defined as part of the development process or recorded during the build process.’[6] is a way of saying ‘the build environment should be reproducible or merely documented’) or overcoming challenges that systems designed with reproducibility in mind from the start simply don't face.

At the same time, the Reproducible Builds website refers to publications[7] by former Nix developers who directly cite the original Nix paper from 2004, whereas Debian's effort didn't begin in earnest until 2013.[8]

Compared to the Nix community, Debian is huge. And they've leveraged their collective expertise and considerable volunteer force to do a ton of work toward reproducible builds which has benefited reproducibility for everyone, including NixOS. Doubtless every remotely attentive member of the Nix community is grateful for that work, which a small community like Nix's could hardly have taken up on its own. But Nix has been attacking reproducibility issues at a different level (reproducing build environments, source code, and whole systems (in terms of behavior, if not bits)) in a meaningful way since long before Debian's reproducible builds effort got going. And some of those efforts have informed the wider reproducible builds effort, just like some of Debian's efforts have not been applicable to every project in the F/OSS community which is interested in reproducible builds.

So: let's praise Debian loudly and often for their work here and be clear that NixOS' reproducibility couldn't be where it is today without that work... but let's also be clear that Nix/NixOS absolutely has blazed some trails in the territory of reproducibility— a terrain that both communities are still mapping out together. :)

—

1: https://michael.stapelberg.ch/posts/2019-03-10-debian-windin...

2: https://wiki.debian.org/ReproducibleInstalls/LiveImages

3: https://www.preining.info/blog/2014/06/debian-pristine-tar-p...

4: https://reproducible-builds.org/docs/definition/

5: https://reproducible-builds.org/docs/system-images/

6: https://reproducible-builds.org/docs/recording/

7: https://reproducible-builds.org/docs/publications/

8: https://wiki.debian.org/ReproducibleBuilds/History#Kick-off

radicalcentrist · on June 20, 2021

Reproducibility is necessary, but unfortunately not sufficient, to stop a "Trusting Trust" attack. Nixpkgs still relies on a bootstrap tarball containing e.g. gcc and binutils, so theoretically such an attack could trace its lineage back to the original bootstrap tarball, if it was built with a compromised toolchain.

mjg59 · on June 20, 2021

Diverse double compilation should allow a demonstration that the toolchain is trustworthy.

Foxboron · on June 20, 2021

Indeed, and with the work done by Guix and the Reproducible Builds project we do have a real-world example of diverse double compilation which is not just a toy example utilizing the GNU Mes C compiler.

https://dwheeler.com/trusting-trust/#real-world

dane-pgp · on June 20, 2021

Projects like GNU Mes are part of the Bootstrappable Builds effort[0]. Another great achievement in that area is the live-bootstrap project, which has automated a build pipeline that goes from a minimal binary seed up to tinycc then gcc 4 and beyond.[1]

[0] https://www.bootstrappable.org/

[1] https://github.com/fosslinux/live-bootstrap/blob/master/part...

Foxboron · on June 20, 2021

I feel the need to point out that the "Bootstrappable Builds" project is a working group from a Reproducible Builds project which where interested in the next step beyond reproducing binaries. Obviously this project has seen most effort from Guix :)

The GNU Mes C experiment mentioned above was also conducted during the 2019 Reproducible Builds summit in Marrakesh.

https://reproducible-builds.org/events/Marrakesh2019/

naniwaduni · on June 21, 2021

In principle, diverse double-compiling merely increases the number of compilers the adversary needs to subvert. There are obvious practical concerns, of course, but frankly this raises the bar less than maintaining the backdoor across future versions of the same compiler did in the first place, since at least backdooring multiple contemporary compilers doesn't rely on guessing, well ahead of time, what change future people are going to make.

Critically, it shouldn't be taken as a demonstration that the toolchain is trustworthy unless you trust whoever's picking the compilers! This kind of ruins approaches based on having any particular outside organization certify certain compilers as "trusted".

XorNot · on June 21, 2021

There is an uphill effort here to actually do this. While theoretically a very informed adversary might get it right first time, human adversaries are unlikely to and their resources are large, but far from infinite.

Your entire effort is potentially brought down by someone making a change in a way you didn't expect and someone goes "huh, that's funny..."

GauntletWizard · on June 21, 2021

Quite frankly, I'm surprised that is hasn't come up multiple times in the course of getting to NixOS and etc. The attacks are easy to hide and hard to attribute.

User23 · on June 21, 2021

Really? How does that accomplish more than proving the build is a fixed point? An attacker may well be aware of the fixed point combinator after all.

Edit: I think that tone may have come off as snarky, but I meant it as an honest question. If any expert can answer I'd really appreciate it.

eru · on June 21, 2021

Fixed points don't come in here at all, unless you specifically want to talk about compiling compilers.

Diverse double compilation is useful for run-of-the mill programs, too.

chriswarbo · on June 21, 2021

Programs built by different compilers aren't generally binary comparable, e.g. we shouldn't expect empty output from `diff <(gcc run-of-the-mill.c) <(clang run-of-the-mill.c)`

However, the behaviour of programs built by different compilers should be the same. Run-of-the-mill programs could use this as part of a test suite, for example; but diverse double compilation goes a step further:

We build compiler A using several different compilers X, Y, Z; then use those binaries A-built-with-X, A-built-with-Y, A-built-with-Z to compile A. The binaries A-built-with-(A-built-with-X), A-built-with-(A-built-with-Y), A-built-with-(A-built-with-Z) should all be identical. Hence for 'fully countering trusting trust through diverse double-compiling', we must compile compilers https://dwheeler.com/trusting-trust/

smitty1e · on June 21, 2021

And how about that hardware and firmware microcode?

beermonster · on June 21, 2021

And also shipped firmware or binary blobs.

cookiengineer · on June 21, 2021

Actually, being able to build projects much easier from GitHub is the sole reason why I'm currently using Arch as my main OS.

Building a project is just a shell script with a couple of defined functions. Quite literally.

I really admire NixOS's philosophy of pushing the boundaries as a distro where everything, including configurations and modifications, can be done in a reproducible manner. They're basically trying to automate the review process down the line, which is absurdly complex as a challenge.

And given stability and desktop integrations improve over time, I really think that Nix has the potential to be the base for easily forkable distributions. Building a live/bootable distro will be so much easier, as everything is just a set of configuration files anyways.

takeda · on June 21, 2021

This is slightly different thing. Nix and NixOS are trying to solve multiple things, and that's what it might be a bit confusing.

Many people don't realize that, but if you get for example mentioned project from github and I do and we compile it on our machines we get a different file (it'll work the same but it won't be exactly the same).

Say we use the same dependencies, we still will get a different files, because maybe you used slightly different version of the compiler, or maybe those dependencies were compiled with different dependencies or compilers. Maybe the project while building inserts a date, or pulls some file. There are million ways that we would end up with different files.

The goal here is to get bit by bit identical files and it's like a Holy Grail in this area. NixOS just appears to achieved that and all packages that come with the system are now fully reproducible.

eru · on June 21, 2021

A rich source of non-reproducibility is non-determinism introduced by parallel building.

Preserving parallel execution, but arriving at deterministic outputs, is an interesting and ongoing challenge. With a rich mathematical structure, too.

londons_explore · on June 21, 2021

> and 6mo later every computer ... gets ransomwared.

I'm really surprised such an attack hasn't happened already. It seems so trivial for a determined attacker to take over an opensource project (plenty of very popular projects have just a single jaded maintainer).

The malicious compiler could inject an extra timed event into the main loop for the time the attack is scheduled to begin, but only if it's >3 hours away, which simply retrieves a URL and executes whatever is received.

Detecting this by chance is highly unlikely - because to find it, someone would have to have their clock set months ahead, be running the software for many days, and be monitoring the network.

That code is probably only a few hundred bytes, so it probably won't be noticed in any disassembly, and is only executed once, so probably won't show up in debugging sessions or cpu profiling.

It just baffles me that this hasn't been done already!

schelling42 · on June 21, 2021

How do you know it hasn't been done already? (with a more silent payload than ransomware) /s

Tabular-Iceberg · on June 21, 2021

What does the /s mean in this context?

Zetaphor · on June 21, 2021

/s is internet parlance to show that the message should be read in a sarcastic tone.

Tabular-Iceberg · on June 21, 2021

Yes, but what confused me is that as far as I can tell we really don’t know that it hasn’t been done before.

gavinhoward · on June 21, 2021

Not GP, but I think it indicates sarcasm?

Gravyness · on June 21, 2021

> I'm really surprised such an attack hasn't happened already.

If you count npm packages this happened quite a few times already. People (who don't understand security very well) seems to be migrating to python now.

zamadatix · on June 20, 2021

Unless you are going to be the equivalent of a full time maintainer doing code review for every piece of software you use you need to trust other software maintainers reproducible builds or not. Considering this is Linux and not even Linus can deeply review every change in just the kernel anymore that philosophy can't apply to meaningfully large software like Nixos.

jnxx · on June 20, 2021

That's too black-and-white. Being able to reproduce stuff makes some kind of attacks entirely uninteresting because malicious changes can be traced back. Which is what many types of attackers do not want. Debian, or the Linux kernel, for example, are not fool-proof, but both are in practice quite safe to work with.

zamadatix · on June 20, 2021

Who are you going to trace it back to if not the maintainer anyways? If the delivery method then why is the delivery of the source from the maintainer inherently any safer?

jnxx · on June 20, 2021

No, it is not always the maintainer. Imagine you download a binary software package via HTTPS. In theory, the integrity of the download is protected by the server certificate. However, it is possible that certificates get hacked, get stolen, or that nation states force CAs to give out back doors. In that case, your download could have been changed on the fly with arbitrary alterations. Reproducible builds make it possible to detect such changes.

zamadatix · on June 20, 2021

Same as when you download the source instead of the binary and see it reproducibly builds the backdoored binary. And at this point we're back to "Build from source. This will always be a deeply niche thing to do. It's slow, inconvenient, and inaccessible except to nerds." anyways.

It's not that reproducible builds provide 0 value it's that they don't truly solve the trust problem as initially stated. They also have non-security value to boot which is often understated compared to the security value IMO.

bigiain · on June 20, 2021

I guess reproducible builds solve some of the problems in the same way TLS/SSL solves some of the problems.

Most of the world is happy enough with the soft guarantee of: “This is _probably_ your bank’s real website. Unless a nation state is misusing their control over state owned certificate authorities, or GlobalSign or LetsEncrypt or whoever has been p0wned.”

Expecting binary black and white solutions to trust problems isn’t all that useful, in my opinion. Often providing 50% more “value” in trust compared to the status quo is extremely valuable in the bigger picture.

zamadatix · on June 21, 2021

Reproducible builds solve many security problems for sure but but the problems it solves in no way help you if the maintainer is not alturistic or bad at security as originally stated. It helps tell you if the maintainers toolchain wasn't compromised and it does it AFTER the payload is delivered and you built your own payload not made by the maintainer anyways. It doesn't even tell you the transport/hosting wasn't compromised unless you can somehow get a copy of the source used to compile not made by the maintainer directly as the transport/hosting for the source they maintain could be as well.

Solving that singular attack vector in the delivery chain does nothing for solving the need to trust the altruism and self interest of maintainers. A good thing™? Absolutely, along with the other non security benefits, but has nothing to do with needing to trust maintainers or be in the niche that reviews source code when automatic updates come along as originally sold.

pabs3 · on June 21, 2021

There are other solutions to the problem of trusting maintainers; namely incremental distributed code review. The Rust folks are working on that:

https://github.com/crev-dev/

bigiain · on June 22, 2021

> but the problems it solves in no way help you if the maintainer is not alturistic or bad at security as originally stated.

That same edgewise applies to your bank too. Pinned TLS certs or pre shared keys might help against "BadGuys(tm)", but you're still screwed if your bank decides to keep your money. (s/bank/online crypto wallet/ for real world examples there...)

squiggleblaz · on June 20, 2021

The question isn't whether they're perfect, nor is it whether they prevent anything. But it does help a person who suspects something is up rule certain things in and out, which increases the chances that the weak link can be found and eliminated.

If you have a fair suspicion that something is up and you discover that when you compile reproduceable-package you get a different output than when you download a prebuilt reproduceable-package, you've now got something to work with.

Your observation that they don't truly solve the trust problem is true. But it's somehow not relevant. It is better to be better off.

eru · on June 21, 2021

Reproducible builds still help a lot with security. For example, they let you shift build latency around.

Eg suppose you have a software package X, available both as a binary and in source.

With reproducible builds, you can start distributing the binary to your fleet of computers, while at the same time you are kicking off the build process yourself.

If the result of your own build is the same as the binary you got, you can give the command to start using it. (Otherwise, you quarantine the downloaded binary, and ring some alarm bells.)

Similarly, you can re-build some random sample of packages locally, just to double-check, and report.

If most debian users were to do something like that, any tempering with the debian repositories would be quickly detected.

(Having a few malicious users wouldn't hurt this strategy much, they can only insert noise in the system, but not give you false answers that you trust.)

robocat · on June 21, 2021

> and inaccessible except to nerds.

So was most every part of computer hardware and software initially - this is just another milestone in that journey.

eptcyka · on June 20, 2021

Even if the original attack happened upstream, if the upstreamed piece of software was pinned via git, then it'd be trivial to bisect the upstream project to find the culprit.

dragonsky67 · on June 20, 2021

This is great if you are looking at attributing blame. Not so great if you are trying to prevent all the worlds computers getting owned....

I'd imagine that if I were looking at causing world wide chaos, I'd love nothing better than getting into the tool chain in a way that I could later on utilise on a wide spread basis.

At that point I would have achieved my aims and if that means I've burnt a few people along the way, so be it, I'm a bad guy, the damage has been done, the objective met.

Taek · on June 20, 2021

You can't solve this problem without having a full history of code to inspect (unless you are decompiling), reproducibility is the first step and bootstrapability is the second step. Then we refine the toolchains and review processes to ensure high impact code is properly scrutinized.

What we can't do is throw our hands up and say anyone who compromises the toolchain deep enough is just allowed to win. It will happen at some point if we don't put the right barriers in place.

It's the first step of a long journey, but it is a step we should be taking.

donio · on June 21, 2021

https://github.com/fosslinux/live-bootstrap is another approach, bootstrapping from a tiny binary seed that you could hand-assemble and type in as hex. But it doesn't address the dependency on the underling OS being trustworthy.

bmwiedemann · on June 30, 2021

There is stage0 by Jeremiah Orians that is designed to be able to bootstrap on hardware that can be built from transistors. Currently it mostly runs in a small VM process that is somewhat harder to subvert.

radicalcentrist · on June 20, 2021

Reproducibility is what allows you to rely on other maintainers' reviews. Without reproducibility, you can't be certain that what you're running has been audited at all.

It's true that no single person can audit their entire dependency tree. But many eyes make all bugs shallow.

IgorPartola · on June 21, 2021

No. I can review 0.1% of the code and verify that it compiles correctly and then let another 999 people review their own portion. It only takes one person to find a bit of malicious code, we don’t all need to review every single line.

xvector · on June 21, 2021

> It only takes one person to find a bit of malicious code, we don’t all need to review every single line.

This is just objectively wrong. I have worked on projects at FAANG where entire teams did not spot critical security issues during review.

You are very unlikely to spot an issue with just one pair of eyes. You need many if you want any hope of catching bugdoors.

IgorPartola · on June 21, 2021

You are misunderstanding what I am saying. I am saying that it only takes one person who finds a vulnerability to disclose it, to a first approximation. Realistically it’s probably closer to 2-3 since the first might be working for the NSA, the CCP, etc. I am making no arguments about what amount of effort it takes to find a vulnerability, just talking about how not every single user of a piece of code needs to verify it.

remram · on June 21, 2021

That only works if you coordinate. With even more people, you can pick randomly and be relatively sure you've read it all, but I posit that 1) you don't pick randomly, you pick a part that is accessible or interesting to you (and therefore probably others) and 2) reading code locally is not sufficient to find bugs or backdoors in the whole.

pabs3 · on June 21, 2021

The crev folks are working on a co-ordination system for incremental distributed code review:

https://github.com/crev-dev/

IgorPartola · on June 21, 2021

I actually wonder if it’s possible to write code at such a macro level as to obfuscate, say, a keylogger in a huge codebase such that reviewing just a single module/unit would not reveal that something bad is going on.

eru · on June 21, 2021

Depends on how complicated the project itself is. A simple structure with the bare minimum of side-effects (think, functional programming) would make this effort harder.

For something like C, all bets are off: http://www.underhanded-c.org/ or https://en.wikipedia.org/wiki/Underhanded_C_Contest

remram · on June 21, 2021

Crev is a great idea, unfortunately it is only really available for Rust right now.

pabs3 · on June 21, 2021

I noticed there is a git-crev project, might that be useful for other languages? Also there is pip-crev for Python.

User23 · on June 21, 2021

> Who remembers Ken Thompson's "Reflections on Trusting Trust"?

> The norm today is auto-updating, pre-built software.

This is a little bit misleading. The actual paper[1] explains that you can't even trust source available code.

[1] https://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-tho...

0xbadcafebee · on June 20, 2021

Supply chain attacks are definitely important to deal with, but defense-in-depth saves us in the end. Even if a postgres container is backdoored, if the admins put postgres by itself in a network with no ingress or egress except the webserver querying it, an attack on the database itself would be very difficult. If on the other hand, the database is run on untrusted networks, and sensitive data kept on it... yeah, they're boned.

dcposch · on June 21, 2021

In the case of a supply chain attack, you don't even need ingress or egress.

Say the posgres binary or image is set to encrypt the data on a certain date. Then it asks you to pay X ZEC to a shielded address to get your decryption key. This would work even if the actual database was airgapped.

0xbadcafebee · on June 21, 2021

That's true, I didn't think of that! D:

marcosdumay · on June 21, 2021

> I predict that everyone's imagination on this topic will expand once there's a big enough incident in the news.

How the Solarwinds incident, with about every large software vendor being silently compromised for years does not qualify?

Because it does not, people's imagination is as closed as it always was.

yeowMeng · on June 21, 2021

Solarwinds is closed source so the choice to build from source is not really an option.

pabs3 · on June 21, 2021

They could have distributed the code to a few select parties for the purposes of doing a build and nothing more.

marcosdumay · on June 21, 2021

Specifically Microsoft did distribute the code to several parties for the purposes of auditing. But they didn't allow building it.

initplus · on June 20, 2021

Building from source doesn't have to be inaccessible, if the build tooling around it is strong. Modern compiled languages like Go (or modern toolchains on legacy languages like vcpkg) have a convention of building everything possible from source.

So at least for software libraries building from source is definitely viable. Fro end user applications it's another story though, doubt we will ever be at a point where building your own browser from source makes sense...

bigiain · on June 20, 2021

Building from source also doesn’t buy you very much, if you haven’t inspected/audited the source.

The upthread hypothetical of a compromised package manager equally applies to a compromised source repo.

_Maybe _ you always check the hashes? _Maybe_ you always get the hashes from a different place to the code? _Maybe_ the hypothetical attacker couldn’t replace both the code you download and the hash you use to check it?

(And as Ken pointed out decades ago, maybe the attacker didn’t fuck with your compiler so you had lost before you even started.)

garmaine · on June 20, 2021

Binary reproducible builds are still pretty inaccessible though.

Accujack · on June 21, 2021

>The norm today is auto-updating, pre-built software.

Only if you define "norm" as what's prevalent in consumer electronics and phones. Certainly, if you go by numbers, it's more common than anything else.

That's not due to choice, though, it's because of the desires of corporations for ever more extensive control of their revenue streams.

tester756 · on June 21, 2021

>There are only two ways around this:

>- Build from source. This will always be a deeply niche thing to do. It's slow, inconvenient, and inaccessible except to nerds.

if you trust the compiler :)

powerbook5300CS · on June 20, 2021

Why does building from source help? It’s not like people are reading every line of the source before building it anyway 99.99% of the time.

xvector · on June 21, 2021

If the package maintainer's build pipeline is compromised (eg. Solarwinds), you are unlikely to be affected if you build from reviewed source yourself.

pjmlp · on June 21, 2021

Except hardly anyone reviews a single line of code.

squiggleblaz · on June 21, 2021

So? We are trying to protect against a malicious interloper damaging the machine of a trusted and trustworthy partner.

You are bringing up red herrings about trusted partners being malicious and untrustworthy.

Do you genuinely believe we should only solve a problem if it leads to a perfect outcome?

pjmlp · on June 21, 2021

I genuinely believe to spend resources on issues where ROI is positive.

So far exploits on FOSS kind of prove the point not everyone is using Gentoo, reading every line of code on their emerged packakges, let alone similar computing models.

Now if we are speaking about driving the whole industry to where security bugs, caused by using languages like C that cannot save us from code reviews unless done by ISO C language lawyers and compiler experts in UB optimizations, are heavily punished like construction companies are for a fallen bridge, then that would be interesting.

therealjumbo · on June 21, 2021

> I genuinely believe to spend resources on issues where ROI is positive.

How are you measuring the ROI of security efforts inside an OSS distro like debian or nixos? The effort in such orgs is freely given, so nobody knows how much it costs. And how would you calculate the return on attacks that have been prevented? Even if an attack wasn't prevented you don't know how much it cost, and you might not even know if it happened (or if it happened due to a lapse in debian.)

>So far exploits on FOSS kind of prove the point not everyone is using Gentoo, reading every line of code on their emerged packakges, let alone similar computing models.

Reproducible builds is attempting to mitigate a very specific type of attack, not all attacks in general. That is, it focuses on a specific threat model and countering that, nothing else. It's not a cure for cancer either.

>Now if we are speaking about driving the whole industry to where security bugs, caused by using languages like C that cannot save us from code reviews unless done by ISO C language lawyers and compiler experts in UB optimizations, are heavily punished like construction companies are for a fallen bridge, then that would be interesting.

This is just a word salad of red herrings. Different people can work on different stuff at the same time.

nix23 · on June 22, 2021

>Who remembers Ken Thompson's "Reflections on Trusting Trust"?

That was his Turing Award ;) not Unix as one would assume.

staticassertion · on June 21, 2021

> Reproducible builds are way more important than is currently widely appreciated.

Why? How will this help with the problems you're talking about?

I can't come up with a single benefit to security from reproducible builds. It seems nice for operational reasons and performance reasons though.

pilif · on June 21, 2021

> I can't come up with a single benefit to security from reproducible builds.

It is a means to allow to detect a compromised supply chain. If people rebuilding a distro cannot get the same hash as the distro shipping from the distributor, then likely the distributors infrastructure has been compromised

staticassertion · on June 21, 2021

How does this work in practice? The distro is owned, so where are you getting the hash from? I mean, specifically, what does the attacker have control of and how does a repeatable build help me stop them.

pilif · on June 21, 2021

The idea is that multiple independent builders build the same distro. You expect all of them to have the same final hash.

This doesn't help against the sources being owned, but it helps about build machines being owned.

Accountability for source integrity is in theory provided by the source control system. Accountability for the build machine integrity can be provided by reproducible builds.

To answer your specific questions: The attacker has access to the distro's build servers and is packaging and shipping altered binaries that do not correspond to the sources but instead contain added malware.

Reproducible builds allow third parties to also build binaries from the same sources and once multiple third parties achieve consensus about the build output, it becomes apparent that the distro's build infrastructure could be compromised.

staticassertion · on June 21, 2021

OK so a build machine is owned and we have a sort of consensus for trusted builders, and if there's a consensus mismatch we know something's up.

I suppose that's reasonable. Sounds like reproducible builds is a big step towards that, though clearly this requires quite a lot of infrastructure support beyond just that.

jnxx · on June 20, 2021

This is great! The one fly in the ointment, pardon, is that Nix is a bit lax about trusting proprietary and binary-only stuff. It would be great if there were a FLOSS-only core system for NixOS which would be fully transparent.

quarantine · on June 20, 2021

Nix/Nixpkgs blocks unfree packages by default, so I presume it would be relatively easy to disable packages with the `unFree` attribute.

jnxx · on June 20, 2021

I totally believe it is possible, it is perhaps more of a cultural thing.

eptcyka · on June 20, 2021

It's the pragmatic thing. I wouldn't use nixOS if I wasn't able to use it on a 16 core modern desktop. I don't think there's a performant and 100% FLOSS compatible computer that wouldn't make me want to gouge my eyes out with a rusty spoon when building stuff for ARM.

zamadatix · on June 20, 2021

Talos has 44 core/176 thread server options which can take 2 TBs of DDR4 that are FSF certified. The board firmware is also open and has reproducible builds.

tadfisher · on June 20, 2021

That is way more expensive than a 16-core desktop, though. Workstations are a class above consumer-grade desktops and that's reflected in the price.

zamadatix · on June 21, 2021

Talos have as low as 8 core desktop options as well this is just an example of how far you can take FLOSS hardware. Not that I consider a 16 core x86 desktop "consumer-grade" in the first place (speaking as a 5950X owner).

Probably not fit for replacing Grandma's budget PC but then again grandma probably isn't worried about the ARM cross compile performance of their machine running NixOS either.

tadfisher · on June 22, 2021

Okay, now I'm interested :)

(I am worried about the ARM cross compile performance of my machine running NixOS)

eptcyka · on June 20, 2021

Thanks, I was legitimately unaware of this option. That does smash my argument, but I'm not likely to be using a system like that anytime soon due to cost concerns mostly.

kaba0 · on June 20, 2021

And it’s not just hardware, there is a useful limit on purity of licenses. In many cases only proprietary programs can do the work at all, or orders of magnitudes better.

rejectedandsad · on June 20, 2021

> It would be great if there were a FLOSS-only core system for NixOS

Might be wrong but isn't this part of the premise for Guix/GuixSD?

Filligree · on June 21, 2021

And it's good that it exists, I guess?

But it can't do any of the things I bought my computer to do, so it's of limited value to me.

swiley · on June 20, 2021

>self interest (companies do not want to attack their own users).

Anyone who has bought an Android phone in the past 5 years knows that's not true.

hsbauauvhabzb · on June 20, 2021

I don’t have the resources to audit every component of my system. I favour enterprise distros who audit code which ends up in their repos and avoid pip, npm, etc. but there are some glaring trade offs on both productivity and scalability.

The problem is unmaintainability, I can’t imagine it’d be easier for medium sized teams where security isn’t a priority, either.

1vuio0pswjnm7 · on June 20, 2021

"- Build from source. This will always be a deeply niche thing to do. It's slow, inconvenient, and inaccessible except to nerds."

I prefer compiling from source to binary packages. For me it is neither slow, incovenient nor inaccessible.

Only with larger, more complex programs does compiling from source become a PITA.

The "solution" I take is to prefer smaller, less complex programs over larger, more complex ones.

If I cannot compile a program from source relatively quickly and easily, I do not voluntarily choose it as a program to use daily and depend on.

For compiling OS, I use NetBSD so perhaps I am spoiled because it is relatively easy to compile.

That said, I understand the value of reproducible builds and appreciate the work being done on such projects.

kixiQu · on June 20, 2021

"except to nerds" was conversationally phrased shorthand for "except to people with rarefied technical skills".

kaba0 · on June 20, 2021

You don’t use a browser or an office suite? Because those are a pain in the ass to compile (in terms of time).

1vuio0pswjnm7 · on June 21, 2021

Not just time, IME. Also 1. highly resource intensive, e.g., cannot compile on small form factor computers (easier for me to compile a kernel than a "modern" browser) and 2. brittle.

vore · on June 21, 2021

I think using NetBSD might put you in the nerd camp ;-)

brigandish · on June 20, 2021

Unfortunately, it's easy to break a lot of builds by things such as deciding not to install to /usr/local, or by building on a Mac. Pushing publishers to practices that aid reproducible builds would help both sides.

I'd love to try building NetBSD, btw, I must try that!

zucker42 · on June 21, 2021

Don't take this the wrong way, but I think you qualify as a nerd. :)

groodt · on June 20, 2021

This is a big deal. Congratulations to all involved.

In Software, complexity naturally increases over time and dependencies and interactions between components become impossible to reason about. Eventually this complexity causes the Software to collapse under its own weight.

Truly reproducible builds (such as NixOS and Nixpkgs) provides us with islands of "determinism" which can be taken as true invariants. This enables us to build more Systems and Software on top of deterministic foundations that can be reproduced by others.

This reproducibility also enables powerful things like decentralized / distributed trust. Different third-parties can build the same software and compare the results. If they differ, it could indicate one of the sources has been compromised. See Trustix https://github.com/tweag/trustix

taviso · on June 20, 2021

I don't see a single comment doubting the value of reproducibility, so I'll be the resident skeptic :)

I think build reproducibility is a cargo cult. The website says reproducibility can reduce the risk of developers being threatened or bribed to backdoor their software, but that is just ridiculous. Developers have a perfect method for making their own software malicious: bugdoors. A bugdoor (bug + backdoor) is a deliberately introduced "vulnerability" that the vendor can "exploit" when they want backdoor access. If the bug is ever discovered you simply issue a patch and say it was a mistake, it's perfectly deniable. It's not unusual for major vendors to patch critical vulnerabilities every month, there is zero penalty for doing this.

The existence of bugdoors means you have to trust the vendor who provided the source code, there is no way around this.

You have to trust the developer, but in theory, reproducible builds could be used to convince yourself their build server hasn't been hacked. This isn't really necessary or useful, you can already produce a trustworthy binary by just building the source code yourself. You still have to trust the vendor to keep hackers off everything else though!

Okay, but building software is tedious, and for some reason you are particularly concerned about build servers being hacked. Perhaps you will nominate a dozen different organization that will all build the code, and make this a consensus system. If they all agree, then you can be sure enough the binaries were built with a trustworthy toolchain. A modest improvement in theory, but that introduces a whole bunch of new crazy problems.

You can't just pick one or two consensus servers, because then an attacker can stop you getting updates by compromising any one of them. You will have to do something like choose a lot of servers, and only require 51% to agree.

Now, imagine a contentious update like a adopting a cryptocurrency fork, or switching to systemd (haha). If the server operators rebel, they can effectively veto a change the vendor wants to make. Perhaps vendors will implement a killswitch that allows them to have the final say, or perhaps they operate all the consensus build servers themselves.

The problem is now you've either just replaced build servers with killswitches, or just replicated the same potentially-compromised buildserver.

I wrote a blog post about this a while ago, although I should update it at some point.

https://blog.cmpxchg8b.com/2020/07/you-dont-need-reproducibl...

danieldk · on June 21, 2021

I think build reproducibility is a cargo cult.

Most people here are debating you on the security angle, but in the case of Nix (and Guix) there is another important angle - reproducible builds make a content-addressed store possible.

In Nix, the store is traditionally addressed by the hash of the derivation (the recipe that builds the package). For example, lr96h... in the path

    /nix/store/lr96h3dlny8aiba9p3rmxcxfda0ijj08-coreutils-8.32

is the hash of the (normalized) derivation that was used to build coreutils. Since the derivation includes build inputs, either changing the derivation for coreutils itself or one of its inputs (dependencies) results in a different hash and a rebuild of coreutils.

This also means that if somebody changes the derivation of coreutils every package that depends on coreutils will be rebuilt, even if this change does not result in a different output path (compiled package).

This is being addressed by the new work on the content-addressed Nix store (although content-addressing aws already discussed in Eelco Dolstra's PhD thesis about Nix). In the content-addressed store, the hash in the path, such as the on above is a hash of the output path (the built package), rather than a hash of the normalized derivation. This means that if the derivation of coreutils is changed in such a way that it does not change the output path, none of the packages that depend on coreutils are rebuilt.

However, this only works reliably with reproducible builds, because if there is non-determinism in the build, how do you know whether a change in the output path is changed as a result of changing a derivation or as a result of uninteresting non-determinisms (the output hash would change in both cases).

londons_explore · on June 21, 2021

Where the dependency chain is long, this substantially reduces build work during development too.

I'd guess that more than half of the invocations of gcc done by Make for example end up producing the exact same bit for bit output as some previous invocation.

taviso · on June 21, 2021

I would point out that is literally what ccache (and Google goma) does, but doesn't require deterministic builds. Instead, it records hashes of preprocessed input and compiler commandlines.

They don't make any security claims about this, it's just for speeding up builds.

Ericson2314 · on June 21, 2021

What we currently do --- hashing inputs --- is the same ccache way. We just don't yet sandbox with the granularity yet.

What we want to id hash outputs. Say I replace 1 + 2 with 0 + 3. That will cause ccache to rebuild. We don't want downstream stuff to also be rebuilt. C-linking withing a package is nice in parallelizable, but in the general case there is more dependency chains and now that sort of thing starts to matter.

pxc · on June 21, 2021

Another non-security angle: doesn't computer science also face a kind of replicability crisis related to the ability to acquire and compile source code associated with some published papers? Reproducible builds directly address that.

And it seems like even when that problem is resolved for the empirical component of computer science, bit-identical reproducibility could be valuable in case binaries are never submitted or distributed. This NixOS release is in a way a benchmark for how far we can currently get on a 'useful' system with that kind of reproducibility.

taviso · on June 21, 2021

I don't really have any complaints about using deterministic builds for non-security reasons, but the number one claim most proponents make is that it somehow prevents backdoors. Literally the first claim on reproducible-builds.org is that build determinism will prevent threats of violence and blackmail.

EE84M3i · on June 21, 2021

Honestly I think the biggest benefit of reproducibility is just debuggability. We both check out the same git repo and build it, we can later hash the binary and compare the hashes to know we're running the exact same code.

On security, if you really care about compromised build servers you might as well just build from source yourself. I think reproducibility might matter most in systems where side loading is hard/impossible like app stores, but I'm not familiar with the current state of the art in terms of iOS reproducable builds and checking them.

smoldesu · on June 21, 2021

Reproducability is an option to mitigate backdoors and incentive developers to operate openly. It's no panacea, but it makes a lot of sense in open-source projects where individual actors are going to represent your largest threat vector. That way, it becomes a lot harder to push an infected blob to main, even if it still is technically possible. Hashes are also "technically pointless", but we still implement them liberally to quickly account for data integrity.

taviso · on June 21, 2021

Signatures are not technically pointless, they mean you only have to trust the developer - not the mirror operators.

Reproducibility is technically pointless, because you still have to trust the developer, and they can still add backdoors.

franga2000 · on June 21, 2021

> Reproducibility is technically pointless, because you still have to trust the developer, and they can still add backdoors.

Builder != developer - and with reproducible builds, you no longer beed to trust the builder. CI is commonly used for the final distributable builds and you can't always trust the CI server. Even if you do, many rely on third party thingd like docker images - if the base build image gets compromised, code could trivially be injected into builds running on it and without reproducible builds, that would not be detectable.

As a developer, it would be quite reassuring to build my binary (which I already do for testing) and compare the hash with the one from the CI server to confirm nothing has been tampered with. As a bonus, distro maintainers who have their own CI can also check against my hashes to verify their build systems aren't doing something fishy (malicious or otherwise).

taviso · on June 21, 2021

> As a developer, it would be quite reassuring to build my binary (which I already do for testing) and compare the hash with the one from the CI server to confirm nothing has been tampered with.

That makes sense! However, this is not a good argument for reproducible builds, because you can already do that today.

You already have to build a trusted binary locally for testing right? You're dreaming of being able to compare that against the untrusted binary so that you can make sure it's a trusted binary too - but you already have a trusted binary!

Okay - but it's a hassle, you don't want to have to do that, right? Too bad - reproducible builds only work if someone reproduces them. You're still going to have to replicate it somewhere you trust, so you gained practically nothing.

eru · on June 21, 2021

With the reproducible build, you can start using the untrusted binary while you are still building your trusted one.

You can also have ten people on the internet verify the untrusted binary. With signatures, adding more people doesn't help.

taviso · on June 21, 2021

> With the reproducible build, you can start using the untrusted binary while you are still building your trusted one.

That's not how it works, you have to reproduce it before it becomes trusted.

> You can also have ten people on the internet verify the untrusted binary.

Sure, then we have to build a complex consensus system that introduces a bunch of unsolved problems. My opinion is that this just isn't worth it, there is practically nothing to gain and it's really really hard.

eru · on June 21, 2021

> That's not how it works, you have to reproduce it before it becomes trusted.

Eh, there's stuff you can do with software before you trust it. Eg you can start pressing the CDs or distributing the data to your servers. Just don't execute it, yet.

> Sure, then we have to build a complex consensus system that introduces a bunch of unsolved problems. My opinion is that this just isn't worth it, there is practically nothing to gain and it's really really hard.

It's the same informal system that keeps eg debian or the Linux kernel secure currently:

People don't do kernel reviews themselves. They just use the official kernel, and when someone finds a bug (or spots otherwise bad code), they notify the community.

Similar with reproducible builds: most normal people will just use the builds from their distro's server, but independent people can do 'reviews' by running builds.

If ever a build doesn't reproduce, that'll be a loud failure. People will complain and investigate.

Reproducible builds in this scenario don't protect you from untrusted code upfront, but they make sure you'll know when you have been attacked.

taviso · on June 21, 2021

> People don't do kernel reviews themselves. They just use the official kernel, and when someone finds a bug (or spots otherwise bad code), they notify the community.

There's a big difference here. When a vulnerability is found in the Linux kernel, that doesn't mean that you were compromised.

If a build was found to be malicious, then you definitely were compromised and it's little solace that it was discovered after the fact. This is why package managers check the deb/rpm signature before installing the software, not after.

eru · on June 22, 2021

Well, you'd still check the signature, to make sure you have the same build that the debian repository has.

This is just an additional check that the debian repository has sane builds.

(If someone mucks around with the debian repositories but you aren't the target, you might or might not be under attack.)

dane-pgp · on June 21, 2021

Reproducibility means you don't have to worry that the developer might have a backdoored toolchain (which also means that they can't pretend that a malicious toolchain added the malicious code without their knowledge).

A talented developer might still be able to create a bugdoor which gets past code review, but that takes more effort and skill than just putting the malicious code into a local checkout and then saying "How did that get there?".

taviso · on June 21, 2021

Every major vendor has vulnerabilities introduced all the time, by accident! No talent is necessary to introduce a bugdoor, just malice.

You can already verify that a toolchain wasn't backdoored today, reproducible builds aren't necessary for that.

wildfire · on June 21, 2021

> You can already verify that a > toolchain wasn't backdoored ) > today

How, exactly?

If we both compiled hello.c (a prototypical hello world program), and exchanged binaries; how would you verify my build wasn't malicious?

taviso · on June 21, 2021

I think the workflow you're proposing is to take some trusted source code, then compile it to make a trusted binary. Now compare the trusted binary to the untrusted binary provided by the vendor - If they're the same - then it must have been made by an uncompromised toolchain.

That does require reproducible builds, but here is how to do it without reproducible builds:

Take the trusted source code, then compile it to make a trusted binary. Now put the untrusted binary in the trash, cause you already have a trusted binary :)

squiggleblaz · on June 21, 2021

How about if the system will only run signed builds? Couldn't you use it to verify the signed build by stripping the signature and comparing them?

sayrer · on June 21, 2021

Is it technically pointless if you view it as a check on your own build, rather than a check on the work of others?

You are obviously familiar with Bazel/Blaze etc. Wouldn't reproducibility be necessary for those systems to work well most of the time? I can think of exceptions (like PGO), but it seems useful to produce at least some binaries this way. Also covered in this: https://security.googleblog.com/2021/06/introducing-slsa-end...

taviso · on June 21, 2021

> Is it technically pointless if you view it as a check on your own build, rather than a check on the work of others?

That depends, I think it's difficult and mostly still pointless. I wrote about this a bit in the blog post I linked to. It's a big trade off, for questionable benefit.

> Wouldn't reproducibility be necessary for those systems to work well most of the time?

Yes, there are definitely some good non-security reasons to want deterministic builds. My gripe is only with the security arguments, like claims it can reduce threats of violence against developers (!?!).

cesarb · on June 21, 2021

> What isn’t clear is what benefit the reproducibility provides. The only way to verify that the untrusted binary is bit-for-bit identical to the binary that would be produced by building the source code, is to produce your own trusted binary first and then compare it. At that point you already have a trusted binary you can use, so what value did reproducible builds provide?

That's not the interesting case. The interesting case is when the untrusted binary doesn't match the binary produced by building the source code. Assuming that the untrusted binary has been signed by its build system, you now have proof that the build system is misbehaving. And that proof can be distributed and reproduced by everyone else.

Once Debian is fully reproducible, I expect several organizations (universities, other Linux distribution vendors, governments, etc) to silently rebuild every single Debian package, and compare the result with the Debian binaries; if they find any mismatch, they can announce it publicly (with proof), so that the whole world (starting with the Debian project itself) will know that there's something wrong. This does not need any complicated consensus mechanism.

> More often, attackers want signing keys so they can sign their own binaries, steal proprietary source code, inject malicious code into source code tarballs, or malicious patches into source repositories.

In Debian, compromising the build server is not enough to inject malicious code into source code tarballs or patches, since the source code is also signed by the package maintainer. Unexpected changes on which maintainer signed the source code for a given package could be flagged as suspicious.

The only attack left from that list, at least for Debian, would be for the attacker to sign their own top-level Release file (on Debian, individual packages are not signed, instead a file containing the hash of a file containing the hash of the package is what is signed). But the attacker cannot distribute the resulting compromised packages to everyone, since those who rebuild and compare every package would notice it not matching the corresponding source code, and warn everyone else.

goodpoint · on June 21, 2021

> I expect several organizations (universities, other Linux distribution vendors, governments, etc) to silently rebuild every single Debian package, and compare the result with the Debian binaries

This has been happening for many years. A lot of large companies that care about security and maintainability sign big contracts to with tech companies that often include indemnification.

pabs3 · on June 21, 2021

There are other solutions to the problem of trusting maintainers; namely incremental distributed code review. The Rust folks are working on that:

https://github.com/crev-dev/

You still need Reproducible Builds and Bootstrappable Builds even if you have a fully reviewed codebase though.

theon144 · on June 21, 2021

>Developers have a perfect method for making their own software malicious: bugdoors.

I think rather than malicious developers the focus is on malicious build machines. How many things are built solely via CI these days, on machines that nobody has ever seen, using docker images that nobody has validated?

It's much easier to imagine a malicious provider (as in Sourceforge bundling in adware) than malicious developers, I think.

But yes, you're right that reproducible builds don't remove the need to trust the source.

>You have to trust the developer, but in theory, reproducible builds could be used to convince yourself their build server hasn't been hacked. This isn't really necessary or useful, you can already produce a trustworthy binary by just building the source code yourself.

This is pretty much all false though - not only the "just" part, as setting up a proper build environment is pretty non-trivial for many projects, and building everything from source is a task only the most dedicated Gentoomen would take up; you can also think of reproducible builds as a "litmus test". If you can, with reasonable accuracy, check whether a build machine is compromised at any time, you have a much greater base on which to trust it and its outputs. The benefits of having build machines probably shouldn't need explaining.

>You can't just pick one or two consensus servers, because then an attacker can stop you getting updates by compromising any one of them. You will have to do something like choose a lot of servers, and only require 51% to agree.

>...

>The problem is now you've either just replaced build servers with killswitches, or just replicated the same potentially-compromised buildserver.

I really don't understand this argument; compromised infrastructure probably shouldn't be a regular occurrence, and even if so, automated killswitches seem like the vastly more preferable option, no?

taviso · on June 21, 2021

> I really don't understand this argument; compromised infrastructure probably shouldn't be a regular occurrence, and even if so, automated killswitches seem like the vastly more preferable option, no?

I'm pointing out how complex implementing reproducible builds is. It introduces a bunch of really hard unsolved problems that people are very handwavy about.

Who will do the reproducing? You say that users won't be able to do it. That makes sense, because if they could, then reproducible builds would be useless! However, you also say they will be able to check if a build server is compromised at any time. In order for both of those claims to be true we will have to design and build a complex consensus system operated by mutually untrusted volunteers. That's really hard, and seems like it provides a pretty negligible benefit.

nixpulvis · on June 21, 2021

You'll never beat the source.

IIUC Reproducible Builds guarantees that source is turned into an artifact in a consistent and unchanging way. So as long as the source doesn't change neither will the build.

taviso · on June 21, 2021

I don't really understand what you're saying.

If you're saying "reproducible builds are reproducible", then that is obviously true, but the question is what is the benefit?

Some people claim that the benefit is that there will be less incentive to threaten developers with violence, and I'm saying that's nonsense. If you cut through the nonsense, there are some modest claims that are true, but doing reproducible builds properly is very complicated and the benefit is negligible.

nixpulvis · on June 21, 2021

> ... the question is what is the benefit?

I don't think I should have to explain this. It has nothing directly to do with violence against developers, that's taking many leaps.

It simply gives you what you expect, which is kinda the basis of safety and security.

taviso · on June 21, 2021

> It has nothing directly to do with violence against developers, that's taking many leaps.

It is literally the very first claim on the front page of https://reproducible-builds.org.

nixpulvis · on June 21, 2021

That's just a bunch of marketing hype. I'm trying to stay focused closer to the matters at hand.

Perhaps my rambling on Development vs Distribution is relevant to the discussion? https://nixpulvis.com/ramblings/2021-02-02-signing-and-notar...

dane-pgp · on June 21, 2021

> If the server operators rebel, they can effectively veto a change the vendor wants to make.

How often do you think there will be a change so controversial that teams who have volunteered to secure the update system will start effectively carrying out a Denial of Service attack against all the users of that distro?

We also have to imagine that these malicious attestation nodes can easily be ignored by users just updating a config file, so the only thing the node operators could achieve by boycotting the attestation process is temporarily inconveniencing people who used to rely on them (which is not a great return on investment for the reputation they burn in doing this).

taviso · on June 21, 2021

I don't know what reputation damage will happen, they're just third parties compiling code. There is no reputational damage for operating a malicious tor exit relay, why would this be different?

dane-pgp · on June 21, 2021

As I understand it, Tor does have a way of detecting whether an exit node is failing to connect users to their intended destination. (With TLS enforced, the only thing a malicious exit node could do is prevent valid connections).

In any case, I don't think anyone is proposing that the attestation nodes be run by random anonymous people on the internet. It would make more sense to have half a dozen or so teams running these nodes, with each team being known and trusted by the distro in question.

I'm not sure what the costs/requirements would be for running one of these nodes, but it might be possible for distros to each run a node dedicated to building each other's distros (or at least the packages that are pushed as security updates to stable releases).

Alternatively, individual developers that already work on a distro can offer to build packages on their own machines and contribute signed hashes to a log maintained by the distro itself.

taviso · on June 21, 2021

The point was that a reproducible build doesn't mean you don't have to trust the developer.

Build servers rebelling was just an example of the additional complexities and attacks that it introduces, for very negligible benefit.

solarkraft · on June 20, 2021

That is a pretty big deal.

This means everyone building NixOS will get the exact same binary, meaning you can now trust any source for it because you can verify the hash.

It’s a huge win compared to the current default distribution model of “just trust these 30 american entities that the software does what they say it does”.

Big congratulations to the team.

jnxx · on June 20, 2021

A good sign that the friendly competition by Guix has a positive influence :)

https://guix.gnu.org/manual/en/html_node/Bootstrapping.html

https://guix.gnu.org/en/blog/2020/guix-further-reduces-boots...

delroth · on June 20, 2021

This smaller bootstrap seed thing is a different problem from reproducible builds. nixpkgs does still have a pretty big initial TCB (aka. stage0) compared to Guix. But as far as I can tell NixOS has the upper hand in terms of how much can be built reproducibly (aka. the output hash matches across separate builds).

siraben · on June 21, 2021

There's an issue for this[0]. Currently Nixpkgs relies on a 130 MB (!) uncompressed tarball, which is pretty big compared to Guix. It would be amazing to get it down to something like less than 1 KB with live-bootstrap.

Also, due to the way Nixpkgs is architectured, it also lets us experiment with more unusual ideas like a uutils-based stdenv[1] instead of GNU coreutils.

[0] https://github.com/NixOS/nixpkgs/issues/123095

[1] https://github.com/NixOS/nixpkgs/pull/116274

jnxx · on June 20, 2021

Bootstrapping from a very small binary core (I think 512 bytes) with an initial C compiler written in Scheme also has the advantage that the system can easily be ported to different hardware. Which is one major strength of the GNU projects and tools.

delroth · on June 20, 2021

Not necessarily. Usually these very small cores end up being more architecture specific binaries than a stage0 consisting of gcc + some other core packages. A good illustration of this is that Guix's work on bootstrap seed reduction has been so far mostly applied to i686/amd64 and not even other architectures they support (at least, not fully).

pxc · on June 22, 2021

Does this still matter if you can work your way up to a cross compiler, though? Do you actually need to go all the way down to ‘native’ hex monitors for a bunch of architectures or whatever?

pxc · on June 22, 2021

We can relate the two efforts as each project addressing an aspect of ‘trusting trust’ problems, albeit in different ways.

Stage0 is awesome, and it'd be cool to see Nix's bootstrap seed shrink

georgyo · on June 20, 2021

Mandatory link to the Debian single purpose site: https://isdebianreproducibleyet.com/

However that is for everything in Debian, not just the iso. It is truly remarkable to see all the Linux distributions move the needle forward.

Foxboron · on June 20, 2021

And Arch Linux :)

https://reproducible.archlinux.org/

avalys · on June 20, 2021

Can anyone comment on the significance of this accomplishment, and why it was hard to achieve before?

I (naively, apparently) assumed this had been possible with open-source toolchains for a long time.

peterkelly · on June 20, 2021

For some reason, many compilers and build scripts have traditionally been written in a way that's not referentially transparent (a pure function from input to output). Unnecessary information like the time of the build, absolute path names of sources and intermediate files, usernames and hostnames often would find their way into build outputs. Compiling the same source on different machines or at different times would yield different results.

Reproducible builds avoid all this and always produce the same outputs given the same inputs. There's no good reason (that I can think of) why this shouldn't have been the case all along, but for a long time I guess it just wasn't seen as a priority.

The benefit of reproducible builds is that it's possible to verify that a distributed binary was definitely compiled from known source files and hasn't been tampered with, because you can recompile the program yourself and check that the result matches the binary distribution.

dane-pgp · on June 21, 2021

> There's no good reason (that I can think of) why this shouldn't have been the case all along

Well, it's not like developers consciously thought "How can I make my build process as non-deterministic as possible?", it's just that by the time people started to become aware of the benefits of reproducibility, various forms of non-determinism had already crept in.

For example, someone writing an archiving tool would be completely right to think it is a useful feature to store the creation date of the archive in the archive's metadata. The idea that a user might want to force this value to instead be some fixed constant would only occur to someone later when they noticed that their packages were non-reproducible because of this.

But you're right; if the goal had been thought of from the start, there's no reason why every build tool wouldn't have supported this.

russfink · on June 21, 2021

Thank you both. I was wondering the same thing.

otabdeveloper4 · on June 21, 2021

> The benefit of reproducible builds is that it's possible to verify that a distributed binary was definitely compiled from known source files and hasn't been tampered with, because you can recompile the program yourself and check that the result matches the binary distribution.

It's not just security. If a hash of the input sources maps directly to a hash of the output binaries, then you can automatically cache build artefacts by hash tag and get huge speedups when compiling stuff from scratch.

This was the primary motivation for Nix, since Nix does a whole lot of building from scratch and caching.

kohlerm · on June 21, 2021

I agree being able to support distributed caching of results is one of the major benefits.

danbst · on June 20, 2021

Just recently, there were large non-reproducible projects: python, gcc. Not sure where is the history of non-r13y.

---

There is Debian initiative to create bit-to-bit reproducible builds for all their software (well, all critical).

https://reproducible-builds.org/

R13y is akin to "computer proofs" in math -- if you don't have it, that's fine, but if you have it, that's awesome.

There are practical reasons to favor reproducibility too, but those are more for distro maintainers.

The fact that NixOS (not Debian) got this 100% is mostly because

- minimal image has a small subset of packages (https://hydra.nixos.org/build/146009592#tabs-build-deps)

- Nix tooling was created 15 years ago *exactly* for this, Nix is mad to make packages bit-to-bit rebuildable from scratch.

- Nix/Nixpkgs is growing in number of maintainers and got more funds

- Nix has fewer Docker/Snap pragmatics

Foxboron · on June 20, 2021

>- Nix tooling was created 15 years ago exactly for this, Nix is mad to make packages bit-to-bit rebuildable from scratch.

I don't think this is accurate?

Nix is about reproducing system behaviour, largely by capturing the dependency graph and replaying the build. But this doesn't entail bit-for-bit identical binaries. It's very much sits in the same group such as Docker and similar technologies. This is also how I read the original thesis from Eelco[0].

And well, claims like this always rubs me the wrong way since nixos only really started using the word "reproducible builds" after Debian started their efforts in 2015-2016[1], and started their reproducible builds effort later. It also muddies the language since people are now talking about "reproducible builds" in terms of system behavior as well as bit-for-bit identical builds. The result has been that people talk about "verifiable builds" instead.

[0]: https://edolstra.github.io/pubs/phd-thesis.pdf

[1]: https://github.com/NixOS/nixpkgs/issues/9731

dataflow · on June 21, 2021

> There's no good reason (that I can think of) why this shouldn't have been the case all along

Determinism can decrease performance dramatically. Like concatenating items (say, object files into a library) in order is clearly more expensive in both time & space than processing them out of order. One requires you to store everything in memory and then sort them before you start doing any work, whereas the other one lets you do your work in a streaming fashion. Enforcing determinism can turn an O(1)-space/O(n)-time algorithm into an O(n)-space/O(n log n)-time one, increasing latency and decreasing throughput. You wouldn't take a performance hit like that without a good reason to justify it.

infogulch · on June 20, 2021

Being bit-for-bit reproduceable means you could do fun things like distribute packages as just sources and a big blob of signatures, and you can still run only signed binaries.

mananaysiempre · on June 20, 2021

The GCC developers in particular were hostile to such efforts for a long time, IIRC. (This is a non-trivial issue because randomized data structures exist and can be a good idea to use: treaps, universal hashes, etc. I’d guess it also pays for compiler heuristics to be randomized sometimes. Incremental compilation is much harder to achieve when you require bit-for-bit identical output. Even just stripping your compile paths from debug info is not entirely straightforward.)

pas · on June 20, 2021

How/why was the randomness part not "solveable" via using fixed seeds?

bruce343434 · on June 21, 2021

the security benefit of things like stack canaries rest on them being random and not known beforehand, I guess. Otherwise stack smashing malware could know to avoid them.

mananaysiempre · on June 21, 2021

Wait, how is that relevant? Nothing says stack canaries have to use the same RNG as the main program, let alone the same seed, and there are cases such as this one where they probably shouldn’t, so it makes sense to separate them.

moonchild · on June 20, 2021

> Incremental compilation is much harder to achieve when you require bit-for-bit identical output

Presumably, incremental compilation is only for development. For release, you would do a clean build, which would be reproducible.

> Even just stripping your compile paths from debug info is not entirely straightforward

Just use the same paths.

mananaysiempre · on June 21, 2021

> Presumably, incremental compilation is only for development. For release, you would do a clean build, which would be reproducible.

I’d say that’s exactly the wrong approach: given how hard incremental anything is, it would make sense to insist on bit-exact output and then fuzz the everliving crap out of it until bit-exactness was reached. (The GCC maintainers do not agree.) But yes, you could do that. It’s not impossible to do reproducible builds with GCC 4.7 or whatever, it’s just intensely unpleasant, especially as a distro maintainer faced with yet another bespoke build system. (Saying that with all the self-awareness of a person making their own build system.)

> Just use the same paths.

I mean, sure, but then you have to build and debug in a chroot and waste half a day of your life figuring out how to do that and just generally feel stupid. And your debug info is still useless to anybody not using the exact same setup. Can’t we just embed relative paths instead, or even arbitrary prefixes it the code is coming from more than one place? In recent GCC versions we can, just chuck the right incantation into CPPFLAGS and you’re golden.

All of this is not really difficult except insofar as getting a large and complicated program to do anything is difficult. (Stares in the direction of the 17-year-old Firefox bug for XDG basedir support.) That’s why I said it wasn’t a GCC problem so much as a maintainer attitude problem.

adonovan · on June 21, 2021

GCC used to attempt certain optimizations (or more generally, choose different code-generation strategies) only if there was plenty of memory available. We discovered this in the course of designing Google's internal build system, which prizes reproducibility.

twisrkrr · on June 20, 2021

The code has to be changed so that things like system specific paths, time of compilation, hardware, etc. Don’t cause the compiled program to be unique to that computer (meaning compiling the same code on a different computer will give you a file that still works but has a different md5 hash)

By being able to reproduce the file completely, down to identical md5 hashes, you know you have the same file the creator has, and know with certainty that the file has not been tampered with

secondcoming · on June 20, 2021

Does this mean that the code cannot be built with CPU specific optimisations (march option with gcc)

Denvercoder9 · on June 20, 2021

The software doesn't suddenly become incompatible with CPU-specific optimisations (or many other compiler flags that change its output), but if you do so, you won't be able to reproduce the distribution binaries. Distributions don't enable CPU-specific optimisations anyway, since they want to be usable on more than one CPU model.

clhodapp · on June 20, 2021

No, just that you need to avoid naively conflating the machine that is doing the compilation with the one that optimization is being performed for.

Concretely, you would need to keep track of and reproduce e.g. the march flag value as a part of your build input. If you wanted to optimize for multiple architectures, that would mean separate builds or a larger binary with function multi-versioning.

maartenh · on June 20, 2021

Nixpkgs contains the build / patch instructions for any packages in NixOS.

If you want to compile any piece of software available in Nixpkgs, you can override it's attributes (inputs used to build it).

One can trivially have an almost identical operation system to your colleagues install, but override just one package to enable optimisations for a certain cpu. This would however imply that you'd lose the transparent binary cache that you could otherwise use.

Exactly this method is used to configure the entire operating install! Your OS install is just another package that has some custom inputs set.

Avamander · on June 20, 2021

Pretty much. But hopefully x86_64 feature levels will provide the benefits of native builds to a reasonable extent.

pas · on June 20, 2021

Likely it means that with the same input arguments the end result is bit-by-bit identical. (As I understand the problems were hard to control output elements. So it was not enough to se the same args, set the same time, and use the same path and filesystem, because there were things that happened at different speeds, so they ended up happening at relative different elapsed times, so the outputs contained different timestamps, etc.)

xyzzy_plugh · on June 20, 2021

There's a lot of problems with reproducible builds. Filesystem paths, timestamps, deterministic build order to say the least. This is a pretty great achievement and I'm looking forward to a non-minimal stable ISO.

bombcar · on June 20, 2021

Yeah even the “gcc compiled Jan 23, 2021 at 11:23AM” messages you often see breaks deterministic builds.

koolba · on June 20, 2021

There’s something very poetic about “unstable” being “reproducible”. It’s like controlled chaos.

dane-pgp · on June 21, 2021

I believe that's known as the Chaotic Good alignment.

toastal · on June 21, 2021

I really liked how easy it was to create a custom ISO when I installed Nix. For once I had Dvorak as the default keyboard from the outset, neovim for editing, and the proprietary WiFi drivers I needed all from a minimal config file and `nix build`.

goodpoint · on June 20, 2021

Reminder: https://reproducible-builds.org/ was born in Debian and pioneered reproducible building.

It took very significant efforts and largely benefit build tools (compilers, linkers, libraries) that are not Debian-specific.

aseipp · on June 20, 2021

I think r13y has said the minimal ISO was less than 10 packages away from 100% over 2 years now. The long tail has finally been overcome! Huge news.