This is great! I'd like to see a formal container security grade that works like...

tptacek · 2025-05-30T19:38:31 1748633911

The issue, at least with multitenant workloads, isn't "container vulnerabilities" as such; it's that standard containers are premised on sharing a kernel, which makes every kernel LPE a potential container escape --- there's a long history of those bugs, and they're only rarely flagged as "container escapes"; it's just sort of understood that a kernel LPE is going to break containers.

delusional · 2025-05-31T09:41:56 1748684516

> it's just sort of understood that a kernel LPE is going to break containers.

I think it's generally understood that any sort of kernel LPE can potentially (and therefore is generally considered to) lead to breaking all security boundaries on the local machine, since the kernel contains no internal security boundaries. That includes both containers, but also everything else such a user separation, hardware virtualization controlled by the local kernel, and kernel private secrets.

zrm · 2025-05-31T16:46:10 1748709970

A large proportion of LPE vulnerabilities are in the nature of "perform a syscall to pass specially crafted data to the kernel and trigger a kernel bug". For containers, the kernel is the host kernel and now the host is compromised. For VMs, the kernel is the guest kernel and now the guest is compromised, but not the host. That's a much narrower compromise and in security models where root on the guest is already expected to be attacker-controlled, isn't even a vulnerability.

Veserv · 2025-05-31T19:13:55 1748718835

VM sandbox escape is just "perform a hypercall/trap to pass specially crafted data to the hypervisor and trigger a hypervisor bug". For virtual machines, the hypervisor is the privileged host and now the host is compromised.

There is no inherent advantage to virtualization, the only thing that matters is the security and robustness of the privileged host.

The only reason there is any advantage in common use is that the Linux Kernel is a security abomination designed for default-shared/allow services that people are now trying to kludge into providing multiplexed services. But even that advantage is minor in comparison to modern, commonplace threat actors who can spend millions to tens of millions of dollars finding security vulnerabilities in core functions and services.

You need privileged manager code that a highly skilled team of 10 with 3 years to pound on it can not find any vulnerabilities in to reach the minimum bar to be secure against prevailing threat actors, let alone near-future threat actors.

zrm · 2025-05-31T20:42:31 1748724151

The syscall interface has a lot more attack surface than the hypercall interface. If you want to run existing applications, you have to implement the existing syscall interface.

The advantage to virtualization is that the syscall interface is being implemented by the guest kernel at a lower privilege level instead of the host kernel at a higher privilege level.

tptacek · 2025-05-31T22:11:18 1748729478

If this were true, it would be easy to support the claim with evidence. What were the last three Linux LPEs that could be used in a realistic scenario (an attacker with shell, root, full control of guest kernel) to compromise a KVM host? There are dozens of published LPEs every year, so this should be easy for you.

Veserv · 2025-06-01T00:05:07 1748736307

You know that is a nonsensical request. Why would a Linux LPE result in a guest to host escape?

That is like asking for the last 3 iMessage RCEs that that could be directly used to get a kernel compromise. You obviously leverage the RCE to get code execution in the unprivileged context then chain it with a LPE or unprivileged to privileged kernel escape. The RCE is very likely to be unrelated to the LPE and can likely even be mixed and matched if the RCE is good enough. You could do both simultaneously, and I guess some might exist, but that is just generally a poor, much harder strategy.

In this case the Linux Kernel LPE would only get you code execution in the unprivileged guest which you then need to chain with a unprivileged to privileged hypervisor escape.

Are you claiming that hypervisors or VMM systems are unhackable? That is a extraordinary claim that demands extraordinary evidence. Otherwise you agree there are VM escapes that can be chained with code execution in the guest which is my entire point.

Your security depends on the quality of your isolation boundary and there is no reason to believe the same class of people who gave us the awful security of the Linux Kernel are going to turn around and solve the same problem they failed to solve by calling it a hypervisor.

tptacek · 2025-06-01T00:42:31 1748738551

Is it possible we're just talking past each other? I read you to be claiming that guest->host escapes were straightforward in the Linux kernel security model (they are not). If we just agree, then we agree, and we should just chalk this up to message board ambiguity.

tptacek · 2025-05-31T17:15:32 1748711732

Yes, what they just said here. ^^ ^^

transpute · 2025-05-31T11:25:54 1748690754

> hardware virtualization controlled by the local kernel

In some architectures, kernel LPE does not break platform (L0/EL2) virtualization, https://news.ycombinator.com/item?id=44141164

  L0/EL2  L1/EL1                   

  pKVM    KVM                  
  AX      Hyper-V / Xen / ESX

tptacek · 2025-05-31T17:16:26 1748711786

Most Linux kernel LPEs --- in fact, the overwhelming majority of them --- don't threaten KVM hosts when exploited in KVM guests.

billywhizz · 2025-05-31T22:35:32 1748730932

is there anything good written up on this?

tptacek · 2025-05-31T23:55:05 1748735705

I don't think so? It's not complicated. Most LPEs get you the local kernel. The KVM security model assumes an untrusted local (guest) kernel. To compromise KVM, they either need to be fundamental architectural flaws (rare) or bugs in KVM itself (also rare).

bjackman · 2025-05-30T19:37:42 1748633862

You cannot build a secure container runtime (against malicious containers) because underlying it is the Linux kernel.

The only way to make Linux containers a meaningful sandbox is to drastically restrict the syscall API surface available to the sandboxee, which quickly reduces its value. It's no longer a "generic platform that you can throw any workload onto" but instead a bespoke thing that needs to be tuned and reconfigured for every usecase.

This is why you need virtualization. Until we have a properly hardened and memory safe OS, it's the only way. And if we do build such an OS it's unclear to me whether it will be faster than running MicroVMs on a Linux host.

akdev1l · 2025-05-30T19:56:15 1748634975

One can definitely build a container runtime that uses virtualization to protect the host

For example there is Kata containers

https://katacontainers.io/

This can be used with regular `podman` by just changing the container runtime so there’s no even need for any extra tooling

In theory you could shove the container runtime into something like k8s

bjackman · 2025-05-31T08:28:34 1748680114

> container runtime that uses virtualization to protect the host

True, by "container" I really meant "shared-kernel container".

> In theory you could shove the container runtime into something like k8s

Yeah this is actually supported by k8s.

Whether that means it's actually reasonable to run completely untrusted workloads on your own cluster is another question. But it definitely seems like a really good defense-in-depth feature.

ignoramous · 2025-05-30T20:21:17 1748636477

> ... drastically restrict the syscall API surface available to the sandboxee, which quickly reduces its value ...

Depends I guess as Android has had quite a bit of success with seccomp-bpf & Android-specific flavour of SELinux [0]

> Until we have a properly hardened and memory safe OS ... faster than running MicroVMs on a Linux host.

Andy Tanenbaum might say, Micro Kernels would do just as well.

[0] https://youtu.be/WxbOq8IGEiE

carlhjerpe · 2025-05-30T21:52:02 1748641922

You also have gVisor, which runs all syscall through some Go history that's supposedly safe enough for Google.

bjackman · 2025-05-31T08:17:19 1748679439

gVisor uses virtualization

bjackman · 2025-05-31T08:16:57 1748679417

> Android

Exactly. Android pulls this off by being extremely constrained. It's dramatically less flexible than an OCI runtime. If you wanna run a random unenlightened workload on it you're probably gonna have a hard time.

> Micro Kernels would do just as well.

Yea this goes in the right direction. In the end a lot of kernel work I look at is basically about trying to retrofit benefits of microkernels onto Linux.

Saying "we should just use an actual microkernel" is a bit like "Russia and Ukraine should just make peace" IMO though.

Veserv · 2025-05-30T19:54:53 1748634893

You cannot build a secure virtualization runtime because underlying it is the VMM. Until you have a secure VMM you are subject to precisely the same class of problems plaguing container runtimes.

The only meaningful difference is that Linux containers target partitioning Linux kernel services which is a shared-by-default/default-allow environment that was never designed for and has never achieved meaningful security. The number of vulnerabilities resulting from, "whoopsie, we forgot to partition shared service 123" would be hilarious if it were not a complete lapse of security engineering in a product people are convinced is adequate for security-critical applications.

Present a vulnerability assessment demonstrating a team of 10 with 3 years time (~10-30 M$, comparable to many commercially-motivated single-victim attacks these days) can find no vulnerabilities in your deployment or a formal proof of security and correctness otherwise we should stick with the default assumption that software if easily hacked instead of the extraordinary claim that demands extraordinary evidence.

transpute · 2025-05-31T01:09:04 1748653744

> You cannot build a secure virtualization runtime because underlying it is the VMM

There are VMMs (e.g. pKVM in upstream Linux) with small SLoC that are isolated by silicon support for nested virtualization. This can be found on recent Google Pixel phones/tablets with strong isolation of untrusted Debian Arm Linux "Terminal" VM.

A similar architecture was shipped a decade ago by Bromium and now on millions of HP business laptops, including hypervisor isolation of firmware, "Hypervisor Security : Lessons Learned — Ian Pratt, Bromium — Platform Security Summit 2018", https://www.youtube.com/watch?v=bNVe2y34dnM

Christian Slater, HP cybersecurity ("Wolf") edutainment on nested virt hypervisor in printers, https://www.youtube.com/watch?v=DjMSq3n3Gqs

delusional · 2025-05-31T09:46:00 1748684760

> silicon support for nested virtualization

Is there any guarantee that this "silicon support" is any safer than the software? Once we break the software abstraction down far enough it's all just configuring hardware. Conversely, once you start baking significant complexity into hardware (such as strong security boundaries) it would seem like hardware would be subject to exactly the same bugs as software would, except it will be hard to update of course.

transpute · 2025-05-31T11:00:47 1748689247

> Is there any guarantee that this "silicon support" is any safer than the software?

Safety and security claims are only meaningful in the context of threat models. As described in the Xen/uXen/AX video, pKVM and AWS Nitro security talks, one goal is to reduce the size, function and complexity of open-source code running at the highest processor privilege levels [1], minimizing dependency on closed firmware/SMM/TrustZone. Nitro moved some functions (e.g. I/O virtualization) to separate processors, e.g. SmartNIC/DPU. Apple used an Arm T2 secure enclave processor for encryption and some I/O paths, when their main processor was still x86. OCP Caliptra RoT requires OSS firmware signed by both the OEM and hyperscaler customer. It's a never-ending process of reducing attack surface, prioritized by business context.

> hardware would be subject to exactly the same bugs as software would, except it will be hard to update of course

Some "hardware" functions can be updated via microcode, which has been used to mitigate speculative execution vulnerabilities, at the cost of performance.

[1] https://en.wikipedia.org/wiki/Protection_ring

[2] https://en.wikipedia.org/wiki/Transient_execution_CPU_vulner...

nyrikki · 2025-05-31T00:26:38 1748651198

While VMs do have an attack surface, it is vastly different than containers, which as you pointed out are not really a security system, but simply namespaces.

Seacomp, capabilities, selinux, apparmor, etc.. can help harden containers, but most of the popular containers don't even drop root for services, and I was one of the people who tried to even get Docker/Moby etc.. to let you disable the privileged flag...which they refused to do.

While some CRIs make this easier, any agent that can spin up a container should be considered a super user.

With the docker --privlaged flag I could read the hosts root volume or even install efi bios files just using mknod etc, walking /sys to find the major/minor numbers.

Namespaces are useful in a comprehensive security plan, but as you mentioned, they are not jails.

It is true that both VMs and containers have attack surfaces, but the size of the attack surface on containers is much larger.

bjackman · 2025-05-31T08:10:25 1748679025

I see your point but even if your VMM is a zillion lines of C++ with emulated devices there are opportunities to secure it that don't exist with a shared-monolithic-kernel container runtime.

You can create security boundaries around (and even within!) the VMM. You can make it so an escape into the VMM process has only minimal value, by sandboxing the VMM aggressively.

Plus you can absolutely escape the model of C++ emulating devices. Ideally I think VMMs should do almost nothing but manage VF passthroughs. Of course then we shift a lot of the problem onto the inevitably completely broken device firmware but again there are more ways to mitigate that than kernel bugs.

delusional · 2025-05-31T09:51:23 1748685083

Could you elaborate on how you could secure those architectures better? It's unclear to me how being in device firmware or being a VMM provides you with any further abilities. Surely you still have the same fundamental problem of being a shared resource.

Intuitively there are differences. The Linux kernel is fucking huge, and anything that could bake the "shared resources" down to less than the entire kernel would be easier to verify, but that would also be true for an entirely software based abstraction inside the kernel.

In a way it's the whole micro kernel discussion again.

bjackman · 2025-05-31T10:14:49 1748686489

When you escape a container generally you can do whatever the kernel can do. There is no further security boundary.

If you escape into a VMM you can do whatever the VMM can do. You can build a system where it can not do very much more than the VM guest itself. By the time the guest boots the process containing the vCPU threads has already lost all its interesting privileges and has no credentials of value.

Similar with device passthrough. It's not very interesting if the device you're passing through ultimately has unchecked access to PCIe but if you have a proper ioMMU set up it should be possible to have a system where pwning the device firmware is just a small step rather than an immediate escalation to root-equivalent. (I should say, I don't know if this system actually exists today, I just know it's possible).

With a VMM escape your next step is usually to exploit the kernel. But if you sandbox the VMM properly there is very limited kernel attack surface available to it.

So yeah you're right it's similar to the microkernel discussion. You could develop these properties for a shared-kernel container runtime... By making it a microkernel.

It's just that isn't a path with any next steps in the real world. The road from Docker to a secure VM platform is rich with reasonable incremental steps forward (virtualization is an essential step but it's still just one of many). The road from Docker to a microkernel is... Rewrite your entire platform and every workload!

delusional · 2025-05-31T15:31:54 1748705514

> It's just that isn't a path with any next steps in the real world.

It appears we find ourselves at the Theory/Praxis intersection once again.

> The road from Docker to a secure VM platform is rich with reasonable incremental steps forward

The reason it seems so reasonable is that it's well trodden. There were an infinity of VM platforms before Docker, and they were all discarded for pretty well known engineering reasons mostly to do with performance, but also for being difficult for developers to reason about. I have no doubt that there's still dialogue worth having between those two approaches, but cgroups isn't a "failed" VM security boundary anymore than Linux is a failed micro kernel. It never aimed to be a VM-like security boundary.

godelski · 2025-05-30T20:06:52 1748635612

Importantly I'd like to see the configurations of the machines. There's a lot you can do to docker or systemd spawns that greatly vary the security levels. This would really help show what needs to be done and what configurations lead to what risks.

Basically I'd love to see a giant ablation

Etheryte · 2025-05-30T19:52:04 1748634724

In a way, containers already run as honeypots with cash or crypto prizes, it's called production code and plenty of people are looking for holes day and night. While this setup sounds like a nice idea conceptually, the monetary incentives it could offer would surely be miniscule compared to real targets.