The issue, at least with multitenant workloads, isn't "container vulnerabilities" as such; it's that standard containers are premised on sharing a kernel, which makes every kernel LPE a potential container escape --- there's a long history of those bugs, and they're only rarely flagged as "container escapes"; it's just sort of understood that a kernel LPE is going to break containers.
> it's just sort of understood that a kernel LPE is going to break containers.
I think it's generally understood that any sort of kernel LPE can potentially (and therefore is generally considered to) lead to breaking all security boundaries on the local machine, since the kernel contains no internal security boundaries. That includes both containers, but also everything else such a user separation, hardware virtualization controlled by the local kernel, and kernel private secrets.
A large proportion of LPE vulnerabilities are in the nature of "perform a syscall to pass specially crafted data to the kernel and trigger a kernel bug". For containers, the kernel is the host kernel and now the host is compromised. For VMs, the kernel is the guest kernel and now the guest is compromised, but not the host. That's a much narrower compromise and in security models where root on the guest is already expected to be attacker-controlled, isn't even a vulnerability.
VM sandbox escape is just "perform a hypercall/trap to pass specially crafted data to the hypervisor and trigger a hypervisor bug". For virtual machines, the hypervisor is the privileged host and now the host is compromised.
There is no inherent advantage to virtualization, the only thing that matters is the security and robustness of the privileged host.
The only reason there is any advantage in common use is that the Linux Kernel is a security abomination designed for default-shared/allow services that people are now trying to kludge into providing multiplexed services. But even that advantage is minor in comparison to modern, commonplace threat actors who can spend millions to tens of millions of dollars finding security vulnerabilities in core functions and services.
You need privileged manager code that a highly skilled team of 10 with 3 years to pound on it can not find any vulnerabilities in to reach the minimum bar to be secure against prevailing threat actors, let alone near-future threat actors.
The syscall interface has a lot more attack surface than the hypercall interface. If you want to run existing applications, you have to implement the existing syscall interface.
The advantage to virtualization is that the syscall interface is being implemented by the guest kernel at a lower privilege level instead of the host kernel at a higher privilege level.
If this were true, it would be easy to support the claim with evidence. What were the last three Linux LPEs that could be used in a realistic scenario (an attacker with shell, root, full control of guest kernel) to compromise a KVM host? There are dozens of published LPEs every year, so this should be easy for you.
You know that is a nonsensical request. Why would a Linux LPE result in a guest to host escape?
That is like asking for the last 3 iMessage RCEs that that could be directly used to get a kernel compromise. You obviously leverage the RCE to get code execution in the unprivileged context then chain it with a LPE or unprivileged to privileged kernel escape. The RCE is very likely to be unrelated to the LPE and can likely even be mixed and matched if the RCE is good enough. You could do both simultaneously, and I guess some might exist, but that is just generally a poor, much harder strategy.
In this case the Linux Kernel LPE would only get you code execution in the unprivileged guest which you then need to chain with a unprivileged to privileged hypervisor escape.
Are you claiming that hypervisors or VMM systems are unhackable? That is a extraordinary claim that demands extraordinary evidence. Otherwise you agree there are VM escapes that can be chained with code execution in the guest which is my entire point.
Your security depends on the quality of your isolation boundary and there is no reason to believe the same class of people who gave us the awful security of the Linux Kernel are going to turn around and solve the same problem they failed to solve by calling it a hypervisor.
Is it possible we're just talking past each other? I read you to be claiming that guest->host escapes were straightforward in the Linux kernel security model (they are not). If we just agree, then we agree, and we should just chalk this up to message board ambiguity.
I don't think so? It's not complicated. Most LPEs get you the local kernel. The KVM security model assumes an untrusted local (guest) kernel. To compromise KVM, they either need to be fundamental architectural flaws (rare) or bugs in KVM itself (also rare).