Letting the containers out of containment
I have written a lot about *Containing the Containers*, e.g. *Are Docker containers really secure?* and *Bringing new security features to Docker*. However, what if you want to ship a container that needs to have access to the host system or other containers? Well, let's talk about removing all the security! Safely?
Packaging Model
I envision a world where lots of software gets shipped in image format. In other words, the application brings all of the content needed to do its job with it, including the shared libraries, and specific versions of python, ruby, glibc ... There are two big benefits with this. One: the application always has the same runtime environment -- meaning packages can be installed on the host, without affecting the application. Second: the application can be installed without breaking any other applications or the host.
Enter container hosts, like Project Atomic, which keep the OS minimal and ship all of the software as containers. Which, in the abstract, makes perfect sense. However, if you want to install debugging tools, monitoring tools, management tools, etc you should also ship them as container images.
The first thing you usually have to do to get this to work (but not always) is turn off or turn down the security.
docker run --privileged ...
The --privileged
option turns off almost all of the security used to confine one container from others and from the host.
You can get more fine grained controls then this by using cap_add and cap_remove to modify the Linux capabilities given to a container (see Capabilities, a short intro and Capability-based security for an overview). You can also modify the SELinux type that a container will run with using the --security-opt label:TYPE_T calls.
Super Privileged Container (SPC)
A proposal I have been knocking around for a while now is the idea of a Super Privileged Container (SPC).
I define an SPC as a container that runs with security turned off (--privileged
) and turns off one or more of the namespaces or "volume mounts in" parts of the host OS into the container. This means it is exposed to more of the Host OS. In the most privileged version, the SPC will use ONLY the MOUNT (newns) namespace. It should be able to run without the PID, NET, IPC or UTS namespaces as well as future namespaces.
I think it would still need to use the MNT namespace in order to bring its own userspace but you could bring parts of the OS or all of "/" into the container using volume mounts.
The current docker CLI can do almost all of this if you include my --ipc=host
patch to disable IPC namespace, which looks like it will get merged soon. The only namespace we can not currently disable is the PID Namespace.
Lets look at a few use cases for running SPCs
Examples
Libvirt in a container
We want to be able to run Virtual Machines on a Project Atomic system, but we don't want to install all of the code requried to run libvirt and qemu, into the host OS. The libvirt application needs a lot of access to the host. Libvirt needs to be able to store its images on the host system but in certain cases its images are stored as device images. Libvirt also needs to communicate with the host's systemd using dbus to setup cgroups. And, finally, it also needs to be able to use SELinux to setup different labels for sVirt. Brent Baude and Scott Collier wrote a blog on how they were able to get libvirtd to run within a docker container.
This is the command they used to start their container.
sudo docker run --rm --privileged --net=host -ti -e 'container=docker' -v /proc/modules:/proc/modules -v /var/lib/libvirt/:/var/lib/libvirt/ -v /sys/fs/cgroup:/sys/fs/cgroup:rw libvirtd
They needed to use --net=host
in order to allow libvirt to manage the network on the host to setup its virtual machines. They also needed to expose the vm's from the host via /var/lib/libvirt. Finally they wanted to allow libvirt to manage the cgroup file system to puts its VMs under cgroup control.
One thing they missed is that they did not mount /sys/fs/selinux into the container. This would tell libselinux within the container that SELinux was enabled and libvirtd would then be able to launch its containers with sVirt separation.
In order to get it to work with the hosts /dev directory I would have volume mounted /dev into the container, e.g. -v /dev:/dev
. I would have also allowed libvirt to communicate with systemd using dbus by adding -v /run:/run
.
While this may seem like just an exercise in "how many turtles can I stack?," there are potential, real benefits they gain from using docker. For example, a project like libvirtd brings with it lots of user space tools that we don't want to necessarily add to the Atomic host.
However, the big, unsolved, downside to doing libvirt within a container is, if the admin shuts down the container, it will also kill all of the VMs within the container. If we could eliminate the PID namespace, we could potentially fix this problem.
A container that needs to load Kernel Modules
Several packages want to ship custom kernel modules that are not included in the Host OS. Currently, they ship these modules in an RPM package and then load them when the application starts. There is no reason that you could not do this within a privileged container. As long as the custom kernel module works with the current kernel. If your application could run as non-privileged, other then loading the kernel module, it would probably be best to ship the container as two different images, or run the same image with different commands. for example.
sudo docker run --rm --privileged foobar /sbin/modprobe PATHTO/foobar-kmod
sudo docker run -d foobar
A host management application like Cockpit
Cockpit manages a Host OS and needs access to pretty much the entire system. I have been playing around with different ways you might build a SPC for managing the host.
One idea I experimented with was mounting the host's "/" on to the containers "/". Imagine if you could execute
sudo docker run -v /:/ rhel7 sh
Then bind mount your userspace application onto /opt/apps/myapp, the app would have to get its shared libraries and content from subdirs of /opt/apps/myapp.
Sadly, I believe this will not work, or would be so fragile that it might cause more problems then it is worth.
It does not seem that gcc/glibc support a mechanism for having their shared libraries in one location while other applications have shared libraries in other directories. /etc/ld.so.cache causes too many problems.
I believe applications that want to manage the host file system will have to know they are running in a container or at least realize the / is in a sub-directory.
However, you could run a container like the following to expose the Host to the container.
sudo docker run -ti --privileged -d --net=host -e sysimage=/host -v /:/host -v /dev:/dev -v /run:/run rhel7-cockpit
Mounting Volumes
Cockpit would need to be coded to realize that if $sysimage environment variable is set. It can pre-pend all commands involving the host with $sysimage. Another option would be to standardize on a path. For example: All SPCs put the image in /host or /sysimage.
Then Cockpit could see if the environment variable container=docker or container_uuid=ID was set and prefix the /sysimage (or /host, not that I am biased to one of the options :) ) onto all of its content.
The example above mounts the Host's /dev onto the container's /dev which allows Cockpit to manage the Host devices. Processes on the Host would also be able to use these devices.
I would also mount the Host's /run on the container's /run, which allows processes within the container to communicate with any service that puts a FIFO file or socket into /run. Specifically, /run/dbus/system_bus_socket which would allow the Cockpit instance running inside the container to use dbus to communicate with all of the dbus services; including systemd.
We might also want to mount /sys on /sys. This would allow processes within the container to manage kernel file systems like SELinux or cgroups.
Eliminate namespaces
--net=host
eliminates net and uts namespace. This allows processes within the container to see and use the Host's network.
I have a github pull request patch that is about to be merged which will support --ipc=host
. This allows the Cockpit instance to share IPC with the Host system, if that is required. Lots of large projects, e.g. databases, rely on IPC to communicate and run a lot faster if they can use shared memory and semaphores.
The only thing we don't have yet is --pid=host
, which would allow Cockpit to see /proc as /proc. I have been talking about this with the upstream docker project, and the only thing that is difficult to add is the ability to kill all processes within the container. We could do this by freezing the processes within the container (docker pause) and then sending all of them sigkills.
The nice thing about this is you are still in the docker framework, i.e. docker ps would be able to show your container running.
CoreOS has a neat shell script hack, toolbox.
Toolbox uses systemd-nspawn and a docker image. They pack their application (gdb and strace) in a docker image. Toolbox uses docker to load the image then untars the docker image onto disk.
docker pull "${TOOLBOX_DOCKER_IMAGE}:${TOOLBOX_DOCKER_TAG}"
docker run --name=${machinename} "${TOOLBOX_DOCKER_IMAGE}:${TOOLBOX_DOCKER_TAG}" /bin/true
docker export ${machinename} | sudo tar -x -C "${machinepath}" -f -
and then executes systemd-nspawn to map the mnt namespace and then mount "/" onto /media/root
sudo systemd-nspawn -D "${machinepath}" --share-system --bind=/:/media/root --bind=/usr:/media/root/usr --user="${TOOLBOX_USER}" "$@"
The advantage of their method is they don't have separate PID namespaces. Meaning ps -el
will show all processes on the system, as mentioned above we need to get this functionality into docker.
The toolbox solution from CoreOS does NOT get listed in docker ps commands and is not treated the same as other docker images/containers.
Execute a command in the host namespace
Say you wanted to execute useradd, but you want to make sure that it happens in the Host namespace so that SELinux labels would be created, the auditing would go to the Host, and most importantly you change the Host's /etc/passwd and /etc/shadow.
sudo docker run -ti --privileged -d --net=host -e sysimage=/host -v /:/host -v /dev:/dev -v /run:/run /bin/sh
sudo nsenter --mount=$sysimage/proc/1/ns/mnt -- /sbin/adduser testuser
Note: This requires that nsenter be inside of your image. Since / is mounted on /host, the /proc of the host is available under $sysimage/proc/1/ns/mnt.
You could execute many shell commands using this method.
Conclusion
Containers can be used to run Host and container management tools. Having the ability to volume mount into the container and turn off namespaces makes this possible.
Last updated: February 22, 2024