LXC containers without CAP_SYS_ADMIN under Debian Jessie

published by Christian Seiler on Sat, 04/11/2015 - 14:44

Containers can be an alternative to virtualization in some cases. They require less resources (because the host's resources are reused) and are typically faster than virtual machines. Also, some things cannot be done as easily with virtualization. However, they do not provide the same security guarantees. And while there have been some bugs that could allow attackers to escape virtual machines to access the host in the past, and there may still be bugs in current virtualization software, in general it is much harder to escape virtual machines than it is to escape containers. The most promising developement in making containers more secure is a concept called 'user namespaces' in current Linux kernels, which allows the container's users be mapped to different 'real' users outside of them, so any action root may take inside a container will be conisdered an action taken by a regular user outside of it. Unfortunately, there is currently stil a high usability barrier surrounding them. This article is not about these types of containers.

If one does not use user namespaces, containers are by default not even remotely secure. If the root user under a container is really root, they can just mount any of the kernel's pseudofilesystems (proc, sysfs, debugfs, etc.) and directly modify settings. It is unlikely that a skilled attacker would not be able to find a way to escape the container in such a way, and even if not, denial of service is always an option (rebooting the system by using /proc/sysrq-trigger, for example). Some aspects of the root user's power can be mitigated by use of the so-called 'capability bounding set'. Capabilities under Linux are based on a withdrawn POSIX draft, and they partition what special permissions on the kernel level the root user has that a normal user doesn't (note that file access rights simply by virtue of root owning most of the filesystem don't count here). For example, only a process with the CAP_SYS_MODULE capability is allowed to load modules into the kernel, regardless of their user id (it doesn't even have to be the root user). How capabilities work in detail is not trivial to understand, but there is one important concept called 'capability bounding set'. It is the list of capabilities that this process and all its child processes will ever be able to acquire. For example, a setuid root program will typically acquire all available capabilities (regardless of the capabilities of the process calling it), but it's only going to be able to acquire those which are in the capability bounding set. So if any capability is removed from the bounding set of a process, it will never be able to be acquired again by that process or any descendant process. If that process happens to be the init process inside the container, all processes inside the container will not be able to gain that capability. Thus, removing e.g. CAP_SYS_MODULE from the capability bounding set of the initial process at container startup, no process inside the container will ever be able to load or unload a kernel module. (One can always remove additional capabilities from the bounding set, but never add to it.) Thus, with capabilities, root inside a container can be further restricted. And LXC supports dropping arbitrary capabilities via its configuration.

Warning: everything described in this article should be considered as part of a defence in depth strategy. Never give a person root access in a container configured as described in this article that you wouldn't also give root access on the host running the container. Do not use this for virtual hosting or similar. The things described here will make it a lot harder to escape a container, but do not assume it will be impossible. Also note that for containers in general, a kernel-level vulnerability always affects the entire host (since containers don't run their own kernel, unlike virtual machines).

Notice: this article is considered to be a work in progress, i.e. that there might be ways to improve upon the things discussed here. If you can think of any improvements, I'd like to hear about them.

General considerations

This article will describe how to set up Jessie containers on a Jessie host. Both the host and the containers will run systemd. If the host is still running Wheezy (and therefore sysvinit), a follow-up article discusses how to achieve that.

This article will only use software available directly in Debian Jessie without any third-party packages.

On the host, the following packages should be installed:

lxc
debootstrap

This article assumes you know how to set up LXC networking according to your own preferences. The example configurations shown here will assume there is a bridge br0 and that the containers have network connectivity via veth pairs, but this is just a detail.

Creating the container

The canonical way of setting up a container with LXC is to use the lxc-create tool that calls different template scripts to automatically set up a container. This article doesn't follow that path, mostly because the template is still geared toward sysvinit systems and changes things around inside the container. These changes (with a small exception) are not necessary for systemd containers (systemd allows you to boot a normal systemd in a container without changes to the normal system). And while most of the changes it makes are to sysvinit-related things, it does change some things related to systemd that are actually not optimal. That said, the template does work, so it does provide an easy way to setup a rootfs.

Throughout this article, the container's rootfs will be placed in the path /srv/lxc/lxc1 (non-standard w.r.t. to LXC) and the container's configuration will reside in /var/lib/lxc/lxc1/config (standard LXC path). The container's name will be lxc1 for this example. Where to place the rootfs is a matter of personal preference.

In order to create the container, one needs to debootstrap it:

debootstrap --arch=amd64 jessie /srv/lxc/lxc1

This will create a fresh Debian installation at the rootfs path. After bootstrapping the container, some trivial configuration details need to be configured:

/srv/lxc/lxc1/etc/hostname: place the hostname in there.
/srv/lxc/lxc1/etc/network/interfaces: the container described here will also lack the capability to perform its own network configuration, that part will be handled by LXC directly. However, Debian's networking scripts should see that there is actually network available, so the contents of the file should read:
```
# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto eth0
iface eth0 inet manual
```
The first couple of lines are from Debian itself (and should remain there), but the important part is the inet manual configuration of the eth0 interface. This tells Debian's scripts that something outside of their scope will take care of the configuration of the network interface, but that it should assume it will be available.
/srv/lxc/lxc1/etc/resolv.conf: the nameserver should be properly configured here. debootstrap copies the host's configuration by default, so typically this does not need to be modified, but it should be checked.
The FUSE pseudo-filesystem inside the container should be masked, so systemd doesn't attemped to mount it automatically. This is strictly speaking not necessary, but will get rid of a stupid error message at boot:
ln -s /dev/null /srv/lxc/lxc1/etc/systemd/system/sys-fs-fuse-connections.mount

Finally, LXC supports pseudo-virtual terminals (lxc-console, lxc.tty setting). Newer LXC versions (1.1, not part of Jessie) are able to tell systemd automatically which ones have been generated, but when using the version shipping with Jessie, they still need to be activated manually. In this example, 4 consoles are used. To activate them, the container-getty@.service must be instantiated. systemd's default service expects the name of the device node beneath /dev/pts within the container as instance name; however, LXC before 1.1 doesn't use the PTS instance of the container for its consoles, so the device nodes will not lie within that directory. Rather, the configuration below will use lxc.devttydir = lxc, so the nodes will be beneath /dev/lxc. However, going up to the parent directory works for the instance names (although the getty greeting looks slightly awkward). Therefore, the following commands should be used to activate the consoles:

for i in 1 2 3 4 ; do
  ln -s /lib/systemd/system/container-getty@.service \
        /srv/lxc/lxc1/etc/systemd/system/multi-user.target.wants/container-getty\@..-lxc-tty$i.service
done

If you don't want to use these consoles (lxc.tty = 0), then this step can be skipped. If you want a different number of consoles, change both the setting in LXCs configuration file and the amount of instances you create here.

Note that this uses container-getty@.service and not getty@.service. This is by design, and getty@.service should not be used for anything but gettys running on real VTs - systemd provides the former service for a reason. (The LXC templates currently get this wrong.)

Configuring LXC

First of all, LXC needs to be told that it should consider all cgroups when bind-mounting the cgroup hierarchy into a container (see below for explanations), therefore one needs to create the configuration file /etc/lxc/lxc.conf and add the line

lxc.cgroup.use = @all

See the lxc.system.conf(5) manpage for details.

For the specific container, since no LXC template was used, the LXC configuration has to be done manually. LXC 1.0 can process includes in configuration files, so one can create a global file for options common to all containers of a specific type, and a local configuration file for the container-specific options. The global options may be stored under e.g. /etc/lxc/include/jessie-systemd.conf:

# Common configuration for Jessie systemd LXC containers
# (so that individual container configuration files are easier to read)

# Default pivot location
lxc.pivotdir = lxc_putold

# Properly shutdown the container with lxc-stop
lxc.haltsignal = SIGRTMIN+4

# Default mount entries
# systemd as init in container: we need to premount everything so that systemd
# will work without CAP_SYS_ADMIN.
# (note that for cgroup:mixed to work, we also have /etc/lxc/lxc.conf to
# make sure we include all cgroups!)
lxc.mount.auto = proc:mixed sys:ro cgroup:mixed
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0
lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs rw,relatime,create=dir 0 0

# Default console settings
lxc.tty = 4
lxc.devttydir = lxc
lxc.pts = 1024

# Default capabilities
#   (note that audit_control is required for systemd-logind)
lxc.cap.drop = mac_admin mac_override net_admin sys_admin sys_module sys_rawio sys_time syslog

# Default cgroup limits
lxc.cgroup.devices.deny = a
## Allow any mknod (but not using the node)
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m
## /dev/null and zero
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
## full
lxc.cgroup.devices.allow = c 1:7 rwm
## consoles
lxc.cgroup.devices.allow = c 5:0 rwm
## /dev/{,u}random
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 1:9 rwm
## /dev/pts/*
lxc.cgroup.devices.allow = c 5:2 rwm
lxc.cgroup.devices.allow = c 136:* rwm
## rtc
lxc.cgroup.devices.allow = c 254:0 rm
## hpet
lxc.cgroup.devices.allow = c 10:228 rm

# Blacklist some syscalls which are not safe in privileged
# containers (part of LXC package)
lxc.seccomp = /usr/share/lxc/config/common.seccomp

# Needed for systemd
lxc.autodev = 1
lxc.kmsg = 0

The local configuration file /var/lib/lxc/lxc1/config can look like this, for example:

lxc.include = /etc/lxc/include/jessie-systemd.conf

lxc.utsname = lxc1
lxc.rootfs  = /srv/lxc/lxc1
lxc.arch    = amd64

lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.veth.pair = l-lxc1
lxc.network.hwaddr = 00:16:3e:3e:11:68
lxc.network.ipv4 = 192.168.0.11/24
lxc.network.ipv4.gateway = 192.168.0.1

Explanations

In the following the choice of configuration options will be explained in detail. First of all, the container-specific configuration should be self-explanatory:

lxc.utsname = lxc1 - the container will not be able to set its own host name (capability will be dropped), so LXC has to do it for the container
lxc.rootfs = /srv/lxc/lxc1 - where the rootfs lies (optional)
lxc.arch = amd64 - the container's architecture, see the lxc.container.conf(5) manpage for possible values, should match with the architecture specified in debootstrap
lxc.network.type = veth - create a network device in the container via the veth pair mechnaism (see the kernel documentation for details)
lxc.network.flags = up - the network interface should be up (the container won't be allowed to change that)
lxc.network.link = br0 - add the host's pair device to the bridge br0
lxc.network.name = eth0 - name inside the container
lxc.network.veth.pair = l-lxc1 - name of the pair device outside of the container, chosen here to include the container's name for clarity, so that the administrator on the host can esaily grasp the meaning of this network device (note that interface names can be at most 16 characters long, so this scheme might be problematic with long host names)
lxc.network.hwaddr = 00:16:3e:3e:11:68 - MAC address of the device inside the container (outside the container LXC will automatically choose a random MAC address that will not cause problems with bridges), the first 3 bytes (00:16:3e) are registered to Xen, which is also a virtualization solution for Linux and other tools such as e.g. Libvirt also use it for their devices, and the last three bytes should be chosen randomly (don't reuse the one in this example).
lxc.network.ipv4 = 192.168.0.11/24 - the IP address of the container, together with its netmask
lxc.network.ipv4.gateway = 192.168.0.1 - the default gateway. For bridged containers (such as this example) this should be the same as the host's gateway (and not the host itself)

Of course, LXC supports also more complicated network configuration (and e.g. multiple network devices), but for simple containers, this is sufficient.

The global configuration is much more interesting. First, there most important line is

lxc.cap.drop = mac_admin mac_override net_admin sys_admin sys_module sys_rawio sys_time syslog

It tells LXC to drop the following capabilites from the bounding set:

CAP_MAC_ADMIN: administrate the Mandatory Access Controls
CAP_MAC_OVERRIDE: override Mandatory Access Controls
CAP_NET_ADMIN: modify network configuration
CAP_SYS_ADMIN: this is the big one[tm] - a lot of special abilties in the kernel depend on this, for example the ability to mount filesystems
CAP_SYS_MODULE: load/unload kernel modules
CAP_SYS_RAWIO: access devices in a raw manner
CAP_SYS_TIME: modify the kernel's internal clock
CAP_SYSLOG: the name is confusing. It controls whether a process has the capability to control the kernel's (global) internal log buffer, i.e. what dmesg returns. Reading data from dmesg will still work, however, unless you turn on the kernel.dmesg_restrict sysctl on the host, then only processes with CAP_SYS_ADMIN can access it.

Further limits are imposed by using device groups, which can regulate to which devices containers can have access to. Except for special circumstances, in general only pseudo-devices (such as /dev/null) should be allowed. Note that if the container didn't drop CAP_SYS_ADMIN, device cgroups are only a method against accidental usage of devices, not malicious. There are two obvious ways with which a container may escape device cgroups if root inside has CAP_SYS_ADMIN: it could either directly whitelist additional devices (the check is against that capability) or it could move itself to another cgroup - and if the cgroup filesystem is not mounted so that that's directly possible, with CAP_SYS_ADMIN a malicious user could easily change that. (Note that unprivileged containers are not affected by this, because device cgroups require the capability in the initial user namespace, not the container's user namespace.)

The policy for device cgroups here is relatively restrictive: disallow everything and then selectively whitelist things. mknod is whitelisted globally, since scripts might fail if that is not available (although it is strictly speaking not required for the use of containers). The devices that are allowed are: /dev/null, /dev/zero, /dev/full (similar to the previous, but all 1s), /dev/tty (a process's current active terminal), /dev/random, /dev/urandom, /dev[/pts]/ptmx, /dev/pts/*, /dev/rtc (read only) and /dev/hpet (read only).

In addition to device cgroups, seccomp filters are used to restrict certain system calls into the kernel that don't make sense from within containers, this uses Debian's LXC's default /usr/share/lxc/config/common.seccomp.

Since CAP_SYS_ADMIN is missing in the container, the container cannot (re)mount any filesystem from within, so LXC has to set up all filesystems by itself. Handling of /dev/pts is integrated into LXC anyway, and the other filesystems are configured as follows:

lxc.autodev = 1
This mounts an empty devtmpfs on /dev and creates necessary device nodes there.
lxc.mount.auto = proc:mixed sys:ro cgroup:mixed
This takes care of automatically mounting most required pseudo-filesystems. This includes:
- Mount /proc with a new procfs instance, but remount /proc/sysrq-trigger and /proc/sys read-only.
- Mount /sys read-only.
- Mount /sys/fs/cgroup hierarchy in such a way that the container can't escape its own hierarchy, but has full access to its own.
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
Mount the filesystem required for use of POSIX shared memory segments.
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
Mount the /run filesystem.
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
Mount the /run/lock filesystem as a separate tmpfs.
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
Mount /run/user as a separate tmpfs, so that users can't DOS the system. See below for details.
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0
The pseudo-filesystem required for POSIX message queues. (IPC mechanism) If you are sure no software inside the container needs this, this could also be dropped. (And then you need to do systemctl mask dev-mqueue.mount inside the container to get rid of error messages.)
lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs rw,relatime,create=dir 0 0
The pseudo-filesystem required for huge page support. If you are sure no software inside the container needs this, this could also be dropped. (And then you need to do systemctl mask dev-hugepages.mount inside the container to get rid of error messages.)

Conditionalization of services

systemd supports starting units (services, mounts, ...) only if a certain condition is fulfilled. This is the key feature that is used to make sure that some services (such as udev) are not started when run inside a container. For this reason, a minimal Debian installation should boot cleanly with the above configuration. For example:

systemd-udevd.service has ConditionPathIsReadWrite=/sys to make udev only run if /sys is mounted read-write. The configuration above mounts /sys read-only.
sys-kernel-debug.mount has ConditionCapability=CAP_SYS_RAWIO to make debugfs to be mounted on /sys/kernel/debug only if CAP_SYS_RAWIO is available. That capability is dropped in the configuration above.
Services may have ConditionVirtualization=!container to make them run only on bare-metal and in VMs, but on inside containers.

This logic can ensure that a clean boot can occur inside containers - as compared to sysvinit containers, where the policy was just to ignore errors that occurred at boot time because of the lack of capabilities. systemctl --failed on a freshly booted container should not show any failed units.

Note that services in non-core packages might not have been adapted properly in this way.

Problems with syslog

systemd brings along with it the so-called journal to collect log messages. Debian's systemd is configured in such a way that by default it will just keep the journal in memory, but forward all messages to a standard syslog daemon. Unfortunately, because of the way systemd parallelizes boot, the syslog daemon is only started relatively late in the boot process. The kernel's socket buffer only allows queuing of 11 messages by default, which causes log messages to be dropped (and under a bit higher load it might also cause log messages to be dropped also during operation, even when syslog is running). This issue is not specific to containers, it has been tracked under Debian Bug #762700. For Jessie, the issue has been addressed by increasing the maximum number of messages that may be queued to 513, see the bug report and also systemd-setup-dgram-qlen.service. This setting is per network namespace only, however, so new network namespaces will still inherit the default value of 11 messages. This can be mitigated by updating the (global or per-container) LXC configuration to run a hook script that sets this up within the network namespace of the container (before privileges are dropped):

# Set max_dgram_qlen for the container (/proc/sys is r/o inside!),
# to make journal -> syslog forwarding work properly.
lxc.hook.mount = /usr/local/lib/lxc/lxc-set-max-dgram-qlen

The hook script could look something like this:

#!/bin/sh

set -e
umask 022

[ -z "$LXC_NAME" ] && { echo "$0: Cannot execute mount hook from non-lxc environment." >&2; exit 1; }
[ -z "$LXC_ROOTFS_MOUNT" ] && { echo "$0: \$LXC_ROOTFS_MOUNT not set. Cannot continue." >&2; exit 1; }
[ -d "$LXC_ROOTFS_MOUNT" ] || { echo "$0: \$LXC_ROOTFS_MOUNT not a directory. Cannot continue." >&2; exit 1; }

# NOTE: this is the host's /proc, but /proc/sys/net uses the netns
#       of the process accessing it
#
# Don't fail if this isn't successful, better boot a system with
# degraded logging than no system at all...
echo 512 > /proc/sys/net/unix/max_dgram_qlen 2>/dev/null || :

This will also perform the same setting for each container.

The proper long-term solution would be for syslog implementations to pull from the journal, which will then make sure that no message can ever be lost. But no syslog daemon in Jessie supports this yet (too old or support not compiled in).

Security trade-offs

Some modern security features in systemd require CAP_SYS_ADMIN to work properly. Having the container run without that capability will not allow the usage of these features. This is a trade-off: while dropping CAP_SYS_ADMIN from a container makes it much, much harder for root inside a container to break out into the host, it reduces barriers inside the container. The following features are affected:

logind would typically mount an individual tmpfs for each user, but since mounting requires CAP_SYS_ADMIN, this cannot work. logind will then just chown + chmod the users' runtime directories. If they are on the same tmpfs as /run itself, users could in principle DOS the container by using up all the space in /run, so the configuration above makes /run/user a separate tmpfs instance to mitigate that somewhat. Users might still DOS each other, but not the entire container. Note: this will only work for systemd version 215-15, which is not in Jessie yet, but will likely be part of the Jessie stable release. In older versions, logind will simply complain that it can't mount and not have a fallback. (logins will still work, but XDG_RUNTIME_DIR will not be properly set up.)
PrivateTmp= and PrivateDevices= use mount namespaces, which can't be set up without CAP_SYS_ADMIN.
PrivateNetwork= uses network namespaces, which can't be set up without CAP_SYS_ADMIN.
ProtectSystem=, ProtectHome=, ReadOnlyDirectories=, ReadWriteDirectories= and InaccessibleDirectories= use mount namespaces, which can't be set up without CAP_SYS_ADMIN.
SystemCallFilter= and RestrictAddressFamilies= will only work if also NoNewPrivileges=yes is set. (Seccomp requires CAP_SYS_ADMIN otherwise.)
IOSchedulingClass=realtime won't work, it requires CAP_SYS_ADMIN.
Possbily other settings.

In each case, one should carefully evaluate the trade-off between not having CAP_SYS_ADMIN, i.e. securing the host against the container, and securing the container against individual services. Since these configuration options of systemd can also be considered part of a defense in depth strategy, this is something that has to be decided on a case-by-case basis.

Dealing with services that use these features

Services that use the features mentioned above will fail to start in containers without CAP_SYS_ADMIN. The easiest way to disable these features is to use drop-ins to reset these options. For example, if a service x.service sets PrivateTmp=yes, one can create a drop-in /etc/systemd/system/x.service.d/container_wo_CAP_SYS_ADMIN.conf with the following contents:

[Service]
# Disable PrivateTmp= because we are run in a container w/o CAP_SYS_ADMIN,
# where this setting doesn't work.
PrivateTmp=no

Settings that take lists can be reset by setting them to empty, e.g.

[Service]
# Disable ReadOnlyDirectories= because we are run in a container w/o CAP_SYS_ADMIN,
# where this setting doesn't work.
ReadOnlyDirectories=

Summary

It is possible to have containers without CAP_SYS_ADMIN running Jessie and systemd, but there are some trade-offs to make.

Things in this article can always be improved, so please send me feedback.

Tags:

debian

jessie

lxc

systemd

Christian's Blog

You are here