You are here
LXC containers without CAP_SYS_ADMIN under Debian Jessie
Containers can be an alternative to virtualization in some cases. They require less resources (because the host's resources are reused) and are typically faster than virtual machines. Also, some things cannot be done as easily with virtualization. However, they do not provide the same security guarantees. And while there have been some bugs that could allow attackers to escape virtual machines to access the host in the past, and there may still be bugs in current virtualization software, in general it is much harder to escape virtual machines than it is to escape containers. The most promising developement in making containers more secure is a concept called 'user namespaces' in current Linux kernels, which allows the container's users be mapped to different 'real' users outside of them, so any action root may take inside a container will be conisdered an action taken by a regular user outside of it. Unfortunately, there is currently stil a high usability barrier surrounding them. This article is not about these types of containers.
If one does not use user namespaces, containers are by default not even remotely secure. If the root user under a container is really root, they can just mount any of the kernel's pseudofilesystems (proc, sysfs, debugfs, etc.) and directly modify settings. It is unlikely that a skilled attacker would not be able to find a way to escape the container in such a way, and even if not, denial of service is always an option (rebooting the system by using /proc/sysrq-trigger
, for example). Some aspects of the root user's power can be mitigated by use of the so-called 'capability bounding set'. Capabilities under Linux are based on a withdrawn POSIX draft, and they partition what special permissions on the kernel level the root user has that a normal user doesn't (note that file access rights simply by virtue of root owning most of the filesystem don't count here). For example, only a process with the CAP_SYS_MODULE
capability is allowed to load modules into the kernel, regardless of their user id (it doesn't even have to be the root user). How capabilities work in detail is not trivial to understand, but there is one important concept called 'capability bounding set'. It is the list of capabilities that this process and all its child processes will ever be able to acquire. For example, a setuid root program will typically acquire all available capabilities (regardless of the capabilities of the process calling it), but it's only going to be able to acquire those which are in the capability bounding set. So if any capability is removed from the bounding set of a process, it will never be able to be acquired again by that process or any descendant process. If that process happens to be the init process inside the container, all processes inside the container will not be able to gain that capability. Thus, removing e.g. CAP_SYS_MODULE
from the capability bounding set of the initial process at container startup, no process inside the container will ever be able to load or unload a kernel module. (One can always remove additional capabilities from the bounding set, but never add to it.) Thus, with capabilities, root inside a container can be further restricted. And LXC supports dropping arbitrary capabilities via its configuration.
Notice: this article is considered to be a work in progress, i.e. that there might be ways to improve upon the things discussed here. If you can think of any improvements, I'd like to hear about them.
General considerations
This article will describe how to set up Jessie containers on a Jessie host. Both the host and the containers will run systemd. If the host is still running Wheezy (and therefore sysvinit), a follow-up article discusses how to achieve that.
This article will only use software available directly in Debian Jessie without any third-party packages.
On the host, the following packages should be installed:
lxc
debootstrap
This article assumes you know how to set up LXC networking according to your own preferences. The example configurations shown here will assume there is a bridge br0
and that the containers have network connectivity via veth
pairs, but this is just a detail.
Creating the container
The canonical way of setting up a container with LXC is to use the lxc-create
tool that calls different template scripts to automatically set up a container. This article doesn't follow that path, mostly because the template is still geared toward sysvinit systems and changes things around inside the container. These changes (with a small exception) are not necessary for systemd containers (systemd allows you to boot a normal systemd in a container without changes to the normal system). And while most of the changes it makes are to sysvinit-related things, it does change some things related to systemd that are actually not optimal. That said, the template does work, so it does provide an easy way to setup a rootfs.
Throughout this article, the container's rootfs will be placed in the path /srv/lxc/lxc1
(non-standard w.r.t. to LXC) and the container's configuration will reside in /var/lib/lxc/lxc1/config
(standard LXC path). The container's name will be lxc1
for this example. Where to place the rootfs is a matter of personal preference.
In order to create the container, one needs to debootstrap
it:
debootstrap --arch=amd64 jessie /srv/lxc/lxc1
This will create a fresh Debian installation at the rootfs path. After bootstrapping the container, some trivial configuration details need to be configured:
/srv/lxc/lxc1/etc/hostname
: place the hostname in there./srv/lxc/lxc1/etc/network/interfaces
: the container described here will also lack the capability to perform its own network configuration, that part will be handled by LXC directly. However, Debian's networking scripts should see that there is actually network available, so the contents of the file should read:# interfaces(5) file used by ifup(8) and ifdown(8) # Include files from /etc/network/interfaces.d: source-directory /etc/network/interfaces.d auto eth0 iface eth0 inet manual
The first couple of lines are from Debian itself (and should remain there), but the important part is the
inet manual
configuration of theeth0
interface. This tells Debian's scripts that something outside of their scope will take care of the configuration of the network interface, but that it should assume it will be available./srv/lxc/lxc1/etc/resolv.conf
: the nameserver should be properly configured here.debootstrap
copies the host's configuration by default, so typically this does not need to be modified, but it should be checked.- The FUSE pseudo-filesystem inside the container should be masked, so systemd doesn't attemped to mount it automatically. This is strictly speaking not necessary, but will get rid of a stupid error message at boot:
ln -s /dev/null /srv/lxc/lxc1/etc/systemd/system/sys-fs-fuse-connections.mount
Finally, LXC supports pseudo-virtual terminals (lxc-console
, lxc.tty
setting). Newer LXC versions (1.1, not part of Jessie) are able to tell systemd automatically which ones have been generated, but when using the version shipping with Jessie, they still need to be activated manually. In this example, 4 consoles are used. To activate them, the container-getty@.service
must be instantiated. systemd's default service expects the name of the device node beneath /dev/pts
within the container as instance name; however, LXC before 1.1 doesn't use the PTS instance of the container for its consoles, so the device nodes will not lie within that directory. Rather, the configuration below will use lxc.devttydir = lxc
, so the nodes will be beneath /dev/lxc
. However, going up to the parent directory works for the instance names (although the getty greeting looks slightly awkward). Therefore, the following commands should be used to activate the consoles:
for i in 1 2 3 4 ; do ln -s /lib/systemd/system/container-getty@.service \ /srv/lxc/lxc1/etc/systemd/system/multi-user.target.wants/container-getty\@..-lxc-tty$i.service done
If you don't want to use these consoles (lxc.tty = 0
), then this step can be skipped. If you want a different number of consoles, change both the setting in LXCs configuration file and the amount of instances you create here.
Note that this uses container-getty@.service
and not getty@.service
. This is by design, and getty@.service
should not be used for anything but gettys running on real VTs - systemd provides the former service for a reason. (The LXC templates currently get this wrong.)
Configuring LXC
First of all, LXC needs to be told that it should consider all cgroups when bind-mounting the cgroup hierarchy into a container (see below for explanations), therefore one needs to create the configuration file /etc/lxc/lxc.conf
and add the line
lxc.cgroup.use = @all
See the lxc.system.conf(5)
manpage for details.
For the specific container, since no LXC template was used, the LXC configuration has to be done manually. LXC 1.0 can process includes in configuration files, so one can create a global file for options common to all containers of a specific type, and a local configuration file for the container-specific options. The global options may be stored under e.g. /etc/lxc/include/jessie-systemd.conf
:
# Common configuration for Jessie systemd LXC containers # (so that individual container configuration files are easier to read) # Default pivot location lxc.pivotdir = lxc_putold # Properly shutdown the container with lxc-stop lxc.haltsignal = SIGRTMIN+4 # Default mount entries # systemd as init in container: we need to premount everything so that systemd # will work without CAP_SYS_ADMIN. # (note that for cgroup:mixed to work, we also have /etc/lxc/lxc.conf to # make sure we include all cgroups!) lxc.mount.auto = proc:mixed sys:ro cgroup:mixed lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0 lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0 lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0 lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0 lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0 lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs rw,relatime,create=dir 0 0 # Default console settings lxc.tty = 4 lxc.devttydir = lxc lxc.pts = 1024 # Default capabilities # (note that audit_control is required for systemd-logind) lxc.cap.drop = mac_admin mac_override net_admin sys_admin sys_module sys_rawio sys_time syslog # Default cgroup limits lxc.cgroup.devices.deny = a ## Allow any mknod (but not using the node) lxc.cgroup.devices.allow = c *:* m lxc.cgroup.devices.allow = b *:* m ## /dev/null and zero lxc.cgroup.devices.allow = c 1:3 rwm lxc.cgroup.devices.allow = c 1:5 rwm ## full lxc.cgroup.devices.allow = c 1:7 rwm ## consoles lxc.cgroup.devices.allow = c 5:0 rwm ## /dev/{,u}random lxc.cgroup.devices.allow = c 1:8 rwm lxc.cgroup.devices.allow = c 1:9 rwm ## /dev/pts/* lxc.cgroup.devices.allow = c 5:2 rwm lxc.cgroup.devices.allow = c 136:* rwm ## rtc lxc.cgroup.devices.allow = c 254:0 rm ## hpet lxc.cgroup.devices.allow = c 10:228 rm # Blacklist some syscalls which are not safe in privileged # containers (part of LXC package) lxc.seccomp = /usr/share/lxc/config/common.seccomp # Needed for systemd lxc.autodev = 1 lxc.kmsg = 0
The local configuration file /var/lib/lxc/lxc1/config
can look like this, for example:
lxc.include = /etc/lxc/include/jessie-systemd.conf lxc.utsname = lxc1 lxc.rootfs = /srv/lxc/lxc1 lxc.arch = amd64 lxc.network.type = veth lxc.network.flags = up lxc.network.link = br0 lxc.network.name = eth0 lxc.network.veth.pair = l-lxc1 lxc.network.hwaddr = 00:16:3e:3e:11:68 lxc.network.ipv4 = 192.168.0.11/24 lxc.network.ipv4.gateway = 192.168.0.1
Explanations
In the following the choice of configuration options will be explained in detail. First of all, the container-specific configuration should be self-explanatory:
lxc.utsname = lxc1
- the container will not be able to set its own host name (capability will be dropped), so LXC has to do it for the containerlxc.rootfs = /srv/lxc/lxc1
- where the rootfs lies (optional)lxc.arch = amd64
- the container's architecture, see thelxc.container.conf(5)
manpage for possible values, should match with the architecture specified indebootstrap
lxc.network.type = veth
- create a network device in the container via theveth
pair mechnaism (see the kernel documentation for details)lxc.network.flags = up
- the network interface should be up (the container won't be allowed to change that)lxc.network.link = br0
- add the host's pair device to the bridgebr0
lxc.network.name = eth0
- name inside the containerlxc.network.veth.pair = l-lxc1
- name of the pair device outside of the container, chosen here to include the container's name for clarity, so that the administrator on the host can esaily grasp the meaning of this network device (note that interface names can be at most 16 characters long, so this scheme might be problematic with long host names)lxc.network.hwaddr = 00:16:3e:3e:11:68
- MAC address of the device inside the container (outside the container LXC will automatically choose a random MAC address that will not cause problems with bridges), the first 3 bytes (00:16:3e
) are registered to Xen, which is also a virtualization solution for Linux and other tools such as e.g. Libvirt also use it for their devices, and the last three bytes should be chosen randomly (don't reuse the one in this example).lxc.network.ipv4 = 192.168.0.11/24
- the IP address of the container, together with its netmasklxc.network.ipv4.gateway = 192.168.0.1
- the default gateway. For bridged containers (such as this example) this should be the same as the host's gateway (and not the host itself)
Of course, LXC supports also more complicated network configuration (and e.g. multiple network devices), but for simple containers, this is sufficient.
The global configuration is much more interesting. First, there most important line is
lxc.cap.drop = mac_admin mac_override net_admin sys_admin sys_module sys_rawio sys_time syslog
It tells LXC to drop the following capabilites from the bounding set:
CAP_MAC_ADMIN
: administrate the Mandatory Access ControlsCAP_MAC_OVERRIDE
: override Mandatory Access ControlsCAP_NET_ADMIN
: modify network configurationCAP_SYS_ADMIN
: this is the big one[tm] - a lot of special abilties in the kernel depend on this, for example the ability to mount filesystemsCAP_SYS_MODULE
: load/unload kernel modulesCAP_SYS_RAWIO
: access devices in a raw mannerCAP_SYS_TIME
: modify the kernel's internal clockCAP_SYSLOG
: the name is confusing. It controls whether a process has the capability to control the kernel's (global) internal log buffer, i.e. whatdmesg
returns. Reading data fromdmesg
will still work, however, unless you turn on thekernel.dmesg_restrict
sysctl on the host, then only processes withCAP_SYS_ADMIN
can access it.
Further limits are imposed by using device groups, which can regulate to which devices containers can have access to. Except for special circumstances, in general only pseudo-devices (such as /dev/null
) should be allowed. Note that if the container didn't drop CAP_SYS_ADMIN
, device cgroups are only a method against accidental usage of devices, not malicious. There are two obvious ways with which a container may escape device cgroups if root inside has CAP_SYS_ADMIN
: it could either directly whitelist additional devices (the check is against that capability) or it could move itself to another cgroup - and if the cgroup filesystem is not mounted so that that's directly possible, with CAP_SYS_ADMIN
a malicious user could easily change that. (Note that unprivileged containers are not affected by this, because device cgroups require the capability in the initial user namespace, not the container's user namespace.)
The policy for device cgroups here is relatively restrictive: disallow everything and then selectively whitelist things. mknod
is whitelisted globally, since scripts might fail if that is not available (although it is strictly speaking not required for the use of containers). The devices that are allowed are: /dev/null
, /dev/zero
, /dev/full
(similar to the previous, but all 1s), /dev/tty
(a process's current active terminal), /dev/random
, /dev/urandom
, /dev[/pts]/ptmx
, /dev/pts/*
, /dev/rtc
(read only) and /dev/hpet
(read only).
In addition to device cgroups, seccomp filters are used to restrict certain system calls into the kernel that don't make sense from within containers, this uses Debian's LXC's default /usr/share/lxc/config/common.seccomp
.
Since CAP_SYS_ADMIN
is missing in the container, the container cannot (re)mount any filesystem from within, so LXC has to set up all filesystems by itself. Handling of /dev/pts
is integrated into LXC anyway, and the other filesystems are configured as follows:
lxc.autodev = 1
This mounts an emptydevtmpfs
on/dev
and creates necessary device nodes there.lxc.mount.auto = proc:mixed sys:ro cgroup:mixed
This takes care of automatically mounting most required pseudo-filesystems. This includes:- Mount
/proc
with a newprocfs
instance, but remount/proc/sysrq-trigger
and/proc/sys
read-only. - Mount
/sys
read-only. - Mount
/sys/fs/cgroup
hierarchy in such a way that the container can't escape its own hierarchy, but has full access to its own.
- Mount
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
Mount the filesystem required for use of POSIX shared memory segments.lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
Mount the/run
filesystem.lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
Mount the/run/lock
filesystem as a separatetmpfs
.lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
Mount/run/user
as a separatetmpfs
, so that users can't DOS the system. See below for details.lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0
The pseudo-filesystem required for POSIX message queues. (IPC mechanism) If you are sure no software inside the container needs this, this could also be dropped. (And then you need to dosystemctl mask dev-mqueue.mount
inside the container to get rid of error messages.)lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs rw,relatime,create=dir 0 0
The pseudo-filesystem required for huge page support. If you are sure no software inside the container needs this, this could also be dropped. (And then you need to dosystemctl mask dev-hugepages.mount
inside the container to get rid of error messages.)
Conditionalization of services
systemd supports starting units (services, mounts, ...) only if a certain condition is fulfilled. This is the key feature that is used to make sure that some services (such as udev) are not started when run inside a container. For this reason, a minimal Debian installation should boot cleanly with the above configuration. For example:
systemd-udevd.service
hasConditionPathIsReadWrite=/sys
to make udev only run if/sys
is mounted read-write. The configuration above mounts/sys
read-only.sys-kernel-debug.mount
hasConditionCapability=CAP_SYS_RAWIO
to makedebugfs
to be mounted on/sys/kernel/debug
only ifCAP_SYS_RAWIO
is available. That capability is dropped in the configuration above.- Services may have
ConditionVirtualization=!container
to make them run only on bare-metal and in VMs, but on inside containers.
This logic can ensure that a clean boot can occur inside containers - as compared to sysvinit containers, where the policy was just to ignore errors that occurred at boot time because of the lack of capabilities. systemctl --failed
on a freshly booted container should not show any failed units.
Note that services in non-core packages might not have been adapted properly in this way.
Problems with syslog
systemd brings along with it the so-called journal to collect log messages. Debian's systemd is configured in such a way that by default it will just keep the journal in memory, but forward all messages to a standard syslog daemon. Unfortunately, because of the way systemd parallelizes boot, the syslog daemon is only started relatively late in the boot process. The kernel's socket buffer only allows queuing of 11 messages by default, which causes log messages to be dropped (and under a bit higher load it might also cause log messages to be dropped also during operation, even when syslog is running). This issue is not specific to containers, it has been tracked under Debian Bug #762700. For Jessie, the issue has been addressed by increasing the maximum number of messages that may be queued to 513, see the bug report and also systemd-setup-dgram-qlen.service
. This setting is per network namespace only, however, so new network namespaces will still inherit the default value of 11 messages. This can be mitigated by updating the (global or per-container) LXC configuration to run a hook script that sets this up within the network namespace of the container (before privileges are dropped):
# Set max_dgram_qlen for the container (/proc/sys is r/o inside!), # to make journal -> syslog forwarding work properly. lxc.hook.mount = /usr/local/lib/lxc/lxc-set-max-dgram-qlen
The hook script could look something like this:
#!/bin/sh set -e umask 022 [ -z "$LXC_NAME" ] && { echo "$0: Cannot execute mount hook from non-lxc environment." >&2; exit 1; } [ -z "$LXC_ROOTFS_MOUNT" ] && { echo "$0: \$LXC_ROOTFS_MOUNT not set. Cannot continue." >&2; exit 1; } [ -d "$LXC_ROOTFS_MOUNT" ] || { echo "$0: \$LXC_ROOTFS_MOUNT not a directory. Cannot continue." >&2; exit 1; } # NOTE: this is the host's /proc, but /proc/sys/net uses the netns # of the process accessing it # # Don't fail if this isn't successful, better boot a system with # degraded logging than no system at all... echo 512 > /proc/sys/net/unix/max_dgram_qlen 2>/dev/null || :
This will also perform the same setting for each container.
The proper long-term solution would be for syslog implementations to pull from the journal, which will then make sure that no message can ever be lost. But no syslog daemon in Jessie supports this yet (too old or support not compiled in).
Security trade-offs
Some modern security features in systemd require CAP_SYS_ADMIN
to work properly. Having the container run without that capability will not allow the usage of these features. This is a trade-off: while dropping CAP_SYS_ADMIN
from a container makes it much, much harder for root
inside a container to break out into the host, it reduces barriers inside the container. The following features are affected:
- logind would typically mount an individual
tmpfs
for each user, but since mounting requiresCAP_SYS_ADMIN
, this cannot work. logind will then justchown
+chmod
the users' runtime directories. If they are on the sametmpfs
as/run
itself, users could in principle DOS the container by using up all the space in/run
, so the configuration above makes/run/user
a separatetmpfs
instance to mitigate that somewhat. Users might still DOS each other, but not the entire container. Note: this will only work for systemd version 215-15, which is not in Jessie yet, but will likely be part of the Jessie stable release. In older versions, logind will simply complain that it can't mount and not have a fallback. (logins will still work, butXDG_RUNTIME_DIR
will not be properly set up.) PrivateTmp=
andPrivateDevices=
use mount namespaces, which can't be set up withoutCAP_SYS_ADMIN
.PrivateNetwork=
uses network namespaces, which can't be set up withoutCAP_SYS_ADMIN
.ProtectSystem=
,ProtectHome=
,ReadOnlyDirectories=
,ReadWriteDirectories=
andInaccessibleDirectories=
use mount namespaces, which can't be set up withoutCAP_SYS_ADMIN
.SystemCallFilter=
andRestrictAddressFamilies=
will only work if alsoNoNewPrivileges=yes
is set. (Seccomp requiresCAP_SYS_ADMIN
otherwise.)IOSchedulingClass=realtime
won't work, it requiresCAP_SYS_ADMIN
.- Possbily other settings.
In each case, one should carefully evaluate the trade-off between not having CAP_SYS_ADMIN
, i.e. securing the host against the container, and securing the container against individual services. Since these configuration options of systemd can also be considered part of a defense in depth strategy, this is something that has to be decided on a case-by-case basis.
Dealing with services that use these features
Services that use the features mentioned above will fail to start in containers without CAP_SYS_ADMIN
. The easiest way to disable these features is to use drop-ins to reset these options. For example, if a service x.service
sets PrivateTmp=yes
, one can create a drop-in /etc/systemd/system/x.service.d/container_wo_CAP_SYS_ADMIN.conf
with the following contents:
[Service] # Disable PrivateTmp= because we are run in a container w/o CAP_SYS_ADMIN, # where this setting doesn't work. PrivateTmp=no
Settings that take lists can be reset by setting them to empty, e.g.
[Service] # Disable ReadOnlyDirectories= because we are run in a container w/o CAP_SYS_ADMIN, # where this setting doesn't work. ReadOnlyDirectories=
Summary
It is possible to have containers without CAP_SYS_ADMIN
running Jessie and systemd, but there are some trade-offs to make.
Things in this article can always be improved, so please send me feedback.
- Log in to post comments