Proxmox LXC, Systemd, and Linux Capabilities

Debian in LXC/Proxmox works flawlessly, except for some systemd utility daemons. Instead of disabling those services, we can leverage Linux capabilities to achieve the same results.

Linux capabilities?

In classic UNIX systems, there are two categories of processes: privileged and unprivileged. Privileged processes, also known as superuser or root, have an effective user ID of 0 and bypass all kernel permission checks. On the other hand, unprivileged processes have a nonzero effective UID, and they are subject to full permission checking based on their credentials, including the effective UID, effective GID, and supplementary group list.

However, in kernel 2.2 and later versions of Linux, the privileges associated with superusers have been divided into distinct units called capabilities, which can be independently enabled and disabled.

With capabilities, developers can assign specific permissions to individual threads/processes rather than granting all privileges to the entire application. This separation allows for more fine-grained access control and helps to prevent potential security breaches that could result from the over-assignment of permissions.

There are three ways a process can get capabilities: The child process can inherit capabilities from the parent; or They can be assigned to a thread/process; or They can be set on an executable on disk (when executed, the program will have that capability).

An example of a standard Linux utility that uses capabilities is ping:

$ sudo getcap `which ping`
/usr/bin/ping cap_net_raw=ep

Why is that? Because ping uses a raw socket for sending ICMP Ping packets. Raw sockets, in Linux, can be opened only by privileged users or processes with the CAP_NET_RAW capability. ep refers to Permitted and Effective, two capability sets.

The list of capabilities, among with other info, is in the manpages:

man 7 capabilities

What is wrong with LXC/Proxmox and Debian?

If you create an unprivileged Debian 11-based LXC in Proxmox, you will find that some services won’t run:

$ sudo systemctl is-system-running
$ sudo systemctl | grep failed
* sys-kernel-config.mount              loaded failed failed    Kernel Configuration File System
* systemd-journald-audit.socket        loaded failed failed    Journal Audit Socket

One solution is to mask these services to make systemd happy. However, the systemd units governing these services are (correctly) configured to avoid starting the service if the capability is unavailable for the whole container (to check this, just run capsh --print).

$ sudo systemctl cat sys-kernel-config.mount | grep ^ConditionCapability
$ sudo systemctl cat systemd-journald-audit.socket | grep ^ConditionCapability

Capabilities are usually expressed with the CAP_ prefix and uppercase or without the CAP_ prefix and lowercase. Capabilities required by the two services are:

Since these capabilities are usually unused in the container, why don’t we drop them? “Drop a capability” means it won’t be available in the container. It makes sense if the container is not using them.

To configure Proxmox to drop the capability on start for a container, add these lines in the configuration file for your unprivileged container:

lxc.cap.drop: "sys_rawio audit_read"

The value of that key is a (space-separated) list of capabilities, in lower case, without the CAP_ prefix.

You can drop other capabilities too. The CAPABILITIES(7) man pages describe all available capabilities. If you are securing a container, you should look at it.