linux capability详解与容器中的capability

最新推荐文章于 2025-03-28 07:00:00 发布

ImSEten

最新推荐文章于 2025-03-28 07:00:00 发布

阅读量5k

点赞数 3

分类专栏： docker Linux 文章标签： linux 运维

本文链接：https://blog.csdn.net/weixin_42152531/article/details/120543324

版权

docker 同时被 2 个专栏收录

12 篇文章

订阅专栏

Linux

8 篇文章

订阅专栏

本文深入探讨了Linux capability机制，详细阐述了不同权限集合（如Permitted、Effective、Inheritable等）的作用，并通过实例展示了在进程内部用户切换、文件权限设置等方面的影响。同时，结合Docker，分析了在启用userns-remap后的权限变化，以及如何通过cap-drop和cap-add限制容器权限。最后，讨论了启用no-new-privileges选项对容器内提权操作的限制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、capability概述

在许多文章中都有讲到这部分，本文不做过多解释。自行百度。

capabilities(7) — Linux manual page——官方权威！！！
Linux Capabilities 入门教程：概念篇——米开朗基杨
 Linux Capabilities 入门教程：基础实战篇——米开朗基杨
 Linux Capabilities 入门教程：进阶实战篇——米开朗基杨
 Linux capability详解——弥敦道人-CSDN

在Linux内核2.2之前，为了检查进程权限，将进程区分为两类：特权进程(euid=0)和非特权进程。特权进程(通常为带有suid的程序)可以获取完整的root权限来对系统进行操作。

在linux内核2.2之后引入了capabilities机制，来对root权限进行更加细粒度的划分。如果进程不是特权进程，而且也没有root的有效id，系统就会去检查进程的capabilities，来确认该进程是否有执行特权操作的的权限。

可以通过man capabilities来查看具体的capabilities。

linux一共由5种权限集合。

Permitted ——可以赋予别人的权限。在下文中用大写P简称该权限
Effective ——当前有限的权限（真正实行权限的东西）。在下文中用大写E简称该权限
Inheritable ——可继承的权限。在下文中用大写I简称该权限
Bounding ——边界权限。在下文中用大写B简称该权限
Ambient——环境权限。在下文中用大写A简称该权限

1.1 查看当前用户的权限

查看/proc/$$/status文件中的Cap部分

普通用户

ubuntu@ubuntu-standard-pc:~$ cat /proc/$$/status | grep Cap
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	000001ffffffffff
CapAmb:	0000000000000000

root用户

root@ubuntu-standard-pc:~# cat /proc/$$/status | grep Cap
CapInh:	0000000000000000
CapPrm:	000001ffffffffff
CapEff:	000001ffffffffff
CapBnd:	000001ffffffffff
CapAmb:	0000000000000000

CapInh对应上文的I
CapPrm对应上文的P
CapEff对应上文的E
CapBnd对应上文的B
CapAmb对应上文的A

1.2 进程的权限

下文中的进程权限用pP、pI、pE、pB、pA来分别对应进程的P、I、E、B、A

首先创建一个进程，sleep进程。sleep 100秒。并且在后台运行。（末尾 &表示后台运行）

ubuntu@ubuntu-standard-pc:~$ sleep 100 &
[1] 1968

可以看到该进程的pid为1968，查看该进程的状态，（位置在/proc/"pid"/status）抓取capability部分。
/proc/pid号/status中记录了该pid进程的状态，包括了该进程的权限（capability）

如果不知道进程号，可以使用ps -ef命令来输出所有的进程，然后通过grep命令来搜索想要的信息。
例如本例子中，则可以

ubuntu@ubuntu-standard-pc:~$ ps -ef | head -1; ps -ef | grep sleep
UID        PID  PPID  C STIME TTY          TIME CMD
root      1595  1638  0 10:35 ?        00:00:00 sleep 60
1030775+  1968  1896  0 10:35 pts/1    00:00:00 sleep 100
1030775+  2065 59302  0 10:35 ?        00:00:00 sleep 5
1030775+  2175  1896  0 10:35 pts/1    00:00:00 grep --color=auto sleep

head -1的意思是输出表头，就是UID PID PPID C STIME TTY TIME CMD那一行。

ubuntu@ubuntu-standard-pc:~$ cat /proc/1968/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

可以看到，该进程的只有B有权限，其他所有集合均没有权限。与该用户的权限是一致的。至于为什么，下文会说。（不是简单的全部复制过来哦~）

我们继续看root用户的。

ubuntu@ubuntu-standard-pc:~$ sudo -i
root@ubuntu-standard-pc:~# cat /proc/$$/status | grep Cap
CapInh:	0000000000000000
CapPrm:	000001ffffffffff
CapEff:	000001ffffffffff
CapBnd:	000001ffffffffff
CapAmb:	0000000000000000

可以看到root用户的权限，只有I和A没有，其他权限都有。与root用户本身的权限是一致的。至于为什么，下文会说。（同样不是简单的全部复制过来哦~）

1.3 在进程内部进行用户切换（进程内调用setuid和setgid）

当一个进程在执行过程中发生用户切换的时候（在进程的执行代码中，调用了系统调用setuid和setgid）那么进程的capability也会发生相应的变化。
内核代码阅读——一定要收藏啊啊啊！！！

在内核中处理这部分的代码如下：

内核代码位置/security/commoncap.c：1087行

static inline void cap_emulate_setxuid(struct cred *new, const struct cred *old)
{
	kuid_t root_uid = make_kuid(old->user_ns, 0);

	if ((uid_eq(old->uid, root_uid) ||
	     uid_eq(old->euid, root_uid) ||
	     uid_eq(old->suid, root_uid)) 				//这3个，表示进程原来的用户是root用户
	     &&
	    (!uid_eq(new->uid, root_uid) &&
	     !uid_eq(new->euid, root_uid) &&
	     !uid_eq(new->suid, root_uid))) 			//这3个，表示进程限制的用户不是root用户
	     {
		if (!issecure(SECURE_KEEP_CAPS)) {			//如果没有设置KEEP_CAPS标志，则清除P和E权限集合
			cap_clear(new->cap_permitted);
			cap_clear(new->cap_effective);
		}

		/*
		 * Pre-ambient programs expect setresuid to nonroot followed
		 * by exec to drop capabilities.  We should make sure that
		 * this remains the case.
		 */
		cap_clear(new->cap_ambient);				//不管是不是root，统统清除A
	}
	if (uid_eq(old->euid, root_uid) && !uid_eq(new->euid, root_uid))
		cap_clear(new->cap_effective);				//曾经是root，现在切换成非root，则清除E
	if (!uid_eq(old->euid, root_uid) && uid_eq(new->euid, root_uid))
		new->cap_effective = new->cap_permitted;	//曾经是非root，现在切换成root，则E=P
}

上述内核代码主要的功能总结如下：

进程以前是root，切换成非root用户以后。如果没有设置KEEP_CAPS标志，则清除E和P权限集。
如果设置了KEEP_CAPS标志，则保留P权限集。

总而言之，只要发生了从root到普通用户切换，E的权限都会被清除掉，P的权限则视是否设置了KKEP_CAPS标志情况而定。

1.3.1 测试内核代码

本文例子中使用golang编程语言。

代码文件名：setid.go

package main

import (
	"fmt"
	"syscall"
	"time"
)

//SetKeepCaps 表示设置保留权限(capability)标志
func SetKeepCaps() error {
	if _, _, err := syscall.RawSyscall(syscall.SYS_PRCTL, syscall.PR_SET_KEEPCAPS, 1, 0); err != 0 {
		return err
	}

	return nil
}

//ClearKeepCaps 表示设置不保留权限(capability)标志
func ClearKeepCaps() error {
	if _, _, err := syscall.RawSyscall(syscall.SYS_PRCTL, syscall.PR_SET_KEEPCAPS, 0, 0); err != 0 {
		return err
	}

	return nil
}

func main() {

	fmt.Println("Hello world!")
	fmt.Println("before set, the uid is ", syscall.Getuid())
	fmt.Println("before set, the gid is ", syscall.Getgid())
	fmt.Println("before set, the effective uid is ", syscall.Geteuid())

	fmt.Println("|***********************************|")

	if err := SetKeepCaps(); err != nil {
		fmt.Println(err)
		return
	} else {
		fmt.Println("*     secessfully set keep caps     *")
	}
	fmt.Println("|***********************************|")

	syscall.Setgid(1000)
	syscall.Setuid(1000)
	//syscall.Setgid(0)
	//syscall.Setuid(0)
	fmt.Println("after set, the uid is ", syscall.Getuid())
	fmt.Println("after set, the gid is ", syscall.Getgid())
	fmt.Println("after set, the effective uid is ", syscall.Geteuid())

	// if err := ClearKeepCaps(); err != nil {
	// 	return
	// }
	// fmt.Println("after Clear, the uid is ", syscall.Getuid())
	// fmt.Println("after Clear, the gid is ", syscall.Getgid())
	time.Sleep(100 * time.Second)
}

上述代码实现的功能：

首先，设置KEEP_CAPS标志
在程序内部调用setgid和setuid系统调用，完成子进程的用户切换，从root用户切换到普通用户
程序休眠100s，在这个时间内，可以用ps命令查找该程序，查看该程序的权限capability

使用方法：

#bash命令
ubuntu@ubuntu-standard-pc:~/codes/go/capability$ go build setid.go

使用go build命令生成可执行文件，文件名为setid，没有后缀

然后使用root用户执行setid，这个setid可执行文件则是从root切换到1000用户上。

ubuntu@ubuntu-standard-pc:~/codes/go/capability$ sudo ./setid
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000

可以看到程序运行正常。ctrl+C退出程序，重新以后台运行的方式运行程序

ubuntu@ubuntu-standard-pc:~/codes/go/capability$ sudo ./setid &
[1] 3778
ubuntu@ubuntu-standard-pc:~/codes/go/capability$ Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000

运行以后再敲以下回车！

该程序的pid为54152，去/proc/3778/status 文件中查找权限（Cap）。

ubuntu@ubuntu-standard-pc:~/codes/go/capability$ cat /proc/3778/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000000000000000
CapBnd: 0000001fffffffff
CapAmb: 0000000000000000

可以看到，该程序从root用户切换到普通用户以后，权限（capability）只有P和B，E被内核清理了。与内核代码一致。

这里的权限（capability）是进程的权限，保留的是pP和pB。

用户	是否设置KEEP_CAPS	切换前权限集合	切换后权限集合
root->root	是/否	E、I、P、B、A	E、I、P、B、A(不清除)
root->普通	是	E、I、P、B、A	I、P、B(清除E、A)
root->普通	否	E、I、P、B、A	I、B(清除E、P、A)
普通->root	是/否	E、I、P、B、A	E、I、P、B、A(E=P)

1.4 文件权限

文件只用E、I、P权限，没有A、B权限！！！
文件只用E、I、P权限，没有A、B权限！！！
文件只用E、I、P权限，没有A、B权限！！！

1.4.1 查看某个文件的权限

下文中使用fI、fP、fE来分别表示文件的I、P、E权限

每个文件同样有权限，这些权限决定了某个用户执行该文件时可以进行哪些敏感操作。一般是看可执行文件的权限。

例如，我们的终端就是一个可执行文件，位置是/bin/bash。可以去查看该文件的权限。

ubuntu@ubuntu-standard-pc:~$ getcap /bin/bash
ubuntu@ubuntu-standard-pc:~$

可以看到该文件的权限为空。

查看我们刚刚的setid可执行文件的权限：

ubuntu@ubuntu-standard-pc:~/codes/go/capability$ getcap setid
ubuntu@ubuntu-standard-pc:~/codes/go/capability$

可以看到setid可执行文件的文件权限fI、fE、fP也为空

getcap看到的文件权限是普通用户的权限！！！
getcap看到的文件权限是普通用户的权限！！！
getcap看到的文件权限是普通用户的权限！！！
重要的事情说3遍！

对于root用户而言，系统默认为root用户设置的权限为所有权限。即fE、fI、fP均为1。这里的1是指后文进行计算时候使用的1，实际拥有哪些权限还是取决于用户root权限B（边界权限）。（cat /proc/$$/status | grep CapBnd）

root用户下的官方解释

   1. If the real or effective user ID of the process is 0 (root),
      then the file inheritable and permitted sets are ignored;
      instead they are notionally considered to be all ones (i.e.,
      all capabilities enabled).  (There is one exception to this
      behavior, described below in Set-user-ID-root programs that
      have file capabilities.)

   2. If the effective user ID of the process is 0 (root) or the
      file effective bit is in fact enabled, then the file effective
      bit is notionally defined to be one (enabled).

1.4.2 为某个文件赋权

以上文的可执行文件setid为例。setid的普通用户文件权限为空，我们来为setid赋予一点权限。

使用命令setcap来进行赋权。

root@ubuntu-standard-pc:/home/ubuntu/codes/go/capability# setcap CAP_SYS_ADMIN+eip setid
root@ubuntu-standard-pc:/home/ubuntu/codes/go/capability# getcap setid
setid = cap_sys_admin+eip

命令中的+eip(也可以用=eip)表示，在fE集合中添加cap_sys_admin权限，在fI集合中添加cap_sys_admin权限，在fP集合中添加cap_sys_admin权限

可以看到赋权成功，setid可执行文件的E、I、P权限集中都有了cap_sys_admin这个权限。

1.5 进程创建子进程的时候的权限

当我们在一个进程中创建一个子进程的时候，权限就会发生变化。

进程在进行fork()调用的时候，权限不会发生变化，子进程完全继承父进程的权限。

但是进程在进行exec()调用的时候，权限就会发生变化，具体的权限变化规则遵从以下公式：

如果子进程是root用户，则权限变化规则如下：

       p'P = pI | pB

       p'E = p'P
       p'I = pI
       p'B = pB

如果子进程是普通用户，则权限变化规则如下：

       p'A = (file is privileged) ? 0 : pA

       p'P= (pI & fI) | (fP & pB) | p'A
       
       p'E = fE ? p'P : p'A

       p'I = pI

       p'B = pB

capability在docker中

docker runc启动一个容器的过程如下：

先用root用户启动runc init进程，用户为root
然后设置pB，此时的pB已经是docker的默认capability集合了。而其他的pE、pP、pI都还是原本的capability。pA为默认的空
设置KEEP_CAPS标志。保留pP
setuid和gid。此时由root->普通用户，掉权，只剩pB、pI、pP。pE为空。
此时已经是普通用户，重新设置所有权限，pB、pI、pP、pE。此时所有权限都有。
普通用户调用系统调用exec()。掉权。实行普通用户的权限变化规则

这部分代码位置

func finalizeNamespace(config *initConfig) error {
	// Ensure that all unwanted fds we may have accidentally
	// inherited are marked close-on-exec so they stay out of the
	// container
	if err := utils.CloseExecFrom(config.PassedFilesCount + 3); err != nil {
		return err
	}

	capabilities := config.Config.Capabilities
	if config.Capabilities != nil {
		capabilities = config.Capabilities
	}
	w, err := newCapWhitelist(capabilities)
	if err != nil {
		return err
	}
	// drop capabilities in bounding set before changing user
	if err := w.dropBoundingSet(); err != nil {
		return err
	}
	// preserve existing capabilities while we change users
	if err := system.SetKeepCaps(); err != nil {
		return err
	}
	if err := setupUser(config); err != nil {
		return err
	}
	if err := system.ClearKeepCaps(); err != nil {
		return err
	}
	// drop all other capabilities
	if err := w.drop(); err != nil {
		return err
	}
	if config.Cwd != "" {
		if err := syscall.Chdir(config.Cwd); err != nil {
			return fmt.Errorf("chdir to cwd (%q) set in config.json failed: %v", config.Cwd, err)
		}
	}
	return nil
}

设置权限的代码如下：

func (c *capsV3) Set(which CapType, caps ...Cap) {
	for _, what := range caps {
		var i uint
		if what > 31 {
			i = uint(what) >> 5
			what %= 32
		}

		if which&EFFECTIVE != 0 {
			c.data[i].effective |= 1 << uint(what)
		}
		if which&PERMITTED != 0 {
			c.data[i].permitted |= 1 << uint(what)
		}
		if which&INHERITABLE != 0 {
			c.data[i].inheritable |= 1 << uint(what)
		}
		if which&BOUNDING != 0 {
			c.bounds[i] |= 1 << uint(what)
		}
	}
}

runc capability设置中，没有对权限集A进行设置，也没有对权限A进行删除。所以A一直为空。

runc顶层过程代码如下：

func (l *linuxSetnsInit) Init() error {
	if !l.config.Config.NoNewKeyring {
		// do not inherit the parent's session keyring
		if _, err := keys.JoinSessionKeyring(l.getSessionRingName()); err != nil {
			return err
		}
	}
	if l.config.NoNewPrivileges {
		if err := system.Prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
			return err
		}
	}
	if l.config.Config.Seccomp != nil {
		if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil {
			return err
		}
	}
	if err := finalizeNamespace(l.config); err != nil {
		return err
	}
	if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
		return err
	}
	if err := label.SetProcessLabel(l.config.ProcessLabel); err != nil {
		return err
	}
	// close the statedir fd before exec because the kernel resets dumpable in the wrong order
	// https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
	syscall.Close(l.stateDirFD)
	return system.Execv(l.config.Args[0], l.config.Args[0:], os.Environ())
}

调用execv以后，发生掉权。
计算过程：

       p'A = (file is privileged) ? 0 : pA

       p'P= (pI & fI) | (fP & pB) | p'A
       
       p'E = fE ? p'P : p'A

       p'I = pI

       p'B = pB

由于所有f的capability都为0，pA也为0，所以p’A=0。
p’A = 0
p’P = 0
p’E = 0
p’I = pI
p’B = pB

二、 docker 启用userns-remap

2.1 容器内部为root用户

先在主机侧创建用户

groupadd -g 10000 dockeruser
useradd -u 10000 -g dockeruser -d /home/dockeruser -m dockeruser

启用userns-remap

#vim /etc/docker/daemon.json
{
	...
	"userns-remap":"dockeruser",
	...
}

systemctl stop docker
systemctl daemon-reload
systemctl start docker

启用userns-remap以后。

Dockerfile如下：

FROM centos
ADD setid .
RUN setcap cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip setid

docker build

root@ubuntu-standard-pc:~# docker build -t centos:host-root-origin .

docker run

root@ubuntu-standard-pc:~# docker run -it --name centos-host-root-origin centos:host-root-origin /bin/bash

2.1.1 在容器侧的权限

[root@e72c2e81500e /]# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root)
gid=0(root)
groups=

权限与没有开启user-remap一致。且可以进入容器的root目录。

2.1.2 在主机侧的权限

查找docker进程在主机侧的pid

ubuntu@ubuntu-standard-pc:~$ ps -ef | grep e72c2e81500e
root        4253       1  0 23:12 ?        00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id e72c2e81500ec485c5664216ead98eaf5e7b7fd71b4521d4748ea6e87dbac2a3 -address /run/containerd/containerd.sock
ubuntu      4342    4334  0 23:13 pts/2    00:00:00 grep --color=auto e72c2e81500e
ubuntu@ubuntu-standard-pc:~$ ps -ef | grep 4253
root        4253       1  0 23:12 ?        00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id e72c2e81500ec485c5664216ead98eaf5e7b7fd71b4521d4748ea6e87dbac2a3 -address /run/containerd/containerd.sock
165536      4274    4253  0 23:12 pts/0    00:00:00 /bin/bash
ubuntu      4344    4334  0 23:13 pts/2    00:00:00 grep --color=auto 4253

查看主机侧docker的权限

ubuntu@ubuntu-standard-pc:~$ cat /proc/4274/status | grep Cap
CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
ubuntu@ubuntu-standard-pc:~$ capsh --decode=00000000a80425fb
WARNING: libcap needs an update (cap=40 should have a name).
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

可以看到，在主机侧，权限与在容器内看到的权限是一致的。

2.2 容器内部为普通用户

Dockerfile

FROM centos
ADD setid .
ADD helloworld .
ADD setid-chmod .
ADD setrootid .
RUN chmod +s setid-chmod
RUN chmod +s setrootid
RUN groupadd -g 20000 dockercentos
RUN useradd -u 20000 -g dockercentos -d /home/dockercentos -m dockercentos
USER dockercentos

docker build

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-usr# docker build -t centos:host-usr-chmod .
Sending build context to Docker daemon  7.139MB
Step 1/10 : FROM centos
 ---> 5d0da3dc9764
Step 2/10 : ADD setid .
 ---> Using cache
 ---> 01aae451ee4a
Step 3/10 : ADD helloworld .
 ---> Using cache
 ---> 91b5dc55ce84
Step 4/10 : ADD setid-chmod .
 ---> Using cache
 ---> a2b67950e1b2
Step 5/10 : ADD setrootid .
 ---> Using cache
 ---> af10aae2cff3
Step 6/10 : RUN chmod +s setid-chmod
 ---> Using cache
 ---> 62b534a30c89
Step 7/10 : RUN chmod +s setrootid
 ---> Using cache
 ---> 9afa21fd8b32
Step 8/10 : RUN groupadd -g 20000 dockercentos
 ---> Running in d233ce98d0f0
Removing intermediate container d233ce98d0f0
 ---> a29952a344b4
Step 9/10 : RUN useradd -u 20000 -g dockercentos -d /home/dockercentos -m dockercentos
 ---> Running in 9bee1a5b9ad6
Removing intermediate container 9bee1a5b9ad6
 ---> 3ccf8e7c7b7d
Step 10/10 : USER dockercentos
 ---> Running in 05feed2c8819
Removing intermediate container 05feed2c8819
 ---> 21b170f459fb
Successfully built 21b170f459fb
Successfully tagged centos:host-usr-chmod

docker run

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-usr# docker run -it --name centos-df-usr centos:host-usr-chmod /bin/bash

2.2.1 在容器侧的权限

[dockercentos@b4b1eccfcd6c /]$ capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+i
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=20000(dockercentos)
gid=20000(dockercentos)
groups=

2.2.2 在主机侧的权限

ubuntu@ubuntu-standard-pc:~$ ps -ef | grep b4b1eccfcd6c
root       12587       1  0 00:57 ?        00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id b4b1eccfcd6ce1daadbe5fb6059cb9a6fea631f81eaaf0b3ae97ba839e41f64b -address /run/containerd/containerd.sock
ubuntu     12656    8418  0 00:58 pts/2    00:00:00 grep --color=auto b4b1eccfcd6c
ubuntu@ubuntu-standard-pc:~$ ps -ef | grep 12587
root       12587       1  0 00:57 ?        00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id b4b1eccfcd6ce1daadbe5fb6059cb9a6fea631f81eaaf0b3ae97ba839e41f64b -address /run/containerd/containerd.sock
185536     12609   12587  0 00:57 pts/0    00:00:00 /bin/bash
ubuntu     12658    8418  0 00:58 pts/2    00:00:00 grep --color=auto 12587

查看主机侧的权限

ubuntu@ubuntu-standard-pc:~$ cat /proc/12609/status | grep Cap
CapInh:	00000000a80425fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000

可以看到，容器内为普通用户的时候，在主机侧的权限只有I和E。

2.3 容器中的chmod

如果容器内部的某个文件，在dockerfile中设置了权限，但是容器本身没有这个权限，则无法运行该文件。
如下：

Dockerfile

FROM centos
ADD setid .
ADD helloworld .
RUN setcap cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip setid
RUN setcap cap_sys_admin+eip helloworld

文件helloworld拥有权限cap_sys_admin，但是容器默认权限中没有该权限。
设置的setid文件的权限=容器默认权限集。

创建docker镜像

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root-cap# docker build -t centos:host-root-captest .
Sending build context to Docker daemon  3.562MB
Step 1/5 : FROM centos
 ---> 5d0da3dc9764
Step 2/5 : ADD setid .
 ---> Using cache
 ---> fff6dba319f3
Step 3/5 : ADD helloworld .
 ---> Using cache
 ---> e22e26214e9d
Step 4/5 : RUN setcap cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip setid
 ---> Running in 2094b4999f5f
Removing intermediate container 2094b4999f5f
 ---> 44ce5251d8b7
Step 5/5 : RUN setcap cap_sys_admin+eip helloworld
 ---> Running in 871c82fa3e98
Removing intermediate container 871c82fa3e98
 ---> d2bfa0175e88
Successfully built d2bfa0175e88
Successfully tagged centos:host-root-captest

运行镜像

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root-cap# docker run -it --rm centos:host-root-captest /bin/bash

[root@339688eb874d /]# ls
bin  dev  etc  helloworld  home  lib  lib64  lost+found  media	mnt  opt  proc	root  run  sbin  setid	srv  sys  tmp  usr  var
[root@339688eb874d /]# ./helloworld 
bash: ./helloworld: Operation not permitted
[root@339688eb874d /]# ./setid
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000

可以看到，helloworld文件无权限运行，setid文件有权限运行。

2.3.1 使用–cap-drop和–cap-add配合分配capabili

2.3.1.1 容器内为root用户

Dockerfile如下

FROM centos
ADD setid .
ADD helloworld .
ADD setid-chmod .
ADD setrootid .
RUN chmod +s setid-chmod
RUN chmod +s setrootid

setrootid是setid.go中，把setuid和setgid的值改为0。

docker build

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root# docker build -t centos:host-chmod .
Sending build context to Docker daemon  7.139MB
Step 1/7 : FROM centos
 ---> 5d0da3dc9764
Step 2/7 : ADD setid .
 ---> 01aae451ee4a
Step 3/7 : ADD helloworld .
 ---> 91b5dc55ce84
Step 4/7 : ADD setid-chmod .
 ---> a2b67950e1b2
Step 5/7 : ADD setrootid .
 ---> af10aae2cff3
Step 6/7 : RUN chmod +s setid-chmod
 ---> Running in 5cd11d90e4ee
Removing intermediate container 5cd11d90e4ee
 ---> 62b534a30c89
Step 7/7 : RUN chmod +s setrootid
 ---> Running in 39a322c185dd
Removing intermediate container 39a322c185dd
 ---> 9afa21fd8b32
Successfully built 9afa21fd8b32
Successfully tagged centos:host-chmod

2.3.1.1.1 不使用no-new-privileges

运行docker

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root# docker run -it --cap-drop all --cap-add={cap_setgid,cap_setuid,cap_setfcap} --rm centos:host-chmod /bin/bash
[root@07b6e17cb6d7 /]# ls
bin  etc	 home  lib64	   media  opt	root  sbin   setid-chmod  srv  tmp  var
dev  helloworld  lib   lost+found  mnt	  proc	run   setid  setrootid	  sys  usr
[root@07b6e17cb6d7 /]# ./setid
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000
^C
[root@07b6e17cb6d7 /]# ./setid-chmod 
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000
^C
[root@07b6e17cb6d7 /]# ./setrootid 
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  0
after set, the gid is  0
after set, the effective uid is  0

可以看到，可以使用setuid和setgid等。且容器内部，在seuid以前，实际的euid用户是0，root用户。

2.3.1.1.2 使用no-new-privileges

docker run

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root# docker run -it --cap-drop all --cap-add={cap_setgid,cap_setuid,cap_setfcap} --rm --security-opt=no-new-privileges centos:host-chmod /bin/bash
[root@1c3a94e2c741 /]# ls
bin  etc	 home  lib64	   media  opt	root  sbin   setid-chmod  srv  tmp  var
dev  helloworld  lib   lost+found  mnt	  proc	run   setid  setrootid	  sys  usr
[root@1c3a94e2c741 /]# ./setid
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000
^C
[root@1c3a94e2c741 /]# ./setid-chmod 
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000
^C
[root@1c3a94e2c741 /]# ./setrootid 
Hello world!
before set, the uid is  0
before set, the gid is  0
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  0
after set, the gid is  0
after set, the effective uid is  0

可以看到，与不使用no-new-privileges效果一样。euid用户依然是0(root用户)

2.3.1.2 容器内为普通用户

2.3.1.2.1 不开启no-new-privileges

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root# docker run -it --rm --user 10000:10000 --cap-drop all --cap-add={cap_setgid,cap_setuid,cap_setfcap} centos:host-chmod /bin/bash
bash-4.4$ ./setid
Hello world!
before set, the uid is  10000
before set, the gid is  10000
before set, the effective uid is  10000
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  10000
after set, the gid is  10000
after set, the effective uid is  10000
^C
bash-4.4$ ./setid-chmod 
Hello world!
before set, the uid is  10000
before set, the gid is  10000
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  1000
after set, the gid is  1000
after set, the effective uid is  1000
^C
bash-4.4$ ./setrootid 
Hello world!
before set, the uid is  10000
before set, the gid is  10000
before set, the effective uid is  0
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  0
after set, the gid is  0
after set, the effective uid is  0

进入容器的uid和gid都指定了，且都不为root时，进入容器是完全的普通用户，setid可执行文件由于没有进行chmod提权行为，所以没有setuid和setgid的权限，无法进行setuid和setgid操作。
而setid-chmod可执行文件在dockerfile中使用了chmod +s进行提权，使得setid-chmod文件在执行的时候拥有root权限，（euid为0），所以setid-chmod文件可以进行setuid和setgid操作。该文件在容器内为root。
经过了chmod +s提权以后的文件，可以通过调用setuid和setgid，使得该文件(setrootid)可以切换为root用户。

2.3.1.2.2 开启no-new-privileges

docker run

root@ubuntu-standard-pc:/home/ubuntu/docker/Dockerfiles/centos-root# docker run -it --user 10000:10000 --cap-drop all --cap-add={cap_setgid,cap_setuid,cap_setfcap} --rm --security-opt=no-new-privileges centos:host-chmod /bin/bash
bash-4.4$ ls
bin  etc	 home  lib64	   media  opt	root  sbin   setid-chmod  srv  tmp  var
dev  helloworld  lib   lost+found  mnt	  proc	run   setid  setrootid	  sys  usr
bash-4.4$ ./setid
Hello world!
before set, the uid is  10000
before set, the gid is  10000
before set, the effective uid is  10000
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  10000
after set, the gid is  10000
after set, the effective uid is  10000
^C
bash-4.4$ ./setid-chmod 
Hello world!
before set, the uid is  10000
before set, the gid is  10000
before set, the effective uid is  10000
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  10000
after set, the gid is  10000
after set, the effective uid is  10000
^C
bash-4.4$ ./setrootid 
Hello world!
before set, the uid is  10000
before set, the gid is  10000
before set, the effective uid is  10000
|***********************************|
*     secessfully set keep caps     *
|***********************************|
after set, the uid is  10000
after set, the gid is  10000
after set, the effective uid is  10000

当开启了no-new-privileges时，无法通过chmod提权的方式，让普通用户的进程切换到root用户。
可以看到，当进入容器的uid和gid都指定了，且都不为root时，进入容器是完全的普通用户，没有setuid和setgid的权限，无法进行setuid和setgid操作。

docker run -it --name centos-host-runroot-nonewprivileges --cap-drop all --cap-add={cap_setgid,cap_setuid,cap_setfcap} --security-opt=no-new-privileges centos:host-root-origin /bin/bash

[root@LIN-29076BB8489 centos-root]# docker run -it --name centos-host-root-nonewprivileges --cap-drop all --cap-add={cap_setgid,cap_setuid,cap_setfcap} --security-opt=no-new-privileges centos:host-root-origin /bin/bash
[root@3215a191e737 /]# capsh --print
Current: = cap_setgid,cap_setuid,cap_setfcap+eip
Bounding set =cap_setgid,cap_setuid,cap_sys_admin,cap_setfcap
Ambient set =
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root)
gid=0(root)
groups=

在主机侧查看该容器的capability

[root@LIN-29076BB8489 centos-org]# ps -ef | grep 3215a191e737
root      8080     1  0 17:11 ?        00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3215a191e73770dd26b393ba99acf183d0380d717da6e2c68e167f554c57a418 -address /run/containerd/containerd.sock
root      9644 58947  0 17:13 pts/1    00:00:00 grep --color=auto 3215a191e737
[root@LIN-29076BB8489 centos-org]# ps -ef | grep 8080
root      8080     1  0 17:11 ?        00:00:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3215a191e73770dd26b393ba99acf183d0380d717da6e2c68e167f554c57a418 -address /run/containerd/containerd.sock
100000    8100  8080  0 17:11 pts/0    00:00:00 /bin/bash
root      9697 58947  0 17:13 pts/1    00:00:00 grep --color=auto 8080

47753则是容器主进程/bin/bash的pid。

[root@LIN-29076BB8489 centos-org]# cat /proc/8100/status | grep Cap
CapInh: 00000000800000c0
CapPrm: 00000000800000c0
CapEff: 00000000800000c0
CapBnd: 00000000800000c0
CapAmb: 0000000000000000
[root@LIN-29076BB8489 centos-org]# capsh --decode=00000000800000c0
0x00000000800000c0=cap_setgid,cap_setuid,cap_setfcap

主机侧，容器主进程只有赋予的有限权限。

三、总结

3.1 docker容器启动过程权限变化

容器内如果为普通用户，容器中的权限为
p’A = 0
p’P = 0
p’E = 0
p’I = pI
p’B = pB

3.2 限制容器的权限

限制容器权限的方法有3种。

docker启用userns-remap，将docker中的root用户映射到主机上的普通用户
容器中使用普通用户，docker在启动容器的时候会进行setuid直接掉权，只剩权限集合I，在主机侧无任何权限。
使用cap-drop all 和cap-add指定权限，docker容器只拥有cap-add指定的个别权限。

启用userns-remap

#vim /etc/docker/daemon.json
{
	...	
	"userns-remap":"用户名",
	...
}

通过启用user-remap，将容器内的用户映射到主机的指定用户上，当指定的主机用户不是root用户时，容器内的root用户则映射为主机上的普通用户。

如果没有使用cap-drop，则容器拥有docker默认权限。且容器内为root用户时，有权限集合E、I、P、B。

3.2.1 启用userns-remap

是否开启userns-remap	效果
是	主机侧是`普通用户`，如果需要使用主机侧的某些权限，需要使用cap-add增加容器对应的权限，否则容器只有docker默认的权限
否	主机侧是`root用户`，拥有root用户权限

3.2.2 限制容器内的用户

容器内的用户	效果
普通用户	容器内和主机侧的权限集合都只有`I`，如果要执行任何需要权限的操作，都需要提前在dockerfile中对对应的程序赋权
root用户	容器内和主机侧的权限集合有`E、I、P、B`，如果需要执行需要权限的操作，只要使用cap-add对应权限即可操作，不需要在dockerfile中赋权

3.2.3 使用cap-add和cap-drop

一般使用cap-drop=all来删除docker默认的权限，然后使用cap-add添加自定义的权限。

是否使用cap-drop和cap-add	效果
是	容器拥有的权限只有cap-drop删掉以后，cap-add增加的指定权限，没有其他权限
否	容器拥有的权限docker默认的权限

是否使用–security-opt=no-new-privileges	效果
是	限制容器内通过chmod提权的普通用户进程`(uid != 0, euid = 0)`，无法进行`setuid`等操作，`对root用户无效`
否	容器内通过chmod提权的普通用户进程`(uid != 0, euid = 0)`，可以进行`setuid`等操作，将uid切换为0，可以获取root用户完整权限

4 建议

4.1 指导原则
容器拥有的权限必须>=程序的权限，否则无法运行该程序。

4.1 建议实施方案1
开启userns-remap，容器内部为root用户，使用cap-drop all cap-add={指定权限}，且设置no-new-privileges。

优点：dockerfile中不需要setcap，构建的镜像少一层。

缺点：容器内用户为root用户，在主机侧有E、I、P权限集合，可以进行某些需要权限的操作。

4.2 建议实施方案2
开启userns-remap，容器内部为普通用户，使用cap-drop all cap-add={指定权限}，在dockerfile中setcap，使用no-new-privileges

优点：容器内用户为普通用户，在主机侧只有权限I，如果容器内进程没有在dockerfile中setcap，则无法进行需要权限的操作。

缺点：需要在dockerfile中setcap。构建的镜像多一层，且无法使用提权小程序进行提权。