Kubernetes 容器安全 Seccomp 限制容器进程系统调用

最新推荐文章于 2024-07-29 15:03:17 发布

富士康质检员张全蛋

最新推荐文章于 2024-07-29 15:03:17 发布

阅读量4.4k

点赞数 1

分类专栏： Kubernetes 安全

本文链接：https://blog.csdn.net/qq_34556414/article/details/118484819

版权

Kubernetes 安全专栏收录该内容

36 篇文章

订阅专栏

本文介绍了Seccomp在Docker和Kubernetes中的应用，它是如何增强容器安全性，限制系统调用以降低攻击面。默认配置文件提供了基本保护，但可以根据需求定制更严格的策略。在Kubernetes中，从1.19版本开始Seccomp成为GA特性，可通过Pod安全上下文设置。通过监控和审计系统调用，可以创建和调整Seccomp配置文件，以确保容器运行的安全性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

今天，我们兴奋地向大家宣布，Kubernetes在2021年内的第二个版本、即1.22版本已经正式来临！

Default profiles for seccomp

An alpha feature for default seccomp profiles has been added to the kubelet, along with a new command line flag and configuration. When in use, this new feature provides cluster-wide seccomp defaults, using the RuntimeDefault seccomp profile rather than Unconfined by default. This enhances the default security of the Kubernetes Deployment. Security administrators will now sleep better knowing that workloads are more secure by default. To learn more about the feature, please refer to the official seccomp tutorial.

What is `seccomp`, Anyway?

If you've been working with Docker or Kubernetes for a while, you might have heard term seccomp, but chances are, you haven't really looked deeper into this obscure tool, right?

The simplest and easiest to understand definition of seccomp is probably a "firewall for syscalls". seccomp is essentially a mechanism to restrict system calls that a process may make, so the same way one might block packets coming from some IPs, one can also block process from sending system calls to CPU.

That's cool and all, but how does that help us make the system more secure? The Linux kernel has a lot of syscalls (few hundred), but most of them are not needed by any given process. If process can get compromised and tricked into using some of these syscalls though, then it can lead to serious security issues for a whole system. So, restricting which syscalls process can make greatly reduces attack surface of a kernel.

Now, if you're running any decently up-to-date version of Docker (1.10 or higher), then you're already using seccomp. You can check that using docker info or by looking in /boot/config-*:

~ $ docker info
...
 Security Options:
  apparmor
  seccomp
   Profile: default  # default seccomp profile applied by default
...

~ $ grep SECCOMP /boot/config-$(uname -r)
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP=y

[root@master ~]# docker info
 Security Options:
  seccomp
   Profile: default

Well, that's great! We already have seccomp, everything is secure and we don't need to mess with any of this, right? Well, not quite - there're quite a few reasons you might want get your hand dirty and dig deeper into seccomp...

嗯，太好了！我们已经有了seccomp，一切都是安全的，我们不需要弄乱这些，对吧?

Seccomp（Secure computing mode） 是一个 Linux 内核安全模块，可用于应用进程允许使用的系统调用。

容器实际上是宿主机上运行的一个进程，共享宿主机内核，如果所有容器都具有任何系统调用的能力，那么容器如果被入侵，就很轻松绕过容器隔离更改宿主机系统权限或者进入宿主机。

这就可以使用Seccomp机制限制容器系统调用，有效减少攻击面。

Linux发行版内置：CentOS、Ubuntu

在Kubernetes中使用Seccomp：Seccomp在K8s1.3版本引入，在1.19版本成为GA版本，因此可以通过两种方式设置：

• 1.19版本之前

annotations:
  seccomp.security.alpha.kubernetes.io/pod: "localhost/<profile>"

• 1.19版本+

apiVersion: v1
kind: Pod
metadata:
  name: hello-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: <profile> # Pod所在宿主机上策略文件名，默认目录：/var/lib/kubelet/seccomp
  containers:
...

这些都是宿主机上面的功能，使用的时候也是读取宿主机上面的配置文件，去应用到这个容器当中的。

Why Bother Creating One?

The default seccomp profile in Docker may often be "good enough" and if you have no experience with it, then "improving" or otherwise customizing it might not be the best use of your time. There are however some reasons that might give you enough motivation to change the default seccomp profile:

seccompDocker 中的默认配置文件通常“足够好”，如果您没有使用它的经验，那么“改进”或以其他方式自定义它可能不是对您时间的最佳利用。然而，有一些原因可能会给您足够的动力来更改默认seccomp配置文件：

From time-to-time a security breach is found. This is inevitable and further restricting default seccomp profile might lower the chance of the security breach effecting your applications. One such incident was CVE 2016-0728, which allowed user to escalate privileges using keyctl() syscall. After this incident, this syscall is now blocked by default seccomp profile, but there might be more such exploits in the future...

不时发现安全漏洞。这是不可避免的，进一步限制默认seccomp配置文件可能会降低安全漏洞影响您的应用程序的机会。此类事件之一是CVE 2016-0728，它允许用户使用keyctl()系统调用提升权限。在这次事件之后，这个系统调用现在被默认seccomp配置文件阻止了，但未来可能会有更多这样的漏洞......

Another reason to restrict the seccomp profile would be that if certain subsystem has a security bug, attackers won’t be able to exploit it from your containers because the related syscalls have been blocked. This subsystem can be dependency, library or some part of the system of which you have no control of. You also probably have no way to fix the issue in that subsystem and to prevent the issue from being exploited you need to block it on different layer, e.g. in seccomp profile.

Potential exploits and security breaches become even more important when you're dealing with PII data or when building mission-critical applications, such as software used in health care or power grid. In these cases any little security improvement counts and customizing seccomp profile might be a worthwhile effort.

Besides all these specific reasons, limiting syscalls that can be used by your containers limits the attack vectors that can be used, e.g. backdoors in Docker upstream images or exploitable bugs in applications.

Getting the basics out of the way

seccomp基本配置文件包括三个元素：

• defaultAction：在syscalls部分未定义的任何系统调用默认动作为允许 syscalls

• names 系统调用名称，可以换行写多个

• SCMP_ACT_ERRNO 阻止系统调用

A basic seccomp has three key elements: the defaultAction, the architectures (or archMap) and the syscalls:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "arch_prctl",
                "sched_yield",
                "futex",
                "write",
                "mmap",
                "exit_group",
                "madvise",
                "rt_sigprocmask",
                "getpid",
                "gettid",
                "tgkill",
                "rt_sigaction",
                "read",
                "getpgrp"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

The defaultAction defines what will happen by default to any system call that is not listed in the syscalls section. To make it simple, let's focus on the two main values you will use: SCMP_ACT_ERRNO which will block the execution of the system call, and SCMP_ACT_ALLOW which does what it says in the tin.

The element architectures defines what architectures you are targetting. This is important because the actual filter that is applied at kernel level is based on the system call IDs and not the names you define in your profile. The container runtime will translate it into IDs just before applying it. The importance of this is that system calls may have different IDs depending in the architectures they are running on. For example, the system callrecvfrom which is used to receive information from a socket, is id 45 for x64 systems while it is id 517 in x86. Here's a list of all system calls for x86-x64.

syscalls is where you list all the system calls and the action associated with them. For example, you can create a whitelist by setting the defaultAction to SCMP_ACT_ERRNO and the action within your syscalls section to SCMP_ACT_ALLOW. This way your are whitelisting all the calls you enumerated and blocking everything else. For a blacklist approach, revert the values of defaultAction and action.

Customizing a Profile

Now, if we established that we have a good reason to change the seccomp profile, how would we actually go about doing that? One way to write seccomp filter is to use Berkeley packet filter (BPF) language. Using this language isn't really simple or convenient. Luckily we don't have to use it - instead we can write JSON that is compiled into profile by libseccomp. Simple/minimal example of such profile could look like this:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "syscalls": [
        {
            "names": [
                "accept",
                "chown",
                "kill",
                "mmap",
                ...
            ],
            "action": "SCMP_ACT_ALLOW",
   "args": [],
   "comment": "",
   "includes": {},
   "excludes": {}
        }
    ]
}

Generally it's preferred to use whitelist instead of blacklist and to explicitly the list allowed syscalls and forbid any other. That's exactly what the above profile does - by default, it will use action SCMP_ACT_ERRNO which causes Permission denied for all syscalls. For the ones we do want to allow we list their names and specify the SCMP_ACT_ALLOW action.

These JSON profiles can use quite a few options and can become very complex, so the one above is really trimmed it down to bare minimum. To see how real profile would look like you can check out Dockers profile here.

Now that we understand seccomp profiles little more, let's play with Docker. Before we try applying any custom profiles though, let's first experiment a little bit and override the seccomp defaults:

# Run without seccomp profile:
~ $ docker run --rm -it --security-opt seccomp=unconfined alpine sh
/ # reboot  # Works, oops
# Run with default seccomp profile
~ $ docker container run --rm -it alpine sh
/ # reboot  # Doesn't work

Above we can see what happens if we disable seccomp completely - any syscall is available, which allows the user in container to - among other things - reboot host machine. The --security-opt seccomp=... in this example is used to disable the seccomp. It's also the way to apply custom profiles, so let's build and apply our own custom profile.

It's not a good idea to start from scratch, so we will modify the existing Dockers profile (referenced above) and we will restrict it a bit by removing chmod, fchmod and fchmodat syscall, which effectively denies permission to use chmod command. This might not be the most reasonable change or "improvement" but for demonstration purposes it works just fine:

# Run with custom seccomp profile (Disallowing chmod)
~ $ docker container run --rm -it --security-opt seccomp=no-chmod.json alpine sh
/ # whoami
root
/ # chmod 777 -R /etc
chmod: /etc: Operation not permitted

In this snippet we load the custom profile from no-chmod.json and try to chmod whole /etc. We can see what this causes in container - any attempt at running chmod results in Operation not permitted.

With this simple "no chmod" profile it was quite simple to pick which syscalls should be blocked. In general however it's not so simple to pick what can or cannot be blocked. You can make an educated guess, but you're risking blocking too much and not enough at the same time by blocking syscalls that will prevent your application from operating correctly, but also missing some syscalls that can and should be blocked.

Only feasible way to choose which syscall to block is call tracing. One way to trace syscalls is to use strace:

~ $ strace -c -f -S name chmod 2>&1 1>/dev/null | tail -n +3 | head -n -2 | awk '{print $(NF)}'
syscall
----------------
access
arch_prctl
brk
close
execve
fstat
mmap
mprotect
munmap
openat
pread64
read
write

The command above gives us a list of all syscalls used by a command (chmod in this case), from which we can choose what to block/allow. If this strace command is little too complicated for you, then also strace -c ls would work, which outputs some extra info like time and number of calls. We obviously can't block all of these, because that would render our container unusable, so it's best to lookup each of these, for example using this syscall table.

What About Kubernetes?

In the title and intro I mentioned also Kubernetes, yet we only talked about Docker so far. So, what's the situation around seccomp in Kubernetes world? Well, unfortunately in Kubernetes seccomp is not used by defaults and therefore syscalls are not filtered, except for a few really dangerous ones. So, with Docker you could get away with not configuring any profile and just using defaults, but in Kubernetes, there's really nothing preventing some security issues from being exploited.

How do we "fix" that? That depends on the version of Kubernetes you're using - for versions before 1.19, you will need to apply annotations to your pods. This would look something like this:

apiVersion: v1
kind: Pod
metadata:
  name: some-pod
  labels:
    app: some-pod
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: localhost/profiles/some-profile.json
spec:
    ...

For the 1.19 (which we will focus on here) and later, the seccomp profiles are GA feature and you can use seccompProfile section in securityContext of a pod. Definition of such pod would look like so:

apiVersion: v1
kind: Pod
metadata:
  name: some-pod
  labels:
    app: some-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/some-profile.json
  containers:
    ...

Now we know how to solve it on Kubernetes, so it's time for a little demonstration. For that we will need a cluster - here I'm going to use KinD (Kubernetes in Docker) to set up a minimal local cluster. Additionally we will also need all the seccomp profiles before we spin up the cluster as these need to be available on the cluster nodes. So, the definition for cluster itself is as follows:

apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  extraMounts:
  - hostPath: "./profiles"
    containerPath: "/var/lib/kubelet/seccomp/profiles"

This defines single node cluster which mounts local profiles directory into node at /var/lib/kubelet/seccomp/profiles. And what do we put in this profiles directory? For the purposes of this example, we will use 3 profiles: Dockers default profile and "no chmod" profile used previously, and additionally also a so-called "audit"/"complain" profile.

We will start with audit profile. This one doesn't allow/block any syscalls but rather just logs them to syslog logs when they are used by some command/program. This can be very useful for both debugging and exploring behavior of application and finding syscalls that can or cannot be blocked. The definition of this profile is really just one line and looks like this:

# ./profiles/audit.json
{
    "defaultAction": "SCMP_ACT_LOG"
}

One more thing we will need is - obviously - a pod. This pod will have single Ubuntu container, which runs ls to trigger some syscalls and then sleeps, so it doesn't terminate and restart:

# ./pods/audit.yaml
apiVersion: v1
kind: Pod
metadata:
  name: audit-seccomp
  labels:
    app: audit-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/audit.json
  containers:
  - name: test-container
    image: ubuntu
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "ls /; while true; do sleep 30; done;" ]
    securityContext:
      allowPrivilegeEscalation: false

With that out of the way, let's build the cluster and apply the first profile:

~ $ tree .
.
├── kind.yaml
├── pods
│   ├── audit.yaml
│   ├── default.yaml
│   └── no-chmod.yaml
└── profiles
    ├── audit.json
    └── no-chmod.json

~ $ kind create cluster --image kindest/node:v1.19.4 --config=kind.yaml
~ $ kubectl apply -f pods/audit.yaml
~ $ tail /var/log/syslog

Nov 25 19:38:18 kernel: [461698.749294] audit: ... syscall=21 compat=0 ip=0x7ff8f8412d5b code=0x7ffc0000    # access
Nov 25 19:38:18 kernel: [461698.749306] audit: ... syscall=257 compat=0 ip=0x7ff8f8412ec8 code=0x7ffc0000   # openat
Nov 25 19:38:18 kernel: [461698.749315] audit: ... syscall=5 compat=0 ip=0x7ff8f8412c99 code=0x7ffc0000     # fstat
Nov 25 19:38:18 kernel: [461698.749317] audit: ... syscall=9 compat=0 ip=0x7ff8f84130e6 code=0x7ffc0000     # mmap
Nov 25 19:38:18 kernel: [461698.749323] audit: ... syscall=3 compat=0 ip=0x7ff8f8412d8b code=0x7ffc0000     # close

We perform the above example in directory with our KinD cluster definition, pods directory and profiles directory. We first use kind command to create cluster from the definition. We then apply the audit-seccomp pod which uses the audit.yaml profile. Finally we inspect syslog messages to see audit messages logged by audit-seccomp pod. Each of these messages contains syscall=..., which specifies syscall ID (for x86_64 architecture), which can be translated to the name in the comment at the end of each line.

Now that we confirmed that seccomp works as expected, we can apply real profile (the Dockers default). To do that we will use different pod:

# pods/default.yaml
apiVersion: v1
kind: Pod
metadata:
  name: default-seccomp
  labels:
    app: default-seccomp
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: test-container
    image: r.j3ss.co/amicontained
    command: [ "/bin/sh", "-c", "--" ]
    args: [ "amicontained" ]
    securityContext:
      allowPrivilegeEscalation: false

We made a few changes here. Namely, we changed seccompProfile section where we specify RuntimeDefault type and we also changed the image to amicontained which is a container introspection tool that will tell us which syscalls are blocked, as well as some other interesting security info.

After applying this pod, we can see see in logs the following:

~ $ kubectl apply -f pods/default.yaml
~ $ kubectl logs default-seccomp
Container Runtime: docker
Has Namespaces:
 pid: true
 user: false
AppArmor Profile: unconfined
Capabilities:
 BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (61):
 PTRACE SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE
    DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY
    REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF
    USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE

This shows that Dockers default profile will block the above 61 syscalls. If we didn't include the seccompProfile section with RuntimeDefault type it would be just 22 (you can test for yourself if you don't trust me on this 😉). This is great improvement in security in my opinion for very little actual effort.

If we decide that the default is not good enough or that we need to make some modifications, we can deploy our custom profile. Here we will demonstrate that with the "no chmod" profile and the following pod:

# pods/no-chmod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: no-chmod-seccomp
  labels:
    app: no-chmod-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/no-chmod.json
  containers:
  - name: test-container
    image: ubuntu
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "touch test; chmod +x test; while true; do sleep 30; done;" ]
    securityContext:
      allowPrivilegeEscalation: false

This pod is very similar to the "audit" pod shown previously. We just switched the localhostProfile to point to different file and changed the container args to include chmod command so that we can see and confirm that our modified seccomp profile works as expected:

~ $ kubectl apply -f pods/no-chmod.yaml
~ $ kubectl logs no-chmod-seccomp
chmod: changing permissions of 'test': Operation not permitted

The logs show expected result - seccomp profile block the attempt to chmod the test file and returns Operation not permitted. This shows that it's quite straightforward to adjust the profile to our liking or needs.

Be careful though when modifying these kinds of default profiles - if you end up blocking a bit too much and your pod/container can't start, the pod will be in Error state, but you will see nothing in logs and nothing useful in events in kubectl describe pod ..., so bear that in mind when debugging seccomp related issues.

示例：禁止容器使用chmod

默认情况下是可以修改任何目录权限的

[root@master ~]# kubectl exec -it web-server -c sidecar -- sh
/ # chmod 777 /etc/hosts

[root@k8s-node1 ~]# mkdir /var/lib/kubelet/seccomp
[root@k8s-node1 ~]# vim  /var/lib/kubelet/seccomp/chmod.json
[root@k8s-node1 ~]# cat /var/lib/kubelet/seccomp/chmod.json
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
     {
    "names": [
      "chmod"
  ],
    "action": "SCMP_ACT_ERRNO"
   }
  ]
}

在容器里面做任何事情都是系统调用，做系统调用需要对其进行限制。先允许所有的系统调用，然后再加黑名单哪些系统调用是不行的。

或者先设置全部系统调用不允许，再去加白名单。

该pod所在节点必须存在这个文件

[root@k8s-master ~]# cat pod.yaml 
apiVersion: v1
kind: Pod
metadata: 
  name: hello-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: chmod.json
  containers:
  - image: busybox
    name: hello
    command:
      - sleep
      - 24h 

[root@k8s-master ~]# kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
hello-seccomp            1/1     Running   0          42s

[root@k8s-master ~]# kubectl exec -it hello-seccomp -- sh
/ # chmod 777 /etc/hosts
chmod: /etc/hosts: Operation not permitted

如果设置为如下：

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
     {
    "names": [
      "chmod",
      "mkdir",
      "rmdir"
  ],
    "action": "SCMP_ACT_ERRNO"
   }
  ]
}

[root@k8s-master ~]# kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE     IP               NODE        NOMINATED NODE   READINESS GATES
hello-seccomp            1/1     Running   0          2m12s   10.244.36.86     k8s-node1   <none>           <none>

[root@k8s-master ~]# kubectl exec -it hello-seccomp --  sh
/ # mkdir test
mkdir: can't create directory 'test': Operation not permitted
/ # rm -rf /home/
rm: can't remove '/home': Operation not permitted

大多数容器运行时都提供一组允许或不允许的默认系统调用。通过使用 runtime/default 注释或将 Pod 或容器的安全上下文中的 seccomp 类型设置为 RuntimeDefault，可以轻松地在 Kubernetes 中应用默认值。

Docker默认配置说明： https://docs.docker.com/engine/security/seccomp/

      type: RuntimeDefault

不需要知道profile了，可以将这个去掉localhostProfile: chmod.json。

系统调用决定了操作的命令，比如限制某个命令是否可以使用，能不能去执行，你可以使用seccomp去做限制，只要找到那些命令禁用掉，那么就执行不了了。比如黑客入侵你系统可能会使用各种各样的工具，比如curl，wget了,如果限制这些的使用那么提权到宿主机就比较难。

难点就在于写的这些策略。

Docker 的 Seccomp 安全配置文件

安全计算模式 ( seccomp) 是 Linux 内核特性。您可以使用它来限制容器内可用的操作。该seccomp()系统调用在调用进程的状态的Seccomp工作。您可以使用此功能来限制您的应用程序的访问。

只有在构建 Dockerseccomp并且内核配置为CONFIG_SECCOMP启用时，此功能才可用。要检查您的内核是否支持seccomp：

$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)
CONFIG_SECCOMP=y

传递容器的配置文件

默认seccomp配置文件为使用 seccomp 运行容器提供了一个合理的默认值，并在 300 多个系统调用中禁用了大约 44 个系统调用。它具有适度的保护作用，同时提供广泛的应用兼容性。可以在此处找到默认的 Docker 配置文件。

实际上，配置文件是一个白名单，默认情况下拒绝访问系统调用，然后将特定系统调用列入白名单。该配置文件通过定义一个 defaultAction of SCMP_ACT_ERRNO并仅针对特定系统调用覆盖该操作来工作。的效果SCMP_ACT_ERRNO是导致Permission Denied 错误。接下来，配置文件定义了一个完全允许的系统调用的特定列表，因为它们action被覆盖为SCMP_ACT_ALLOW. 最后，一些特定规则适用于单个系统调用，例如personality，以及其他，以允许具有特定参数的那些系统调用的变体。

seccomp有助于以最低权限运行 Docker 容器。不建议更改默认seccomp配置文件。

当您运行容器时，它会使用默认配置文件，除非您使用该--security-opt选项覆盖它。例如，以下显式指定了一个策略：

$ docker run --rm \
             -it \
             --security-opt seccomp=/path/to/seccomp/profile.json \
             hello-world

默认配置文件阻止的重要系统调用

Docker 的默认 seccomp 配置文件是一个白名单，用于指定允许的调用。下表列出了由于不在白名单中而被有效阻止的重要（但不是全部）系统调用。该表包括每个系统调用被阻止而不是列入白名单的原因。

系统调用	描述
`acct`	记帐系统调用可以让容器禁用自己的资源限制或进程记帐。也由`CAP_SYS_PACCT`.
`add_key`	防止容器使用未命名空间的内核密钥环。
`bpf`	拒绝将潜在的持久性 bpf 程序加载到内核中，已经被`CAP_SYS_ADMIN`.

..................................................................

在没有默认 seccomp 配置文件的情况下运行

您可以unconfined在没有默认 seccomp 配置文件的情况下运行容器。

$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
    unshare --map-root-user --user sh -c whoami

Conclusion

Even after reading this article, modifying or creating your own seccomp profile might not be one of your top priorities. It is however important to be aware of this powerful tool and be able to use it when needed - as for example in Kubernetes - where it's not enforced by default which can easily become big security problem. For this reason I would recommend to - at the very least - enable the "audit" profile, so you can monitor syscalls being used and use that information to later create your own profile or validate that the default will work for your applications.

Also, if you take away anything from this article, then it should probably be, that seccomp is an important security layer and you should never run your containers uncontained.