今天轮到讨论安全问题了。 python 代码中包含有害内容该怎么办?常用技术是沙箱(Sandboxing)。本文从一些基础设施讲起。
- 如何在 C++ 中调用 python 解析器来执行 python 代码(一)?
- 如何在 C++ 中调用 python 解析器来执行 python 代码(二)?
- 如何在 C++ 中调用 python 解析器来执行 python 代码(三)?
- 如何在 C++ 中调用 python 解析器来执行 python 代码(四)?
- 如何在 C++ 中调用 python 解析器来执行 python 代码(五)?
- 如何在 C++ 中调用 python 解析器来执行 python 代码(六)?
seccomp is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a “secure” state where it cannot make any system calls except exit, sigreturn, read and write to already-open file descriptors.
Seccomp BPF 全称 SECure COMPuting with filters,它产生的背景是:操作系统给应用层暴露了数百个系统调用接口,但是大部分应用程序只需要访问其中一个子集。Seccomp BPF 提供了一个过滤器接口,用于描述允许应用程序使用哪些系统调用接口。
其中,prog 指向一个 struct sock_fprog,里面定义了过滤器。考虑到 Berkeley Packet Filter (BPF) 已经在 socket 过滤领域使用多年,拥有非常强大的描述能力,接口过滤器也使用了 BPF 格式(kernel design choice)。
BPF 比较有意思,它有一套自己的指令集,用于编写 FILTER 程序,举个例子(来自这里),下面这段程序禁止execve系统调用:
#include <stdio.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <stdlib.h>
#include <unistd.h>
int main()
struct sock_filter filter[] = {
BPF_STMT(BPF_LD+BPF_W+BPF_ABS,0), //将帧的偏移0处,取4个字节数据,也就是系统调用号的值载入累加器
BPF_JUMP(BPF_JMP+BPF_JEQ,59,0,1), //判断系统调用号是否为59(execve),是则顺序执行,否则跳过下一条
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),//规则条数
.filter = filter, //结构体数组指针
prctl(PR_SET_NO_NEW_PRIVS,1,0,0,0); //设置NO_NEW_PRIVS
return 0;
小结:有了 seccomp-bpf 后,我们就可以针对系统调用做一些定制化的约束,在安全和功能之间取得平衡。
关于 seccomp-bpf 更多讨论,这篇文章非常好:https://xz.aliyun.com/t/11480
基础设施: Linux Namespaces
Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources. Resources may exist in multiple spaces. Examples of such resources are process IDs, host-names, user IDs, file names, some names associated with network access, and Inter-process communication
Namespaces are a fundamental aspect of containers in Linux.
从这篇文章里摘来一个总结,包含了比较全面的 namespace 资源:
关于 namespace 的更多概念,参考 wiki
bison 3.0+
特别注意:protobuf 的 library 版本要和 protoc 文件的版本一致,不然会各种链接报错。
编译命令: PATH=/.vos/.dep_cache/7d6d26725ac1e91bc824e1be337cf31e/bin/:/share/nsjail/bison/bin/:$PATH make -j
在我的系统上,clone(flags=CLONE_NEWUSER) 还不支持,所以需要用 --disable_clone_newuser
把这个 flag 过滤掉。
[xiaochu.yh ~/tools/nsjail] (master) $sudo LD_LIBRARY_PATH=/.vos/.dep_cache/7d6d26725ac1e91bc824e1be337cf31e/var/usr/local/gcc-5.2.0/lib64/ nsjail -Mr --chroot / -R /tmp/ --user 99999 --group 99999 --disable_clone_newuser -- /bin/sh -i
[I][2023-03-07T21:56:30+0800] Mode: STANDALONE_RERUN
[I][2023-03-07T21:56:30+0800] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/sh', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:false, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2023-03-07T21:56:30+0800] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2023-03-07T21:56:30+0800] Mount: '/proc' flags:MS_RDONLY type:'proc' options:'' dir:true
[I][2023-03-07T21:56:30+0800] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[I][2023-03-07T21:56:30+0800] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2023-03-07T21:56:30+0800][1] initNs():223 prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL): Invalid argument
[I][2023-03-07T21:56:30+0800] Executing '/bin/sh' for '[STANDALONE MODE]'
sh: cannot set terminal process group (-1): Inappropriate ioctl for device
sh: no job control in this shell
sh-4.2$ ls
bin boot data dev etc home lib lib64 lost+found media mnt ob opt proc root run sbin share srv sys tmp u01 usr var
sh-4.2$ ps wuax
99999 1 0.1 0.0 13760 1744 ? SNs 21:59 0:00 /bin/sh -i
99999 2 0.0 0.0 49556 1716 ? RN 21:59 0:00 ps wuax
sh-4.2$ id
uid=99999 gid=99999 groups=99999
sh-4.2$ echo "abc" > /tmp/abc.txt
sh: /tmp/abc.txt: Read-only file system
nsjail 的选项如下:
Usage: nsjail [options] -- path_to_command [args]
Help plz..
--mode|-M VALUE
Execution mode (default: 'o' [MODE_STANDALONE_ONCE]):
l: Wait for connections on a TCP port (specified with --port) [MODE_LISTEN_TCP]
o: Launch a single process on the console using clone/execve [MODE_STANDALONE_ONCE]
e: Launch a single process on the console using execve [MODE_STANDALONE_EXECVE]
r: Launch a single process on the console with clone/execve, keep doing it forever [MODE_STANDALONE_RERUN]
--config|-C VALUE
Configuration file in the config.proto ProtoBuf format (see configs/ directory for examples)
--exec_file|-x VALUE
File to exec (default: argv[0])
Use execveat() to execute a file-descriptor instead of executing the binary path. In such case argv[0]/exec_file denotes a file path before mount namespacing
--chroot|-c VALUE
Directory containing / of the jail (default: none)
When creating a mount namespace, use mount(MS_MOVE) and chroot rather than pivot_root. Usefull when pivot_root is disallowed (e.g. initramfs). Note: escapable is some configuration
Mount chroot dir (/) R/W (default: R/O)
--user|-u VALUE
Username/uid of processes inside the jail (default: your current uid). You can also use inside_ns_uid:outside_ns_uid:count convention here. Can be specified multiple times
--group|-g VALUE
Groupname/gid of processes inside the jail (default: your current gid). You can also use inside_ns_gid:global_ns_gid:count convention here. Can be specified multiple times
--hostname|-H VALUE
UTS name (hostname) of the jail (default: 'NSJAIL')
--cwd|-D VALUE
Directory in the namespace the process will run (default: '/')
--port|-p VALUE
TCP port to bind to (enables MODE_LISTEN_TCP) (default: 0)
--bindhost VALUE
IP address to bind the port to (only in [MODE_LISTEN_TCP]), (default: '::')
--max_conns VALUE
Maximum number of connections across all IPs (only in [MODE_LISTEN_TCP]), (default: 0 (unlimited))
--max_conns_per_ip|-i VALUE
Maximum number of connections per one IP (only in [MODE_LISTEN_TCP]), (default: 0 (unlimited))
--log|-l VALUE
Log file (default: use log_fd)
--log_fd|-L VALUE
Log FD (default: 2)
--time_limit|-t VALUE
Maximum time that a jail can exist, in seconds (default: 600)
--max_cpus VALUE
Maximum number of CPUs a single jailed process can use (default: 0 'no limit')
Daemonize after start
Verbose output
Log warning and more important messages only
Log fatal messages only
Pass all environment variables to the child process (default: all envars are cleared)
--env|-E VALUE
Additional environment variable (can be used multiple times). If the envar doesn't contain '=' (e.g. just the 'DISPLAY' string), the current envar value will be used
Don't drop any capabilities
--cap VALUE
Retain this capability, e.g. CAP_PTRACE (can be specified multiple times)
Redirect child process' fd:0/1/2 to /dev/null
Redirect child process' fd:2 (STDERR_FILENO) to /dev/null
Don't call setsid(), allows for terminal signal handling in the sandboxed process. Dangerous
--pass_fd VALUE
Don't close this FD before executing the child process (can be specified multiple times), by default: 0/1/2 are kept open
Don't set the prctl(NO_NEW_PRIVS, 1) (DANGEROUS)
--rlimit_as VALUE
RLIMIT_AS in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 4096)
--rlimit_core VALUE
RLIMIT_CORE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 0)
--rlimit_cpu VALUE
RLIMIT_CPU, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 600)
--rlimit_fsize VALUE
RLIMIT_FSIZE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 1)
--rlimit_nofile VALUE
RLIMIT_NOFILE, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 32)
--rlimit_nproc VALUE
RLIMIT_NPROC, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
--rlimit_stack VALUE
RLIMIT_STACK in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
--rlimit_memlock VALUE
RLIMIT_MEMLOCK in KB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
--rlimit_rtprio VALUE
RLIMIT_RTPRIO, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
--rlimit_msgqueue VALUE
RLIMIT_MSGQUEUE in bytes, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
Disable all rlimits, default to limits set by parent
Don't use CLONE_NEWNET. Enable global networking inside the jail
Don't use CLONE_NEWUSER. Requires euid==0
Don't use CLONE_NEWCGROUP. Might be required for kernel versions < 4.6
Use CLONE_NEWTIME. Supported with kernel versions >= 5.3
--uid_mapping|-U VALUE
Add a custom uid mapping of the form inside_uid:outside_uid:count. Setting this requires newuidmap (set-uid) to be present
--gid_mapping|-G VALUE
Add a custom gid mapping of the form inside_gid:outside_gid:count. Setting this requires newgidmap (set-uid) to be present
--bindmount_ro|-R VALUE
List of mountpoints to be mounted --bind (ro) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'
--bindmount|-B VALUE
List of mountpoints to be mounted --bind (rw) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'
--tmpfsmount|-T VALUE
List of mountpoints to be mounted as tmpfs (R/W) inside the container. Can be specified multiple times. Supports 'dest' syntax. Alternatively, use '-m none:dest:tmpfs:size=8388608'
--mount|-m VALUE
Arbitrary mount, format src:dst:fs_type:options
--symlink|-s VALUE
Symlink, format src:dst
Disable mounting procfs in the jail
--proc_path VALUE
Path used to mount procfs (default: '/proc')
Is procfs mounted as R/W (default: R/O)
--seccomp_policy|-P VALUE
Path to file containing seccomp-bpf policy (see kafel/)
--seccomp_string VALUE
String with kafel seccomp-bpf policy (see kafel/)
Use SECCOMP_FILTER_FLAG_LOG. Log all actions except SECCOMP_RET_ALLOW). Supported since kernel version 4.14
--nice_level VALUE
Set jailed process niceness (-20 is highest -priority, 19 is lowest). By default, set to 19
--cgroup_mem_max VALUE
Maximum number of bytes to use in the group (default: '0' - disabled)
--cgroup_mem_memsw_max VALUE
Maximum number of memory+swap bytes to use (default: '0' - disabled)
--cgroup_mem_swap_max VALUE
Maximum number of swap bytes to use (default: '-1' - disabled)
--cgroup_mem_mount VALUE
Location of memory cgroup FS (default: '/sys/fs/cgroup/memory')
--cgroup_mem_parent VALUE
Which pre-existing memory cgroup to use as a parent (default: 'NSJAIL')
--cgroup_pids_max VALUE
Maximum number of pids in a cgroup (default: '0' - disabled)
--cgroup_pids_mount VALUE
Location of pids cgroup FS (default: '/sys/fs/cgroup/pids')
--cgroup_pids_parent VALUE
Which pre-existing pids cgroup to use as a parent (default: 'NSJAIL')
--cgroup_net_cls_classid VALUE
Class identifier of network packets in the group (default: '0' - disabled)
--cgroup_net_cls_mount VALUE
Location of net_cls cgroup FS (default: '/sys/fs/cgroup/net_cls')
--cgroup_net_cls_parent VALUE
Which pre-existing net_cls cgroup to use as a parent (default: 'NSJAIL')
--cgroup_cpu_ms_per_sec VALUE
Number of milliseconds of CPU time per second that the process group can use (default: '0' - no limit)
--cgroup_cpu_mount VALUE
Location of cpu cgroup FS (default: '/sys/fs/cgroup/cpu')
--cgroup_cpu_parent VALUE
Which pre-existing cpu cgroup to use as a parent (default: 'NSJAIL')
--cgroupv2_mount VALUE
Location of cgroupv2 directory (default: '/sys/fs/cgroup')
Use cgroup v2
Use cgroupv2, if it is available. (Specify instead of use_cgroupv2)
Don't bring the 'lo' interface up
--iface_own VALUE
Move this existing network interface into the new NET namespace. Can be specified multiple times
--macvlan_iface|-I VALUE
Interface which will be cloned (MACVLAN) and put inside the subprocess' namespace as 'vs'
--macvlan_vs_ip VALUE
IP of the 'vs' interface (e.g. "")
--macvlan_vs_nm VALUE
Netmask of the 'vs' interface (e.g. "")
--macvlan_vs_gw VALUE
Default GW for the 'vs' interface (e.g. "")
--macvlan_vs_ma VALUE
MAC-address of the 'vs' interface (e.g. "ba:ad:ba:be:45:00")
--macvlan_vs_mo VALUE
Mode of the 'vs' interface. Can be either 'private', 'vepa', 'bridge' or 'passthru' (default: 'private')
Disable rdtsc and rdtscp instructions. WARNING: To make it effective, you also need to forbid `prctl(PR_SET_TSC, PR_TSC_ENABLE, ...)` in seccomp rules! (x86 and x86_64 only). Dynamic binaries produced by GCC seem to rely on RDTSC, but static ones should work.
Forward fatal signals to the child process instead of always using SIKGILL.
Wait on a port 31337 for connections, and run /bin/sh
nsjail -Ml --port 31337 --chroot / -- /bin/sh -i
Re-run echo command as a sub-process
nsjail -Mr --chroot / -- /bin/echo "ABC"
Run echo command once only, as a sub-process
nsjail -Mo --chroot / -- /bin/echo "ABC"
Execute echo command directly, without a supervising process
nsjail -Me --chroot / --disable_proc -- /bin/echo "ABC"