cgroups

云满笔记

已于 2023-09-06 21:02:40 修改

阅读量336

点赞数

分类专栏： # linux cloud_container_virtualization 文章标签： linux docker container containerd virtualization

于 2023-04-13 17:34:50 首次发布

本文链接：https://blog.csdn.net/wan212000/article/details/130135923

版权

linux 同时被 2 个专栏收录

60 篇文章 3 订阅

订阅专栏

cloud_container_virtualization

38 篇文章 0 订阅

订阅专栏

1. cgroups

1.1. 引子

cgroups 是 Linux 内核提供的一种可以限制单个进程或者多个进程所使用资源的机制, 可以对 cpu, 内存等资源实现精细化的控制, 目前越来越火的轻量级容器 Docker 就使用了 cgroups 提供的资源限制能力来完成 cpu, 内存等部分的资源控制。

另外, 开发者也可以使用 cgroups 提供的精细化控制能力, 限制某一个或者某一组进程的资源使用。比如在一个既部署了前端 web 服务, 也部署了后端计算模块的八核服务器上, 可以使用 cgroups 限制 web server 仅可以使用其中的六个核, 把剩下的两个核留给后端计算模块。

本文从以下四个方面描述一下 cgroups 的原理及用法:

cgroups 的概念及原理
cgroups 文件系统概念及原理
cgroups 使用方法介绍
cgroups 实践中的例子

1.2. 概念及原理

1.2.1. cgroups 子系统

cgroups 的全称是 control groups, cgroups 为每种可以控制的资源定义了一个子系统。典型的子系统介绍如下:

cpu 子系统, 主要限制进程的 cpu 使用率。
cpuacct 子系统, 可以统计 cgroups 中的进程的 cpu 使用报告。
cpuset 子系统, 可以为 cgroups 中的进程分配单独的 cpu 节点或者内存节点。
memory 子系统, 可以限制进程的 memory 使用量。
blkio 子系统, 可以限制进程的块设备 io。
devices 子系统, 可以控制进程能够访问某些设备。
net_cls 子系统, 可以标记 cgroups 中进程的网络数据包, 然后可以使用 tc 模块 (traffic control) 对数据包进行控制。
freezer 子系统, 可以挂起或者恢复 cgroups 中的进程。
ns 子系统, 可以使不同 cgroups 下面的进程使用不同的 namespace。

这里面每一个子系统都需要与内核的其他模块配合来完成资源的控制, 比如对 cpu 资源的限制是通过进程调度模块根据 cpu 子系统的配置来完成的; 对内存资源的限制则是内存模块根据 memory 子系统的配置来完成的, 而对网络数据包的控制则需要 Traffic Control 子系统来配合完成。本文不会讨论内核是如何使用每一个子系统来实现资源的限制, 而是重点放在内核是如何把 cgroups 对资源进行限制的配置有效的组织起来的, 和内核如何把 cgroups 配置和进程进行关联的, 以及内核是如何通过 cgroups 文件系统把 cgroups 的功能暴露给用户态的。

2. Cgroup 使用介绍

Cgroup(Control Group) 是 Linux 内核提供的一种资源管理和限制机制, 用于对进程进行分组并对分组内的进程进行资源限制、优先级调整等操作。本文将分为七个部分来介绍 Cgroup 基础内容: 简介、安装与挂载、基本操作、日常内存 CPU 限制用法、其它使用、Cgroup 嵌套使用以及注意事项。

2.1. 简介

Cgroup 的主要目标是提供一种统一的接口来管理系统资源, 如 CPU、内存、磁盘 I/O 等。Cgroup 的核心组件包括:

Cgroup 文件系统: 用于存储 Cgroup 的配置信息和状态数据。
子系统 (Subsystem): 用于实现对特定资源的管理和限制, 如 CPU 子系统、内存子系统等。
控制组 (Control Group): 用于对进程进行分组, 每个控制组可以关联一个或多个子系统。

/sys/fs/cgroup 目录下列了各个 subsystem 资源管理器:

cpu : 限制 cgroup 的 CPU 使用率
cpuacct: 统计 icgroup 的 CPU 的使用率
cpuset: 绑定 cgroup 到指定 CPUs 和 NUMAQ 节点。
freezer: suspend 和 restore 一个 cgroup 中的所有进程
memory: 统计和限制 cgroup 的内存的使用率 (process memory、kernel memory 、swap)

2.2. 安装与挂载

2.2.1. 安装 Cgroup 工具包

在 Debian/Ubuntu 系统上, 可以使用以下命令安装 Cgroup 工具包(部分系统也会内置):

sudo apt-get install cgroup-tools

在 RHEL/CentOS 系统上, 可以使用以下命令安装:

sudo yum install libcgroup-tools

2.2.2. 挂载 Cgroup 文件系统

创建一个挂载点并挂载 Cgroup 文件系统:

sudo mkdir /sys/fs/cgroup

sudo mount -t cgroup -o none,name=cgroup /sys/fs/cgroup

2.3. 基本操作

2.3.1. 挂载与创建 CGROUP 树

创建一颗 cgroup 树关联所有 subsystem, 并挂载在/sys/fs/cgroup 下 (xxx 为 cgroup 树名称)

sudo mount -t cgroup xxx /sys/fs/cgroup

也可以不关联任何 subsystem, 挂载其他目录也可以, 比如名叫 demo 树挂载在~/test_aa/demo 目录:

sudo mount -t cgroup -o none,name=demo demo ./demo/

cgroup.procs : 当前 cgroup 中的所有进程 ID

tasks : 当前 cgroup 中的所有线程 ID

2.3.2. 建立子 Cgroup

通过往里面创建文件夹的方式, 建立子 cgroup

2.3.3. 添加进程进入 Cgroup

往子 cgroup 中添加进程

查询到当前 shell 的 pid 是 4150

把这个 pid 加进去 cgroup.procs 内

检查是否添加成功

父进程的子进程, 也会自动加入到这个 cgroup, 在刚刚的 shell 里面运行一个 top, 再看 cgroup.procs 里面内容, 如下图

2.3.4. 检查 Cgroup 状态

cat /proc/{pid}/cgroup 这里{pid}是查询的进程的 pid 号。

路径是相对于挂载点的相对路径, 默认是挂载于/sys/fs/cgroup 下

root@mst6g:/sys/fs/cgroup/cpu,cpuacct# systemd-cgls

systemctl status moss.service 也会显示 cgroup 的状态。

另外也可以通过 cgcreate 命令进行新的控制组创建与设置

2.3.5. 创建控制组

使用 cgcreate 命令创建一个新的控制组:

sudo cgcreate -g cpu,memory:my_group

这将在/sys/fs/cgroup 目录下创建一个名为 my_group 的控制组, 并关联 CPU 和内存子系统。

2.3.6. 添加进程到控制组

使用 cgclassify 命令将进程添加到控制组中:

sudo cgclassify -g cpu,memory:my_group PID

其中, PID 是要添加的进程 ID。

2.3.7. 从控制组中移动进程

使用 cgclassify 命令将进程从一个控制组移动到另一个控制组:

sudo cgclassify -g cpu,memory:another_group PID

2.3.8. 删除控制组

使用 cgdelete 命令删除一个控制组:

sudo cgdelete cpu,memory:my_group

2.4. 日常内存 CPU 限制用法

2.4.1. 限制 CPU 使用率

创建一个控制组, 并设置 CPU 使用进行限制

sudo cgcreate -g cpu:cpu_limit

echo 10000 > /sys/fs/cgroup/cpu/cpu_limit/cpu.cfs_quota_us

echo 200000 > /sys/fs/cgroup/cpu/cpu_limit/cpu.cfs_period_us

(上述表示当前组可占用 cpu 10000 微秒内核时间, 然后让出时间并等待 200000 微秒, 等内核时间过后再占用, 以达到限制 CPU 使用的目地)

将进程添加到控制组中:

sudo cgclassify -g cpu:cpu_limit PID

下面是一个尝试限制的例子:

未对 CPU 限制前: CPU 占用高点在 20 ~ 45%之间波动

对 CPU 限制后: CPU 占用高点在 10~23%之间波动, 且明显感觉滑动/点击操作时, 出现应用卡顿情况

2.4.2. 限制内存使用量

创建一个控制组, 并设置内存使用量限制为 300MB:

sudo cgcreate -g memory:mem_limit

echo $((300 1024 1024)) > /sys/fs/cgroup/memory/mem_limit/memory.limit_in_bytes

将进程添加到控制组中:

sudo cgclassify -g memory:mem_limit PID

2.5. 其它使用

2.5.1. 限制磁盘 I/O 速率

创建一个控制组, 并设置磁盘 I/O 限制为 10MB/s:

sudo cgcreate -g blkio:io_limit

echo “8:0 $(10 1024 1024)” > /sys/fs/cgroup/blkio/io_limit/blkio.throttle.read_bps_device

将进程添加到控制组中:

sudo cgclassify -g blkio:io_limit PID

2.5.2. 限制网络带宽

创建一个控制组, 并设置网络带宽限制为 1Mbps:

sudo cgcreate -g net_cls:net_limit

echo 0x10000 > /sys/fs/cgroup/net_cls/net_limit/net_cls.classid

使用 tc 命令配置网络限制:

sudo tc qdisc add dev eth0 root handle 1: htb

sudo tc class add dev eth0 parent 1: classid 1:1 htb rate 1mbit

sudo tc filter add dev eth0 parent 1: protocol ip prio 1 handle 1: cgroup

将进程添加到控制组中:

sudo cgclassify -g net_cls:net_limit PID

2.6. Cgroup 的嵌套使用

Cgroup 支持嵌套使用, 即在一个控制组内创建子控制组。例如, 可以创建一个名为 parent_group 的控制组, 并在其中创建两个子控制组 child_group1 和 child_group2:

sudo cgcreate -g cpu,memory:parent_group

sudo cgcreate -g cpu,memory:parent_group/child_group1

sudo cgcreate -g cpu,memory:parent_group/child_group2

可以为每个子控制组分别设置资源限制和优先级

2.7. 额外需要注意的内容

使用 Cgroup 时, 避免过度限制资源, 否则可能导致进程性能下降或无法正常运行。
删除控制组之前, 确保控制组内的所有进程已经退出, 否则可能导致资源泄露。
使用 Cgroup 进行资源监控时, 可以定期读取状态文件, 以便及时发现和处理潜在的问题。
使用 Cgroup 进行优先级调整时, 注意权衡各个控制组之间的资源分配, 避免出现资源竞争等相关情况。

3. Using cgroups to limit CPU utilization

3.1. Intro to cgroups

Cgroups(control groups) make it possible to allocate system resources such as CPU time, memory, disk I/O and network bandwidth, or combinations of them, among a group of tasks(processes) running on a system.

The following commands output the available subsystems(resource controllers) for the cgroups. Each subsystem has a bunch of tunables to control the resource allocation.

$ lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
blkio /sys/fs/cgroup/blkio
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
pids /sys/fs/cgroup/pids
rdma /sys/fs/cgroup/rdma

$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

3.2. CPU subsystem and tunables

3.2.1. Ceiling enforcement parameters

cpu.cfs_period_us

specifies a period of time in microseconds (µs, represented here as “us”) for how regularly a cgroup’s access to CPU resources should be reallocated. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. The upper limit of the cpu.cfs_quota_us parameter is 1 second and the lower limit is 1000 microseconds.

cpu.cfs_quota_us

specifies the total amount of time in microseconds (µs, represented here as “us”) for which all tasks in a cgroup can run during one period (as defined by cpu.cfs_period_us). As soon as tasks in a cgroup use up all the time specified by the quota, they are throttled for the remainder of the time specified by the period and not allowed to run until the next period. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. Note that the quota and period parameters operate on a CPU basis. To allow a process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 100000.

Setting the value in cpu.cfs_quota_us to -1 indicates that the cgroup does not adhere to any CPU time restrictions. This is also the default value for every cgroup (except the root cgroup).

3.2.2. Relative shares parameter

cpu.shares

contains an integer value that specifies a relative share of CPU time available to the tasks in a cgroup. For example, tasks in two cgroups that have cpu.shares set to 100 will receive equal CPU time, but tasks in a cgroup that has cpu.shares set to 200 receive twice the CPU time of tasks in a cgroup where cpu.shares is set to 100. The value specified in the cpu.shares file must be 2 or higher.

Note that shares of CPU time are distributed per all CPU cores on multi-core systems. Even if a cgroup is limited to less than 100% of CPU on a multi-core system, it may use 100% of each individual CPU core.

Using relative shares to specify CPU access has two implications on resource management that should be considered:

- Because the CFS does not demand equal usage of CPU, it is hard to predict how much CPU time a cgroup will be allowed to utilize. When tasks in one cgroup are idle and are not using any CPU time, the leftover time is collected in a global pool of unused CPU cycles. Other cgroups are allowed to borrow CPU cycles from this pool.

- The actual amount of CPU time that is available to a cgroup can vary depending on the number of cgroups that exist on the system. If a cgroup has a relative share of 1000 and two other cgroups have a relative share of 500, the first cgroup receives 50% of all CPU time in cases when processes in all cgroups attempt to use 100% of the CPU. However, if another cgroup is added with a relative share of 1000, the first cgroup is only allowed 33% of the CPU (the rest of the cgroups receive 16.5%, 16.5%, and 33% of CPU).

3.3. Using libcgroup tools

Install libcgroup package to manage cgroups:

$ yum install libcgroup libcgroup-tools

List the cgroups:

$ lscgroup
hugetlb:/
cpu,cpuacct:/
cpuset:/
blkio:/
memory:/
freezer:/
net_cls,net_prio:/
pids:/
rdma:/
perf_event:/
devices:/
devices:/system.slice
devices:/system.slice/irqbalance.service
devices:/system.slice/systemd-udevd.service
devices:/system.slice/polkit.service
devices:/system.slice/chronyd.service
devices:/system.slice/auditd.service
devices:/system.slice/tuned.service
devices:/system.slice/systemd-journald.service
devices:/system.slice/sshd.service
devices:/system.slice/crond.service
devices:/system.slice/NetworkManager.service
devices:/system.slice/rsyslog.service
devices:/system.slice/abrtd.service
devices:/system.slice/lvm2-lvmetad.service
devices:/system.slice/postfix.service
devices:/system.slice/dbus.service
devices:/system.slice/system-getty.slice
devices:/system.slice/systemd-logind.service
devices:/system.slice/abrt-oops.service

$ ls /sys/fs/cgroup
blkio  cpuacct      cpuset   freezer  memory   net_cls,net_prio  perf_event  rdma
cpu    cpu,cpuacct  devices  hugetlb  net_cls  net_prio          pids        systemd

Create the cgroup:

$ cgcreate -g cpu:/cpulimited

$ lscgroup | grep cpulimited
cpu,cpuacct:/cpulimited

$ ls cpulimited/
cgroup.clone_children  cpuacct.usage_percpu       cpu.cfs_period_us  cpu.stat
cgroup.procs           cpuacct.usage_percpu_sys   cpu.cfs_quota_us   notify_on_release
cpuacct.stat           cpuacct.usage_percpu_user  cpu.rt_period_us   tasks
cpuacct.usage          cpuacct.usage_sys          cpu.rt_runtime_us
cpuacct.usage_all      cpuacct.usage_user         cpu.shares

Limit CPU utilization by percentage:

$ lscpu | grep ^CPU\(s\):
CPU(s):                96

$ cgset -r cpu.cfs_quota_us=200000 cpulimited
Check the cgroup settings:

$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: 200000

$ cgget -g cpu:cpulimited
cpulimited:
cpu.cfs_period_us: 100000
cpu.stat: nr_periods 2
	nr_throttled 0
	throttled_time 0
cpu.shares: 1024
cpu.cfs_quota_us: 200000
cpu.rt_runtime_us: 0
cpu.rt_period_us: 1000000

Delete the cgroup:

$ cgdelete cpu,cpuacct:/cpulimited

3.4. Verify the CPU utilization with fio workload

Create a fio job file:

$ cat burn_cpu.job
[burn_cpu]
# Don't transfer any data, just burn CPU cycles
ioengine=cpuio
# Stress the CPU at 100%
cpuload=100
# Make 4 clones of the job
numjobs=4

Run the fio jobs without CPU limit:

$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: -1

$ cgexec -g cpu:cpulimited fio burn_cpu.job

Check the CPU usage:

$ top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13775 root      20   0 1079912   4016   2404 R 100.0  0.0   0:11.65 fio
13776 root      20   0 1079916   4004   2392 R 100.0  0.0   0:11.65 fio
13777 root      20   0 1079920   4004   2392 R 100.0  0.0   0:11.65 fio
13778 root      20   0 1079924   4004   2392 R 100.0  0.0   0:11.65 fio

The CPU utilization is 400% for the 4 fio jobs when there is no CPU limit set. Note that, there is totally 9600% CPU bandwidth available.

Limit the CPU utilization to 200%:

$ cgset -r cpu.cfs_quota_us=200000 cpulimited

$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: 200000

Run the fio jobs again:

$ cgexec -g cpu:cpulimited fio burn_cpu.job

Check the CPU usage:

$ top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12908 root      20   0 1079916   3948   2336 R  50.3  0.0   0:06.91 fio
12909 root      20   0 1079920   3948   2336 R  50.0  0.0   0:06.88 fio
12910 root      20   0 1079924   3948   2336 R  50.0  0.0   0:06.93 fio
12907 root      20   0 1079912   3948   2336 R  49.3  0.0   0:06.86 fio

The CPU utilization is 200% for the 4 fio jobs when the CPU utilization is limited to 200%.

Check the processes are running on which CPU cores:

$ mpstat -P ALL 5 | awk '{if ($3=="CPU" || $NF<99)print;}'

12:40:32 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:40:37 AM  all    2.11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.89
12:40:37 AM    0   20.52    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   79.48
12:40:37 AM    1   50.60    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.40
12:40:37 AM    2   50.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.90
12:40:37 AM   24   29.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   70.26
12:40:37 AM   87   50.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.80

12:40:37 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:40:42 AM  all    2.11    0.00    0.01    0.00    0.00    0.00    0.00    0.00    0.00   97.88
12:40:42 AM    0   11.49    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   88.31
12:40:42 AM    1   50.60    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.40
12:40:42 AM    2   50.30    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.70
12:40:42 AM   24   38.97    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   61.03
12:40:42 AM   87   49.90    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   50.10

The 4 fio jobs are running on 5 CPU cores with total utilization of 200%. So, it indicates this method limits the total CPU utilization out of all the CPU cores. However, the number of CPU cores is not limited.

3.5. Reference

https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/chap-using_libcgroup_tools
https://scoutapm.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups

4. 探索 cgroup2

在之前学习 Docker 的时候了解到了 cgroup 机制, 利用它可以在 Linux 中对单个或多个进程能够使用的 CPU、内存等资源进行精细化控制, 美团技术团队 15 年写的 Linux 资源管理之 cgroups 简介介绍了基础概念, 推荐阅读。在最近准备实践 cgroup 控制内存占用时, 发现目前内核(版本 5.10) 早已使用 cgroup2, 其网站为 cgroup2, Linux 源码内的参考手册: cgroup-v2.rst。

(在 Manjaro 上面查看根 cgroup 时, 发现已经变为 cgroup2)

4.1. 创建 cgroup

下面让我们来实践一下如何使用 cgroup2 去限制我们编写程序对系统各类资源的占用。在 /sys/fs/cgroup/ 目录下, 保存着根 cgroup 的信息, 当前所有进程都在这个默认的组内, 从 cgroup.procs 文件内, 我们可以发现目前的全部 PID:

通过下面的命令, 我们能够创建一个名为 cg1 的分组, 并把 PID 为 2345 的进程添加到这个组内(一个进程同时只能在一个组内):

mkdir /sys/fs/cgroup/cg1
echo 2345 > /sys/fs/cgroup/cg1/cgroup.procs

4.2. 设计资源控制

在每个 cgroup 内都可以找到两个文件:

文件名描述
cgroup.controllers 表示可以使用的资源控制, 根 cgroup 内包含全部可用的资源控制, 而子 cgroup 可以使用的控制继承自其父组的 cgroup.subtree_control 文件。
cgroup.subtree_control 表示当前已经启用的资源控制, 其内容会继承到子 cgroup 的 cgroup.controllers 中。

因此其层级可视化如下:

(图片来源自 cgroup2)

我们可以向 cgroup.subtree_control 文件内写入内容来控制启用的资源控制, +代表启用控制, -代表不启用控制, 如:

echo '+cpu +memory -io' > /sys/fs/cgroup/cg1/cgroup.subtree_control

最终 cgroup.subtree_control 的内容为 “cpu memory”。

4.3. 查看 PSI(压力指标)

据 PSI 网站描述, 这是新的 Linux 资源压力测试指标, 我们可以在 /proc/pressure 目录下面看到三个文件: cpu、io、memory。

avg 开头的项代表前多少秒的平均水平, totol 代表累计的毫秒数。

4.3.1. 指标的定义

some 表示一个或多个进程由于资源不足而被延迟的时间百分比。比如 60 秒内, 由于缺少内存资源, Task A 能够顺利运行, 而 Task B 等待了 30 秒内存, 因此 some avg60 的值为 50%。
full 表示所有进程都被延迟的时间百分比。沿用上面的例子, 如果 Task A 在 Task B 等待的这 30 秒内, 也因资源不足等待了 10 秒钟, 因此 A 和 B 有共同等待的 10 秒钟, full avg60 的值就为 10 / 60 * 100% = 16.66%。

让我们回到主题, 之所以介绍 PSI 是因为 PSI 也为 cgroup 提供了接口。在 cgroup 的文件夹内, 我们可以从 cpu.pressure、io.pressure、memory.pressure 中查看组内进程的 PSI。

4.4. 实践内存限制

假如有下面的一个 Rust 程序:

Cargo.toml:

...
[dependencies]
procfs = "0.11"

main.rs:

use procfs::process::Process;

fn main() {
    let me = Process::myself().unwrap();
    println!("PID: {}", me.pid);

    let page_size = procfs::page_size().unwrap() as u64;
    println!("Memory page size: {}", page_size);

    println!("Total virtual memory used: {} kB", me.stat.vsize / 1024);
    println!("Total resident set: {} pages ({} kB)", me.stat.rss, me.stat.rss as u64 * page_size / 1024);

    let mut bad_vec: Vec<u64> = Vec::new();
    let mut count: u64 = 0;

    loop {
        bad_vec.push(count);
        count += 1;
        if count % 100000 == 0 {
            let me = Process::myself().unwrap();
            println!("Current bad_vec size: {}", bad_vec.len());
            println!("Total virtual memory used: {} kB", me.stat.vsize / 1024);
            println!("Total resident set: {} pages ({} kB)", me.stat.rss, me.stat.rss as u64 * page_size / 1024);
        }
    }
}

程序会不断向 Vec 内添加元素并打印内存占用, 直到我们手动终止。有什么办法去阻止这疯狂的程序霸占我们的内存呢?

在 /sys/fs/cgroup/cg1 文件夹内, 我们可以看到诸多控制参数, 其中有个 memory.max 文件。它定义了内存使用的硬限制, 如果进程达到了限制且无法减小占用, 系统就会因 Out Of Memory 终止进程。

我们首先打开一个终端, 由于我们知道 cgroup 中的进程派生出来的进程也会在同一个 cgroup 中, 因此我们可以将该终端的 PID 加进 cg1 中。(以下提示符前的数字表示执行顺序)

终端 1(Rust 项目文件夹):

1$ echo $$
2580
3$ cat /proc/self/cgroup
0::/cg1
5$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/cgroup`
PID: 5147
Memory page size: 4096
Total virtual memory used: 4404 kB
Total resident set: 263 pages (1052 kB)
Current bad_vec size: 100000
Total virtual memory used: 5432 kB
Total resident set: 813 pages (3252 kB)
...
Total virtual memory used: 69944 kB
Total resident set: 12728 pages (50912 kB)
zsh: killed     cargo run

终端 2(root /sys/fs/cgroup/cg1):

2# echo 2580 > cgroup.procs
4# echo 50M > memory.max

我们可以看到 cgroup 成功阻止了这个邪恶程序!

4.5. 总结

cgroup 为 Linux 提供了限制进程资源的功能, 在虚拟化中非常有用。v2 在原来的基础上更加易用, 还提供了 PSI 接口。

5. cgroup v2 使用与测试

5.1. 配置 cgroup v2 的环境

5.1.1. 判断内核使用的 cgroup 版本

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
tmpfs on /usr/local/aegis/cgroup type tmpfs (rw,relatime,size=51200k)
cgroup on /usr/local/aegis/cgroup/cpu type cgroup (rw,relatime,cpu)

如果输出只有 cgroup, 说明内核还未挂载 cgroup v2 或者内核不支持 cgroup v2

5.1.2. 判断内核是否支持 cgroup v2

$ cat /proc/filesystems | grep cgroup
nodev	cgroup
nodev	cgroup2

出现了 cgroup v2 说明内核是支持的, 可以继续接下来的操作

5.1.3. 挂载 cgroup v2

对于使用 systemd 引导的系统, 可以在引导文件 /etc/default/grub 的 GRUB_CMDLINE_LINUX_DEFAULT 中添加如下一行, 启用 v2 版本

“systemd.unified_cgroup_hierarchy=yes”

$ sudo vim /etc/default/grub
添加: GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=yes"
$ sudo grub-mkconfig -o /boot/grub/grub.cfg
$ reboot

可以使用 cgroup_no_v1 = allows 防止 cgroup v1 抢占所有 controller, 体验纯 cgroup v2 环境

最后再进行重启, 就可以使用 cgroup v2 了

5.2. 初探 cgroup v2

查看 cgroup2 的目录树结构

$ ls /sys/fs/cgroup
cgroup.controllers      cpuset.mems.effective   memory.idle_page_stats.local
cgroup.max.depth        cpu.stat                memory.numa_stat
cgroup.max.descendants  init.scope              memory.pressure
cgroup.procs            io.cost.model           memory.reap_background
cgroup.stat             io.cost.qos             memory.stat
cgroup.subtree_control  io.pressure             memory.use_priority_oom
cgroup.threads          io.stat                 memory.use_priority_swap
cpu.pressure            memory.fast_copy_mm     system.slice
cpuset.cpus.effective   memory.idle_page_stats  user.slice

相比于 cgroup v1, v2 的目录则显得直接很多, 毕竟如果将 cgroup v1 比作森林的话, cgroup v2 就只是一颗参天大树

查看 cgroup2 可管理的系统资源类型

$ cat cgroup.controllers
cpuset cpu io memory hugetlb pids rdma

查看 cgroup2 开启的控制器

$ cat cgroup.subtree_control
cpuset cpu io memory pids

在 root cgroup 下创建一个 cgroup

$ mkdir cgrp2test
$ ls cgrp2test
cgroup.controllers      cpu.weight.nice               memory.pressure
cgroup.events           io.bfq.weight                 memory.priority
cgroup.freeze           io.latency                    memory.reap_background
cgroup.max.depth        io.max                        memory.stat
cgroup.max.descendants  io.pressure                   memory.swap.current
cgroup.procs            io.stat                       memory.swap.events
cgroup.stat             io.weight                     memory.swap.high
cgroup.subtree_control  memory.current                memory.swap.max
cgroup.threads          memory.events                 memory.use_priority_oom
cgroup.type             memory.events.local           memory.use_priority_swap
cpu.max                 memory.fast_copy_mm           memory.wmark_high
cpu.pressure            memory.high                   memory.wmark_low
cpuset.cpus             memory.idle_page_stats        memory.wmark_ratio
cpuset.cpus.effective   memory.idle_page_stats.local  memory.wmark_scale_factor
cpuset.cpus.partition   memory.low                    pids.current
cpuset.mems             memory.max                    pids.events
cpuset.mems.effective   memory.min                    pids.max
cpu.stat                memory.numa_stat
cpu.weight              memory.oom.group

查看 cgrp2test 可用的控制器

$ cat cgroup.controllers 
cpuset cpu io memory pids

每个 child cgroup 不会继承其父母的控制器, 只有在父节点 cgroup.subtree_control 显式配置开启的控制器才能在 child cgroup 中使用。目前 cgrp2test 可以对 cpuset, cpu, io, memroy, pids 这些资源进行限制

对 cgroup 添加 cpu 资源限制

$ echo 5000 10000 > cpu.max

含义是在 10000 的 CPU 时间周期内, 有 5000 是分配给本 cgroup 的, 也就是本 cgroup 管理的进程在单核 CPU 上的使用率不会超过 50%

测试一下

$ vim while.sh
while :
do
			:
done

$ ./while.sh &
[1] 4139

$ top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
   4139 root      20   0  226356   3100   1372 R  99.7   0.0   0:36.51 bash    
   4143 root      20   0  227880   5636   3928 R   0.3   0.0   0:00.07 top     
      1 root      20   0  168812  11240   8392 S   0.0   0.0   0:02.03 systemd 
 		...
 		
$ cd /sys/fs/cgroup
$ mkdir cputest
$ echo 5000 10000 > cputest/cpu.max
$ echo 4139 > cputest/cgroup.procs

$ top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
   4139 root      20   0  223708   2888   1328 R  50.0   0.0  25:27.62 bash    
   3288 root      10 -10  119312  28720  16716 S   0.7   0.0   0:11.19 AliYunD+
   4149 root      20   0  227880   5612   3904 R   0.3   0.0   0:00.04 top     
      1 root      20   0  168700  11208   8440 S   0.0   0.0   0:01.88 systemd

为了对直系 child cgroup 进行资源限制, cgrp2test 也需要开启特定的控制器, 查看 cgrp2test 已经开启的控制器

$ cat cgroup.subtree_control

此时 cgrp2test 没有开启任何控制器, 那么 cgrp2test 对其直系 child cgroup 将无法进行资源分配的限制。

$ mkdir -p cgrp2test/cg1
$ ls cgrp2test/cg1 #没有控制器接口可用, cg1 对资源自由竞争
cgroup.controllers  cgroup.max.descendants  cgroup.threads  io.pressure
cgroup.events       cgroup.procs            cgroup.type     memory.pressure
cgroup.freeze       cgroup.stat             cpu.pressure
cgroup.max.depth    cgroup.subtree_control  cpu.stat

我们现在开启 cpu, memory 控制器

$ echo "+cpu +memory" > cgrp2test/cgroup.subtree_control
$ ls cgrp2test/cg1 #出现了控制器的接口
cgroup.controllers      cpu.weight.nice               memory.pressure
cgroup.events           io.pressure                   memory.priority
cgroup.freeze           memory.current                memory.reap_background
cgroup.max.depth        memory.events                 memory.stat
cgroup.max.descendants  memory.events.local           memory.swap.current
cgroup.procs            memory.fast_copy_mm           memory.swap.events
cgroup.stat             memory.high                   memory.swap.high
cgroup.subtree_control  memory.idle_page_stats        memory.swap.max
cgroup.threads          memory.idle_page_stats.local  memory.use_priority_oom
cgroup.type             memory.low                    memory.use_priority_swap
cpu.max                 memory.max                    memory.wmark_high
cpu.pressure            memory.min                    memory.wmark_low
cpu.stat                memory.numa_st

以下图为例:

 A(cpu,memory) - B(memory) - C()
                           \ D()

A 开启了 cpu 和 memory, 那么 A 可以控制 B 的 CPU 周期和内存的分配。B 开启了 memory, 但没有开启 cpu controller, 那么 C 和 D 可以 CPU 资源进行自由竞争, 但是他们对 B 可用内存的划分则是可控制的。

5.3. 详解 cgroup v2

5.3.1. top-down constraint

资源是自顶向下 (top-down) 分配的, 只有当一个 cgroup 从 parent 获得了某种资源, 它才可以继续向下分发。这意味着

只有父节点启用了某个控制器, 子节点才能启用;
对应到实现上, 所有非根节点 (non-root) 的 cgroup.subtree_control 文件中, 只能包含它的父节点的 cgroup.subtree_control 中有的控制器;
另一方面, 只要有子节点还在使用某个控制器, 父节点就无法禁用之。

5.3.2. no internal process

只有当一个 non-root cgroup 中没有任何进程时, 才能将其 domain resource 分配给它的 children。换句话说, 只有那些没有任何进程的 domain cgroup, 才能将它们的 domain controllers 写到 cgroup.subtree_control 文件中。

这种方式保证了在一个启动的 domain controller 的视野范围内, 所有进程都位于叶子节点上, 因而避免了 child cgroup 内的进程与 parent 内的进程竞争的情况, 便于 domain controller 扫描 hierarchy。

但 root cgroup 不受此限制。

对大部分类型的控制器来说, root 中包含了一些没有与任何 cgroup 相关联的进程和匿名资源占用 (anonymous resource consumption), 需要特殊对待。
root cgroup 的资源占用是如何管理的, 因控制器而异。

注意, 在 cgroup.subtree_control 启用某个控制器之前, no internal process 限制不会生效。这非常重要, 因为它决定了创建 populated cgroup children 的方式。要控制一个 cgroup 的资源分配, 这个 cgroup 需要创建 children cgroup 然后在 cgroup.subtree_control 启动控制器之前, 将自己所有的进程转移到 children cgroup 中

测试一下:

$ cd /sys/fs/cgroup
$ mkdir -p test test/cg1 test/cg2 test/cg1/cg1_1 test/cg2/cg2_1
# test __ cg1 -- cg1_1
#      \_ cg2 -- cg2_1
$ echo "+memroy" > test/cgroup.subtree_control #开启 test 的 memory 控制器
$ echo "+memory" > test/cg1/cgroup.subtree_control #开启 cg1 的 memory 控制器
# test(*) __ cg1(*) -- cg1_1
#      	\_ cg2 -- cg2_1
# 打*号表示开启了控制器
$ sleep 10000 &
[1] 6768

$ echo 6768 > test/cg1/cg1_1/cgroup.procs #成功
$ echo 6768 > test/cg1/cgroup.procs #-bash: echo: 写错误: 设备或资源忙
$ echo 6768 > test/cgroup.procs #-bash: echo: 写错误: 设备或资源忙
$ echo 6768 > test/cg2/cg2_1/cgroup.procs #成功
$ echo 6768 > test/cg2/cgroup.procs #成功
$ echo 6768 > cgroup.procs #成功