O1 - cgroup学习

最新推荐文章于 2022-07-08 15:16:16 发布

shuimuyq

最新推荐文章于 2022-07-08 15:16:16 发布

阅读量821

点赞数

分类专栏： bigsh 文章标签： cgroup devices

本文链接：https://blog.csdn.net/shuimuyq/article/details/48177725

版权

bigsh 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Cgroups是control groups的缩写，是Linux内核提供的一种可以限制、记录、隔离进程组（process groups）所使用的物理资源（如：cpu,memory,IO等等）的机制。

Cgroups最初的目标是为资源管理提供的一个统一的框架，既整合现有的cpuset等子系统，也为未来开发新的子系统提供接口。现在的cgroups适用于多种应用场景，

从单个进程的资源控制，到实现操作系统层次的虚拟化（OS Level Virtualization）。

cgroup默认挂载到/sys/fs/cgroup

在上述目录，既树根节点上可以挂载多个子系统，如blkio cgmanager cpu cpuacct cpuset devices freezer hugetlb memory net_cls net_prio perf_event systemd

可以参看mount命令输出的信息如下：

none on /sys/fs/cgroup type tmpfs (rw)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory,release_agent=/run/cgmanager/agents/cgm-release-agent.memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices,release_agent=/run/cgmanager/agents/cgm-release-agent.devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,relatime,net_cls,release_agent=/run/cgmanager/agents/cgm-release-agent.net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio,release_agent=/run/cgmanager/agents/cgm-release-agent.blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,noexec,nosuid,nodev)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
xxx on /root/dev type cgroup (rw,devices)

子系统介绍

##Cgroups子系统介绍
- blkio -- 这个子系统为块设备设定输入/输出限制,比如物理设备(磁盘,固态硬盘,USB 等等)。
- cpu -- 这个子系统使用调度程序提供对 CPU 的 cgroup 任务访问。
- cpuacct -- 这个子系统自动生成 cgroup 中任务所使用的 CPU 报告。
- cpuset -- 这个子系统为 cgroup 中的任务分配独立 CPU(在多核系统)和内存节点。
- devices -- 这个子系统可允许或者拒绝 cgroup 中的任务访问设备。
- freezer -- 这个子系统挂起或者恢复 cgroup 中的任务。
- memory -- 这个子系统设定 cgroup 中任务使用的内存限制,并自动生成由那些任务使用的内存资源报告。
- net_cls -- 这个子系统使用等级识别符(classid)标记网络数据包,可允许 Linux 流量控制程序(tc)识别从具体 cgroup 中生成的数据包。
- ns -- 名称空间子系统。

##Cgroup 子系统配置

#### blkio - BLOCK IO限额

- blkio.reset_stats - 重置统计信息，写int到此文件
- blkio.time - 统计cgroup对设备的访问时间 - device_types:node_numbers milliseconds
- blkio.sectors - 统计cgroup对设备扇区访问数量 - device_types:node_numbers sector_count
- blkio.avg_queue_size - 统计平均IO队列大小(需要CONFIG_DEBUG_BLK_CGROUP=y)
- blkio.group_wait_time - 统计cgroup等待总时间(需要CONFIG_DEBUG_BLK_CGROUP=y, 单位ns)
- blkio.empty_time - 统计cgroup无等待io总时间(需要CONFIG_DEBUG_BLK_CGROUP=y, 单位ns)
- blkio.idle_time - reports the total time (in nanoseconds — ns) the scheduler   spent idling for a cgroup in anticipation of a better request than those requests   already in other queues or from other groups.
- blkio.dequeue - 此cgroup IO操作被设备dequeue次数(需要CONFIG_DEBUG_BLK_CGROUP=y) - device_types:node_numbers number
- blkio.io_serviced - 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)次数 - device_types:node_numbers operation number
- blkio.io_service_bytes - 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)数据量 - device_types:node_numbers operation bytes
- blkio.io_service_time - 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)时间(单位ns) - device_types:node_numbers operation time
- blkio.io_wait_time - 此cgroup对特定设备的特定操作(read, write, sync, or async)的等待时间(单位ns) - device_types:node_numbers operation time
- blkio.io_merged - 此cgroup的BIOS requests merged into IO请求的操作(read, write, sync, or async)的次数 - number operation
- blkio.io_queued - 此cgroup的queued IO 操作(read, write, sync, or async)的请求次数 - number operation
Proportional weight division 策略 - 按比例分配block io资源
- blkio.weight - 100-1000的相对权重，会被blkio.weight_device的特定设备权重覆盖
- blkio.weight_device - 特定设备的权重 - device_types:node_numbers weight
I/O throttling (Upper limit) 策略 - 设定IO操作上限
每秒读/写数据上限
- blkio.throttle.read_bps_device - device_types:node_numbers bytes_per_second
- blkio.throttle.write_bps_device - device_types:node_numbers bytes_per_second
每秒读/写操作次数上限
- blkio.throttle.read_iops_device - device_types:node_numbers operations_per_second - blkio.throttle.write_iops_device - device_types:node_numbers operations_per_second
每秒具体操作(read, write, sync, or async)的控制 blkio.throttle.io_serviced - device_types:node_numbers operation operations_per_second
- blkio.throttle.io_service_bytes - device_types:node_numbers operation bytes_per_second

#### cpu - CPU使用时间限额
- CFS(Completely Fair Scheduler)策略 - CPU最大资源限制
   - cpu.cfs_period_us, cpu.cfs_quota_us - 必选 - 二者配合，前者规定时间周期(微秒)后者规定cgroup最多可使用时间(微秒)，实现task对单个cpu的使用上限(cfs_quota_us是cfs_period_us的两倍即可限定在双核上完全使用)。
   - cpu.stat - 记录cpu统计信息，包含 nr_periods（经历了几个cfs_period_us）, nr_throttled (cgroup里的task被限制了几次), throttled_time (cgroup里的task被限制了多少纳秒)
   - cpu.shares - 可选 - cpu轮转权重的相对值
- RT(Real-Time scheduler)策略 - CPU最小资源限制
   - cpu.rt_period_us, cpu.rt_runtime_us 二者配合使用规定cgroup里的task每cpu.rt_period_us(微秒)必然会执行cpu.rt_runtime_us(微秒)

#### cpuacct - CPU资源报告
- cpuacct.usage - cgroup中所有task的cpu使用时长(纳秒)
- cpuacct.stat - cgroup中所有task的用户态和内核态分别使用cpu的时长
- cpuacct.usage_percpu - cgroup中所有task使用每个cpu的时长

#### cpuset - CPU绑定
- cpuset.cpus - 必选 - cgroup可使用的cpu，如0-2,16代表 0,1,2,16这4个cpu
- cpuset.mems - 必选 - cgroup可使用的memory node
- cpuset.memory_migrate - 可选 - 当cpuset.mems变化时page上的数据是否迁移, default 0
- cpuset.cpu_exclusive - 可选 - 是否独占cpu， default 0
- cpuset.mem_exclusive - 可选 - 是否独占memory，default 0
- cpuset.mem_hardwall - 可选 - cgroup中task的内存是否隔离， default 0
- cpuset.memory_pressure - 可选 - a read-only file that contains a running average of the memory pressure created by the processes in this cpuset
- cpuset.memory_pressure_enabled - 可选 - cpuset.memory_pressure开关，default 0
- cpuset.memory_spread_page - 可选 - contains a flag (0 or 1) that specifies   whether file system buffers should be spread evenly across the memory nodes   allocated to this cpuset， default 0
- cpuset.memory_spread_slab - 可选 - contains a flag (0 or 1) that specifies   whether kernel slab caches for file input/output operations should be spread evenly across the cpuset， default 0
- cpuset.sched_load_balance - 可选 - cgroup的cpu压力是否会被平均到cpu set中的多个cpu, default 1
- cpuset.sched_relax_domain_level - 可选 -
- cpuset.sched_load_balance的策略
- -1 = Use the system default value for load balancing
- 0 = Do not perform immediate load balancing; balance loads only periodically
- 1 = Immediately balance loads across threads on the same core
- 2 = Immediately balance loads across cores in the same package
- 3 = Immediately balance loads across CPUs on the same node or blade
- 4 = Immediately balance loads across several CPUs on architectures with non-uniform memory access (NUMA)
- 5 = Immediately balance loads across all CPUs on architectures with NUMA

#### device - cgroup的device限制设备黑/白名单
- devices.allow - 允许名单
- devices.deny - 禁止名单
- 语法 - type device_types:node_numbers   access type - b (块设备) c (字符设备) a (全部设备) access - r 读 w 写 m 创建
- devices.list - 报告

#### freezer - 暂停/恢复 cgroup的限制   不能出现在root目录下
- freezer.state
- FROZEN 停止
- FREEZING 正在停止
- THAWED 恢复

#### memory - 内存限制
- memory.usage_in_bytes - 报告内存限制byte
- memory.memsw.usage_in_bytes - 报告cgroup中进程当前所用内存+swap空间
- memory.max_usage_in_bytes - 报告cgoup中的最大内存使用
- memory.memsw.max_usage_in_bytes - 报告最大使用到的内存+swap
- memory.limit_in_bytes - cgroup - 最大内存限制，单位k,m,g. -1代表取消限制
- memory.memsw.limit_in_bytes - 最大内存+swap限制，单位k,m,g. -1代表取消限制
- memory.failcnt - 报告达到最大允许内存的次数
- memory.memsw.failcnt - 报告达到最大允许内存+swap的次数
- memory.force_empty - 设为0且无task时，清除cgroup的内存页
- memory.swappiness - 换页策略，60基准，小于60降低换出机率，大于60增加换出机率
- memory.use_hierarchy - 是否影响子group
- memory.oom_control - 0 enabled，当oom发生时kill掉进程

- memory.stat - 报告cgroup限制状态
- cache - page cache, including tmpfs (shmem), in bytes
- rss - anonymous and swap cache, not including tmpfs (shmem), in bytes
- mapped_file - size of memory-mapped mapped files, including tmpfs (shmem), in bytes
- pgpgin - number of pages paged into memory
- pgpgout - number of pages paged out of memory
- swap - swap usage, in bytes
- active_anon - anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs (shmem), in bytes
- inactive_anon - anonymous and swap cache on inactive LRU list, including tmpfs (shmem), in bytes
- active_file - file-backed memory on active LRU list, in bytes
- inactive_file - file-backed memory on inactive LRU list, in bytes
- unevictable - memory that cannot be reclaimed, in bytes
- hierarchical_memory_limit - memory limit for the hierarchy that contains the memory cgroup, in bytes
- hierarchical_memsw_limit - memory plus swap limit for the hierarchy that contains the memory cgroup, in bytes

#### net_cls
- net_cls.classid - 指定tc的handle，通过tc实现网络控制

####net_prio 指定task网络设备优先级
- net_prio.prioidx - a read-only file which contains a unique integer value that the kernel uses as an internal representation of this cgroup.

- net_prio.ifpriomap - 网络设备使用优先级 - <network_interface> <priority>

####其他
- tasks - 该cgroup的所有进程pid
- cgroup.event_control - event api
- cgroup.procs - thread group id
- release_agent(present in the root cgroup only) - 根据notify_on_release是否在task为空时执行的脚本
- notify_on_release - 当cgroup中没有task时是否执行release_agent

可以通过cgget cgset等命令获取相关配置，或者也可以简单的用cat，echo来获取和设置参数

cgget -n -g devices dev-group-path

cgset -r devices.allow="c 1:5 rwm" dev-group-path

实例：

cd /sys/fs/cgroup/
mkdir devices
mount -t cgroup -o devices device devices/
cd devices/
mkdir group1
cd group1/

root@test:/sys/fs/cgroup/devices/group1# echo $$ > tasks
root@test:/sys/fs/cgroup/devices/group1# cat tasks
15133
15156
root@test:/sys/fs/cgroup/devices/group1# cat devices.list
a *:* rwm
root@test:/sys/fs/cgroup/devices/group1# echo a > devices.deny
root@test:/sys/fs/cgroup/devices/group1# cat devices.list
root@test:/sys/fs/cgroup/devices/group1# echo "c 1:3 rw" > devices.allow
root@test:/sys/fs/cgroup/devices/group1# cat devices.list
c 1:3 rw
root@test:/sys/fs/cgroup/devices/group1# dd if=/dev/zero of=/dev/null bs=1M count=2
dd: failed to open ‘/dev/zero’: Operation not permitted
root@test:/sys/fs/cgroup/devices/group1# echo "c 1:5 rw" > devices.allow
root@test:/sys/fs/cgroup/devices/group1# cat devices.list
c 1:3 rw
c 1:5 rw
root@test:/sys/fs/cgroup/devices/group1# dd if=/dev/zero of=/dev/null bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.00153738 s, 1.4 GB/s
root@test:/sys/fs/cgroup/devices/group1# echo "c 1:3 rw" > devices.deny
root@test:/sys/fs/cgroup/devices/group1# cat devices.list
c 1:5 rw
root@test:/sys/fs/cgroup/devices/group1# dd if=/dev/zero of=/dev/null bs=1M count=2
dd: failed to open ‘/dev/null’: Operation not permitted

在每个子系统下面可以通过mkdir rmdir创建和删除控制组

一个子系统相当于一种计算机资源，如cpu，内存，设备等

通过子系统来管理这些资源，每个子系统下的分组目录下都有tasks文件，这个文件的意义是所有在tasks文件的pid进程都受到这个分组的控制

子系统可以同时被挂载到多个目录下，但是同一个子系统根目录下的内容是完全一致的，这也可以理解为不论子系统挂载在多少地方，资源还是那些资源

每个子系统下的分组按树的结构呈现，子节点继承父节点的一些属性，子节点只能比父节点限制更严格

比如devices子系统根节点的devices.list没有的设备，子节点再怎么设置都是无效的

下面是一张我所理解的cgroup树结构的图

对于应用层，或者系统管理员来说只是知道怎么用cgroup

具体cgroup在内核里怎么实现还需要进一步学习研究

shuimuyq

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
O1 - cgroup学习

Cgroups是control groups的缩写，是Linux内核提供的一种可以限制、记录、隔离进程组（process groups）所使用的物理资源（如：cpu,memory,IO等等）的机制。Cgroups最初的目标是为资源管理提供的一个统一的框架，既整合现有的cpuset等子系统，也为未来开发新的子系统提供接口。现在的cgroups适用于多种应用场景，从单个进程的资源控制，到实现操
复制链接

扫一扫