我们从进程出发来剖析cgroups相关数据结构之间的关系
在Linux中,管理进程的数据结构是task_struct,其中与cgroups有关的:
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock */
struct css_set __rcu *cgroups;
/* cg_list protected by css_set_lock and tsk->alloc_lock */
struct list_head cg_list;
#endif
其中cgroups指针指向了一个css_set结构,而css_set存储了与进程相关的cgroups信息。cg_list是一个嵌入的list_head结构,用于将练到同一个css_set的进程组织成一个链表。下面我们来看css_set的结构:
struct css_set {
/* Reference count */
atomic_t refcount;
/*
* List running through all cgroup groups in the same hash
* slot. Protected by css_set_lock
*/
struct hlist_node hlist;
/*
* List running through all tasks using this cgroup
* group. Protected by css_set_lock
*/
struct list_head tasks;
/*
* List of cg_cgroup_link objects on link chains from
* cgroups referenced from this css_set. Protected by
* css_set_lock
*/
struct list_head cg_links;
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time) and modular subsystem
* loading/unloading.
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
/* For RCU-protected deletion */
struct rcu_head rcu_head;
};
其中refcount是该css_set的引用数,因为一个css_set可以被多个进程公用,只要这些进程的cgroups信息相同,比如:在所有已创建的层级里面都在同一cgroup里的进程。
hlist是嵌入的hlist_node,用于把所有css_set组织成一个hash表,这样内核可以快速查找特定的css_set。
tasks指向所有练到此css_set的进程练成的链表。
cg_links指向一个由struct cg_cgroup_link连成的链表
subsys是一个指针数组,存储一组指向cgroup_subsys_state的指针。一个cgroup_subsys_state就是进程与一个特定子系统相关的信息。通过这个指针数组,进程就可以获得相应的cgroups控制信息了。
下面看下cgroup_subsys_state的结构:
struct cgroup_subsys_state {
/*
* The cgroup that this subsystem is attached to. Useful
* for subsystems that want to know about the cgroup
* hierarchy structure
*/
struct cgroup *cgroup;
/*
* State maintained by the cgroup system to allow subsystems
* to be "busy". Should be accessed via css_get(),
* css_tryget() and and css_put().
*/
atomic_t refcnt;
unsigned long flags;
/* ID for this css, if possible */
struct css_id __rcu *id;
};
cgroup指针指向了一个cgroup结构,也就是进程属于的cgroup。进程受到子系统的控制,实际上是通过加入到特定的cgroup实现的,因为cgroup在特定的层级上,而子系统又是附加到层级上的。通过以上三个结构,进程就可以和cgroup连接起来了:task_struct->css_set->cgroup_subsys_state->cgroup。
下面看下cgroup的结构:
struct cgroup {
unsigned long flags; /* "unsigned long" so bitops work */
/*
* count users of this cgroup. >0 means busy, but doesn't
* necessarily indicate the number of tasks in the cgroup
*/
atomic_t count;
/*
* We link our 'sibling' struct into our parent's 'children'.
* Our children link their 'sibling' into our 'children'.
*/
struct list_head sibling; /* my parent's children */
struct list_head children; /* my children */
struct cgroup *parent; /* my parent */
struct dentry __rcu *dentry; /* cgroup fs entry, RCU protected */
/* Private pointers for each registered subsystem */
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
struct cgroupfs_root *root;
struct cgroup *top_cgroup;
/*
* List of cg_cgroup_links pointing at css_sets with
* tasks in this cgroup. Protected by css_set_lock
*/
struct list_head css_sets;
/*
* Linked list running through all cgroups that can
* potentially be reaped by the release agent. Protected by
* release_list_lock
*/
struct list_head release_list;
/*
* list of pidlists, up to two for each namespace (one for procs, one
* for tasks); created on demand.
*/
struct list_head pidlists;
struct mutex pidlist_mutex;
/* For RCU-protected deletion */
struct rcu_head rcu_head;
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
};
sibling,children和parent三个嵌入的list_head负责将同一层级的cgroup连接成一棵cgroup树。
subsys是一个指针数组,存储一组指向cgroup_subsys_state的指针。这指针指向了此cgroup跟各个子系统相关的信息,这个跟css_set中的道理是一样的。
root指向了一个cgroupfs_root的结构,就是cgroup所在的层级对应的结构体。这样一来,之前的几个cgroups概念就全部联系起来了。
top_cgroup指向了所在层级的根cgroup,也就是创建层级时自动创建的那个cgroup
css_set指向一个由struct cg_cgroup_link连成的链表,跟css_set中的cg_links一样。
下面分析一下css_set和cgroup之间的关系。先看下cg_cgroup_linnk的结构:
struct cg_cgroup_link {
/*
* List running through cg_cgroup_links associated with a
* cgroup, anchored on cgroup->css_sets
*/
struct list_head cgrp_link_list;
struct cgroup *cgrp;
/*
* List running through cg_cgroup_links pointing at a
* single css_set object, anchored on css_set->cg_links
*/
struct list_head cg_link_list;
struct css_set *cg;
};
cgrp_link_list连入到cgroup->css_set指向的链表,cgrp则指向此cg_cgroup_link相关的cgroup
cg_link_list则连入到css_set->cg_links指向的链表,cg则指向此cg_cgroup_link相关的css_set。
设计原理:
因为cgroup和css_set是一个多对多的关系,必须添加一个中间结构来将两者联系起来,这跟数据库模式设计是一个道理。cg_cgroup_link中的cgrp和cg就是此结构体的联合主键,而cgrp_link_list和cg_link_list分别连入到cgroup和css_set相应的链表,使得能从cgroup或css_set都可以进行遍历查询。
一个进程对应css_set,一个css_set就存储了一组进程跟各个子系统相关的信息,但是这些信息由可能不是从一个cgroup那里获得的,因为一个进程可以同时属于几个cgroup,只要这些cgroup不在同一个层级。举个例子:我们创建一个层级A,A上面附加了cpu和memory两个子系统,进程B属于A的根cgroup;然后我们再创建一个层级C,C上面附加了ns和blkio两个子系统,进程B同样属于C的根cgroup;那么进程B对应的cpu和memory的信息是从A的跟cgroups获得的,ns和blkio信息则是从C的跟cgroup获得的。因此,一个css_set存储的cgroup_subsys_state可以对应多个cgroup。另一方面,cgroup也存储了一组cgroup_subsys_state,这一组cgroup_subsys_state则是cgroup从所在的层级附加的子系统获得的。一个cgroup钟可以有多个进程,而这些进程的css_set不一定都相同,因为有些进程可能还加入了其他的cgroup。但是同一个cgroup中的进程与该cgroup关联的cgroup_subsys_state都受到该cgroup的管理,所以一个cgroup也可以对应多个css_set。
从前面的分析,我们可以看出从task到cgroup时很容易定位的,但是从cgroup获取此cgroup的所有的task就必须通过这个结构了。每个进程都会指向一个css_set,而与这个css_set关联的所有进程都会链入到css->tasks链表。而cgroup又通过一个中间结构cg_cgroup_link来寻找所有与之关联的css_set,从而可以得到与cgroup关联的所有进程。
最后看下层级和子系统对应的结构体,层级对应的结构体是cgroupfs_root:
struct cgroupfs_root {
struct super_block *sb;
/*
* The bitmask of subsystems intended to be attached to this
* hierarchy
*/
unsigned long subsys_bits;
/* Unique id for this hierarchy. */
int hierarchy_id;
/* The bitmask of subsystems currently attached to this hierarchy */
unsigned long actual_subsys_bits;
/* A list running through the attached subsystems */
struct list_head subsys_list;
/* The root cgroup for this hierarchy */
struct cgroup top_cgroup;
/* Tracks how many cgroups are currently defined in hierarchy.*/
int number_of_cgroups;
/* A list running through the active hierarchies */
struct list_head root_list;
/* Hierarchy-specific flags */
unsigned long flags;
/* The path to use for release notifications. */
char release_agent_path[PATH_MAX];
/* The name for this hierarchy - may be empty */
char name[MAX_CGROUP_ROOT_NAMELEN];
};
sb指向该层级关联的文件系统超级块
subsys_bits和actual_subsys_bits分别指向将要附加到层级的子系统和现在实际附加到层级的子系统,在子系统附加到层级时使用
hierarchy_id是该层级唯一的id
top_cgroup指向该层级的根cgroup
number_of_cgroups记录该层级cgroup的个数
root_list是一个嵌入的list_head,用于将系统所有的层级连成链表
子系统对应的结构体是cgroup_subsys:
struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
int (*pre_destroy)(struct cgroup *cgrp);
void (*destroy)(struct cgroup *cgrp);
int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
void (*attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
void (*fork)(struct task_struct *task);
void (*exit)(struct cgroup *cgrp, struct cgroup *old_cgrp,
struct task_struct *task);
int (*populate)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*post_clone)(struct cgroup *cgrp);
void (*bind)(struct cgroup *root);
int subsys_id;
int active;
int disabled;
int early_init;
/*
* True if this subsys uses ID. ID is not available before cgroup_init()
* (not available in early_init time.)
*/
bool use_id;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
/*
* Protects sibling/children links of cgroups in this
* hierarchy, plus protects which hierarchy (or none) the
* subsystem is a part of (i.e. root/sibling). To avoid
* potential deadlocks, the following operations should not be
* undertaken while holding any hierarchy_mutex:
*
* - allocating memory
* - initiating hotplug events
*/
struct mutex hierarchy_mutex;
struct lock_class_key subsys_key;
/*
* Link to parent, and list entry in parent's children.
* Protected by this->hierarchy_mutex and cgroup_lock()
*/
struct cgroupfs_root *root;
struct list_head sibling;
/* used when use_id == true */
struct idr idr;
spinlock_t id_lock;
/* should be defined only by modular subsystems */
struct module *module;
};
cgroup_subsys定义了一组操作,让各个子系统根据各自的需要去实现。这个相当于C++中抽象基类,然后各个特定的子系统对应cgroup_subsys则是实现了相应操作的子类。类似的思想还被用在了cgroup_subsys_state中,cgroup_subsys_state并未定义控制信息,而只是定义了各个子系统都需要的共同信息,比如该cgroup_subsys_state从属的cgroup。然后各个子系统再根据各自的需要去定义自己的进程控制信息结构体,最后在各自的结构体中将cgroup_subsys_state包含进去,这样通过Linux内核的container_of等宏就可以通过cgroup_subsys_state来获取相应的结构体。