Linux内核-进程管理之CPU拓扑结构和调度域/组

如今CPU的核数从单核,到双核,再到4核、8核、甚至10核。但是我们知道Android使用的多核架构都是分大小核,或者现在最新的,除了大小核以外,还有一个超大核。

区分大小核,是因为它们之间的性能(算力),功耗是不同的,而且它们又以cluster来区分(小核在一个cluster,大核在另一个cluster),而目前由于同cluster内的cpu freq是同步调节的。

所以,在对CPU的任务调度中,需要对其同样进行区分,来确保性能和功耗的平衡。

因此,针对CPU的拓扑结构,内核中会建立不同的调度域、调度组来体现。如下图,以某8核cpu为例:

在DIE level,cpu 0-7

在MC level,cpu 0-3在一组,而cpu4-7在另一组

*SMT超线程技术,会在MC level以下,再进行一次区分:01、23、45、67(这里可以暂不考虑,因为当前ARM平台并未支持SMT)

CPU Topology建立

在kernel中,有CPU Topology的相关代码来形成这样的结构,结构的定义在dts文件中,根据不同平台会不同。我当前这个mtk平台的DTS相关信息如下(至于这里为什么没有用qcom平台,因为现在公司暂时貌似只有mtk平台,所以可能略微有点差别):

cpu0: cpu@000 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x000>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu1: cpu@001 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x001>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu2: cpu@002 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x002>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu3: cpu@003 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x003>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu4: cpu@100 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x100>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu5: cpu@101 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x101>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu6: cpu@102 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x102>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu7: cpu@103 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x103>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu-map {
            cluster0 {
                core0 {
                    cpu = <&cpu0>;
                };
                core1 {
                    cpu = <&cpu1>;
                };
                core2 {
                    cpu = <&cpu2>;
                };
                core3 {
                    cpu = <&cpu3>;
                };
            };

            cluster1 {
                core0 {
                    cpu = <&cpu4>;
                };
                core1 {
                    cpu = <&cpu5>;
                };
                core2 {
                    cpu = <&cpu6>;
                };
                core3 {
                    cpu = <&cpu7>;
                };
            };
        };

代码路径:drivers/base/arch_topology.c、arch/arm64/kernel/topology.c,本文代码以CAF Kernel msm-5.4为例。

第一部分,这里解析DTS,并保存cpu_topology的package_id,core_id,cpu_sclae(cpu_capacity_orig)

kernel_init()
    -> kernel_init_freeable()
        -> smp_prepare_cpus()
            -> init_cpu_topology()
                -> parse_dt_topology()

针对dts中,依次解析"cpus"节点,以及其中的"cpu-map"节点;

先解析了其中cluster节点的内容结构。

在对cpu capacity进行归一化

static int __init parse_dt_topology(void)
{
    struct device_node *cn, *map;
    int ret = 0;
    int cpu;

    cn = of_find_node_by_path("/cpus");    //查找dts中 /cpus的节点
    if (!cn) {
        pr_err("No CPU information found in DT\n");
        return 0;
    }

    /*
     * When topology is provided cpu-map is essentially a root
     * cluster with restricted subnodes.
     */
    map = of_get_child_by_name(cn, "cpu-map");    //查找/cpus节点下,cpu-map节点
    if (!map)
        goto out;

    ret = parse_cluster(map, 0);    //(1)解析cluster结构
    if (ret != 0)
        goto out_map;

    topology_normalize_cpu_scale();    //(2)将cpu capacity归一化

    /*
     * Check that all cores are in the topology; the SMP code will
     * only mark cores described in the DT as possible.
     */
    for_each_possible_cpu(cpu)
        if (cpu_topology[cpu].package_id == -1)
            ret = -EINVAL;

out_map:
    of_node_put(map);
out:
    of_node_put(cn);
    return ret;
}

(1)解析cluster结构

通过第一个do-while循环,进行"cluster+序号"节点的解析:当前平台分别解析cluster0、1。然后仍然调用自身函数,这样代码复用,进一步解析其中的“core”结构

在进一步解析core结构时,同样通过第二个do-while循环,进行"core+序号"节点的解析:当前平台支持core0,1...7,共8个核,通过parse_core函数进一步解析

所以实际解析执行顺序应该是:cluster0,core0,1,2,3;cluster1,core4,5,6,7。

最后在每个cluster中的所有core都解析完,跳出其do-while循环时,package_id就是递增。说明package_id就对应了cluster的id

static int __init parse_cluster(struct device_node *cluster, int depth)
{
    char name[20];
    bool leaf = true;
    bool has_cores = false;
    struct device_node *c;
    static int package_id __initdata;
    int core_id = 0;
    int i, ret;

    /*
     * First check for child clusters; we currently ignore any
     * information about the nesting of clusters and present the
     * scheduler with a flat list of them.
     */
    i = 0;
    do {
        snprintf(name, sizeof(name), "cluster%d", i);    //依次解析cluster0,1... 当前平台只有cluster0/1
        c = of_get_child_by_name(cluster, name);  //检查cpu-map下,是否有cluster结构
        if (c) {
            leaf = false;
            ret = parse_cluster(c, depth + 1);     //如果有cluster结构,会继续解析更深层次的core结构。(这里通过代码复用,接着解析core结构)
            of_node_put(c);
            if (ret != 0)
                return ret;
        }
        i++;
    } while (c);

    /* Now check for cores */
    i = 0;
    do {
        snprintf(name, sizeof(name), "core%d", i);    //依次解析core0,1... 当前平台有8个core
        c = of_get_child_by_name(cluster, name);    //检查cluster下,是否有core结构
        if (c) {
            has_cores = true;

            if (depth == 0) {                                            //这里要注意,是因为上面depth+1的调用才会走下去
                pr_err("%pOF: cpu-map children should be clusters\n",    //如果cpu-map下没有cluster结构的(depth==0),就会报错
                       c);
                of_node_put(c);
                return -EINVAL;
            }

            if (leaf) {                                            //在depth+1的情况下,leaf == true,说明是core level了
                ret = parse_core(c, package_id, core_id++);     //(1-1)解析core结构
            } else {
                pr_err("%pOF: Non-leaf cluster with core %s\n",
                       cluster, name);
                ret = -EINVAL;
            }

            of_node_put(c);
            if (ret != 0)
                return ret;
        }
        i++;
    } while (c);

    if (leaf && !has_cores)
        pr_warn("%pOF: empty cluster\n", cluster);

    if (leaf)            //在core level遍历完成:说明1个cluster解析完成,要解析下一个cluster了,package id要递增了
        package_id++;    //所以package id就对应了cluster id

    return 0;
}

(1-1)解析core结构

因为当前平台不支持超线程,所以core+序号节点下面,没有thread+序号的节点了

解析cpu节点中的所有信息

更新cpu_topology[cpu].package_id、core_id,分别对应了哪个cluster的哪个core

static int __init parse_core(struct device_node *core, int package_id,
                 int core_id)
{
    char name[20];
    bool leaf = true;
    int i = 0;
    int cpu;
    struct device_node *t;

    do {
        snprintf(name, sizeof(name), "thread%d", i);    //不支持SMT,所以dts没有在core下面配置超线程
        t = of_get_child_by_name(core, name);
        if (t) {
            leaf = false;
            cpu = get_cpu_for_node(t);
            if (cpu >= 0) {
                cpu_topology[cpu].package_id = package_id;
                cpu_topology[cpu].core_id = core_id;
                cpu_topology[cpu].thread_id = i;
            } else {
                pr_err("%pOF: Can't get CPU for thread\n",
                       t);
                of_node_put(t);
                return -EINVAL;
            }
            of_node_put(t);
        }
        i++;
    } while (t);

    cpu = get_cpu_for_node(core);    //(1-1-1)从core中解析cpu节点
    if (cpu >= 0) {
        if (!leaf) {
            pr_err("%pOF: Core has both threads and CPU\n",
                   core);
            return -EINVAL;
        }

        cpu_topology[cpu].package_id = package_id;    //保存package id(cluster id)到cpu_topology结构体的数组
        cpu_topology[cpu].core_id = core_id;        //保存core id到cpu_topology结构体的数组; core id对应cpu号:0,1...7
    } else if (leaf) {
        pr_err("%pOF: Can't get CPU for leaf core\n", core);
        return -EINVAL;
    }

    return 0;
}

(1-1-1)从core中解析cpu节点

从core节点中查找cpu节点,并对应好cpu id

再解析cpu core的capacity

static int __init get_cpu_for_node(struct device_node *node)
{
    struct device_node *cpu_node;
    int cpu;

    cpu_node = of_parse_phandle(node, "cpu", 0);    //获取core节点中cpu节点信息
    if (!cpu_node)
        return -1;

    cpu = of_cpu_node_to_id(cpu_node);    //获取cpu节点对应的cpu core id:cpu-0,1...
    if (cpu >= 0)
        topology_parse_cpu_capacity(cpu_node, cpu);    //(1-1-1-1)解析每个cpu core的capacity
    else
        pr_crit("Unable to find CPU node for %pOF\n", cpu_node);

    of_node_put(cpu_node);
    return cpu;
}

(1-1-1-1)解析每个cpu core的capacity

先解析capacity-dmips-mhz值作为cpu raw_capacity,这个参数就是对应了cpu的算力,数字越大,算力越强(可以对照上面mtk平台dts,明显是大小核架构;但不同的是,它cpu0-3都是大核,cpu4-7是小核,这个与一般的配置不太一样,一般qcom平台是反过来,cpu0-3是小核,4-7是大核)

当前raw_capcity是cpu 0-3:1024,cpu4-7:801

bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
{
    static bool cap_parsing_failed;
    int ret;
    u32 cpu_capacity;

    if (cap_parsing_failed)
        return false;

    ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",    //解析cpu core算力,kernel4.19后配置该参数
                   &cpu_capacity);
    if (!ret) {
        if (!raw_capacity) {
            raw_capacity = kcalloc(num_possible_cpus(),        //为所有cpu raw_capacity变量都申请空间
                           sizeof(*raw_capacity),
                           GFP_KERNEL);
            if (!raw_capacity) {
                cap_parsing_failed = true;
                return false;
            }
        }
        capacity_scale = max(cpu_capacity, capacity_scale);    //记录最大cpu capacity值作为scale
        raw_capacity[cpu] = cpu_capacity;                    //raw capacity就是dts中dmips值
        pr_debug("cpu_capacity: %pOF cpu_capacity=%u (raw)\n",
            cpu_node, raw_capacity[cpu]);
    } else {
        if (raw_capacity) {
            pr_err("cpu_capacity: missing %pOF raw capacity\n",
                cpu_node);
            pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n");
        }
        cap_parsing_failed = true;
        free_raw_capacity();
    }

    return !ret;
}

(2)将cpu raw_capacity进行归一化

遍历每个cpu core进行归一化,其实就是将最大值映射为1024,小的值,按照原先比例n,归一化为n*1024。

归一化步骤:将当前raw_capacity *1024 /capacity_scale,capacity_scale其实就是raw_capacity的最大值,其实就是1024

将cpu raw capacity保存到per_cpu变量:cpu_scale中,在内核调度中经常使用的cpu_capacity_orig、cpu_capacity参数的计算都依赖它。

void topology_normalize_cpu_scale(void)
{
    u64 capacity;
    int cpu;

    if (!raw_capacity)
        return;

    pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
    for_each_possible_cpu(cpu) {
        pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
             cpu, raw_capacity[cpu]);
        capacity = (raw_capacity[cpu] << SCHED_CAPACITY_SHIFT)        //就是按照max cpu capacity的100% = 1024的方式归一化capacity
            / capacity_scale;
        topology_set_cpu_scale(cpu, capacity);                    //更新per_cpu变量cpu_scale(cpu_capacity_orig)为各自的cpu raw capacity
        pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
            cpu, topology_get_cpu_scale(cpu));
    }
}

第二部分更新sibling_mask

cpu0的调用路径如下:

kernel_init
    -> kernel_init_freeable
        -> smp_prepare_cpus
            -> store_cpu_topology

cpu1-7的调用路径如下:

secondary_start_kernel
    -> store_cpu_topology
void store_cpu_topology(unsigned int cpuid)
{
    struct cpu_topology *cpuid_topo = &cpu_topology[cpuid];
    u64 mpidr;

    if (cpuid_topo->package_id != -1)  //这里因为已经解析过package_id了,所以直接就不会走读协处理器寄存器等相关步骤了
        goto topology_populated;

    mpidr = read_cpuid_mpidr();

    /* Uniprocessor systems can rely on default topology values */
    if (mpidr & MPIDR_UP_BITMASK)
        return;

    /*
     * This would be the place to create cpu topology based on MPIDR.
     *
     * However, it cannot be trusted to depict the actual topology; some
     * pieces of the a
  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值