翻译自:https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html
Per-CPU variables are one of the kernel features. You can understand the meaning of this feature by reading its name. We can create a variable and each processor core will have its own copy of this variable. In this part, we take a closer look at this feature and try to understand how it is implemented and how it works.
CPU私有变量是内部提供的一种机制。可以为CPU的每一个core创建一份该变量的副本。本文将详细分析这种机制,分析其内部实现及工作原理。
The kernel provides an API for creating per-cpu variables - the DEFINE_PER_CPU
macro:
内核提供了API,去创建CPU私有变量,即通过宏DEFINE_PER_CPU:
#define DEFINE_PER_CPU(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "")
This macro defined in the include/linux/percpu-defs.h as many other macros for work with per-cpu variables. Now we will see how this feature is implemented.
这个宏同其他操作CPU私有变量的宏一道被定义在include/linux/percpu-defs.h 中。
Take a look at the DECLARE_PER_CPU
definition. We see that it takes 2 parameters: type
and name
, so we can use it to create per-cpu variables, for example like this:
简单看一下DECLARE_PER_CPU的宏定义。它有两个参数:type和name。我们可以使用它创建CPU私有变量,例如:
DEFINE_PER_CPU(int, per_cpu_n)
We pass the type and the name of our variable. DEFINE_PER_CPU
calls the DEFINE_PER_CPU_SECTION
macro and passes the same two parameters and empty string to it. Let's look at the definition of theDEFINE_PER_CPU_SECTION
:
我们传出我们自定义的type和name,DEFINE_PER_CPU会调用DEFINE_PER_CPU_SECTION宏,原样传出type和name,而sec则以空字符串传入。DEFINE_PER_CPU_SECTION的定义如下:
#define DEFINE_PER_CPU_SECTION(type, name, sec) \
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \
__typeof__(type) name
#define __PCPU_ATTRS(sec) \
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \
PER_CPU_ATTRIBUTES
where section
is:
其中PER_CPU_BASE_SECTION定义如下:
#define PER_CPU_BASE_SECTION ".data..percpu"
After all macros are expanded we will get a global per-cpu variable:
所有宏展开后,我们可以得到一个全局的CPU私有变量。
__attribute__((section(".data..percpu"))) int per_cpu_n
It means that we will have a per_cpu_n
variable in the .data..percpu
section. We can find this section in the vmlinux
:
这意味着我们在.data..percpu数据段上创建了一个名字为per_cpu_n的变量。我们可以再Linux虚拟地址空间中看到这个数据段:
.data..percpu 00013a58 0000000000000000 0000000001a5c000 00e00000 2**12
CONTENTS, ALLOC, LOAD, DATA
Ok, now we know that when we use the DEFINE_PER_CPU
macro, a per-cpu variable in the.data..percpu
section will be created. When the kernel initializes it calls the setup_per_cpu_areas
function which loads the .data..percpu
section multiple times, one section per CPU.
现在我们知道当我们使用DEFINE_PER_CPU宏,将会在.data..percpu段上创建一个CPU私有变量。当内核初始化时,会调用setup_per_spu_areas函数,加载.data..perscpu端多次,为每一个CPU创建一个这样的数据段。
Let's look at the per-CPU areas initialization process. It starts in the init/main.c from the call of thesetup_per_cpu_areas
function which is defined in the arch/x86/kernel/setup_percpu.c.
让我们看看CPU私有变量初始化过程。该过程起始于init/main.c,从调用setup_per_cpu_areas函数开始(该函数在arch/x86/setup_percpu.c中定义)。
pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
The setup_per_cpu_areas
starts from the output information about the maximum number of CPUs set during kernel configuration with the CONFIG_NR_CPUS
configuration option, actual number of CPUs,nr_cpumask_bits
is the same that NR_CPUS
bit for the new cpumask
operators and number of NUMA
nodes.
在函数setup_per_cpu_areas开始处,首先会输出最大CPU数,该参数由CONFIG_NR_CPUS选项配置。实际的CPU数量由nr_cpumask的bit位决定(.....)。
We can see this output in the dmesg:
我们可以通过dmesg命令查看percpu输出信息。
$ dmesg | grep percpu
[ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
In the next step we check the percpu
first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has percpu_alloc
command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation:
percpu_alloc= Select which percpu first chunk allocator to use.
Currently supported values are "embed" and "page".
Archs may support subset or none of the selections.
See comments in mm/percpu.c for details on each
allocator. This parameter is primarily for debugging
and performance comparison.
The mm/percpu.c contains the handler of this command line option:
early_param("percpu_alloc", percpu_alloc_setup);
Where the percpu_alloc_setup
function sets the pcpu_chosen_fc
variable depends on thepercpu_alloc
parameter value. By default the first chunk allocator is auto
:
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
If the percpu_alloc
parameter is not given to the kernel command line, the embed
allocator will be used which embeds the first percpu chunk into bootmem with the memblock. The last allocator is the first chunkpage
allocator which maps the first chunk with PAGE_SIZE
pages.
As I wrote above, first of all we make a check of the first chunk allocator type in thesetup_per_cpu_areas
. We check that first chunk allocator is not page:
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
...
...
...
}
If it is not PCPU_FC_PAGE
, we will use the embed
allocator and allocate space for the first chunk with thepcpu_embed_first_chunk
function:
rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
dyn_size, atom_size,
pcpu_cpu_distance,
pcpu_fc_alloc, pcpu_fc_free);
As shown above, the pcpu_embed_first_chunk
function embeds the first percpu chunk into bootmem then we pass a couple of parameters to the pcup_embed_first_chunk
. They are as follows:
PERCPU_FIRST_CHUNK_RESERVE
- the size of the reserved space for the staticpercpu
variables;dyn_size
- minimum free size for dynamic allocation in bytes;atom_size
- all allocations are whole multiples of this and aligned to this parameter;pcpu_cpu_distance
- callback to determine distance between cpus;pcpu_fc_alloc
- function to allocatepercpu
page;pcpu_fc_free
- function to releasepercpu
page.
We calculate all of these parameters before the call of the pcpu_embed_first_chunk
:
const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
size_t atom_size;
#ifdef CONFIG_X86_64
atom_size = PMD_SIZE;
#else
atom_size = PAGE_SIZE;
#endif
If the first chunk allocator is PCPU_FC_PAGE
, we will use the pcpu_page_first_chunk
instead of thepcpu_embed_first_chunk
. After that percpu
areas up, we setup percpu
offset and its segment for every CPU with the setup_percpu_segment
function (only for x86
systems) and move some early data from the arrays to the percpu
variables (x86_cpu_to_apicid
, irq_stack_ptr
and etc...). After the kernel finishes the initialization process, we will have loaded N .data..percpu
sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the DEFINE_PER_CPU
macro.
The kernel provides an API for per-cpu variables manipulating:
- get_cpu_var(var)
- put_cpu_var(var)
Let's look at the get_cpu_var
implementation:
#define get_cpu_var(var) \
(*({ \
preempt_disable(); \
this_cpu_ptr(&var); \
}))
The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel is running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why, first of all we can see a call of the preempt_disable
function then a call of the this_cpu_ptr
macro, which looks like:
#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
and
#define raw_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
where per_cpu_ptr
returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the put_cpu_var
macro which enables preemption with a call of preempt_enable
function. So the typical usage of a per-cpu variable is as follows:
get_cpu_var(var);
...
//Do something with the 'var'
...
put_cpu_var(var);
Let's look at the per_cpu_ptr
macro:
#define per_cpu_ptr(ptr, cpu) \
({ \
__verify_pcpu_ptr(ptr); \
SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \
})
As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls__verify_pcpu_ptr
:
#define __verify_pcpu_ptr(ptr)
do {
const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;
(void)__vpp_verify;
} while (0)
which makes the given ptr
type of const void __percpu *
,
After this we can see the call of the SHIFT_PERCPU_PTR
macro with two parameters. As first parameter we pass our ptr and for second parameter we pass the cpu number to the per_cpu_offset
macro:
#define per_cpu_offset(x) (__per_cpu_offset[x])
which expands to getting the x
element from the __per_cpu_offset
array:
extern unsigned long __per_cpu_offset[NR_CPUS];
where NR_CPUS
is the number of CPUs. The __per_cpu_offset
array is filled with the distances between cpu-variable copies. For example all per-cpu data is X
bytes in size, so if we access__per_cpu_offset[Y]
, X*Y
will be accessed. Let's look at the SHIFT_PERCPU_PTR
implementation:
#define SHIFT_PERCPU_PTR(__p, __offset) \
RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
RELOC_HIDE
just returns offset (typeof(ptr)) (__ptr + (off))
and it will return a pointer to the variable.
That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the include/linux/percpu-defs.h magic.
Let's again look at the algorithm of getting a pointer to a per-cpu variable:
- The kernel creates multiple
.data..percpu
sections (one per-cpu) during initialization process; - All variables created with the
DEFINE_PER_CPU
macro will be relocated to the first section or for CPU0; __per_cpu_offset
array filled with the distance (BOOT_PERCPU_OFFSET
) between.data..percpu
sections;- When the
per_cpu_ptr
is called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the__per_cpu_offset
array will be accessed, where every index points to the required CPU.
That's all.