Linux内存管理

Memory Management
内存在内核中的分配远没内核之外那么容易。简单来说,内核没有用户空间的优越待遇。不想用户空间,内核不会总是很容易的分配内存。比如说,内核不能够容易的处理内存分配错误问题,内核通常不能睡眠。由于这些限制和对于轻量级的内存分配机制的需求,内核当中的内存处理要比用户空间复杂得多。
本章,我们讨论内核当中获取内存的手段。但是,在你真正的深入接触到这些接口之前,你需要理解内核是如何处理内存的。
12.1 Pages
内核把物理内存页作为内存管理的基本单元。虽然处理器最小的可寻址单元是字节或者字,内存管理单元通常以页为单位。因此,是MMU矗立着系统的页表with页size的粒度。就虚拟内存而言,页是最小的单位。
在第十九章,你会发现,不同的各个处理器都给自己定义了他自己的页size。有些架构还支持多个page sizes。大多数32位架构是4K pages,但是64位架构是8K pages。这就是说在一个拥有1G 内存,4Kpages的设备上,物理内存被分为了262144个pages。
内核在系统中通过struct page structure代表每一个物理内存页。该结构体定义在<linux/mm_types.h>。我已经简化了定义,移除了两项confusing联合体:
struct page {
unsigned long flags;
atomic_t _count;
atomic_t _mapcount;
unsigned long private;
struct address_space *mapping;
pgoff_t index;
struct list_head lru;
void *virtual;
};
Let’s look at the important fields.变量flags存储了页的状态,包括页是否是dirty的、是否在内存中锁定。Bit flags代表不同的参数,所以至少有32个不同的flags是同时有效的。Flags参数定义在<linux/page-flags.h>。
变量_count存储了页的使用数量-----也就是说对于页的引用的数量有多少了。当该参数达到负一的时候,页是没有被使用的,它是可以被重新allocation的。内核代码不应该直接检查它,而是应该通过接口page_count()。虽然,当页是free后,_count变量的值是负一,page_count()在page是free的时候会返回0,在page在使用中的时候,返回非零的正值。页可能被页cache使用,as private data或者as a mapping in a process’s table.
这个虚拟的域是页虚拟地址。通常来说,这是页在虚拟内存中的地址。在内核地址空间中,有些内存不是永久映射的(called the high memory)。在这种请款下,该域是NULL的,and页在需要的时候是动态分配的。
页数据结构是跟物理页相关的,而不是虚拟页。因此,该数据结构描述的是最好是临时的sth.内核使用该数据结构描述物理页。该数据结构的目的是描述物理内存,而不是描述数据contained therin。
内核使用该数据结构跟踪系统中的所有的页,因为内核需要知道哪些页是不是free的。如果有一个页不是free的,是谁在使用它。可能的拥有者包括用户空间的进程、动态分配的内核数据、静态内核代码、页cache等等。
开发者通常 are surprised that这样的数据结构分配在系统中的每一个物理页。They think ”what a lot of memory wasted!”.Let’s look at how bad (or good) the space comsumption is from all these pages.假设说结构体page占用了40字节的内存,系统拥有8KB物理页,4G物理内存。在这种情况下,系统有524288页和数据结构。页结构体消耗了20MB内存:这一数量看起来庞大,但是相较于4G的内存来说,仅占了一小部分而已。
12.2 Zones
由于硬件的限制,内核不能将所有页面都视为相同的。由于内存中物理地址的原因,有些页是不能够被某些任务使用的。由于这一限制,内核把页分为了不同的zones。内核利用zones将相似属性的pages进行了分组。In particular,内核不得不处理这两个硬件缺点with respect to memory addressing:
 有些设备能够对一定的内存地址执行DMA操作
 一些架构可以物理地处理比它们实际上能够处理的更大的内存量
由于这些限制,Linux有以下四个主要的内存zones:
 ZONE_DMA:该zone包含了可以执行DMA操作的pages
 ZONE_DMA32:除了和ZONE_DMA一样的属性外;these pages仅适用于32位设备。在一些架构的设备上,this zone有着更大的内存子集。
 ZONE_NORMAL:该zone包含了normal、regularly mapped的内存
 ZONE_HIGHMEM:该zone包含了“high memory”,它的pages不会永久的映射进内核地址空间。
上述标志的使用和内存ZONE的分布是和硬件架构相关的。比如说,有些硬件架构中可以对所有的内存区域执行DMA操作。在这些架构中,ZONE_DMA是空的,ZONE_NORMAL是可以全部应用到的。相反地,X86架构中,ISA设备仅能访问物理内存的前16MB;因此,X86设备中的ZONE_DMA区域包含了所有内存区域的0-16MB。
ZONE_HIGHMEM works in the same manner.那么什么架构能够,什么架构不能够直接映射是不同的。在32位X86架构中,ZONE_HIGHMEM是超过物理内存896MB后的all memory.在其他架构中,ZONE_HIGHMEM是空的,因为所有的内存都是直接映射的。系统内存中剩余部分叫做low memory.
ZONE_NORMAL tend to be whatever前两者分配剩余下的部分。比如说在X86系统中,ZONE_NORMAL的分布区域为物理内存16MB到896MB。在其他的架构中,ZONE_NORMAL是全部可用的内存。表格12.1展示了X86系统中的ZONE内存分布。

Linux把系统pages分区成zone,来满足内存分配。比如说,使用ZONE_DMA pool使内核拥有DMA内存分配的能力。Note that这些zones不具有任何的物理相关性,而仅仅是逻辑上的分组,使内核可以keep track of pages.
虽然有些内存分配require pages来自某个zone,也可能其他的内存分配来自多个zone。比如说,DMA-able的内存必须来自ZONE_DMA,normal的内存分配可能来自ZONE_DMA或者DMA_NORMAL,但是不可能两者一起;内存分配不能够越界。内核倾向于使用normal zone来满足normal的内存分配,当然来说是为了保存DMA ZONE that need it.But if push come to shove,the kernel can dip its fingers in whatever zone is available and suitable.
不是所有的应架构都会定义所有的ZONE。比如说,X86-64架构能够全部映射和处理64位的内存。因此,X86-64是没有ZONE_HIGHMEM的,所有的物理内存都被包含在ZONE_DMA和ZONE_NORMAL。
每一个ZONE利用struct zone来represented,which is defined n <linux/mmzone.h>:
struct zone {
unsigned long watermark[NR_WMARK];
unsigned long lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
spinlock_t lock;
struct free_area free_area[MAX_ORDER]
spinlock_t lru_lock;
struct zone_lru {
struct list_head list;
unsigned long nr_saved_scan;
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
unsigned long pages_scanned;
unsigned long flags;
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
int prev_priority;
unsigned int inactive_ratio;
wait_queue_head_t *wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
struct pglist_data *zone_pgdat;
unsigned long zone_start_pfn;
unsigned long spanned_pages;
unsigned long present_pages;
const char *name;
};
该结构体本身很庞大,但是系统中只会存在3个zones,所以说仅仅会有三个该结构体。Let’s look at more important fields.
The lock field是一个spin lock that protects the structure from concurrent access.Note that它仅仅保护该结构体,而不是zone中的所有pages。
The watermark数组holds the minimum,low and high watermarks for this zone.内核利用watermark来设置benchmark给suitable per-zone memory consumption,区分他们的aggressiveness as the watermarks vary vis-a-vis memory.
The name代表着zone的名称。内核在boot阶段in mm/page_alloc.c中初始化该参数,and三个zones分别赋值DMA,NORMAL,HIGHMEME。
12.3 Getting Pages
现在,我们了解到内核是如何通过pages,zones等管理内存;let’s look at the interfaces that the kernel implements to enable you to allocate and free memory within the kernel.
内核提供了一种底层的内存申请机制,也提供了多个访问的接口。所有的这些分配内存的接口with page-sized granularity and are declared in <linux/gfp.h>. The core function is
Struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
This allocates 2order (that is, 1 << order) contiguous physical pages and returns a pointer to the first page’s page structure;如果失败了的话,返回NULL.we look at the gfp_t type and gfp_mask parameter in a later section.You can convert a given page to its logical address with the function
Void *page_address(struct page *page)
This returns a pointer to the logical address where the given physical page currently resides.If you have no need for the actual struct page,you can call
Unsigned long _get_free_pages(gfp_t gfp_mask, unsigned int order)
This function works the same as alloc_pages(),except that it directly returns the logical address of the first requested page.由于pages是连续的,其他的pages会simply follow the first.
如果你仅需要一page,你可以使用如下函数:
Struct *alloc_page(gfp_t gfp_mask);
Unsigned long _get_free_page(gfp_t gfp_mask);
These work the same as their brethren but pass zero for the order(20 = one page).
12.3.1 Getting Zeroed Pages
如果你需要返回的pages filled with zeros,使用下面的函数:
Unsigned long get_zeroed_page(unsigned int gfp_mask);
This function works the same as _get_free_page,except that 分配的pages是zero-filled.This is useful for pages given to user-space because the random garbage in an allocated page is not so random;it might contain sensitity data.All data must be zeroed or otherwise cleaned before it is returned to user-space to ensure system security is not compromised.表12.2展示了底层的page分配办法。

12.3.2 Freeing Pages
A family of functions enables you to free allocated pages when you no longer need them:
Void _free_pages(struct page *page, unsigned int order);
Void free_pages(struct long addr, unsigned int order);
Void free_page(unsigned long addr);
你仅仅能释放你申请的内存。传递错误的struct page或者address,或者incorrect order,可能会导致崩溃。Remember,the kernel truct itself.不像用户空间,内核will happily hang itself if you ask it.
Let’s look at an example.Here,we wannna allocate eight pages:

Unsigned long page;

Page = _get_free_pages(GFP_KERNEL, 3);
If(!page)
	/* insufficient memory:you must handle this error! */
	Return –ENOMEM;

/* page is now the address of the first of eight contiguous pages… */
And here,we free the eight pages,after we are doing using them:
Free_pages(page, 3);

/* our pages are now freed and we should no
* longer access the address stored in ‘page’

*/
参数GFP_KERNEL是gfp_mask flag的一个举例。在调用_get_free_pages()后还需要进行错误检查。内核分配可能存在失败的情况,所以说你必须检查错误并处理它们。
当你需要连续的page-sized pages的时候(尤其是如果你需要一个page或者两个pages的时候),这些底层函数是有用的。对于更通用的byte-sized的分配,内核提供了kmalloc函数使用。
12.4 Kmalloc()
Kmalloc()函数的使用是和用户空间malloc的使用是类似的,除了kmalloc还附带额外的flags参数。Kmalloc函数是一个内核获取byte-sized内存的接口。如果你需要整个pages,前面讨论的章节会更加适合你。但是大多数情况下,kmalloc都是一个preferred选择:
函数定义在<linux/slab.h>:
Void *kmalloc(size_t size, gfp_t flags)
函数会返回一个指向内存区域的指针that is at least size bytes in length.该分配的内存区域是物理上连续的。错误的话,函数返回NULL。除非有效的内存不够了,否则的话,它是一直成功的。因此,你需要在kmalloc之后检查kmalloc是否返回了NULL值并合理的处理该错误。
12.4.1 gfp_mask Flags
The flags are broken into three categories:action modifiers, zone modifiers and types.Action modifiers specify how the kernel is supposed to allocate the requested memory.In certain situations,only certain methods can be employed to allocate memory.For example, interrupt handlers must instruct the kernel not to sleep(because interrupt handlers cannot reschedule) in course of allocating memory.As you saw earlier in this charpter, the kernel divides physical memory into multiple zones, each of which serves different purpose.Zone modifiers指出了从哪一块zone去分配内存。Type of flags specify a combination of action and zone modifiers as needed by a certain type of memory allocation.Type flags简化了多个specification of modifiers;The GFP_KERNEL is a type flag,which is used for code in process context inside kernel.
12.4.2 Action Modifiers
All the flags,the action modifiers included, are declared in <linux/gfp.h>.Table 12.3 is a list of the action modifiers.

These allocations can be specified together.

12.4.3 Zone Modifiers
Zone modifiers指明了分配内存的时候应该从哪个内存zone分配.Normally,allocations can be fulfilled from any zone.除了ZONE_NORMAL之外(内核默认的内存分配originate区域),还存在三个zone modifiers。Table 12.4 is a list of the zone modifiers.

12.4.4 Type Flags
The type flags指出了required action and zone modifiers to fulfill a particular type of transaction.因此,内核代码是倾向于使用正确的type flag,而不是大量的其他的flags混在一起。这样的话,即简洁又不容易出错。Table 12.5 is a list of the type flag,and table 12.6 shows which modifiers are associated with each type flag.

In the vast majority of the code that you write,you use either GFP_KERNEL or GFP_ATOMIC.Table 12.7 is a list of the common situations and the flags to use.Regardless of the allocation type,you must check for and handle failures.

12.5 Kfree()
The counterpart to kmalloc() is kfree(),which is declared in <linux/slab.h>:
Void kfree(const void *ptr)
函数kfree()释放了之前kmalloc()申请的内存。不要在未申请的内存处或者已经释放的内存处调用它;这样做的胡is a bug,会导致释放掉属于kernel模块的另一部分的内存。像用户空间一样,小心对应内存的申请和释放,放置内存泄漏和其他的bug.
Let’s look at an example of allocating memory in an interrupt handler.在这个例子中,中断handler需要申请一片内存用来hold incoming data。超过一哭泣贵不贵BUF_SIZE是设计buffer的size, which is larger than just a couple of bytes.
Char *buf;

Buf = kmalloc(BUF_SIZE, GFP_ATOMIC);
If(!buf)
		/* error allocating memory */
Later,当你不再需要这片内存的时候,别忘记释放它:
	Kfree(buf);

12.6 vmalloc()
Vmalloc()函数和kmalloc()的工作方式相似,except that他分配的内存仅仅是虚拟上连续的,而不一定是物理内存连续。This is how a user-space allocation function works:malloc()函数返回的pages在虚拟地址空间上面是连续的,但是不保证物理RAM地址空间是连续的。函数kmalloc()保证the pages是物理上连续和虚拟地址空间连续的。函数vmalloc()仅仅保证虚拟地址空间上面是连续的。它通过分配潜在不连续的物理内存并“修复”页表将内存映射在逻辑地址空间映射连续。
大多数情况来说,仅仅硬件设备会要求物理内存连续。在一些架构中,硬件设备live on the other side of the memory management unit and,thus,do not understand virtual address space.因此,任何硬件设备work with的内存区域必须存在于物理内存连续的块和虚拟内存连续的块。软件使用的内存(比如说进程相关的buffer)使用虚拟地址连续的内存块也是fine的。
虽然仅仅有限的场合需要物理内存连续,但是内核代码还是更多的使用Kmalloc函数而不是vmalloc()。这主要是从性能上面考虑的。Vmalloc函数使物理上不连续的pages在虚拟地址空间上看起来连续,而这必须明确的建立页表机制。更糟糕的是,通过vmalloc申请到的pages必须通过individual pages映射(因为他们在物理上是不连续的),这会导致更多的TLB thrashing than you see when directly mapped memory is used.由于这些原因,vmalloc函数is used only when absolutely necessary—typically, to obtain large regions of memory.比如说,当模块动态插入内核当中的时候,they are loaded into memory created via vmalloc().
Vmalloc()函数is declared in <linux/vmalloc.h> and defined in mm/vmalloc.c.使用办法和用户空间的相同:
Void *vmalloc(unsigned long size)
函数返回一个指针,指向at least size bytes of virtually contiguous memory.错误的话,返回NULL.该函数可能会睡眠,因此不能再中断handler context中调用或者other situations in which blocking is not permissible.
释放vmalloc分配的内存需要使用函数vfree:
Void vfree(const void *addr)
该函数会释放vmalloc申请到的内存空间。该函数也会睡眠,所以不能再中断handler context中调用。它没有返回值。
12.7 Slab Layer
分配和释放数据结构是任何内核中很常见的operations.为了简化这个分配和释放的操作,programmers引入了free list.A free list包含了可用的、已经分配的数据结构。当代码申请一个新的数据结构的时候,它可以从free list列表中分配structures,而不是从内存中重新申请内存并set it up for the structure.之后,当数据结构不再需要的时候,它会返回到free list而不是释放掉。在这种情况下,free list作为一个Object cache存在,caching a frequently used type of object.
内核使用free list的主要问题之一是这里没有一个全局控制。当可用内存低的时候,内核就不可能和每个free list通信了,因此它需要精简its cache的大小来释放内存。内核是根本就不理解random free list的。为了解决这个问题和健壮代码,内核提供了slab layer层。This slab layer acts as a generic data structure-caching layer.
The slab layer理念的第一次实现是在SunOS中实现的。Linux data structure caching layer共享了相同的命名和基本的设计。
The slab layer attempts去利用以下几点概念:
 频繁使用的数据结构又不断的申请和释放的,so cache them
 频繁的申请和释放可能导致内存碎片化(这可能会导致不能够找到大的连续的物理内存)。为了阻止这种情况的产生,The cached free lists are arranged contiguously.Because freed data structures return to the free list,there is no resulting fragmentation
 在频繁的申请和释放内存中,The free list提升了性能,因为a freed object能够被立刻返回到下一次的allocation
 如果the allocator知道了object size,page size,和total cache size,它会做出更加intelligent的决策
 如果part of the cache is made per-processor(seprate and unique to each processor on the system),allocations and frees can be performed without SMP lock
 如果the allocator是NUMA(Non Uniform Memory Access Architecture)-aware的,它能够fulfill allocations from the same memory node as the requestor.
 已经存储的objects是可以被colored,用来避免多个objects映射到同一个cache lines
The slab layer in Linux是按照上述几个思想实现的。
12.7.1 Design of the Slab Layer
The slab layer把不同的object分配为groups,called caches,each of which stores a different type of object.There is one cache per object type.比如说,one cache for the process descriptors(a free list of task_struct structure), whereas another cache is for inode objects(struct inode).Interestingly,the kmalloc interface is built on top of the slab layer,using a family of the general purpose caches.
The caches are then divided into slabs(hence the name of this subsystem).The slabs are composed of one or more physically contiguous pages.Typically,slabs由一个页组成。每一个cache包含多个slabs。
每一个slabs包含一定数量的objects,which are the data structures being cached.每一个slab会处于三种状态中的一种:full,empty or partial.A full slab是没有free objects的(在slab中的所有objects都被分配完了)。A empty slab没有allocator的objects(在slab中的所有objects都是free的)。A partial slab has some free objects and some allocated objects.当内核代码需要申请一个新的object的时候,如果partial slab存在的话,首先从partial slab分配object。否则的话,就会从empty slab中分配。如果不存在empty slab的话,就创建一个empty slab.很明显,一个full slab是永远不可能满足这个请求的,因为它没有free的object。这种策略减少了内存的 碎片化。
比如说,我们来拿Inode structure来作为例子,which is in-memory representation of a disk inode.他们的结构体会被频繁的创建和销毁,所以说通过slab allocator来管理他们就变得很有意义了。Thus,struct inode是从inode_cachep cache分配而来(such a naming convention is standard.)。This cache is made up of one or more slabs----probaly a lot of slabs because there are a lot of objects.每一个slab都包含了尽可能多的struct inode结构体对象。当内核申请一个新的inode structure后,内核会返回一个指针指向已经分配的、但未使用的structure,from a partial slab,或者说如果不存在partial slab的话,from a empty slab.当内核使用inode object完成后,the slab allocator会marks the object as free.Figure 12.1 diagrams the relationship between caches,slabs and objects.

每一个cache is represented by a kmem_cache structure.This structure contains three lists---slabs_full, slabs_partial and slabs_empty----stored inside a kmem_list3 structure,which is defined in mm/slab.c.这些lists包含了跟cache相关的所有slabs。A slab descriptor, struct slab, represents each slab:
struct slab {

struct list_head list; /* full, partial, or empty list /
unsigned long colouroff; /
offset for the slab coloring */
void s_mem; / first object in the slab /
unsigned int inuse; /
allocated objects in the slab /
kmem_bufctl_t free; /
first free object, if any */
};
Slab allocator are allocated either outside the slab in a general cache or inside the slab itself,at the beginning.该描述符会被存储在slab内不,因为the total size of the slab is sufficiently small, or if internal slack space is sufficient to hold the descriptor.
The slab allocator通过底层kernel allocator _get_free_pages()接口创建了新的slabs:
static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
struct page *page;
void *addr;
int i;

flags |= cachep->gfpflags;
if (likely(nodeid == -1)) {
addr = (void*)__get_free_pages(flags, cachep->gfporder);
if (!addr)
return NULL;
page = virt_to_page(addr);
} else {
page = alloc_pages_node(nodeid, flags, cachep->gfporder);
if (!page)
return NULL;
addr = page_address(page);
}

i = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_add(i, &slab_reclaim_pages);
add_page_state(nr_slab, i);
while (i––) {
SetPageSlab(page);
page++;
}
return addr;
}
该函数使用_get_free_pages()来有效的分配内存to hold the cache.这个函数的第一个参数指向需要更多pages的cache。第二项参数指向flags,given to _get_free_pages().Note how this value is binary OR’ed against another value.This adds default flags that the cache required to the flags parameter.The power-of-two size of the allocation is stored in cachep->gfporder.之前的函数相比one expect更加复杂,因为code让allocator NUMA-aware.当nodeid不是负一的时候,the allocator试图去fulfill the allocation from the same memory node that requested the allocation.这在NUMA系统中提供了更好的性能,in which accessing memory outside your node results in a performance penalty.
为了更好的讲解代码,we can ignore the NUMA-aware code and 写一个简单的kmem_getpages():
Static inline void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags)
{
Void *addr;
Flags |= cachep->gfpflags;
Addr = (void *) _get_free_pages(flags, cachep->gfporder);
Return addr;
}
在这之后,内存可以被kmem_freepages()释放掉,which调用了free_pages on the given cache’s pages.Of course,the point of the slab layer is to refrain from allocating and freeing pages.In turn,the slab layer invokes the page allocation function only when there does not exist any partial or empty slabs in a given cache.The freeing函数仅在内存变低且系统试图释放内存,或者a cache被摧毁的时候才会被调用到。
The slab layer被per-cache的简单的接口管理,which is exported to the entire kernel.该接口使能了创建和摧毁新cache机制,还使能了cache内object的申请和释放。Cache和slab的管理是通过slab内部来控制的。当你创建cache后,slab layer works just like a specialized allocator for the specific type of object.
12.7.2 Slab Allocator Interface
一个新的cache通过如下接口创建,
Struct kmem_cache *kmem_cache_create(const char *name,
Size_t size,
Size_t aligh,
Unsigned long flags,
Void (*ctor)(void *));
第一项参数是用来存储cache的名称的。第二项参数是cache中每个元素的size。第三个参数是slab内第一个object的offset。This is done to ensure a particular alignment within the page.Normally,zero is sufficient,which results in the standard alignment.The flags parameter specifies optional settings controlling the cache’s behavior.It can be zero,specifying no special behavior, or one or more of the following flags OR’ed together:
 SLAB_HWCACHE_ALIGN—this flag instructs the slab layer to align each object within a slab to a cache line.This prevents “false sharing”(two or more objects mapping to the same cache line despite existing at different addresses in memory).This improves performance but comes at a cost of increased memory footprint because the stricter alignment results in more wasted slack space.How large the increase in memory consumption is depends on the size of the objects and how they naturally align with respect to the system’s cache lines.For frequently used caches in performance—critical code,setting this option is a good idea;otherwise,think twice.
 SLAB_POISON—This flag causes the slab layer to fill the slab with a known value(a5a5a5a5).This is called poisoning and is useful for catching access to uninitialized memory.
 SLAB_RED_ZONE—This flag causes the slab layer to insert “red zones” around the allocated memory to help detect buffer overruns.
 SLAB_PANIC—This flag causes the slab layer to panic if the allocation fails.This flag is useful when the allocation must not fail,as in, say, allocating the VMA structurecache during bootup.
 SLAB_CACHE_DMA—This flag instructs the slab layer to allocate cache slab in DMA-able memory.This is needed if the allocated object object is used for DMA and must reside in ZONE-DMA.Otherwise,you do not need this and you should not set it.
最后一项参数,ctor,is a constructor for the cache.The constructor is called whenever new pages are added to the cache.In practice,caches in the linux kernel do not often utilize a constructor.In fact,there once was a deconstructor parameter,too,but it was removed because no kernel code used it.You can pass NULL for this parameter.
On success,函数kmem_cache_create()返回一个指针,指向创建的cache。否则的话,返回NULL。因为该函数可能会睡眠,所以说不可能在中断context中调用它。
To destroy a cache,call
Int kmem_cache_destroy(struct kmem_cache *cachep)
As the name implies,该函数会摧毁指定的cache。It is Generally invoked from module shutdown code in modules that create their own.同样的,由于该函数可能会睡眠,他也不能在中断context中调用。调用该函数之前,有两个先决条件必需先确认(也就是说在调用之前,这两个条件已经满足):
 Cache中的全部slab都是empty的。
 在调用kmem_cache_destroy()函数的时候,没有代码访问该cache。调用者必须确认this 同步。
成功的花,该函数返回0;失败的话,返回非零值。
12.7.3 Allocating from Cache
Cache创建成功后,通过下面的函数在cache中获取一个object:
Void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
该函数返回一个指针,指向来自给定cachep的一个object。如果cache中的所有slabs都没有free object,the slab layer必须通过kmem_getpages()函数获取新的pages,参数flags传递给_get_free_pages().这个flags就是我们早前看到的same flags。You probably want GFP_KERNEL or GFP_ATOMIC.
后面,如果我们需要释放object并返回到原来的slab,使用下面的函数:
Kmem_cache_free(struct kmem_cache *cachep, void *objp)
This marks the object objp in cachep as free.
12.7.4 Example of Using the Slab Allocator
You guys should check the details of these codes,in charpter 12.7.4 in your real book.
12.8 Statically Allocating on the Stack
在用户空间,allocations 可能发生在stack上,because we knew the size of the allocation a priori.用户空间是能够负担起这么大的开销,large and dynamically growing stack,但是内核是不足以负担起的,因为内核stack is small and fixed.当每一个进程分配一个small,fixed stack,内存消耗就minimized,and the kernel code need not burden itself with stack management code.
每一个内核进程stack的大小取决于硬件架构和编译选项。由于部分历史原因,内核stack中每一个进程拥有两页大小的空间。在32位处理器中通常是8KB,64位处理器通常是16KB,因为他们分别拥有的page大小位4KB和8KB。
12.8.1 Single-Page Kernel Stacks
但是,在2.6内核之前,一项新的选项有引入了内核single-page kernel stack。使能该选项的话,每个进程仅仅付给一个page—32位处理器是4KB,64位处理器是8KB。这么做有以下两个原因:第一,这样的做法会使每个进程消耗的内存page变小;第二,随着uptime的增加,找到两块物理上连续的未分配的pages会变得越来越难。物理内存变的碎片化,给新的进程分配的虚拟内存的压力变大。
There is one more complication.Keep with me: 我们几乎已经掌握了所有关于内核堆栈的知识。现在,每个进程的整个调用链都必须适合(fit)其内核堆栈。但是,由于部分历史因素,中断handlers也使用他们中断的进程的内核stack,因此他们也不得不去fit。这是有效且简单的,但是它对已经很少的内核堆栈施加了更严格的约束。当stack变得仅有一个page的时候,中断handler不再fit了。
为了解决这个问题,内核开发者完成了一个新的特性:中断stack。中断stack提供了a single per-processor stack used for interrupt handlers.with this option,中断handler不再共享被中断进程的内核stack。Instead,他们使用自己的stack。This consumes only a single page per processor.
To summarize,取决于编译-time配置选项,内核stack要么一page或者两page。因此,stack range from 4KB to 16KB.由于历史原因,中断handler共享被中断进程的stack。随着single page stacks are enabled,中断handlers拥有自己的stack。
12.8.2 Playing Fair on the Stack
在任何函数中,你必须保证kernel stack usage到最小化。There is no hard and fast rule,but you should keep the sum of all local variables in a particular function to a maximum of a couple hundred bytes.Performing a large static allocation on the stack,such as of a large array or structure, is dangerous.Otherwise,stack allocations are performed in the kernel just as in user-space.Stack overflows occur silently and will undoubtedly redult in problems.因为内核没有去管理内核stack,当内核stack overflow的时候,the excess data simply spills into whatever exists at the tail end of the stack.第一个破坏的结构体就是thread_info structure.Beyond the stack,any kernel data might lurk.At best,the machine will crash when the stack overflows.At worst,the overflow will silently corrupt data.
因此,动态分配表是优选的方案,比如说one of those previously discussed in this chapter for any large memory allocations.
12.9 High Memory Mappings
By definition,pages in high memory might not be permanently mapped into the kernel’s address space.因此,通过alloc_pages()with the _GFP_HIGHMEM flag获取到的pages可能没有逻辑地址。
在X86架构中,所有物理地址超过896MB的部分都被标记为high memory并且他们不能够永久或者自动的映射进内核地址空间,虽然X86处理器能够寻址超过4G大小RAM的地址空间。After they are allocated,these pages must be mapped into the kernel’s logical address space.在X86中,pages in high memory被映射到3G到4G空间中的某些地方。
12.9.1 Permanent Mappings
To map a given page structure into the kernel’s address space,use this function,declared in <linux/highmem.h>:
Void *kmap(struct page *page)
该函数works on either high or low memory.如果page structure属于a page在低内存区域,page的虚拟地址is simply returned.If the page resides in high memory,a permanent mapping is created and the address is returned.该函数可能睡眠,所以kmap()仅能够工作在进程context。
由于永久映射的数量是有限的,high memory在不需要的时候,应该被unmapped。我们可以通过如下函数实现unmapps the given page:
Void kunmap(struct page *page)
12.9.2 Temporary Mappings
For times when a mapping必须被创建并且进程context不能睡眠,内核提供了一种temporary mapping(which are also called atomic mapping).These are a set of reserved mappings that can hold a temporary mappings.内核能够atomically映射a high memory到one of these reserved mappings.因此,a temporary mapping can be used in places that cannot sleep,such as中断handlers,因为获取the mappings永远不会阻塞。
Setting up a temporary mappings is done via:
Void *kmap_atomic(struct page *page, enum km_type type)
The type parameter is one of the following enumerations,which describe the purpose of the temporary mapping.他们定义在<asm-generic/kmap_types.h>:
enum km_type {
KM_BOUNCE_READ,
KM_SKB_SUNRPC_DATA,
KM_SKB_DATA_SOFTIRQ,
KM_USER0,
KM_USER1,
KM_BIO_SRC_IRQ,
KM_BIO_DST_IRQ,
KM_PTE0,
KM_PTE1,
KM_PTE2,
KM_IRQ0,
KM_IRQ1,
KM_SOFTIRQ0,
KM_SOFTIRQ1,
KM_SYNC_ICACHE,
KM_SYNC_DCACHE,
KM_UML_USERCOPY,
KM_IRQ_PTE,
KM_NMI,
KM_NMI_PTE,
KM_TYPE_NR
};
该函数不会阻塞,因此可以在中断context和其他不能reschedule的地方使用。他也会关闭内核抢占,which is needed because该mappings对每个处理器而言都是唯一的。(And a reschedule可能会改变which task is running on which processor.)
The mapping is undone via:
Void kunmap_atomic(void *kvaddr, enum km_type type)
该函数也不会阻塞。在一些架构中,它仅仅是使能内核抢占,什么工作也不会做;因为a temporary mapping is valid only until the next temporary mapping.因此,内核可能just “forget about” the kmap_atomic() mapping,and kunmap_atomic() doesnot need to do anything special.The next atomic mapping then simply overwrite the previous one.
12.10 Per-CPU Allocations
现代SMP-able操作系统使用per-CPU数据—该数据对于每一个指定的处理器都是唯一的。Typically,per-CPU数据保存在一个数组中。数组中的每一个元素都对应着系统中的处理器。当前处理器的number indexes this array,which is how内核2.4处理per-CPU数据。2.6内核也采用了这种办法。You declare the data as
Unsigned long my_percpu[NR_CPUS];
你可以通过下面的办法去访问:

int cpu;
cpu = get_cpu(); /* get current processor and disable kernel preemption /
my_percpu[cpu]++; /
… or whatever /
printk(“my_percpu on cpu=%d is %lu\n”, cpu, my_percpu[cpu]);
put_cpu(); /
enable kernel preemption */

Note that由于这些数据是独属于当前处理器的,所以说不需要lock。如果出了当前处理器访问这些数据外,没有其他的数据访问的话,并发的问题也不需要担心了,and当前处理器可以不用lock的情况下安全的访问这些数据。
内核抢占是per-CPU唯一需要担心的地方。内核抢占会导致两个问题,listed here:
 如果你的代码被抢占了,被调度到其他的处理器,变量cpu就不再有效了,因为它错误的指向了其他的处理器。(In general,在获取了当前处理器后,代码是不能睡眠的)
 如果另一个任务抢占了你的代码,在同一个处理器上面它能够并发的访问my_percpu变量,这就造成了竞态。
Any fears are unwarranted,however,because the call get_cpu(),on top of returning the current processor number,also disables kernel preemption.The corresponding call to put_cpu() enables kerne; preemption.Note that if you use a call to smp_processor_id() to get the current processor number,kernel preemption is not disabled;always use the aforementioned methods to remain safe.
12.11 The NEW percpu Interface
2.6内核引入了一个新的接口,known as percpu,为了创建和操作per-CPU数据。This interface generalizes the previous example.Creation and manipulation of per-CPU data is simplified with this approach.
我们之前讨论的创建和访问per-CPU数据的办法依然有效和适用。但是,这个新的接口grew out of the needs for a simpler and more powerful method for manipulating per-CPU data on large symmetrical multiprocessing computers.
The header <linux/percpu.h>declares all the routines.You can find the actual definitions there in mm/slab.c,and in <asm/percpu.h>.
12.11.1 Per-Cpu Data at Compile-Time
Defining a per-CPU variable at compile time is quite easy:
DEFINE_PER_CPU(type,name);
This creates an instance of a variable of type type,named name, for each processor on the system.如果你需要在其他地方声明变量,来避免编译报警,下面的办法适用:
DEFLCARE_PER_CPU(type, name);
你可以使用函数get_cpu_var()和put_cpu_var()来操作这些变量。调用get_cpu_var()会返回当前处理器的lvalue for the given variable.它也关闭了抢占,which put_cpu_var()响应的会打开抢占。
Get_cpu_var(name)++; /* increment name on this processot /
Put_cpu_var(name); /
done;enable kernel preemption /
你也可以获取到另一个处理器的per-CPU数据:
Per_cpu(name, cpu)++; /
increment name on the given processor */
使用这个方法的时候,你需要be careful that,因为per_cpu()既没有关闭内核抢占,也没有提供任何锁的机制。不使用锁的情况仅仅适用于操作当前数据的仅仅是当前处理器。如果其他的处理器也操作其他处理器的数据,we need locks。
(?)Another subtle issue:These compile-time per-CPU examples do not work for modules because the linker actually creates them in a unique executable section.If you need to access per-CPU data from modules,or if you need to create such data dynamically,there is hope.
12.11.2 Per-CPU Data at Runtime
内核实现了动态分配器,类似于kmalloc(),用来创建per-CPU数据。该方法create an instance of the requested memory for each processor on the systems.The prototypes are in <linux/percpu.h>:
Void alloc_percpu(type); / a macro */
Void *_alloc_percpu(size_t size, size_t align);
Void free_percpu(const void *);
The alloc_percpu() macro allocates one instance of an object of the given type for exery processor on the system. It is a wrapper around _alloc_percpu(),which takes the actual number of bytes to allocate as a parameter and the number of bytes on which to align the allocation. The alloc_percpu()宏aligns the allocation on a byte boundary that is the natural alignment of the given type. Such alignment is the usual behavior. For example,
Struct rabid_cheetah = alloc_percpu(struct rabid_cheetah);

Is the same as

Struct rabid_cheetah = _alloc_percpu(sizeof(struct rabid_cheetah), 

alignof(struct rabid_cheetah));
The alignof construct is a gcc feature that returns the required alignment in bytes for a given type or lvalue. Its syntax is just like that of sizeof. For example, the following would return four on X86:
alignof(unsigned long)
When given an lvalue, the return value is the largest alignment that the lvalue might have. For example, an lvalue inside a structure could have a greater alignment requirement than if an instance of the same type were created outside of the structure, because of structure alignment requirements.
12.12 Reasons for Using Per-CPU Data
使用per-CPU数据有诸多益处。第一个就是减少锁的要求。取决于机制which processors access the per-CPU data, you might not need any locking at all. Keep in mind ”only this processor access this data” rule is only a programming convention. You need to ensure that本地处理器仅能访问到unique的数据。
第二点是,per-CPU数据极大地降低了cache invalidation. This occurs as processors try to keep their caches in sync.如果一个处理器操作了其他处理器hold在cache中的数据了,这个处理器must flush or otherwise update its cache.连续的cache invalidation cache叫做thrashing the cache and wreaks havoc on the system performance. 每cpu数据的使用将缓存效果保持在最低限度,因为理想情况下,处理器只访问自己的数据。The percpu接口 cache-aligns all data to ensure that accessing one processor’s data does not bring in another processor’s data on the same cache line.
因此,使用per-CPU数据通常会移除了锁的读权限。使用per-CPU数据的唯一安全性要求就是关闭内核抢占,which is much cheaper than locking, and the interface does so automatically. Per-CPU数据可以在中断context和进程context中安全的使用。但是,you cannot sleep在你操作per-CPU数据的中间(这样的话,你可能end up on a different processor).
No one is currently required to use the new per-CPU接口。Doing things manually is fine,只要你关闭了内核抢占。但是,这个新的接口更加的易于使用,并且在未来可能引入更多的优化。如果你决定在你的内核代码中使用per-CPU数据,可以考虑使用这个新的接口。他的缺点可能是它不兼容早期版本的内核。
12.13 Picking an Allocation Method
这么多的分配方法和手段,我们应该在什么情况下,选择使用哪一个呢。如果你需要连续的物理pages,使用底层page分配器kmalloc(). 这是内核中最标准的内存申请办法。Recall that两个最常见的flags,given to these functions are GFP_ATOMIC and GFP_KERNEL.说明GFP_ATOMIC flag执行更高优先级的分配 that will not sleep. Code that can sleep, such as process context code that does not hold spin lock, should use GFP_KERNEL. This flag specifies an allocation that can sleep, if needed, to obtain the requested memory.
If you want to allocate from high memory, use alloc_pages(). The alloc_pages() function返回一个结构体page而不是一个指向逻辑地址的指针。因为high memory可能不会被mapped,访问它的唯一办法就是通过相关的struct page结构体。To obtain an actual pointer, use kmap() to map the high memory into the kernel’s逻辑地址空间。
如果你不需要逻辑地址连续的pages—仅仅在虚拟地址连续—使用vmalloc(),尽管要记住vmalloc()对kmalloc()的轻微性能影响。Vmalloc函数分配的内存仅仅是虚拟地址连续而不是物理地址连续。他的做法和用户空间的做法类似,通过把物理地址映射到连续的逻辑地址空间。
如果你正在创建和摧毁很多的大的数据结构,考虑建立slab cache吧。The slab layer maintains a per-processor object cache(a free list),which might greatly enhance object allocation allocation and deallocation performance.相比频繁的创建和摧毁内存,the slab layer存储已经分配好的对象的高速缓存。当你需要一个新的内存来存储你的数据结构的时候,the slab layer通常是不需要分配更多的内存,而是直接返回object供你使用。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值