1. Frequency of the Timer Interrupt
2. 全局变量jiffies系统启动后的tick数值.
jiffies变量声明在 <linux/jiffies.h> :
extern unsigned long volatile jiffies;
jiffies是jiffies_64的低32位,大部分情况用jiffies,只有时间管理才用jiffies_64,不会溢出.
避免溢出计算出错,采用的宏:
#define time_after(unknown, known) ((long)(known) - (long)(unknown) < 0)
#define time_before(unknown, known) ((long)(unknown) - (long)(known) < 0)
#define time_after_eq(unknown, known) ((long)(unknown) - (long)(known) >= 0)
#define time_before_eq(unknown, known) ((long)(known) - (long)(unknown) >= 0)
3.Timer中断函数分为平台相关和平台无关部分( tick_periodic()).
4. 当前时间(the wall time)定义在 kernel/time/timekeeping.c:
struct timespec xtime;
The timespec data structure is defined in <linux/time.h> as:
struct timespec {
__kernel_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
xtime.tv_sec保存了从1970年1月1日(UTC)以来的秒数,叫做epoch,
读写变量xtime需要seqlock类型的spinlock.
写xtime:
write_seqlock(&xtime_lock);
/* update xtime ... */
write_sequnlock(&xtime_lock);
读xtime:
unsigned long seq;
do {
unsigned long lost;
seq = read_seqbegin(&xtime_lock);
usec = timer->get_offset();
lost = jiffies - wall_jiffies;
if (lost)
usec += lost * (1000000 / HZ);
sec = xtime.tv_sec;
usec += (xtime.tv_nsec / 1000);
} while (read_seqretry(&xtime_lock, seq));
Userspace获得xtime的方法是 gettimeofday(), 由sys_gettimeofday()实现(kernel/time.c):
asmlinkage long sys_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv)) {
struct timeval ktv;
do_gettimeofday(&ktv);
if (copy_to_user(tv, &ktv, sizeof(ktv)))
return -EFAULT;
}
if (unlikely(tz)) {
if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
return -EFAULT;
}
return 0;
}
settimeofday()用来设置wall time(需定义 CAP_SYS_TIME).
5. Timers
Timers由结构体timer_list表示, 定义在 <linux/timer.h>:
struct timer_list {
struct list_head entry; /* entry in linked list of timers */
unsigned long expires; /* expiration value, in jiffies */
void (*function)(unsigned long); /* the timer handler function */
unsigned long data; /* lone argument to the handler */
struct tvec_t_base_s *base; /* internal timer field, do not touch */
};
创建timer第一步:
struct timer_list my_timer;
初始化timer:
init_timer(&my_timer);
然后填充需要数据:
my_timer.expires = jiffies + delay; /* timer expires in delay ticks */
my_timer.data = 0; /* zero is passed to the timer handler */
my_timer.function = my_function; /* function to run when timer expires */
时间到了执行函数的原型:
void my_timer_function(unsigned long data);
最后激活timer:
add_timer(&my_timer);
mod_timer()用来操作已经初始化但还没激活的timer.运行后timer被激活.
在timer到期前取消timer:
del_timer(&my_timer);
取消并等待执行函数完成:
del_timer_sync(&my_timer);//和del_timer不同,不能用在中断上下文.
内核在时间中断完成在bottom-half执行timers,softirqs类型,时间中断运行update_process_times(),会调用run_local_timers():
void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ); /* raise the timer softirq */
softlockup_tick();
}
TIMER_SOFTIRQ softirq由run_timer_softirq()处理.执行所有到期的timer.timer存储在链表中,kernel根据expired value讲timers分为5组来提高效率.
6.delay
Busy looping:
unsigned long timeout = jiffies + 10; /* ten ticks */
while (time_before(jiffies, timeout))
;
这样系统会死等,下面的方法在等待时候允许其他进程运行:
unsigned long delay = jiffies + 5*HZ;
while (time_before(jiffies, delay))
cond_resched();
不使用jiffies的delay方法,可以获得更短的delay:
void udelay(unsigned long usecs)
void ndelay(unsigned long nsecs)
void mdelay(unsigned long msecs)
udelay使用busy looping实现,通过BogoMIPS获取.
更优化的delay'方法:
schedule_timeout(),delay时任务进入sleep状态直至到期被唤醒.使用方法:
/* set task’s state to interruptible sleep */
set_current_state(TASK_INTERRUPTIBLE);
/* take a nap and wake up in “s” seconds */
schedule_timeout(s * HZ);
等待一定时间或事件被唤醒可调用schedule_timeout() 而不是schedule().
第十二章: Memory Management
1. 页是虚拟内存的最小单元. 大部分32位处理器使用4KB的页,64位使用8KB的页.
2. 页由struct page来表示,定义在 <linux/mm_types.h>.原型:
struct page {
unsigned long flags;
atomic_t _count;
atomic_t _mapcount;
unsigned long private;
struct address_space *mapping;
pgoff_t index;
struct list_head lru;
void *virtual;
};
flags: 保存页的状态,共32个bit用来表示状态,定义在<linux/page-flags.h>.
_count: 保存页使用的数目.-1时没有被使用.可以被用来新的分配.kernel使用page_count()而不是直接访问该成员.page_count()返回0表示free,非0表示在使用.page可以被page cache使用(mapping指向关联该页的 address_space对象).作为private data(private指向), 或者进程页表的映射.
virtual: 是页的虚拟地址.
page结构关联的是物理页,不是虚拟页.用来表示物理内存,而不是其中的数据.
3. kernel将页分成不同zones. Linux有4个基本的么memory zones(定义在<linux/mmzone.h>):
ZONE_DMA: 包含的页可以进行DMA.
ZONE_DMA32: 包含的页可以进行DMA, 但只能被32位设备访问.
ZONE_NORMAL: 包含普通的,可被映射的页.
ZONE_HIGHMEM: 包含high memory,这些内容不能永久被内核地址空间映射.
zones结构( <linux/mmzone.h>):
struct zone {
unsigned long watermark[NR_WMARK];
unsigned long lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
spinlock_t lock;
struct free_area free_area[MAX_ORDER]
spinlock_t lru_lock;
struct zone_lru {
struct list_head list;
unsigned long nr_saved_scan;
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
unsigned long pages_scanned;
unsigned long flags;
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
int prev_priority;
unsigned int inactive_ratio;
wait_queue_head_t *wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
struct pglist_data *zone_pgdat;
unsigned long zone_start_pfn;
unsigned long spanned_pages;
unsigned long present_pages;
const char *name;
};
lock: 用来保护该结构避免被并发进程访问.用来保护该结构而不是其代表的zones的内容.
watermark: 表示该zone消耗的情况,用minimum, low, and high来表示.
name: zones的名字,在 mm/page_alloc.c中被初始化,其中三个名字为DMA, Normal, and HighMem.
4. 分配页内存(<linux/gfp.h>):
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
该函数分配2的order次幂的page,返回指向第一个page的指针.
page转换逻辑地址:
void * page_address(struct page *page)
只获得一页内存的方法:
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
只获得一页内存的简单方法:
struct page * alloc_page(gfp_t gfp_mask)
unsigned long __get_free_page(gfp_t gfp_mask)
返回page被填充为0的函数:
unsigned long get_zeroed_page(unsigned int gfp_mask)
释放内存:
void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)
分配内存的例子:
unsigned long page;
page = __get_free_pages(GFP_KERNEL, 3);
if (!page) {
/* insufficient memory: you must handle this error! */
return –ENOMEM;
}
/* ‘page’ is now the address of the first of eight contiguous pages ... */
释放内存例子:
free_pages(page, 3);
/*
* our pages are now freed and we should no
* longer access the address stored in ‘page’
*/
5. kmalloc()
kmalloc()用来分配基于字节数目的内核内存.定义在 <linux/slab.h>,成功返回分配的内存地址,否则返回NULL.
void * kmalloc(size_t size, gfp_t flags);
例子:
struct dog *p;
p = kmalloc(sizeof(struct dog), GFP_KERNEL);
if (!p)
/* handle error ... */
gfp_t,定义在 <linux/types.h>,是如何分配内存的标志,有Action Modifiers,Zone Modifiers和Type Flags之分.
Action modifiers:定义内核如何分配所需的内存.
Zone Modifiers:指明内存从哪个zone开始分配.
Type flags:是上两种类型的组合.使用时使用这种标志.
kfree(), 声明在 <linux/slab.h>:
void kfree(const void *ptr)
例子:
char *buf;
buf = kmalloc(BUF_SIZE, GFP_ATOMIC);
if (!buf)
/* error allocating memory ! */
....
kfree(buf);
6. vmalloc()(声明在 <linux/vmalloc.h>,定义在mm/vmalloc.c)
void * vmalloc(unsigned long size)
vmalloc()和kmalloc()类似,但分配的是连续的虚拟内存,而不必是连续的物理内存.内核中一般用kmalloc(),开销比vmalloc()小,因为vmalloc()要进行内存映射.
vmalloc()能够sleep,因此不能被中断及其他阻塞不允许的情况.
释放内存:void vfree(const void *addr)
例子:
char *buf;
buf = vmalloc(16 * PAGE_SIZE); /* get 16 pages */
if (!buf)
/* error! failed to allocate memory */
/*
* buf now points to at least a 16*PAGE_SIZE bytes
* of virtually contiguous block of memory
*/
After you finish with the memory, make sure to free it by using
vfree(buf);
7. slab layer
slab layer是通用的数据结构缓存层.频繁使用的数据需要缓存.可以避免内存碎片.对于频繁的内存分配释放操作提高性能.
slab layer将不同的对象分组,每个组叫一个cache,每个对象类型对应一个cache.cache然后分组成slab,slabs包含一个或多个连续的物理内存页.cache包含多个slab.
每个slab有3种状态:: full, partial, or empty.
每个cache由 kmem_cache结构表示,包含3个链表:slabs_full, slabs_partial, slabs_empty.
slab结构:
struct slab {
struct list_head list; /* full, partial, or empty list */
unsigned long colouroff; /* offset for the slab coloring */
void *s_mem; /* first object in the slab */
unsigned int inuse; /* allocated objects in the slab */
kmem_bufctl_t free; /* first free object, if any */
};
使用kmem_getpages通过调用__get_free_pages()来分配新的slab.
static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
kmem_freepages()进行释放.
新的cache通过以下函数创建:
struct kmem_cache * kmem_cache_create(const char *name,
size_t size,
size_t align,
unsigned long flags,
void (*ctor)(void *));
成功返回创建的cache的指针,失败返回NULL.该函数不能在interrupt上下午中调用,因为会睡眠.
销毁cache:
int kmem_cache_destroy(struct kmem_cache *cachep);
成功返回0,失败返回非0值.
cache创建完成后,object可以通过以下函数来获得.
void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags);
Example:
一个全局指针指向task struct cache
struct kmem_cache *task_struct_cachep;
在 fork_init()中创建:
task_struct_cachep = kmem_cache_create(“task_struct”,
sizeof(struct task_struct),
ARCH_MIN_TASKALIGN,
SLAB_PANIC | SLAB_NOTRACK,
NULL);
进程调用 fork()创建新的进程时,新的进程描述被创建, do_fork()- dup_task_struct():
struct task_struct *tsk;
tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
if (!tsk)
return NULL;
任务终止后,如果没有子任务等待,进程描述符被释放回 task_struct_cachep slab cache, free_task_struct()调用:
kmem_cache_free(task_struct_cachep, tsk);
进程描述符是内核核心部分一直被需要,因而不会销毁.
8. Statically Allocating on the Stack
栈上的静态分配,内核进程使用2个页的栈, 也就是32位系统8KB,64位系统16KB.
在编译时可以选择1个页的栈,中断进程使用单独的栈.
9.High Memory Mappings
将high memory内存永久映射到内核地址空间,
void *kmap(struct page *page);( <linux/highmem.h>)
high memory或low memory都可以实用该函数,不使用时应释放:
void kunmap(struct page *page);
临时映射:
可以使用在无法睡眠,如中断处理函数
void *kmap_atomic(struct page *page, enum km_type type);
void kunmap_atomic(void *kvaddr, enum km_type type);
10.The New percpu Interface
2.6内核新的接口percpu,声明在 <linux/percpu.h> ,实现在 mm/slab.c和<asm/percpu.h>.
编译时定义percpu数据:
DEFINE_PER_CPU(type, name);
引用其他地方声明的数据:
DECLARE_PER_CPU(type, name);
使用get_cpu_var()和put_cpu_var()来操作数据:
get_cpu_var(name)++; /* increment name on this processor */
put_cpu_var(name); /* done; enable kernel preemption */
获得其他处理器的数据:
per_cpu(name, cpu)++; /* increment name on the given processor */
动态数据:
void *alloc_percpu(type); /* a macro */
void *__alloc_percpu(size_t size, size_t align);
void free_percpu(const void *);
get_cpu_var(ptr); /* return a void pointer to this processor’s copy of ptr */
put_cpu_var(ptr); /* done; enable kernel preemption */
Example:
void *percpu_ptr;
unsigned long *foo;
percpu_ptr = alloc_percpu(unsigned long);
if (!ptr)
/* error allocating memory .. */
foo = get_cpu_var(percpu_ptr);
/* manipulate foo .. */
put_cpu_var(percpu_ptr);