再读内核存储管理(2):相关的数据结构

 
快乐虾
http://blog.csdn.net/lights_joy/
lights@hb165.com
  
 
本文适用于
ADI bf561 DSP
uclinux-2008r1-rc8 (移植到vdsp5)
Visual DSP++ 5.0
  
 
欢迎转载,但请保留作者信息
 
pglist_data的定义在include/linux/mmzone.h中:
 
/*
 * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
 * (mostly NUMA machines?) to denote a higher-level memory zone than the
 * zone denotes.
 *
 * On NUMA machines, each NUMA node would have a pg_data_t to describe
 * it's memory layout.
 *
 * Memory statistics and page replacement data structures are maintained on a
 * per-zone basis.
 */
struct bootmem_data;
typedef struct pglist_data {
     struct zone node_zones[MAX_NR_ZONES];
     struct zonelist node_zonelists[MAX_NR_ZONES];
     int nr_zones;
     struct page *node_mem_map;
     struct bootmem_data *bdata;
     unsigned long node_start_pfn;
     unsigned long node_present_pages; /* total number of physical pages */
     unsigned long node_spanned_pages; /* total size of physical page
                            range, including holes */
     int node_id;
     wait_queue_head_t kswapd_wait;
     struct task_struct *kswapd;
     int kswapd_max_order;
} pg_data_t;
这个结构体用于描述可用存储空间的情况。
l         bdata
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
从这个定义还可以看出在这个结构体中,bdata实际将指向一个固定的位置 contig_bootmem_data且在 mem_init函数调用后此成员将不再使用。
l         zone
对于这个结构体中的zone,内核实际只使用了ZONE_DMA(0)这个区域,它的范围从内核代码结束一直到物理内存结束。
l         node_id
因为整个内核只使用了一个NODE,因此在这个结构体中node_id的值将为0。
l         node_start_pfn
将为0。
l         node_spanned_pages node_present_pages
两个成员的初始化在calculate_node_totalpages函数中完成,它们的值为SDRAM的页表数量,包含未用的区域和内核代码等,其值相等。对于64M内存而言(实际限制到60M),其值为0x3bff。
l         node_mem_map
在内核中每个4K的页都有一个struct page结构体与之对应,这个成员指向这个page数组的首地址,它将在初始化时由alloc_node_mem_map函数进行空间分配(使用bootmem)。
l         nr_zones
这个值用于表示可用的zone的最高序号+1。对于BF561而言,只使用了ZONE_DMA,因此这个值将为1。
 
这个结构体的定义在include/linux/mmzone.h中:
enum zone_stat_item {
     /* First 128 byte cacheline (assuming 64 bit words) */
     NR_FREE_PAGES,
     NR_INACTIVE,
     NR_ACTIVE,
     NR_ANON_PAGES,     /* Mapped anonymous pages */
     NR_FILE_MAPPED,    /* pagecache pages mapped into pagetables.
                 only modified from process context */
     NR_FILE_PAGES,
     NR_FILE_DIRTY,
     NR_WRITEBACK,
     /* Second 128 byte cacheline */
     NR_SLAB_RECLAIMABLE,
     NR_SLAB_UNRECLAIMABLE,
     NR_PAGETABLE,      /* used for pagetables */
     NR_UNSTABLE_NFS,   /* NFS unstable pages */
     NR_BOUNCE,
     NR_VMSCAN_WRITE,
     NR_VM_ZONE_STAT_ITEMS };
 
struct per_cpu_pages {
     int count;         /* number of pages in the list */
     int high;     /* high watermark, emptying needed */
     int batch;         /* chunk size for buddy add/remove */
     struct list_head list; /* the list of pages */
};
 
struct per_cpu_pageset {
     struct per_cpu_pages pcp[2];     /* 0: hot. 1: cold */
     s8 stat_threshold;
     s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
} ____cacheline_aligned_in_smp;
这个结构体将只用在struct zone当中,且通过zone_pcp这个宏来进行访问:
#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
内核经常需要请求和释放单个页面。为了提升系统性能,每个内存管理区zone定义了一个“每CPU”页面高速缓存。所有高速缓存包含一些预先分配的页,它们被用于满足本地CPU发出的单一内存请求。
实际上,这里为每个内存管理区和每个CPU提供了两个高速缓存:一个热高速缓存,它存放的页框中所包含的内容很可能就在CPU硬件高速缓存中;还有一个冷高速缓存。
内核使用两个位标来监视热高速缓存和冷高速缓存的大小:如果页个数低于下界low,内核通过buddy系统分配batch个单一页面来补充对应的高速缓存;否则,如果页框个数高过上界high,内核从高速缓存中释放batch个页框到buddy系统中。
这个结构体的初始化由setup_pageset函数完成:
inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
{
     struct per_cpu_pages *pcp;
 
     memset(p, 0, sizeof(*p));
 
     pcp = &p->pcp[0];      /* hot */
     pcp->count = 0;
     pcp->high = 6 * batch;
     pcp->batch = max(1UL, 1 * batch);
     INIT_LIST_HEAD(&pcp->list);
 
     pcp = &p->pcp[1];      /* cold*/
     pcp->count = 0;
     pcp->high = 2 * batch;
     pcp->batch = max(1UL, batch/2);
     INIT_LIST_HEAD(&pcp->list);
}
对于64M的SDRAM而言,batch的值为3。
 
这个结构体的定义位于include/linux/mmzone.h:
struct zone {
     /* Fields commonly accessed by the page allocator */
     unsigned long      pages_min, pages_low, pages_high;
     /*
      * We don't know if the memory that we're going to allocate will be freeable
      * or/and it will be released eventually, so to avoid totally wasting several
      * GB of ram we must reserve some of the lower zone memory (otherwise we risk
      * to run OOM on the lower zones despite there's tons of freeable ram
      * on the higher zones). This array is recalculated at runtime if the
      * sysctl_lowmem_reserve_ratio sysctl changes.
      */
     unsigned long      lowmem_reserve[MAX_NR_ZONES];
 
     struct per_cpu_pageset pageset[NR_CPUS];
     /*
      * free areas of different sizes
      */
     spinlock_t         lock;
     struct free_area   free_area[MAX_ORDER];
 
     ZONE_PADDING(_pad1_)
 
     /* Fields commonly accessed by the page reclaim scanner */
     spinlock_t         lru_lock;
     struct list_head   active_list;
     struct list_head   inactive_list;
     unsigned long      nr_scan_active;
     unsigned long      nr_scan_inactive;
     unsigned long      pages_scanned;        /* since last reclaim */
     int           all_unreclaimable; /* All pages pinned */
 
     /* A count of how many reclaimers are scanning this zone */
     atomic_t      reclaim_in_progress;
 
     /* Zone statistics */
     atomic_long_t      vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
     /*
      * prev_priority holds the scanning priority for this zone. It is
      * defined as the scanning priority at which we achieved our reclaim
      * target at the previous try_to_free_pages() or balance_pgdat()
      * invokation.
      *
      * We use prev_priority as a measure of how much stress page reclaim is
      * under - it drives the swappiness decision: whether to unmap mapped
      * pages.
      *
      * Access to both this field is quite racy even on uniprocessor. But
      * it is expected to average out OK.
      */
     int prev_priority;
 
 
     ZONE_PADDING(_pad2_)
     /* Rarely used or read-mostly fields */
 
     /*
      * wait_table      -- the array holding the hash table
      * wait_table_hash_nr_entries    -- the size of the hash table array
      * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
      *
      * The purpose of all these is to keep track of the people
      * waiting for a page to become available and make them
      * runnable again when possible. The trouble is that this
      * consumes a lot of space, especially when so few things
      * wait on pages at a given time. So instead of using
      * per-page waitqueues, we use a waitqueue hash table.
      *
      * The bucket discipline is to sleep on the same queue when
      * colliding and wake all in that wait queue when removing.
      * When something wakes, it must check to be sure its page is
      * truly available, a la thundering herd. The cost of a
      * collision is great, but given the expected load of the
      * table, they should be so rare as to be outweighed by the
      * benefits from the saved space.
      *
      * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
      * primary users of these fields, and in mm/page_alloc.c
      * free_area_init_core() performs the initialization of them.
      */
     wait_queue_head_t * wait_table;
     unsigned long      wait_table_hash_nr_entries;
     unsigned long      wait_table_bits;
 
     /*
      * Discontig memory support fields.
      */
     struct pglist_data *zone_pgdat;
     /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
     unsigned long      zone_start_pfn;
 
     /*
      * zone_start_pfn, spanned_pages and present_pages are all
      * protected by span_seqlock. It is a seqlock because it has
      * to be read outside of zone->lock, and it is done in the main
      * allocator path. But, it is written quite infrequently.
      *
      * The lock is declared along with zone->lock because it is
      * frequently read in proximity to zone->lock. It's good to
      * give them a chance of being in the same cacheline.
      */
     unsigned long      spanned_pages;     /* total size, including holes */
     unsigned long      present_pages;     /* amount of memory (excluding holes) */
 
     /*
      * rarely used fields:
      */
     const char         *name;
} ____cacheline_internodealigned_in_smp;
在内核中访问的zone全部是指contig_page_data->node_zones[0],因为node_zones[1]没有可用空间,在内核中可以忽略它。
l         spanned_pages present_pages
表示可用的SDRAM的页的数量,其内存范围从0到60M。而 present_pages则在spanned_pages的基础上减去了page数组所占用的页数。对于64MSDRAM,不启用MTD而言,spanned_pages的值为0x3bff,而 present_pages的值则为0x3b6a。
l         zone_pgdat
将指向全局唯一的一个pglist_data: contig_page_data
l          prev_priority
被初始化为DEF_PRIORITY:
/*
 * The "priority" of VM scanning is how much of the queues we will scan in one
 * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
 * queues ("queue_length >> 12") during an aging round.
 */
#define DEF_PRIORITY 12
l          wait_table_hash_nr_entries wait_table_bits
对于64M内存, wait_table_hash_nr_entries的值为0x40。而wait_table_bits的值则为6。而wait_table的大小将为wait_table_hash_nr_entries * sizeof(wait_queue_head_t)。
l         free_area [MAX_ORDER]
struct free_area {
     struct list_head   free_list;
     unsigned long      nr_free;
};
在buddy算法中,将空闲页面分为11个块链表,每个块链表分别包含大小为1、2、4、8、16、32、64、128、256、512和1024个连续的页。这个成员就是为了表示此链表而设的。
这个结构体的定义在include/linux/mmzone.h中:
/*
 * One allocation request operates on a zonelist. A zonelist
 * is a list of zones, the first one is the 'goal' of the
 * allocation, the other zones are fallback zones, in decreasing
 * priority.
 *
 * If zlcache_ptr is not NULL, then it is just the address of zlcache,
 * as explained above. If zlcache_ptr is NULL, there is no zlcache.
 */
 
struct zonelist {
     struct zonelist_cache *zlcache_ptr;            // NULL or &zlcache
     struct zone *zones[MAX_ZONES_PER_ZONELIST + 1];      // NULL delimited
};
在这里有
/* Maximum number of zones on a zonelist */
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
其值为2。
l         zones
这个结构体只用在pglist_data中,在初始化后,zones实际只有一个元素,即指向ZONE_DMA,contig_page_data->zone[0]。
l         zlcache_ptr
这个值实际没有使用,为NULL。
 
page的定义在include/linux/mm_types.h中:
/*
 * Each physical page in the system has a struct page associated with
 * it to keep track of whatever it is we are using the page for at the
 * moment. Note that we have no way to track which tasks are using
 * a page, though if it is a pagecache page, rmap structures can tell us
 * who is mapping it.
 */
struct page {
     unsigned long flags;        /* Atomic flags, some possibly
                        * updated asynchronously */
     atomic_t _count;       /* Usage count, see below. */
     union {
         atomic_t _mapcount;    /* Count of ptes mapped in mms,
                        * to show when page is mapped
                        * & limit reverse map searches.
                        */
         struct { /* SLUB uses */
              short unsigned int inuse;
              short unsigned int offset;
         };
     };
     union {
         struct {
         unsigned long private;      /* Mapping-private opaque data:
                             * usually used for buffer_heads
                             * if PagePrivate set; used for
                             * swp_entry_t if PageSwapCache;
                             * indicates order in the buddy
                             * system if PG_buddy is set.
                             */
         struct address_space *mapping;   /* If low bit clear, points to
                             * inode address_space, or NULL.
                             * If page mapped as anonymous
                             * memory, low bit is set, and
                             * it points to anon_vma object:
                             * see PAGE_MAPPING_ANON below.
                             */
         };
         spinlock_t ptl;
         struct {           /* SLUB uses */
         void **lockless_freelist;
         struct kmem_cache *slab;    /* Pointer to slab */
         };
         struct {
         struct page *first_page;    /* Compound pages */
         };
     };
     union {
         pgoff_t index;         /* Our offset within mapping. */
         void *freelist;        /* SLUB: freelist req. slab lock */
     };
     struct list_head lru;       /* Pageout list, eg. active_list
                        * protected by zone->lru_lock !
                        */
};
内核对每一个4K的内存页都用一个page结构进行描述,使用 virt_to_page这个宏定义可以快速地找到指定物理地址对应的page结构体。
#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配,但是与其它用bootmem分配的内存不同,这块内存空间永远不会被回收。
而这个结构体中的flags成员每个位的定义在include/linux/page_flags.h中:
/*
 * Various page->flags bits:
 *
 * PG_reserved is set for special pages, which can never be swapped out. Some
 * of them might not even exist (eg empty_bad_page)...
 *
 * The PG_private bitflag is set on pagecache pages if they contain filesystem
 * specific data (which is normally at page->private). It can be used by
 * private allocations for its own usage.
 *
 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O
 * and cleared when writeback _starts_ or when read _completes_. PG_writeback
 * is set before writeback starts and cleared when it finishes.
 *
 * PG_locked also pins a page in pagecache, and blocks truncation of the file
 * while it is held.
 *
 * page_waitqueue(page) is a wait queue of all tasks waiting for the page
 * to become unlocked.
 *
 * PG_uptodate tells whether the page's contents is valid. When a read
 * completes, the page becomes uptodate, unless a disk I/O error happened.
 *
 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and
 * file-backed pagecache (see mm/vmscan.c).
 *
 * PG_error is set to indicate that an I/O error occurred on this page.
 *
 * PG_arch_1 is an architecture specific page state bit. The generic code
 * guarantees that this bit is cleared for a page when it first is entered into
 * the page cache.
 *
 * PG_highmem pages are not permanently mapped into the kernel virtual address
 * space, they need to be kmapped separately for doing IO on the pages. The
 * struct page (these bits with information) are always mapped into kernel
 * address space...
 *
 * PG_buddy is set to indicate that the page is free and in the buddy system
 * (see mm/page_alloc.c).
 *
 */
 
/*
 * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break
 * locked- and dirty-page accounting.
 *
 * The page flags field is split into two parts, the main flags area
 * which extends from the low bits upwards, and the fields area which
 * extends from the high bits downwards.
 *
 * | FIELD | ... | FLAGS |
 * N-1     ^             0
 *          (N-FLAGS_RESERVED)
 *
 * The fields area is reserved for fields mapping zone, node and SPARSEMEM
 * section. The boundry between these two areas is defined by
 * FLAGS_RESERVED which defines the width of the fields section
 * (see linux/mmzone.h). New flags must _not_ overlap with this area.
 */
#define PG_locked      0   /* Page is locked. Don't touch. */
#define PG_error       1
#define PG_referenced       2
#define PG_uptodate         3
 
#define PG_dirty        4
#define PG_lru              5
#define PG_active      6
#define PG_slab             7   /* slab debug (Suparna wants this) */
 
#define PG_owner_priv_1     8   /* Owner use. If pagecache, fs may use*/
#define PG_arch_1      9
#define PG_reserved         10
#define PG_private     11   /* If pagecache, has fs-private data */
 
#define PG_writeback        12   /* Page is under writeback */
#define PG_compound         14   /* Part of a compound page */
#define PG_swapcache        15   /* Swap page: swp_entry_t in private */
 
#define PG_mappedtodisk     16   /* Has blocks allocated on-disk */
#define PG_reclaim     17   /* To be reclaimed asap */
#define PG_buddy       19   /* Page is free, on buddy lists */
 
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked     PG_owner_priv_1 /* Used by some filesystems */
 
初始化时,flags都将初始化为 PG_reserved
对page的初始化过程可参见 memmap_init_zone,位于mm/page_alloc.c:
/*
 * Initially all pages are reserved - free ones are freed
 * up by free_all_bootmem() once the early boot process is
 * done. Non-atomic initialization, single-pass.
 */
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
         unsigned long start_pfn, enum memmap_context context)
{
     struct page *page;
     unsigned long end_pfn = start_pfn + size;
     unsigned long pfn;
 
     for (pfn = start_pfn; pfn < end_pfn; pfn++) {
         /*
          * There can be holes in boot-time mem_map[]s
          * handed to this function. They do not
          * exist on hotplugged memory.
          */
         if (context == MEMMAP_EARLY) {
              if (!early_pfn_valid(pfn))
                   continue;
              if (!early_pfn_in_nid(pfn, nid))
                   continue;
         }
         page = pfn_to_page(pfn);
         set_page_links(page, zone, nid, pfn);
         init_page_count(page);
         reset_page_mapcount(page);
         SetPageReserved(page);
         INIT_LIST_HEAD(&page->lru);
     }
}
在上述函数中,由于zone参数为0,故而实际只有
SetPageReserved(page)
这行语句起作用,将flags成员设置为0x400( PG_reserved)。
下面是几个对PAGE进行操作的宏:
这个宏位于include/asm/page.h中:
#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)
#define __pa(vaddr)         virt_to_phys((void *)(vaddr))
#define virt_to_phys(vaddr) ((unsigned long) (vaddr))
从这三个宏的定义可以看出,virt_to_pfn这个宏接受一个物理地址做为参数,并计算这个物理地址页在整个页表中的序号。
这个宏位于include/asm/page.h中:
#define page_to_virt(page) ((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)
在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配。
在这里有
#define PAGE_SHIFT 12
而PAGE_OFFSET的值则为0。
在上述分析可以看出,这个宏接受一个page结构体指针做为参数,并计算得到这个page结构体所代表的物理内存页的地址。
 
这个宏位于include/asm/page.h中:
#define page_to_pfn(page)   virt_to_pfn(page_to_virt(page))
根据virt_to_pfn和page_to_virt这两个宏定义可知,page_to_pfn这个宏接一个page结构体的指针做为参数,并取得这个页在整个页表中的序号。
其实在这里用
#define page_to_pfn(page)   (page mem_map)
也可以达到相同的目的。
这个宏位于include/asm/page.h中:
#define pfn_to_virt(pfn)    __va((pfn) << PAGE_SHIFT)
#define __va(paddr)         phys_to_virt((unsigned long)(paddr))
#define phys_to_virt(vaddr) ((void *) (vaddr))
从这三个宏定义可以看出,这个宏接受页面序号做为参数,并返回指定序号的页面的物理地址。
这个宏位于include/asm/page.h中:
#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
在这里mem_map指向page数组的第一个元素。
在这里有
#define PAGE_SHIFT 12
而PAGE_OFFSET的值则为0。
在上述分析可以看出,这个宏接受一个物理地址做为参数,并计算得到表示这个物理地址所在页面的page结构体指针。
这个宏位于include/asm/page.h中:
#define pfn_to_page(pfn)    virt_to_page(pfn_to_virt(pfn))
根据pfn_to_virt和virt_to_page这两个宏定义可知,pfn_to_page这个宏接受一个page序号做为参数,并取得这个序号所代表的page结构体指针。
其实在这里用
#define pfn_to_ page (pfn) (page + mem_map)
也可以达到相同的目的。
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

嵌云阁主

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值