内存mmap实现原理

原文: http://blog.chinaunix.net/space.php?uid=24517893&do=blog&id=107464

内存mmap实现原理 (2011-01-28 13:57)
标签style 分类: linux 内存

内存mmap实现原理

 

谨以此文纪念过往的岁月。

 

结构体列表

1. struct rb_node_s

2 struct page

3. struct vm_area_struct

4. struct mm_struct

函数列表

do_mmap

do_mmap2

 

红黑树结构

     Linux内核从2.4.10开始,对虚拟区的组织不再采用AVL树,而是采用红黑树,这也是出于效率的考虑,虽然AVL树和红黑树很类似,但在插入和删除节点方面,采用红黑树的性能更好一些,下面对红黑树给予简单介绍。

   一颗红黑树是具有以下特点的二叉树:

(1)   每个节点着有颜色,或者为红,或者为黑

(2)   根节点为黑色

(3)    如果一个节点为红色,那么它的子节点必须为黑色

(4)    从一个节点到叶子节点上的所有路径都包含有相同的黑色节点数   

红黑树的结构在include/linux/rbtree.h中定义如下:

typedef struct rb_node_s

{

         struct rb_node_s * rb_parent;

         int rb_color;

#define RB_RED          0

#define RB_BLACK        1

        struct rb_node_s * rb_right;

        struct rb_node_s * rb_left;

} rb_node_t;

 

O(∩_∩)O~

struct page {

         unsigned long flags;            /* Atomic flags, some possibly

                                                * updated asynchronously */

         atomic_t _count;          /* Usage count, see below. */

         union {

                   atomic_t _mapcount;   /* Count of ptes mapped in mms,

                                                * to show when page is mapped

                                                * & limit reverse map searches.

                                                */

                   struct {               /* SLUB */

                            u16 inuse;

                            u16 objects;

                   };

         };

         union {

             struct {

                   unsigned long private;                 /* Mapping-private opaque data:

                                                        * usually used for buffer_heads

                                                         * if PagePrivate set; used for

                                                         * swp_entry_t if PageSwapCache;

                                                         * indicates order in the buddy

                                                         * system if PG_buddy is set.

                                                         */

                   struct address_space *mapping; /* If low bit clear, points to

                                                         * inode address_space, or NULL.

                                                         * If page mapped as anonymous

                                                         * memory, low bit is set, and

                                                         * it points to anon_vma object:

                                                         * see PAGE_MAPPING_ANON below.

                                                         */

             };

#if USE_SPLIT_PTLOCKS

             spinlock_t ptl;

#endif

             struct kmem_cache *slab;     /* SLUB: Pointer to slab */

             struct page *first_page;        /* Compound tail pages */

         };

         union {

                   pgoff_t index;              /* Our offset within mapping. */

                   void *freelist;              /* SLUB: freelist req. slab lock */

         };

         struct list_head lru;              /* Pageout list, eg. active_list

                                                * protected by zone->lru_lock !

                                                */

         /*

          * On machines where all RAM is mapped into kernel address space,

          * we can simply calculate the virtual address. On machines with

          * highmem some memory is mapped into kernel virtual memory

          * dynamically, so we need a place to store that address.

          * Note that this field could be 16 bits on x86 ... ;)

          *

          * Architectures with slow multiplication can define

          * WANT_PAGE_VIRTUAL in asm/page.h

          */

#if defined(WANT_PAGE_VIRTUAL)

         void *virtual;                        /* Kernel virtual address (NULL if

                                                  not kmapped, ie. highmem) */

#endif /* WANT_PAGE_VIRTUAL */

};

 

/*

 * This struct defines a memory VMM memory area. There is one of these

 * per VM-area/task.  A VM area is any part of the process virtual memory

 * space that has a special rule for the page-fault handlers (ie a shared

 * library, the executable area etc).

 */

O(∩_∩)O~

struct vm_area_struct {

         struct mm_struct * vm_mm;         /* The address space we belong to. */

         unsigned long vm_start;               /* Our start address within vm_mm. */

         unsigned long vm_end;                /* The first byte after our end address

                                                  within vm_mm. */

 

         /* linked list of VM areas per task, sorted by address */

         struct vm_area_struct *vm_next;

 

         pgprot_t vm_page_prot;                /* Access permissions of this VMA. */

         unsigned long vm_flags;               /* Flags, see mm.h. */

 

         struct rb_node vm_rb;

 

         /*

          * For areas with an address space and backing store,

          * linkage into the address_space->i_mmap prio tree, or

          * linkage to the list of like vmas hanging off its node, or

          * linkage of vma in the address_space->i_mmap_nonlinear list.

          */

         union {

                   struct {

                            struct list_head list;

                            void *parent;     /* aligns with prio_tree_node parent */

                            struct vm_area_struct *head;

                   } vm_set;

 

                   struct raw_prio_tree_node prio_tree_node;

         } shared;

 

         /*

          * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma

          * list, after a COW of one of the file pages. A MAP_SHARED vma

          * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack

          * or brk vma (with NULL file) can only be in an anon_vma list.

          */

         struct list_head anon_vma_node; /* Serialized by anon_vma->lock */

         struct anon_vma *anon_vma;       /* Serialized by page_table_lock */

 

         /* Function pointers to deal with this struct. */

         struct vm_operations_struct * vm_ops;

 

         /* Information about our backing store: */

         unsigned long vm_pgoff;              /* Offset (within vm_file) in PAGE_SIZE

                                                  units, *not* PAGE_CACHE_SIZE */

         struct file * vm_file;              /* File we map to (can be NULL). */

         void * vm_private_data;                /* was vm_pte (shared mem) */

         unsigned long vm_truncate_count;/* truncate_count or restart_addr */

 

#ifndef CONFIG_MMU

         atomic_t vm_usage;              /* refcount (VMAs shared if !MMU) */

#endif

#ifdef CONFIG_NUMA

         struct mempolicy *vm_policy;      /* NUMA policy for the VMA */

#endif

};

 

struct core_thread {

         struct task_struct *task;

         struct core_thread *next;

};

 

struct core_state {

         atomic_t nr_threads;

         struct core_thread dumper;

         struct completion startup;

};

O(∩_∩)O~               

struct mm_struct {

         struct vm_area_struct * mmap;             //vmas链表  /* list of VMAs */

         struct rb_root mm_rb;              //红黑树的根节点

     struct vm_area_struct * mmap_cache;     /* last find_vma result */ 最后一次通过find_vma查找的vma的结果

         unsigned long (*get_unmapped_area) (struct file *filp,

                                     unsigned long addr, unsigned long len,

                                     unsigned long pgoff, unsigned long flags);

         void (*unmap_area) (struct mm_struct *mm, unsigned long addr);

         unsigned long mmap_base;                   /* base of mmap area */

         unsigned long task_size;               /* size of task vm space */

         unsigned long cached_hole_size;          /* if non-zero, the largest hole below free_area_cache */

         unsigned long free_area_cache;            /* first hole of size cached_hole_size or larger */

         pgd_t * pgd;                 //页目录表

         atomic_t mm_users;                       /* How many users with user space? */

         atomic_t mm_count;                      /* How many references to "struct mm_struct" (users count as 1) */

         int map_count;                               /* number of VMAs */

         struct rw_semaphore mmap_sem;   //读写信号量

         spinlock_t page_table_lock;           /* Protects page tables and some counters */

 

         struct list_head mmlist;                /* List of maybe swapped mm's.          These are globally strung

                                                         * together off init_mm.mmlist, and are protected

                                                         * by mmlist_lock

                                                         */

 

         /* Special counters, in some configurations protected by the

          * page_table_lock, in other configurations by being atomic.

          */

         mm_counter_t _file_rss;

         mm_counter_t _anon_rss;

 

         unsigned long hiwater_rss; /* High-watermark of RSS usage */

         unsigned long hiwater_vm;          /* High-water virtual memory usage */

 

         unsigned long total_vm, locked_vm, shared_vm, exec_vm;

         unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;

         unsigned long start_code, end_code, start_data, end_data;

         unsigned long start_brk, brk, start_stack;

         unsigned long arg_start, arg_end, env_start, env_end;

 

         unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

 

         cpumask_t cpu_vm_mask;

 

         /* Architecture-specific MM context */

         mm_context_t context;

 

         /* Swap token stuff */

         /*

          * Last value of global fault stamp as seen by this process.

          * In other words, this value gives an indication of how long

          * it has been since this task got the token.

          * Look at mm/thrash.c

          */

         unsigned int faultstamp;

         unsigned int token_priority;

         unsigned int last_interval;

 

         unsigned long flags; /* Must use atomic bitops to access the bits */

 

         struct core_state *core_state; /* coredumping support */

 

         /* aio bits */

         rwlock_t              ioctx_list_lock;     /* aio lock */

         struct kioctx                *ioctx_list;

#ifdef CONFIG_MM_OWNER

         /*

          * "owner" points to a task that is regarded as the canonical

          * user/owner of this mm. All of the following must be true in

          * order for it to be changed:

          *

          * current == mm->owner

          * current->mm != mm

          * new_owner->mm == mm

          * new_owner->alloc_lock is held

          */

         struct task_struct *owner;

#endif

 

#ifdef CONFIG_PROC_FS

         /* store ref to file /proc/<pid>/exe symlink points to */

         struct file *exe_file;

         unsigned long num_exe_file_vmas;

#endif

#ifdef CONFIG_MMU_NOTIFIER

         struct mmu_notifier_mm *mmu_notifier_mm;

#endif

};

O(∩_∩)O~

static inline unsigned longdo_mmap(struct file *file, unsigned long addr,

unsigned long len, unsigned long prot,

                                    unsigned long flag, unsigned long offset)

{

         unsigned long ret = -EINVAL;

if ((offset + PAGE_ALIGN(len)) < offset)    //如果mmap开辟的长度与偏移量和小于偏移量,即在文件偏移量之前,则错误。

                   goto out;

         if (!(offset & ~PAGE_MASK))   //offset必须页对齐  PAGE_MASK   ~(PAGE_SIZE-1)

                   ret = do_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);

out:

         return ret;

}

O(∩_∩)O~

inline long do_mmap2(unsigned long addr, unsigned long len,unsigned long prot, unsigned long flags,

                       unsigned long fd, unsigned long pgoff)

         int error = -EINVAL;

         struct file * file = NULL;

 

         flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); //将标志设置为不可执行以及可写

//在ARM中 #define FIRST_USER_ADDRESS        PAGE_SIZE

         if (flags & MAP_FIXED && addr < FIRST_USER_ADDRESS)  //如果被设置为准确地址,即addr被赋值,而addr的地址小于用户的首地址,则错误。

                   goto out;

 

         error = -EBADF;

       if (!(flags & MAP_ANONYMOUS)) {   //如果flags并没有被设置为任意文件,则需要检查文件是否存在

                   file = fget(fd);

                   if (!file)

                            goto out;

         }

         down_write(&current->mm->mmap_sem);  //锁定信号量,防止其它进程是使用

         error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff); //注意do_mmap 和do_mmap1中的pgoff传入的参数不同。Do_mmap中pgoff直接为偏移地址,而在do_mmap1中为偏移页帧数

         up_write(&current->mm->mmap_sem);

 

         if (file)

                   fput(file);

out:

         return error;

}

O(∩_∩)O~

unsigned long do_mmap_pgoff(struct file * file, unsigned long addr,

                                         unsigned long len, unsigned long prot,

                                         unsigned long flags, unsigned long pgoff)

{

         struct mm_struct * mm = current->mm;   //mm指针指向当前进程的mm_struct

         struct inode *inode;

         unsigned int vm_flags;

         int error;

         int accountable = 1;

         unsigned long reqprot = prot;

 

         /*

          * Does the application expect PROT_READ to imply PROT_EXEC?

          *

          * (the exception is when the underlying filesystem is noexec

          *  mounted, in which case we dont add PROT_EXEC.)

          */

         if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))  //????

                   if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))

                            prot |= PROT_EXEC;

 

         if (!len)

                   return -EINVAL;

 

         if (!(flags & MAP_FIXED))  //如果没有设置直接地址,对addr进行检测,如果小于mmap_min_addr的最小地址,则将addr改为最小的addr,在设置安全配置后有效

                   addr = round_hint_to_min(addr);

   //#ifndef arch_mmap_check

//#define arch_mmap_check(addr, len, flags)         (0)

//#endif

         error = arch_mmap_check(addr, len, flags); //对参数进行检测,在ARM中没有对其定义直接返回0

         if (error)

                   return error;

 

         /* Careful about overflows.. */

// #define PAGE_ALIGN(addr)    (((addr)+PAGE_SIZE-1)&PAGE_MASK)

    len = PAGE_ALIGN(len); //将物理地址addr修整为页边界地址(页的上边界)

         if (!len || len > TASK_SIZE)  //len大于0并且len小于用户进程空间的最大值

                   return -ENOMEM;

 

         /* offset overflow? */

         if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)  //判断是否溢出

               return -EOVERFLOW;

 

         /* Too many mappings? */

         if (mm->map_count > sysctl_max_map_count)  // map_count记录用户进程的映射个数,如果大于最大的映射个数

                   return -ENOMEM;

 

/* Obtain the address to map to. we verify (or select) it and ensure that it represents a valid section of the address space.

          */

         addr = get_unmapped_area(file, addr, len, pgoff, flags); 

         if (addr & ~PAGE_MASK)

                   return addr;

 

         /* Do simple checking here so the lower-level routines won't have

          * to. we assume access permissions have been handled by the open

          * of the memory object, so we don't do any here.

          */

         vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |

                            mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

 

         if (flags & MAP_LOCKED) {

                   if (!can_do_mlock())

                            return -EPERM;

                   vm_flags |= VM_LOCKED;

         }

 

         /* mlock MCL_FUTURE? */

         if (vm_flags & VM_LOCKED) {

                   unsigned long locked, lock_limit;

                   locked = len >> PAGE_SHIFT;

                   locked += mm->locked_vm;

                   lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;

                   lock_limit >>= PAGE_SHIFT;

                   if (locked > lock_limit && !capable(CAP_IPC_LOCK))

                            return -EAGAIN;

         }

 

         inode = file ? file->f_path.dentry->d_inode : NULL;

 

         if (file) {

                   switch (flags & MAP_TYPE) {

                   case MAP_SHARED:

                            if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))

                                     return -EACCES;

 

                            /*

                             * Make sure we don't allow writing to an append-only

                             * file..

                             */

                            if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))

                                     return -EACCES;

 

                            /*

                             * Make sure there are no mandatory locks on the file.

                             */

                            if (locks_verify_locked(inode))

                                     return -EAGAIN;

 

                            vm_flags |= VM_SHARED | VM_MAYSHARE;

                            if (!(file->f_mode & FMODE_WRITE))

                                     vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

 

                            /* fall through */

                   case MAP_PRIVATE:

                            if (!(file->f_mode & FMODE_READ))

                                     return -EACCES;

                            if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {

                                     if (vm_flags & VM_EXEC)

                                               return -EPERM;

                                     vm_flags &= ~VM_MAYEXEC;

                            }

                            if (is_file_hugepages(file))

                                     accountable = 0;

 

                            if (!file->f_op || !file->f_op->mmap)

                                     return -ENODEV;

                            break;

 

                   default:

                            return -EINVAL;

                   }

         } else {

                   switch (flags & MAP_TYPE) {

                   case MAP_SHARED:

                            /*

                             * Ignore pgoff.

                             */

                            pgoff = 0;

                            vm_flags |= VM_SHARED | VM_MAYSHARE;

                            break;

                   case MAP_PRIVATE:

                            /*

                             * Set pgoff according to addr for anon_vma.

                             */

                            pgoff = addr >> PAGE_SHIFT;

                            break;

                   default:

                            return -EINVAL;

                   }

         }

 

         error = security_file_mmap(file, reqprot, prot, flags, addr, 0);

         if (error)

                   return error;

 

         return  mmap_region(file, addr, len, flags, vm_flags, pgoff,accountable);

}

(*^__^*) 嘻嘻……

static inline unsigned long round_hint_to_min(unsigned long hint)

{

#ifdef CONFIG_SECURITY

         hint &= PAGE_MASK;

         if (((void *)hint != NULL) &&

             (hint < mmap_min_addr))

                   return PAGE_ALIGN(mmap_min_addr);

#endif

         return hint;

}

 

unsigned long get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

                                        unsigned long pgoff, unsigned long flags)

{

         unsigned long (*get_area)(struct file *, unsigned long,

                                       unsigned long, unsigned long, unsigned long); //定义一个函数指针

          get_area起先设置为当前进程的默认函数,一般情况下在arm中则会调用arch_get_unmapped_area get_area = current->mm->get_unmapped_area

         如果打开的设备文件存在get_unmapped_area函数操作,则调用该文件的操作函数

if (file && file->f_op && file->f_op->get_unmapped_area)

                   get_area = file->f_op->get_unmapped_area;

         addr = get_area(file, addr, len, pgoff, flags);

         if (IS_ERR_VALUE(addr))

                   return addr;

         if (addr > TASK_SIZE - len)

                   return -ENOMEM;

         if (addr & ~PAGE_MASK)

                   return -EINVAL;

在ARM下为addr

#ifndef arch_rebalance_pgtables

#define arch_rebalance_pgtables(addr, len)                  (addr)

#endif

         return arch_rebalance_pgtables(addr, len);

}

 

unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,

                   unsigned long len, unsigned long pgoff, unsigned long flags)

{

         struct mm_struct *mm = current->mm;

         struct vm_area_struct *vma;

         unsigned long start_addr;

#ifdef CONFIG_CPU_V6

         unsigned int cache_type;

         int do_align = 0, aliasing = 0;

 

         /*

          * We only need to do colour alignment if either the I or D

          * caches alias.  This is indicated by bits 9 and 21 of the

          * cache type register.

          */

         cache_type = read_cpuid_cachetype();

         if (cache_type != read_cpuid_id()) {

                  aliasing = (cache_type | cache_type >> 12) & (1 << 11);

                   if (aliasing)

                            do_align = filp || flags & MAP_SHARED;

         }

#else

#define do_align 0

#define aliasing 0

#endif

 

         /*

          * We enforce the MAP_FIXED case.

          */

如果有MAP_FIXED标志那么就直接返回addr,其实在MAP_FIXED 设置的前提下,用户传入的addr必须是页对齐的,若不是则出错,这里将addr返回,如果addr不是页对齐,那么get_unmapped_area中的if (addr & ~PAGE_MASK)验证将无法通过,即是指定了映射地址,并且指定地址页对齐,直接返回addr

         if (flags & MAP_FIXED) {

                   if (aliasing && flags & MAP_SHARED && addr & (SHMLBA - 1))

                            return -EINVAL;

                   return addr;

         }

     边界检测

         if (len > TASK_SIZE) 

                   return -ENOMEM;

 如果页不对齐,则优先使用指定地址页对齐后的地址,

         if (addr)

{

                   if (do_align)

                            addr = COLOUR_ALIGN(addr, pgoff);

                   else

                            addr = PAGE_ALIGN(addr);

                   vma = find_vma(mm, addr); 

vma为NULL即addr的地址不在任一个VMA(vma->vm_start~vma->vm_end)

                   addr的地址没有被映射,而且空洞足够我们这次的映射,那么返回addr以准备这次的映射 

                   if (TASK_SIZE - len >= addr &&(!vma || addr + len <= vma->vm_start))

                            return addr;

         }

     如果所需的长度大于当前vma之间的空洞长度

         if (len > mm->cached_hole_size) {

                 start_addr = addr = mm->free_area_cache; 从上次查询的位置开始查询

         } else {

需要的长度小于当前空洞,为了不至于时间浪费,那么从0开始搜寻,这里的搜寻基地址TASK_UNMAPPED_BASE很重要,用户mmap的地址的基地址必须在TASK_UNMAPPED_BASE之上,但是一定这样严格 吗?看上面的if (addr)判断,如果用户给了一个地址在TASK_UNMAPPED_BASE之下,映射实际上还是会发生的。 

             #define TASK_UNMAPPED_BASE         (UL(CONFIG_PAGE_OFFSET) / 3)   CONFIG_PAGE_OFFSET=0xC0000000

                 start_addr = addr = TASK_UNMAPPED_BASE;

                 mm->cached_hole_size = 0;

         }

 

full_search:

         if (do_align)

                   addr = COLOUR_ALIGN(addr, pgoff);

         else

                   addr = PAGE_ALIGN(addr);

 

         for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {  find_vma只执行一次,下一个循环的vma为vma->vma_next

                   /* At this point:  (!vma || addr < vma->vm_end). */

                   if (TASK_SIZE - len < addr) {

                            /*

                             * Start a new search - just in case we missed

                             * some holes.

                             */

                            if (start_addr != TASK_UNMAPPED_BASE)

 {

                                     start_addr = addr = TASK_UNMAPPED_BASE;

                                     mm->cached_hole_size = 0;

                                     goto full_search;

                            }

                            return -ENOMEM;

                   }

                   如果第一次find_vma返回值即为NULL ,vma没有被映射并且空洞足够映射

          !vma的条件只有可能在循环的第一次满足,在其后不可能满足,在其后的判断条件即为

          vma->vma_end~vma->vma_next->vma_start之间的空洞大小大于所需要映射的长度即可,下面判断条件中的addr为vma->vma_end,而vma->vm_start为vma->vma_next->vma_start

                   if (!vma || addr + len <= vma->vm_start)

 { 

                            /*

                             * Remember the place where we stopped the search:

                             */

                            mm->free_area_cache = addr + len;  记住映射的最新地址

                            return addr;

                   }

         在循环的第一次如果vma不为NULL,不会满足下面的条件,在以后循环中

mm->cached_hole_size 则为该次vma->vm_start 与上一次的vma->vm_end之间的差值

                   if (addr + mm->cached_hole_size < vma->vm_start)

                           mm->cached_hole_size = vma->vm_start - addr;   更新当前空洞长度 

                   addr = vma->vm_end;

                   if (do_align)

                            addr = COLOUR_ALIGN(addr, pgoff);

         }

}

 

查找第一个VMA满足其地址小于vm_end的VMA块,如果没有返回NULL

struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr)

{

         struct vm_area_struct *vma = NULL;

 

         if (mm) {

                   /* Check the cache first. */

                   /* (Cache hit rate is typically around 35%.) */

         首先将缓存赋给vma

                   vma = mm->mmap_cache;

                   如果下列有一项不满足则重新查找,在红黑树中查找,满足的VMA结构,由于VMA是在进程管理中采用了使用rb_node红黑树的结构来组织的,故在红黑树中查找满足addr的VMA结构。

                   if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {

                            struct rb_node * rb_node;

 

                            rb_node = mm->mm_rb.rb_node;

                            vma = NULL;

 

                            while (rb_node) {

                                     struct vm_area_struct * vma_tmp;

                   根据当前rb_node找出rb_node所在的vm_area_struct

                                     vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);

                   在树中查找。

                                     if (vma_tmp->vm_end > addr) {

                                               vma = vma_tmp;

                                               if (vma_tmp->vm_start <= addr)  查找出的vma将addr这个地址包在其内

                                                        break;

                                               rb_node = rb_node->rb_left;

                                     } else

                                               rb_node = rb_node->rb_right;

                            }

              如果找到vma则将其置为缓存

                            if (vma)

                                     mm->mmap_cache = vma;

                   }

         }

         return vma;

}

 

unsigned long mmap_region(struct file *file, unsigned long addr,unsigned long len, unsigned long flags,

                                       unsigned int vm_flags, unsigned long pgoff,int accountable)

{

         struct mm_struct *mm = current->mm;

         struct vm_area_struct *vma, *prev;

         struct vm_area_struct *merged_vma;

         int correct_wcount = 0;

         int error;

         struct rb_node **rb_link, *rb_parent;

         unsigned long charged = 0;

         struct inode *inode =  file ? file->f_path.dentry->d_inode : NULL;

 

         /* Clear old maps */

         error = -ENOMEM;

munmap_back:

         vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);

         if (vma && vma->vm_start < addr + len) {

                   if (do_munmap(mm, addr, len))

                            return -ENOMEM;

                   goto munmap_back;

         }

    函数find_vma_prepare()与find_vma()基本相同,它扫描当前进程地址空间的vm_area_struct结构所形成的红黑树,试图找到结束地址高于addr的第一个区间;如果找到了一个虚拟区,说明addr所在的虚拟区已经在使用,也就是已经有映射存在,因此要调用do_munmap()把这个老的虚拟区从进程地址空间中撤销,如果撤销不成功,就返回一个负数;如果撤销成功,就继续查找,直到在红黑树中找不到addr所在的虚拟区

 

         /* Check against address space limit. */

     页数和超过限定值返回 0 ,不超过放回1

         if (!may_expand_vm(mm, len >> PAGE_SHIFT))

                   return -ENOMEM;

 

         if (flags & MAP_NORESERVE)

                   vm_flags |= VM_NORESERVE;

 

         if (accountable && (!(flags & MAP_NORESERVE) ||

                                sysctl_overcommit_memory == OVERCOMMIT_NEVER)) {

                   if (vm_flags & VM_SHARED) {

                            /* Check memory availability in shmem_file_setup? */

                            vm_flags |= VM_ACCOUNT;

                   } else if (vm_flags & VM_WRITE) {

                            /*

                             * Private writable mapping: check memory availability

                             */

                            charged = len >> PAGE_SHIFT;

                            if (security_vm_enough_memory(charged))

                                     return -ENOMEM;

                            vm_flags |= VM_ACCOUNT;

                   }

         }

     如果flags参数中没有设置MAP_NORESERVE标志,新的虚拟区含有私有的可写页,空闲页面数小于要映射的虚拟区的大小;则函数终止并返回一个负数;其中函数security_vm_enough_memory()用来检查一个进程的地址空间中是否有足够的内存来进行一个新的映射

         /*

          * Can we just expand an old private anonymous mapping?

          * The VM_SHARED test is necessary because shmem_zero_setup

          * will create the file object for a shared anonymous map below.

          */

         if (!file && !(vm_flags & VM_SHARED)) {

                   vma = vma_merge(mm, prev, addr, addr + len, vm_flags,

                                               NULL, NULL, pgoff, NULL);

                   if (vma)

                            goto out;

         }

      如果是匿名映射(file为空),并且这个虚拟区是非共享的,则可以把这个虚拟区和与它紧挨的前一个虚拟区进行合并;虚拟区的合并是由vma_merge()函数实现的。如果合并成功,则转out处,请看后面out处的代码。

         /*

          * Determine the object being mapped and call the appropriate

          * specific mapper. the address has already been validated, but

          * not unmapped, but the maps are removed from the list.

          */

     经过以上种种检查以及种种判断,终于可以修成正果,去开辟一个真正的vma结构,心里那个high啊。(*^__^*) 嘻嘻……

         vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

     开辟没有空间,又错误,o(╯□╰)o,上面所做的种种又白费了!!!

         if (!vma) {

                   error = -ENOMEM;

                   goto unacct_error;

         }

 

         vma->vm_mm = mm;

         vma->vm_start = addr;

         vma->vm_end = addr + len;

         vma->vm_flags = vm_flags;

         vma->vm_page_prot = vm_get_page_prot(vm_flags);

         vma->vm_pgoff = pgoff;

 

         if (file) {

                   error = -EINVAL;

                   if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))

                            goto free_vma;

                   if (vm_flags & VM_DENYWRITE) {

                            error = deny_write_access(file);

                            if (error)

                                     goto free_vma;

                            correct_wcount = 1;

                   }

                   vma->vm_file = file;

                   get_file(file);

         (⊙o⊙)哦 ,终于可以调用设备文件中真正的mmap

                   error = file->f_op->mmap(file, vma);

                   if (error)

                            goto unmap_and_free_vma;

                   if (vm_flags & VM_EXECUTABLE)

                            added_exe_file_vma(mm);

         } else if (vm_flags & VM_SHARED) {

                   error = shmem_zero_setup(vma);

                   if (error)

                            goto free_vma;

         }

       如果建立的是从文件到虚存区间的映射,则:

·      当参数flags中的VM_GROWSDOWN或VM_GROWSUP标志位为1时,说明这个区间可以向低地址或高地址扩展,但从文件映射的区间不能进行扩展,因此转到free_vma,释放给vm_area_struct分配的Slab,并返回一个错误;

·      当flags中的VM_DENYWRITE标志位为1时,就表示不允许通过常规的文件操作访问该文件,所以要调用deny_write_access()排斥常规的文件操作(参见第八章)。

·      get_file()函数的主要作用是递增file结构中的共享计数;

·      每个文件系统都有个fiel_operation数据结构,其中的函数指针mmap提供了用来建立从该类文件到虚存区间进行映射的操作,这是最具有实质意义的函数;对于大部分文件系统,这个函数为generic_file_mmap( )函数实现的,该函数执行以下操作:

(1)   初始化vm_area_struct结构中的vm_ops域。如果VM_SHARED标志为1,就把该域设置成file_shared_mmap,否则就把该域设置成file_private_mmap。从某种意义上说,这个步骤所做的事情类似于打开一个文件并初始化文件对象的方法。

(2)   从索引节点的i_mode域(参见第八章)检查要映射的文件是否是一个常规文件。如果是其他类型的文件(例如目录或套接字),就返回一个错误代码。

(3)   从索引节点的i_op域中检查是否定义了readpage( )的索引节点操作。如果没有定义,就返回一个错误代码。

(4)   调用update_atime( )函数把当前时间存放在该文件索引节点的i_atime域中,并将这个索引节点标记成脏。

·      如果flags参数中的MAP_SHARED标志位为1,则调用shmem_zero_setup()进行共享内存的映射。

 

         /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform

          * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)

          * that memory reservation must be checked; but that reservation

          * belongs to shared memory object, not to vma: so now clear it.

          */

         if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT))

                   vma->vm_flags &= ~VM_ACCOUNT;

 

         /* Can addr have changed??

          *

          * Answer: Yes, several device drivers can do it in their

          *         f_op->mmap method. -DaveM

          */

         addr = vma->vm_start;

         pgoff = vma->vm_pgoff;

         vm_flags = vma->vm_flags;

 

         if (vma_wants_writenotify(vma))

                   vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

 

         merged_vma = NULL;

         if (file)

                   merged_vma = vma_merge(mm, prev, addr, vma->vm_end,

                            vma->vm_flags, NULL, file, pgoff, vma_policy(vma));

         if (merged_vma) {

                   mpol_put(vma_policy(vma));

                   kmem_cache_free(vm_area_cachep, vma);

                   fput(file);

                   if (vm_flags & VM_EXECUTABLE)

                            removed_exe_file_vma(mm);

                   vma = merged_vma;

         } else {

                   vma_link(mm, vma, prev, rb_link, rb_parent ;

                   file = vma->vm_file;

         }

     此时,把新建的虚拟区插入到进程的地址空间,这是由函数vma_link()完成的,该函数具有三方面的功能:

(1)   把vma 插入到虚拟区链表中

(2)   把vma插入到虚拟区形成的红黑树中

(3)   把vam插入到索引节点(inode)共享链表中

   函数atomic_inc(x)给*x加1,这是一个原子操作。在内核代码中,有很多地方调用了以atomic为前缀的函数。原子操作,在操作过程中不会被中断。

 

         /* Once vma denies write, undo our temporary denial count */

         if (correct_wcount)

                   atomic_inc(&inode->i_writecount);

以下是一些出错处理

out:

         mm->total_vm += len >> PAGE_SHIFT;

         vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);

         if (vm_flags & VM_LOCKED) {

                   /*

                    * makes pages present; downgrades, drops, reacquires mmap_sem

                    */

                   long nr_pages = mlock_vma_pages_range(vma, addr, addr + len);

                   if (nr_pages < 0)

                            return nr_pages;         /* vma gone! */

                   mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;

         } else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))

                   make_pages_present(addr, addr + len);

         return addr;

 

unmap_and_free_vma:

         if (correct_wcount)

                   atomic_inc(&inode->i_writecount);

         vma->vm_file = NULL;

         fput(file);

 

         /* Undo any partial mapping done by a device driver. */

         unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);

         charged = 0;

free_vma:

         kmem_cache_free(vm_area_cachep, vma);

unacct_error:

         if (charged)

                   vm_unacct_memory(charged);

         return error;

}

 

static struct vm_area_struct *find_vma_prepare(struct mm_struct *mm, unsigned long addr,

                   struct vm_area_struct **pprev, struct rb_node ***rb_link,

                   struct rb_node ** rb_parent)

{

         struct vm_area_struct * vma;

         struct rb_node ** __rb_link, * __rb_parent, * rb_prev;

 

         __rb_link = &mm->mm_rb.rb_node;

         rb_prev = __rb_parent = NULL;

         vma = NULL;

 

         while (*__rb_link) {

                   struct vm_area_struct *vma_tmp;

 

                   __rb_parent = *__rb_link;

                   vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb);

 

                   if (vma_tmp->vm_end > addr) {

                            vma = vma_tmp;

                            if (vma_tmp->vm_start <= addr)

                                     break;

                            __rb_link = &__rb_parent->rb_left;

                   } else {

                            rb_prev = __rb_parent;

                            __rb_link = &__rb_parent->rb_right;

                   }

         }

 

         *pprev = NULL;

         if (rb_prev)

                   *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);

         *rb_link = __rb_link;

         *rb_parent = __rb_parent;

         return vma;

}

 

int may_expand_vm(struct mm_struct *mm, unsigned long npages)

{

         unsigned long cur = mm->total_vm;    /* pages */

         unsigned long lim;

    记录当前进程页数限定值

         lim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;

     如果进程已使用的页数和新开辟的页数超过了限定值

         if (cur + npages > lim)

                   return 0;

         return 1;

}



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值