Linux那些事儿之我是Block层(11)传说中的内存映射(上)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/fudan_abc/article/details/2034264

如果这次有机会与中央首长握了手,能不能不要洗掉,这样等回去之后与他们握手,就如同首长与他们握手了.” 20071017,参加十七大的福建三明市特殊教育学校校长黄金莲如此转述学生的嘱托.

网络暴民们对这一事件进行了强烈的讽刺和抨击,然而我觉得大可不必如此,事实上,学生们的想法看似纯朴,实则蕴含了一种深刻的思想,这就是Linux中的内存映射的思想.Linux中经常有这样的情况,一个是用户空间的buffer,一个是内核空间的buffer,一个是属于应用程序,一个属于设备驱动,它们原本没有联系,它们只是永远的相提并论,只是永恒的擦肩而过,就仿佛天上的小鸟和水里的鱼,也许可以相恋,但是它们在哪里筑巢呢?

解决这一问题的方法就是映射,看似并不相连的世界,通过映射,就使得它们有关系了.但是为什么要让前者和后者联系起来呢?如果我把user buffer比作上例中的学生,而把kernel buffer比作黄金莲校长,那么你很快就能知道,之所以学生要和黄校长握手,并不是因为黄校长多么有明星气质,而是因为她和中央首长握了手,那么这里谁可以被比作中央首长呢?仔细一想就知道,设备驱动干嘛用的?用来驱动设备,没错,真正的主角不是设备驱动,而是设备.所以,应用程序之所以愿意把它的user bufferkernel buffer映射起来,恰恰是因为kernel buffer和设备本身有联系.所以,kernel buffer握手,就如同和设备握手.

我们拿Block层的两个函数来举例.这两个函数就是blk_rq_map_userblk_rq_map_kern.两者都来自block/ll_rw_block.c.在我们分析sd模块时,说到ioctl,我们最后实际上调用的是sg_io(),sg_io()中我们需要调用blk_rq_map_user函数,所以我们先来看这个函数.

   2394 /**

   2395  * blk_rq_map_user - map user data to a request, for REQ_BLOCK_PC usage

   2396  * @q:          request queue where request should be inserted

   2397  * @rq:         request structure to fill

   2398  * @ubuf:       the user buffer

   2399  * @len:        length of user data

   2400  *

   2401  * Description:

   2402  *    Data will be mapped directly for zero copy io, if possible. Otherwise

   2403  *    a kernel bounce buffer is used.

   2404  *

   2405  *    A matching blk_rq_unmap_user() must be issued at the end of io, while

   2406  *    still in process context.

   2407  *

   2408  *    Note: The mapped bio may need to be bounced through blk_queue_bounce()

   2409  *    before being submitted to the device, as pages mapped may be out of

   2410  *    reach. It's the callers responsibility to make sure this happens. The

   2411  *    original bio must be passed back in to blk_rq_unmap_user() for proper

   2412  *    unmapping.

   2413  */

   2414 int blk_rq_map_user(request_queue_t *q, struct request *rq, void __user *ubuf,

   2415                     unsigned long len)

   2416 {

   2417         unsigned long bytes_read = 0;

   2418         struct bio *bio = NULL;

   2419         int ret;

   2420

   2421         if (len > (q->max_hw_sectors << 9))

   2422                 return -EINVAL;

   2423         if (!len || !ubuf)

   2424                 return -EINVAL;

   2425

   2426         while (bytes_read != len) {

   2427                 unsigned long map_len, end, start;

   2428

   2429                 map_len = min_t(unsigned long, len - bytes_read, BIO_MAX_SIZE);

   2430                 end = ((unsigned long)ubuf + map_len + PAGE_SIZE - 1)

   2431                                                                 >> PAGE_SHIFT;

   2432                 start = (unsigned long)ubuf >> PAGE_SHIFT;

   2433

   2434                 /*

   2435                  * A bad offset could cause us to require BIO_MAX_PAGES + 1

   2436                  * pages. If this happens we just lower the requested

   2437                  * mapping len by a page so that we can fit

   2438                  */

   2439                 if (end - start > BIO_MAX_PAGES)

   2440                         map_len -= PAGE_SIZE;

   2441

   2442                 ret = __blk_rq_map_user(q, rq, ubuf, map_len);

   2443                 if (ret < 0)

   2444                         goto unmap_rq;

   2445                 if (!bio)

   2446                         bio = rq->bio;

   2447                 bytes_read += ret;

   2448                 ubuf += ret;

   2449         }

   2450

   2451         rq->buffer = rq->data = NULL;

   2452         return 0;

   2453 unmap_rq:

   2454         blk_rq_unmap_user(bio);

   2455         return ret;

   2456 }

这个函数的参数ubuf不是别人,正是从用户空间传下来的那个user buffer,或曰user-space buffer,len则是该buffer的长度.

也许我们早就该讲struct bio.毫无疑问这个结构体是Generic Block Layer中最基础最核心最拉风最潇洒最酷的结构体之一.它表征的是一次正在进行的块设备I/O操作.经典的Linux书籍中无一例外的都对这个结构体进行了详细的介绍,但作为80后我们并不需要跟风,并不需要随波逐流,我们要追求自己的个性,所以这里我们并不过多地讲这个结构体,只是告诉你,它来自include/linux/bio.h:

     68 /*

     69  * main unit of I/O for the block layer and lower layers (ie drivers and

     70  * stacking drivers)

     71  */

     72 struct bio {

     73         sector_t                bi_sector;      /* device address in 512 byte

     74                                                    sectors */

     75         struct bio              *bi_next;       /* request queue link */

     76         struct block_device     *bi_bdev;

     77         unsigned long           bi_flags;       /* status, command, etc */

     78         unsigned long           bi_rw;          /* bottom bits READ/WRITE,

     79                                                  * top bits priority

     80                                                  */

     81

     82         unsigned short          bi_vcnt;        /* how many bio_vec's */

     83         unsigned short          bi_idx;         /* current index into bvl_vec */

     84

     85         /* Number of segments in this BIO after

     86          * physical address coalescing is performed.

     87          */

     88         unsigned short          bi_phys_segments;

     89

     90         /* Number of segments after physical and DMA remapping

     91          * hardware coalescing is performed.

     92          */

     93         unsigned short          bi_hw_segments;

     94

     95         unsigned int            bi_size;        /* residual I/O count */

     96

     97         /*

     98          * To keep track of the max hw size, we account for the

     99          * sizes of the first and last virtually mergeable segments

    100          * in this bio

    101          */

    102         unsigned int            bi_hw_front_size;

    103         unsigned int            bi_hw_back_size;

    104

    105         unsigned int            bi_max_vecs;    /* max bvl_vecs we can hold */

    106

    107         struct bio_vec          *bi_io_vec;     /* the actual vec list */

    108

    109         bio_end_io_t            *bi_end_io;

    110         atomic_t                bi_cnt;         /* pin count */

    111

    112         void                    *bi_private;

    113

    114         bio_destructor_t        *bi_destructor; /* destructor */

    115 };

而它的存在并非是孤立的,它和request是有联系的.struct request中有一个成员struct bio *bio,表征的就是这个requestbio,因为一个request包含多个I/O操作.blk_rq_map_user的主要工作就是建立user bufferbio之间的映射,具体工作是由__blk_rq_map_user来完成的.

   2341 static int __blk_rq_map_user(request_queue_t *q, struct request *rq,

   2342                              void __user *ubuf, unsigned int len)

   2343 {

   2344         unsigned long uaddr;

   2345         struct bio *bio, *orig_bio;

   2346         int reading, ret;

   2347

   2348         reading = rq_data_dir(rq) == READ;

   2349

   2350         /*

   2351          * if alignment requirement is satisfied, map in user pages for

   2352          * direct dma. else, set up kernel bounce buffers

   2353          */

   2354         uaddr = (unsigned long) ubuf;

   2355         if (!(uaddr & queue_dma_alignment(q)) && !(len & queue_dma_alignment(q)))

   2356                 bio = bio_map_user(q, NULL, uaddr, len, reading);

   2357         else

   2358                 bio = bio_copy_user(q, uaddr, len, reading);

   2359

   2360         if (IS_ERR(bio))

   2361                 return PTR_ERR(bio);

   2362

   2363         orig_bio = bio;

   2364         blk_queue_bounce(q, &bio);

   2365

   2366         /*

   2367          * We link the bounce buffer in and could have to traverse it

   2368          * later so we have to get a ref to prevent it from being freed

   2369          */

   2370         bio_get(bio);

   2371

   2372         if (!rq->bio)

   2373                 blk_rq_bio_prep(q, rq, bio);

   2374         else if (!ll_back_merge_fn(q, rq, bio)) {

   2375                 ret = -EINVAL;

   2376                 goto unmap_bio;

   2377         } else {

   2378                 rq->biotail->bi_next = bio;

   2379                 rq->biotail = bio;

   2380

   2381                 rq->data_len += bio->bi_size;

   2382         }

   2383

   2384         return bio->bi_size;

   2385

   2386 unmap_bio:

   2387         /* if it was boucned we must call the end io function */

   2388         bio_endio(bio, bio->bi_size, 0);

   2389         __blk_rq_unmap_user(orig_bio);

   2390         bio_put(bio);

   2391         return ret;

   2392 }

但至少目前为止,bio还只是一个虚无缥缈的指针,华而不实,谁为它申请了内存呢?让我们接着深入,进一步我们需要关注的是bio_map_user().uaddrubuf的虚拟地址,如果其满足所在队列的字节对齐要求,bio_map_user()会被调用.(否则需要调用bio_copy_user()来建立所谓的bounce buffer,不表.)该函数来自fs/bio.c:

    713 /**

    714  *      bio_map_user    -       map user address into bio

    715  *      @q: the request_queue_t for the bio

    716  *      @bdev: destination block device

    717  *      @uaddr: start of user address

    718  *      @len: length in bytes

    719  *      @write_to_vm: bool indicating writing to pages or not

    720  *

    721  *      Map the user space address into a bio suitable for io to a block

    722  *      device. Returns an error pointer in case of error.

    723  */

    724 struct bio *bio_map_user(request_queue_t *q, struct block_device *bdev,

    725                          unsigned long uaddr, unsigned int len, int write_to_vm)

    726 {

    727         struct sg_iovec iov;

    728

    729         iov.iov_base = (void __user *)uaddr;

    730         iov.iov_len = len;

    731

    732         return bio_map_user_iov(q, bdev, &iov, 1, write_to_vm);

    733 }

走到这里,struct sg_iovec似曾相识,仔细回忆一下,sd中讲ioctl的时候曾经讲过这个结构体,描述的就是一个scatter-gather数组成员.iovec就是io vector的意思,IO向量,或者说一个由基地址和长度组成的结构体.

关于函数的各个参数,注释里说得很清楚,而且注释也说了这个函数的目的,不难知道这个函数将返回一个描述了一次IO操作的bio指针.不过真正干活的是bio_map_user_iov().于是再转战至bio_map_user_iov().同样来自fs/bio.c:

    735 /**

    736  *      bio_map_user_iov - map user sg_iovec table into bio

    737  *      @q: the request_queue_t for the bio

    738  *      @bdev: destination block device

    739  *      @iov:   the iovec.

    740  *      @iov_count: number of elements in the iovec

    741  *      @write_to_vm: bool indicating writing to pages or not

    742  *

    743  *      Map the user space address into a bio suitable for io to a block

    744  *      device. Returns an error pointer in case of error.

    745  */

    746 struct bio *bio_map_user_iov(request_queue_t *q, struct block_device *bdev,

    747                              struct sg_iovec *iov, int iov_count,

    748                              int write_to_vm)

    749 {

    750         struct bio *bio;

    751

    752         bio = __bio_map_user_iov(q, bdev, iov, iov_count, write_to_vm);

    753

    754         if (IS_ERR(bio))

    755                 return bio;

    756

    757         /*

    758          * subtle -- if __bio_map_user() ended up bouncing a bio,

    759          * it would normally disappear when its bi_end_io is run.

    760          * however, we need it for the unmap, so grab an extra

    761          * reference to it

    762          */

    763         bio_get(bio);

    764

    765         return bio;

    766 }

还不是终点,继续走入__bio_map_user_iov().

    603 static struct bio *__bio_map_user_iov(request_queue_t *q,

    604                                       struct block_device *bdev,

    605                                       struct sg_iovec *iov, int iov_count,

    606                                       int write_to_vm)

    607 {

    608         int i, j;

    609         int nr_pages = 0;

    610         struct page **pages;

    611         struct bio *bio;

    612         int cur_page = 0;

    613         int ret, offset;

    614

    615         for (i = 0; i < iov_count; i++) {

    616                 unsigned long uaddr = (unsigned long)iov[i].iov_base;

    617                 unsigned long len = iov[i].iov_len;

    618                 unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;

    619                 unsigned long start = uaddr >> PAGE_SHIFT;

    620

    621                 nr_pages += end - start;

    622                 /*

    623                  * buffer must be aligned to at least hardsector size for now

    624                  */

    625                 if (uaddr & queue_dma_alignment(q))

    626                         return ERR_PTR(-EINVAL);

    627         }

    628

    629         if (!nr_pages)

    630                 return ERR_PTR(-EINVAL);

    631

    632         bio = bio_alloc(GFP_KERNEL, nr_pages);

    633         if (!bio)

    634                 return ERR_PTR(-ENOMEM);

    635

    636         ret = -ENOMEM;

    637         pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);

    638         if (!pages)

    639                 goto out;

    640

    641         for (i = 0; i < iov_count; i++) {

    642                 unsigned long uaddr = (unsigned long)iov[i].iov_base;

    643                 unsigned long len = iov[i].iov_len;

    644                 unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;

    645                 unsigned long start = uaddr >> PAGE_SHIFT;

    646                 const int local_nr_pages = end - start;

    647                 const int page_limit = cur_page + local_nr_pages;

    648

    649                 down_read(&current->mm->mmap_sem);

    650                 ret = get_user_pages(current, current->mm, uaddr,

    651                                      local_nr_pages,

    652                                      write_to_vm, 0, &pages[cur_page], NULL);

    653                 up_read(&current->mm->mmap_sem);

    654

    655                 if (ret < local_nr_pages) {

    656                         ret = -EFAULT;

    657                         goto out_unmap;

    658                 }

    659

    660                 offset = uaddr & ~PAGE_MASK;

    661                 for (j = cur_page; j < page_limit; j++) {

    662                         unsigned int bytes = PAGE_SIZE - offset;

    663

    664                         if (len <= 0)

    665                                 break;

    666

    667                         if (bytes > len)

    668                                 bytes = len;

    669

    670                         /*

    671                          * sorry...

    672                          */

    673                         if (bio_add_pc_page(q, bio, pages[j], bytes, offset) <

    674                                             bytes)

    675                                 break;

    676

    677                         len -= bytes;

    678                         offset = 0;

    679                 }

    680

    681                 cur_page = j;

    682                 /*

    683                  * release the pages we didn't map into the bio, if any

    684                  */

    685                 while (j < page_limit)

    686                         page_cache_release(pages[j++]);

    687         }

    688

    689         kfree(pages);

    690

    691         /*

    692          * set data direction, and check if mapped pages need bouncing

    693          */

    694         if (!write_to_vm)

    695                 bio->bi_rw |= (1 << BIO_RW);

    696

    697         bio->bi_bdev = bdev;

    698         bio->bi_flags |= (1 << BIO_USER_MAPPED);

    699         return bio;

    700

    701  out_unmap:

    702         for (i = 0; i < nr_pages; i++) {

    703                 if(!pages[i])

    704                         break;

    705                 page_cache_release(pages[i]);

    706         }

    707  out:

    708         kfree(pages);

    709         bio_put(bio);

    710         return ERR_PTR(ret);

    711 }

632,bio_alloc(),看到了吧,很明显,内存是在这里申请的,bio从此站了起来.

我们本可以不再深入,但是阿信告诉我们看代码不淋漓尽致不痛快.

所以继续深入bio_alloc,来自fs/bio.c:

    187 struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)

    188 {

    189         struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);

    190

    191         if (bio)

    192                 bio->bi_destructor = bio_fs_destructor;

    193

    194         return bio;

    195 }

其实就是调用bio_alloc_bioset(),来自同一个文件:

    147 /**

    148  * bio_alloc_bioset - allocate a bio for I/O

    149  * @gfp_mask:   the GFP_ mask given to the slab allocator

    150  * @nr_iovecs:  number of iovecs to pre-allocate

    151  * @bs:         the bio_set to allocate from

    152  *

    153  * Description:

    154  *   bio_alloc_bioset will first try it's on mempool to satisfy the allocation.

    155  *   If %__GFP_WAIT is set then we will block on the internal pool waiting

    156  *   for a &struct bio to become free.

    157  *

    158  *   allocate bio and iovecs from the memory pools specified by the

    159  *   bio_set structure.

    160  **/

    161 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)

    162 {

    163         struct bio *bio = mempool_alloc(bs->bio_pool, gfp_mask);

    164

    165         if (likely(bio)) {

    166                 struct bio_vec *bvl = NULL;

    167

    168                 bio_init(bio);

    169                 if (likely(nr_iovecs)) {

    170                         unsigned long idx = 0; /* shut up gcc */

    171

    172                         bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);

    173                         if (unlikely(!bvl)) {

    174                                 mempool_free(bio, bs->bio_pool);

    175                                 bio = NULL;

    176                                 goto out;

    177                         }

    178                         bio->bi_flags |= idx << BIO_POOL_OFFSET;

    179                         bio->bi_max_vecs = bvec_slabs[idx].nr_vecs;

    180                 }

    181                 bio->bi_io_vec = bvl;

    182         }

    183 out:

    184         return bio;

    185 }

看到这儿基本上就明白怎么回事了.mempool_alloc很明确的告诉我们,bio申请了内存,紧接着bio_init()为它做了初始化.更多细节不再说了,唯一需要关注的是,nr_iovecs,一路传过来的, __bio_map_user_iov()中把nr_pages传递了给了bio_alloc(),615行到627行对nr_pages进行了计算,通过一个for循环累加,循环次数是iov_count,每次雷加的是endstart的差值.很显然,最终的nr_pages就是iov数组所对应的page的数量,iov__bio_map_user_iov的第三个参数,另一方面,很显然,iov_count表征的是iov数组的元素个数,而在bio_map_user中调用bio_map_user_iov时传递的第三个参数是1,所以iov_count就是1.不过这些都不重要,重要的是我们现在有bio.我们结束bio_alloc,回到__bio_map_user_iov中继续往下走,637,申请了另一个东西,pages,一个二级指针,冥冥中感觉到这将代表一个指针数组.

而紧接着,又是另一个for循环.get_user_pages是获得page描述符.这一行代码应该是灵魂性质的代码.从这一刻起,用户空间的buffer和内核空间建立了姻缘.让我们从下面这幅图说起.

Bio中最重要的成员就是bi_io_vecbi_vcnt.bi_io_vec是一个struct bio_vec指针,后者的定义在include/linux/bio.h:

     54 /*

     55  * was unsigned short, but we might as well be ready for > 64kB I/O pages

     56  */

     57 struct bio_vec {

     58         struct page     *bv_page;

     59         unsigned int    bv_len;

     60         unsigned int    bv_offset;

     61 };

bi_io_vec实际上则是代表了一个struct bio_vec的数组,bi_vcnt是这个数组的元素个数.如图中看到的那样,bio_vec中的成员bv_page指向的是一个个映射的page.而建立映射的恰恰就是刚才看到的这个伟大的get_user_pages()函数,是它让这些个page和用户空间的buffer联系了起来.bio_add_pc_page()则是让bv_page指向相应的page.之所以要把page和用户空间的buffer映射起来,其原因在于block层只认bio不认用户空间的user buffer,block层的那些个函数都是针对bio来操作的,它们可不管你什么用户空间不用户空间,它们就管自己的bio,它们就知道每一个request对应一个bio.

关于get_user_pages函数,其原型在include/linux/mm.h:

    795 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,

    796                 int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);

这其中,startlen这两个参数描述的是user-space buffer,(其中len的单位是page,len如果为3就表示3page.)本函数的目的就是把这个user-space buffer映射到内核空间,pagesvmas是这个函数的输出.其中pages是一个二级指针,换言之它其实就是一个指针数组,包含的是一群page指针,这群page指针指向的正是这个user-space buffer.这个函数的返回值是实际映射了几个pages.(The return value is the number of pages actually mapped.)vmas咱们不用管了,至少咱们这里传递进去的是NULL,所以它不会起什么作用.

继续对get_user_pages多八卦几句,正如每一个成功的男人背后都有一个(或多个)女人,比如张斌老师,比如赵忠祥老师,比如李金斗老师,每一个Linux进程背后都有一个页表.在进程创建的时候会在其地址空间中建立自己的页表,对于x86而言,页表中一共有1024,每一项可以表征一个page,而该page是否存在于物理内存中呢?这就很难说了.我们不妨把page table中的1024项说成1024个指针,1024个指针都是32bits,这其中就有一位被叫做Present,它为1则说明该page存在于物理内存中,它为0则说明不存在物理内存中.

那么这和我们这个get_user_pages有什么关系呢?get_user_pages的参数startlen表征的是线性地址,x86来说,线性地址一共32bits,这三十二个bits分为三段,bit31-bit22称为Directory,或者说Page Directory中的索引,bit21-bit12称为Table,或者说Page Table中的索引,bit11-bit0则是Offset.给定了一个虚拟地址,或者说线性地址,就相当于给定了它在Page Directory中的位置,给定了它在Page Table中的位置,也就是说给定了一个Page.假如这个Page在物理内存中,那么好说,但是如果不在呢?如果不在,这时候get_user_pages()方显英雄本色,它会申请一个Page Frame,会相应的设置页表.这之后,这段虚拟地址就属于有后台的虚拟地址了,因为有物理地址给它撑腰,这样你应用程序就可以访问它了,而设备驱动也可以访问它了,只不过设备驱动并不是直接访问这些个地址,还是前面说的,Block层只认bio,不认page,不认虚拟地址,所以有下面这个函数bio_add_pc_page(),负责把pagebio联系起来.

我们来看bio_add_pc_page,它来自fs/bio.c:

    414 /**

    415  *      bio_add_pc_page -       attempt to add page to bio

    416  *      @q: the target queue

    417  *      @bio: destination bio

    418  *      @page: page to add

    419  *      @len: vec entry length

    420  *      @offset: vec entry offset

    421  *

    422  *      Attempt to add a page to the bio_vec maplist. This can fail for a

    423  *      number of reasons, such as the bio being full or target block

    424  *      device limitations. The target block device must allow bio's

    425  *      smaller than PAGE_SIZE, so it is always possible to add a single

    426  *      page to an empty bio. This should only be used by REQ_PC bios.

    427  */

    428 int bio_add_pc_page(request_queue_t *q, struct bio *bio, struct page *page,

    429                     unsigned int len, unsigned int offset)

    430 {

    431         return __bio_add_page(q, bio, page, len, offset, q->max_hw_sectors);

    432 }

__bio_add_pages来自同一个文件.

    318 static int __bio_add_page(request_queue_t *q, struct bio *bio, struct page

    319                           *page, unsigned int len, unsigned int offset,

    320                           unsigned short max_sectors)

    321 {

    322         int retried_segments = 0;

    323         struct bio_vec *bvec;

    324

    325         /*

    326          * cloned bio must not modify vec list

    327          */

    328         if (unlikely(bio_flagged(bio, BIO_CLONED)))

    329                 return 0;

    330

    331         if (((bio->bi_size + len) >> 9) > max_sectors)

    332                 return 0;

    333

    334         /*

    335          * For filesystems with a blocksize smaller than the pagesize

    336          * we will often be called with the same page as last time and

    337          * a consecutive offset.  Optimize this special case.

    338          */

    339         if (bio->bi_vcnt > 0) {

    340                 struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];

    341

    342                 if (page == prev->bv_page &&

    343                     offset == prev->bv_offset + prev->bv_len) {

    344                         prev->bv_len += len;

    345                         if (q->merge_bvec_fn &&

    346                             q->merge_bvec_fn(q, bio, prev) < len) {

    347                                 prev->bv_len -= len;

    348                                 return 0;

    349                         }

    350

    351                         goto done;

    352                 }

    353         }

    354

    355         if (bio->bi_vcnt >= bio->bi_max_vecs)

    356                 return 0;

    357

    358         /*

    359          * we might lose a segment or two here, but rather that than

    360          * make this too complex.

    361          */

    362

    363         while (bio->bi_phys_segments >= q->max_phys_segments

    364                || bio->bi_hw_segments >= q->max_hw_segments

    365                || BIOVEC_VIRT_OVERSIZE(bio->bi_size)) {

366

    367                 if (retried_segments)

    368                         return 0;

    369

    370                 retried_segments = 1;

    371                 blk_recount_segments(q, bio);

    372         }

    373

    374         /*

    375          * setup the new entry, we might clear it again later if we

    376          * cannot add the page

    377          */

    378         bvec = &bio->bi_io_vec[bio->bi_vcnt];

    379         bvec->bv_page = page;

    380         bvec->bv_len = len;

    381         bvec->bv_offset = offset;

    382

    383         /*

    384          * if queue has other restrictions (eg varying max sector size

    385          * depending on offset), it can specify a merge_bvec_fn in the

    386          * queue to get further control

    387          */

    388         if (q->merge_bvec_fn) {

    389                 /*

    390                  * merge_bvec_fn() returns number of bytes it can accept

    391                  * at this offset

    392                  */

    393                 if (q->merge_bvec_fn(q, bio, bvec) < len) {

    394                         bvec->bv_page = NULL;

    395                         bvec->bv_len = 0;

    396                         bvec->bv_offset = 0;

    397                         return 0;

    398                 }

    399         }

    400

    401         /* If we may be able to merge these biovecs, force a recount */

    402         if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec) ||

    403             BIOVEC_VIRT_MERGEABLE(bvec-1, bvec)))

    404                 bio->bi_flags &= ~(1 << BIO_SEG_VALID);

    405

    406         bio->bi_vcnt++;

    407         bio->bi_phys_segments++;

    408         bio->bi_hw_segments++;

    409  done:

    410         bio->bi_size += len;

    411         return len;

    412 }

Block层很多东西都是为Raid服务的,比如这里的这个merge_bvec_fn函数指针,对于普通的硬盘驱动来说,是没有这么一个破指针的,或者说这个指针指向的是空气.不过有意思的是没有这个函数的话,__bio_add_pages这个函数就变得很简单了,所以我们很开心.这个函数最有意义的代码就是378行到381行对bvec的赋值,以及406行到410行对bio的赋值.友情提醒一下,注意410行这个赋值,bio->bi_size就是len的累加,如果你仔细追踪一下就会发现,其实兜来转去,这个bio->bi_size就是最初用户空间传下来那个len.

函数__bio_map_user_iov(),661行到679行这个for循环,就是让这所有的那些pages一个个的全都加入到bio的那张bi_io_vec表里去,让每一个bv_page都有所指.

然后,699,__bio_map_user_iov()函数返回,返回的就是bio.紧接着,bio_map_user_iov()bio_map_user()也先后返回,返回值也都是这个bio.我们于是回到了__blk_rq_map_user().

不过,我们刚才也看到了,bio是有了,biopages也有了暧昧关系,biouser buffer也有了暧昧关系,可是这就够了吗?很显然bio还应该和request建立关系吧,没加入到request中的bio可不是有用的bio,requestbio之间的关系如下图所示:

完成这项工作的就是2373行调用的blk_rq_bio_prep()函数,来自block/ll_rw_blk.c:

   3669 void blk_rq_bio_prep(request_queue_t *q, struct request *rq, struct bio *bio)

   3670 {

   3671         /* first two bits are identical in rq->cmd_flags and bio->bi_rw */

   3672         rq->cmd_flags |= (bio->bi_rw & 3);

   3673

   3674         rq->nr_phys_segments = bio_phys_segments(q, bio);

   3675         rq->nr_hw_segments = bio_hw_segments(q, bio);

   3676         rq->current_nr_sectors = bio_cur_sectors(bio);

   3677         rq->hard_cur_sectors = rq->current_nr_sectors;

   3678         rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);

   3679         rq->buffer = bio_data(bio);

   3680         rq->data_len = bio->bi_size;

   3681

   3682         rq->bio = rq->biotail = bio;

   3683 }

到这里bio正式嫁入rq.

回到__blk_rq_map_user(),也该返回了,2384,返回的是bio->bi_size.刚才说过了,这个就是用户空间传过来那个user buffer的长度.

而回到blk_rq_map_user(),发现这个函数也该结束了,正常的话这个函数返回0.于是这个浩大的映射工程就算是结束了.然而网友贱男村村长提出质疑,这些个bio什么时候被用到的?当时在讲scsi命令的时候好像没怎么说起?其实当时在讲scsi命令的时候,有这么一个函数,scsi_setup_blk_pc_cmnd,这个函数1104行就是判断req->bio是否为NULL,如果不为NULL,则会对它进行相应的处理,一个叫做scsi_init_io()的函数会被调用,会建立一个scatter-gather数组来和这个bio中的向量bi_io_vec相对应.

阅读更多
想对作者说点什么?

博主推荐

换一批

没有更多推荐了,返回首页