linux内核奇遇记之md源代码解读之十三raid5重试读

最新推荐文章于 2019-08-29 10:28:41 发布

默默努力的小熊

最新推荐文章于 2019-08-29 10:28:41 发布

阅读量3.5k

点赞数 3

分类专栏：存储技术 linux内核奇遇记之md(raid)源代码解读

本文链接：https://blog.csdn.net/liumangxiong/article/details/14222589

版权

存储技术同时被 2 个专栏收录

27 篇文章 8 订阅

订阅专栏

linux内核奇遇记之md(raid)源代码解读

15 篇文章 33 订阅

订阅专栏

linux内核奇遇记之md源代码解读之十三raid5重试读

转载请注明出处：http://blog.csdn.net/liumangxiong

上节我们讲到条块内读失败，在回调函数raid5_align_endio中将请求加入阵列重试链表，在唤醒raid5d线程之后，raid5d线程将该请求调用retry_aligned_read函数进行重试读：

4539static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
4540{
4541     /* We may not be able to submit a whole bio at once as there
4542     * may not be enough stripe_heads available.
4543     * We cannot pre-allocate enough stripe_heads as we may need
4544     * more than exist in the cache (if we allow ever large chunks).
4545     * So we do one stripe head at a time and record in
4546     * ->bi_hw_segments how many have been done.
4547     *
4548     * We *know* that this entire raid_bio is in one chunk, so
4549     * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
4550     */

如果没有足够的struct stripe_head结构，我们没能把请求一次性提交。我们也不能提前预留足够的struct stripe_head结构，所以我们一次提交一个struct stripe_head，并将已提交记录在bio->bi_hw_segments字段里。

由于是条块内读，所以raid_bio请求区间都在一个条块内的，所以我们只需要调用一次raid5_compute_sector来计算对应磁盘下标dd_idx。

看完了以上的注释部分，我们就知道这里复用了bio->bi_hw_segment字段，用于记录已经下发的struct stripe_head数，那具体是怎么用的呢？我们来继续看代码：

4558     logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
4559     sector = raid5_compute_sector(conf, logical_sector,
4560                          0, &dd_idx, NULL);
4561     last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);

4558行，计算请求开始扇区对应的stripe扇区，因为读操作的基本单位是stripe大小，即一页大小

4559行，计算对应磁盘下标dd_idx，磁盘中偏移sector

4561行，请求结束扇区

4563     for (; logical_sector < last_sector;
4564          logical_sector += STRIPE_SECTORS,
4565               sector += STRIPE_SECTORS,
4566               scnt++) {
4567
4568          if (scnt < raid5_bi_processed_stripes(raid_bio))
4569               /* already done this stripe */
4570               continue;
4571
4572          sh = get_active_stripe(conf, sector, 0, 1, 0);
4573
4574          if (!sh) {
4575               /* failed to get a stripe - must wait */
4576               raid5_set_bi_processed_stripes(raid_bio, scnt);
4577               conf->retry_read_aligned = raid_bio;
4578               return handled;
4579          }
4580
4581          if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
4582               release_stripe(sh);
4583               raid5_set_bi_processed_stripes(raid_bio, scnt);
4584               conf->retry_read_aligned = raid_bio;
4585               return handled;
4586          }
4587
4588          set_bit(R5_ReadNoMerge, &sh->dev[dd_idx].flags);
4589          handle_stripe(sh);
4590          release_stripe(sh);
4591          handled++;
4592     }

4563行，对于条块内的每一个stripe进行操作，比如说条块为64KB，stripe为4KB，请求为整个条块，那么这里就需要循环16次。

4568行，如果是已经下发请求的stripe，那么就跳过去。在上面注释里我们已经讲过，利用了bio->bi_hw_segments来表示一个请求中已经下发的stripe数量。比如说一次只下发了8个stripe，有了这里的continue那么下次再进来这个函数就继续下发后面8个stripe。

4572行，获取sector对应的stripe_head

4574行，如果没有申请到stripe_head，那么保存已经下发的stripe数量，将请求raid_bio保存到阵列retry_read_aligned指针里，下次唤醒raid5d里直接从该指针中获取bio，并继续下发stripe请求。

4578行，返回已下发stripe个数

4581行，将bio添加到stripe_head请求链表中

4582行，如果添加失败，释放stripe_head，记录下发stripe数量，保存重试读请求

4588行，设置块层不需要合并标志

4589行，处理stripe

4590行，递减stripe计数

4591行，增加处理stripe数

4593     remaining = raid5_dec_bi_active_stripes(raid_bio);
4594     if (remaining == 0)
4595          bio_endio(raid_bio, 0);
4596     if (atomic_dec_and_test(&conf->active_aligned_reads))
4597          wake_up(&conf->wait_for_stripe);
4598     return handled;

4593行，递减stripe数

4594行，所有下发stripe都已处理完成

4595行，调用请求回调函数

4596行，唤醒等待该条带的进程

4598行，返回已下发stirpe数

我们已经将stripe_head调用handle_stripe进行处理了，对于一个条块内读，handle_stripe会如何处理呢，接着看handle_stripe函数，前面两个if代码没有执行到，直接来到analyse_stripe函数，这个函数很长，但真正执行到的有用地方就一两句，所以这里抓取重点把这几句代码给找出来，大家可以打开源代码对照着看。首先，在之前retry_aligned_read函数中分配了stripe_head，在给stripe_head添加bio的函数add_stripe_bio中将bio加入了对应磁盘的toread队列，由于又是条块内读，所以只有一个数据盘的toread队列挂有bio，所以有analyse_stripe函数中就执行到了：

3245          if (test_bit(R5_Wantfill, &dev->flags))
3246               s->to_fill++;
3247          else if (dev->toread)
3248               s->to_read++;

3247行，判断设备读队列中有请求。

3248行，递增需要读的设备数。

由于之前是条块内读失败，物理磁盘对应的扇区出错或者磁盘异常，对应的是rdev被设置了Faulty标志，或者对应的物理磁盘扇区为坏块，即对应扇区判断is_badblock会返回true。对于第一种情况磁盘设置了Faulty标志：

3271          if (rdev && test_bit(Faulty, &rdev->flags))
3272               rdev = NULL;
3287          if (!rdev)
3288               /* Not in-sync */;
...
3304          else if (test_bit(R5_UPTODATE, &dev->flags) &&
3305               test_bit(R5_Expanded, &dev->flags))
3310               set_bit(R5_Insync, &dev->flags);

如果设置了Faulty标志，那么rdev被设置为NULL，那么就3287行就成立，进行不会进入3310行，从而dev->flags不会被设置R5_Insync标志。

对于第二种情况对应扇区是坏块，那么去尝试读之后必然会设置R5_ReadError标志：

3350          if (test_bit(R5_ReadError, &dev->flags))
3351               clear_bit(R5_Insync, &dev->flags);

3350行，成立

3351行，清除了R5_Insync标志

所以不管是以上哪一种情况，最终结果是一样的，就是dev->flags会清除R5_Insync标志。那么接着看：

3352          if (!test_bit(R5_Insync, &dev->flags)) {
3353               if (s->failed < 2)
3354                    s->failed_num[s->failed] = i;
3355               s->failed++;

3352行，成立

3353行，成立，因为对于raid5来说，fail>2就是阵列已经fail

3354行，记录fail磁盘下标

3355行，递增fail磁盘计数

所以这一趟analyse_stripe下来，我们得到了两样宝贝：一是s->toread，二是s->failed并且s->failed_num[0]=i。带着这两样宝贝我们回到了handle_stripe函数中来：

3468     /* Now we might consider reading some blocks, either to check/generate
3469     * parity, or to satisfy requests
3470     * or to load a block that is being partially written.
3471     */
3472     if (s.to_read || s.non_overwrite
3473         || (conf->level == 6 && s.to_write && s.failed)
3474         || (s.syncing && (s.uptodate + s.compute < disks))
3475         || s.replacing
3476         || s.expanding)
3477          handle_stripe_fill(sh, &s, disks);

查看是否要做读操作。

3472行，s.to_read成立，毫不犹豫地进入handle_stripe_fill

2707/**
2708 * handle_stripe_fill - read or compute data to satisfy pending requests.
2709 */
2710static void handle_stripe_fill(struct stripe_head *sh,
2711                      struct stripe_head_state *s,
2712                      int disks)
2713{
2714     int i;
2715
2716     /* look for blocks to read/compute, skip this if a compute
2717     * is already in flight, or if the stripe contents are in the
2718     * midst of changing due to a write
2719     */
2720     if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
2721         !sh->reconstruct_state)
2722          for (i = disks; i--; )
2723               if (fetch_block(sh, s, i, disks))
2724                    break;
2725     set_bit(STRIPE_HANDLE, &sh->state);
2726}

看注释，直接读取数据或者用于计算数据。很显然，我们要读的磁盘已经出错了，我们现在要做的是读其他盘的数据来计算数据。

handle_stripe_fill对于我们来说也是老朋友了，我们在讲Raid5同步的时候就已经拜访过了。

2722行，对于条带中每一个磁盘，调用fetch_block函数。

我们跟着来到fetch_block函数，虽然这个函数我们之前也已经阅读过了，但今时不同晚日，当我们带着不一样的心情来欣赏这片风景时，得到的感觉是不一样的。

2624static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
2625                 int disk_idx, int disks)
2626{
2627     struct r5dev *dev = &sh->dev[disk_idx];
2628     struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
2629                      &sh->dev[s->failed_num[1]] };
2630
2631     /* is the data in this block needed, and can we get it? */
2632     if (!test_bit(R5_LOCKED, &dev->flags) &&
2633         !test_bit(R5_UPTODATE, &dev->flags) &&
2634         (dev->toread ||
2635          (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
2636          s->syncing || s->expanding ||
2637          (s->replacing && want_replace(sh, disk_idx)) ||
2638          (s->failed >= 1 && fdev[0]->toread) ||
2639          (s->failed >= 2 && fdev[1]->toread) ||
2640          (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
2641           !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
2642          (sh->raid_conf->level == 6 && s->failed && s->to_write))) {

这里再次重现一下我们的上下文，即条带中下标为i盘的dev->toread不为空，同时s->failed==1，s->failed_num[0]=i。

进入了fetch_blockb函数，当disk_idx==i时，2632行和2633行成立，2634行也成立，所以进入if分支。当disk_idx!=i时，2632行和2633行成立，2638行也成立，所以也进入if分支。

2648          if ((s->uptodate == disks - 1) &&
...
2670          } else if (s->uptodate == disks-2 && s->failed >= 2) {
...
2695          } else if (test_bit(R5_Insync, &dev->flags)) {
2696               set_bit(R5_LOCKED, &dev->flags);
2697               set_bit(R5_Wantread, &dev->flags);
2698               s->locked++;
2699               pr_debug("Reading block %d (sync=%d)\n",
2700                    disk_idx, s->syncing);
2701          }

由于s->uptodate==0，所以直接进入2695行代码。所以对于非fail盘而言，都设置了R5_LOCKED和R5_Wantread标志。

这里就简单地归纳一下读重试流程的全过程：

1）发起条块内读

2）读失败，加入重试链表，唤醒raid5d

3）raid5d将读请求从重试链表中移除，为每个stripe申请struct stripe_head并调用handle_stripe

4）handle_stripe调用analyse_stripe设置了s->toread和s->failed，然后再调用handle_stripe_fill从其他冗余磁盘读取数据，最后调用ops_run_io下发请求到磁盘

5）当下发到磁盘的所有子请求返回时，raid5_end_read_request将stripe_head加入到阵列handle_list链表中

6）raid5d从handle_list链表中取出stripe_head，调用handle_stripe

7）由于这时s->uptodate==disks-1，handle_stripe调用handle_stripe_fill设置set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);由于设置了该标志，在raid_run_ops函数中调用ops_run_compute5将需要读的块给计算出来。

8）计算回调函数ops_complete_compute设置对应dev->flags为R5_UPTODATE，重新加入handle_list

9）再一次进入handle_stripe函数，analyse_stripe中设置了R5_Wantfill标志和s->to_fill。handle_stripe中再设置了STRIPE_OP_BIOFILL和STRIPE_BIOFILL_RUN标志。之后raid_run_ops调用ops_run_biofill将计算出来的数据拷贝到bio的页中。

10）拷贝回调函数ops_complete_biofill中，当所有下发的stripe都已经返回的时候，原始请求bio也得到了想要的所有数据，然后通过return_io函数将原始下发的请求bio done回去。

下一小节继续讲raid5非条块内的读流程。

转载请注明出处：http://blog.csdn.net/liumangxiong

默默努力的小熊

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
linux内核奇遇记之md源代码解读之十三raid5重试读

linux内核奇遇记之md源代码解读之十三raid5重试读转载请注明出处：http://blog.csdn.net/liumangxiong上节我们讲到条块内读失败，在回调函数raid5_align_endio中将请求加入阵列重试链表，在唤醒raid5d线程之后，raid5d线程将该请求调用retry_aligned_read函数进行重试读：4539static int re
复制链接

扫一扫