Linux内核中RAID5源码详解之基本架构与数据结构

RAID5内核实现解析

最新推荐文章于 2022-04-09 16:47:16 发布

原创最新推荐文章于 2022-04-09 16:47:16 发布 · 7.2k 阅读

21 ·

CC 4.0 BY-SA版权

文章标签：

#内核 #存储 #raid5 #linux #stripe

linux kernel 同时被 3 个专栏收录

10 篇文章

订阅专栏

raid

9 篇文章

订阅专栏

存储

8 篇文章

订阅专栏

本文深入探讨了Linux内核中RAID5的基本架构与数据结构，详细介绍了RAID5模块化、环境搭建过程，并重点分析了核心数据结构stripe_head和r5conf，为读者提供了理解RAID5工作原理的基础。

Linux内核中RAID5的基本架构与数据结构解析

众所周知，早年的计算机存储数据现在磁带上，然后发展到了磁盘，然而仅仅是单个盘，速度和性能都不是很好，然是，要知道人类的聪明才智是连ET都想不到的，前辈们不断的猜想，实验来提高计算机的性能，于是磁盘阵列问世了。由于磁盘阵列(Redundant Arrays of Independent Disks，RAID)的出现，使磁盘的存储性能和安全性等诸多方面有了飞速的提升，随着科技的进步，存储材质也在不断的优化，早期的磁带到磁盘，以及现在的SSD,甚至未来的比SSD性能更好的PCM(phase-change memory)都在极力的提高存储性能。然而试想一下，几个盘连在一起的性能居然比单盘高几倍？不可思议啊！前辈们是怎么做到的？他们怎么这么厉害！今天，我们就来看看真正的RAID是怎么实现的，我们拿用的最多的RAID5这个架构来学习下内核中的RAID实现方法。
相信有关RAID5的资料和博文已经很多很多了，那么我们就换一个特别的方式来了解RAID5，我们从linux内核中的源码起步，从代码的角度来分析RAID5是怎么一步步实现的。这篇博文主要描述RAID5的基本架构以及相应的数据结构。

环境搭建

RAID模块化

要想深入了解内核中的RAID5代码，必须要搭建一个RAID5的环境，为了避免修改源码时重新编译的耗时长的问题，我们将RAID5进行模块化，具体操作见我的一篇博文—Linux 内核中MD及RAID模块化。

搭建RAID5

上述的RAID5模块化结束后，我们就可以搭建RAID5的实验环境，具体操作见我的一篇博文—mdadm创建software RAID。

RAID5环境搭建好后，我们就开始来了解RAID5的原理了。Let’s go！

RAID5的基本架构

RAID5的读写操作采用的是stripe的基本结构，即以stripe为读写的基本单位，假设一个3+1的RAID5，即3个数据盘+1个校验盘，那么一个stripe就包含3个数据块和一个校验块。我们结合图示来仔细看下RAID5的架构。
RAID5 基本架构
如图所示，这是一个3+1的RAID5，图中的每一个方块表示一个stripe的一个基本单元，又称为chunk；相同颜色的方块组成一个stripe，即每个stripe由3个数据chunk(A,B,C)+1个校验chunk(P)组成。关于校验块的生成方法以及数据恢复原理如下：

校验块P的生成方法为P=A⊕B⊕C 。(⊕表示异或运算)
加入1号盘坏了，此时有读请求读B0块的数据，那么可以通过B0=A0⊕C0⊕P0 的方法来进行恢复。

可以观察到上图中的校验块不是单独的全部存在一个盘上，这是为了实现RAID中磁盘的磨损平衡，防止某个盘寿命太短而先损坏。 内核中有很多这种平衡校验块的算法，上图中用到的是ALGORITHM_LEFT_SYMMETRIC。

内核中默认的stripe大小

基本上所有的OS都认可的page大小是4KB，由于内核中是按sector为基本大小单位，1 sector = 512B，所以有如下公式：

1 page = 8*sector = 4KB
1 chunk = 128*page = 512KB
1 stripe = 4*chunk = 2048KB
1 stripe的data size =3*chunk =1536KB

RAID5的数据结构

RAID5在内核中的处理单元stripe_head

虽然说直观上看RAID5的基本处理单元是stripe，但是一个chunk的大小是512KB，这与OS一次处理的page大小相差太多，所以为了处理的一致性，内核将一个chunk分成128个page，由一个stripe的每个chunk出一个对应的page组成内核中的RAID5处理的基本单元：stripe_head。stripe_head的定义在raid5.h中。

struct stripe_head {
    struct hlist_node   hash;
    struct list_head    lru;          /* inactive_list or handle_list */
    struct llist_node   release_list;
    struct r5conf       *raid_conf;//raid5的全局配置信息
    short           generation; /* increments with every
                         * reshape */
    sector_t        sector;     /* sector of this row */
    short           pd_idx;     /* parity disk index */
    short           qd_idx;     /* 'Q' disk index for raid6 */
    short           ddf_layout;/* use DDF ordering to calculate Q */
    short           hash_lock_index;
    unsigned long       state;      /* state flags */
    atomic_t        count;        /* nr of active thread/requests */
    int         bm_seq; /* sequence number for bitmap flushes */
    int         disks;      /* disks in stripe */
    enum check_states   check_state;
    enum reconstruct_states reconstruct_state;
    spinlock_t      stripe_lock;
    int         cpu;
    struct r5worker_group   *group;
    /**
     * struct stripe_operations
     * @target - STRIPE_OP_COMPUTE_BLK target
     * @target2 - 2nd compute target in the raid6 case
     * @zero_sum_result - P and Q verification flags
     * @request - async service request flags for raid_run_ops
     */
    struct stripe_operations {
        int              target, target2;
        enum sum_check_flags zero_sum_result;
    } ops;
    struct r5dev {
        /* rreq and rvec are used for the replacement device when
         * writing data to both devices.
         */
        struct bio  req, rreq;
        struct bio_vec  vec, rvec;
        struct page *page, *orig_page;
        struct bio  *toread, *read, *towrite, *written;
        sector_t    sector;         /* sector of this page */
        unsigned long   flags;
    } dev[1]; /* allocated with extra space depending of RAID geometry */
};

相应项的注释已经给出，我们用图来详细了解下stripe_head与stripe的区别。
stripe_head架构
这是第一幅图中的stripe 0 的细化，stripe 0 由A0、B0、C0和P0组成，这幅图中，将每个chunk细化，由于一个chunk的大小是128个page的大小，所以一个chunk中含有128个page，每个page的大小是4KB，所以在每一个chunk中具有相同偏移量的page组成一个stripe_head，即图中每个颜色相同的方块组成一个stripe_head。

1 stripe_head = 4*page = 16KB
1 stripe = 128 * stripe_head =2048KB

所以说：我们经常说的RAID5的处理单元stripe，实际上是内核中的处理单元stripe_head的结合体，不要搞混淆了哦~

另外我还要再强调一点，每个请求bio都会有一个起始地址，这个地址对应的位置(根据上述的算法ALGORITHM_LEFT_SYMMETRIC来确定到哪块盘上，以及在这块盘上的偏移量)，一旦这个位置确定，它就会和在其他盘上具有相同偏移量的page构成一个stripe_head结构，这是确定了的，无法更改的！！！stripe_head中的sector域就是来记录这个偏移量的。

值得注意的是stripe_head中的struct dev：就是对应的每个盘的缓冲区，为一个page的大小，里面包含了发往这个盘上的请求bio链表(toread表示需要处理的读请求，towrite表示需要处理的写请求，read表示已经处理完的读请求，written表示已经处理完的写请求)以及相应的缓冲区标志。

RAID5的全局配置信息r5conf

对每一个系统都需要维护个全局的信息来管理整个系统，RAID5也不例外，对整个RAID5的管理需要维护一个数据结构—r5conf。r5conf定义在raid5.h中：

struct r5conf {
    struct hlist_head   *stripe_hashtbl;
    /* only protect corresponding hash list and inactive_list */
    spinlock_t      hash_locks[NR_STRIPE_HASH_LOCKS];
    struct mddev        *mddev;
    int         chunk_sectors;/*一个chunk中sector的数目，    默认值为1024，即一个chunk的大小为512KB*/
    int         level, algorithm;//raid5中level=5
    int         max_degraded;
    int         raid_disks;//raid中磁盘的个数
    int         max_nr_stripes;/*raid中允许的最大stripe_head的个数，默认为256，即最多允许256个stripe_head存在*/

    /* reshape_progress is the leading edge of a 'reshape'
     * It has value MaxSector when no reshape is happening
     * If delta_disks < 0, it is the last sector we started work on,
     * else is it the next sector to work on.
     */
    sector_t        reshape_progress;
    /* reshape_safe is the trailing edge of a reshape.  We know that
     * before (or after) this address, all reshape has completed.
     */
    sector_t        reshape_safe;
    int         previous_raid_disks;
    int         prev_chunk_sectors;
    int         prev_algo;
    short           generation; /* increments with every reshape */
    seqcount_t      gen_lock;   /* lock against generation changes */
    unsigned long       reshape_checkpoint; /* Time we last updated
                             * metadata */
    long long       min_offset_diff; /* minimum difference between
                          * data_offset and
                          * new_data_offset across all
                          * devices.  May be negative,
                          * but is closest to zero.
                          */

    struct list_head    handle_list; /* stripes needing handling */
    struct list_head    hold_list; /* preread ready stripes */
    struct list_head    delayed_list; /* stripes that have plugged requests */
    struct list_head    bitmap_list; /* stripes delaying awaiting bitmap update */
    struct bio      *retry_read_aligned; /* currently retrying aligned bios   */
    struct bio      *retry_read_aligned_list; /* aligned bios retry list  */
    atomic_t        preread_active_stripes; /* stripes with scheduled io */
    atomic_t        active_aligned_reads;
    atomic_t        pending_full_writes; /* full write backlog */
    int         bypass_count; /* bypassed prereads */
    int         bypass_threshold; /* preread nice */
    int         skip_copy; /* Don't copy data from bio to stripe cache */
    struct list_head    *last_hold; /* detect hold_list promotions */

    atomic_t        reshape_stripes; /* stripes with pending writes for reshape */
    /* unfortunately we need two cache names as we temporarily have
     * two caches.
     */
    int         active_name;
    char            cache_name[2][32];
    struct kmem_cache       *slab_cache; /* for allocating stripes */

    int         seq_flush, seq_write;
    int         quiesce;

    int         fullsync;  /* set to 1 if a full sync is needed,
                        * (fresh device added).
                        * Cleared when a sync completes.
                        */
    int         recovery_disabled;
    /* per cpu variables */
    struct raid5_percpu {
        struct page *spare_page; /* Used when checking P/Q in raid6 */
        void        *scribble;   /* space for constructing buffer
                          * lists and performing address
                          * conversions
                          */
    } __percpu *percpu;
    size_t          scribble_len; /* size of scribble region must be
                           * associated with conf to handle
                           * cpu hotplug while reshaping
                           */
#ifdef CONFIG_HOTPLUG_CPU
    struct notifier_block   cpu_notify;
#endif

    /*
     * Free stripes pool
     */
    atomic_t        active_stripes;
    struct list_head    inactive_list[NR_STRIPE_HASH_LOCKS];
    atomic_t        empty_inactive_list_nr;
    struct llist_head   released_stripes;
    wait_queue_head_t   wait_for_stripe;
    wait_queue_head_t   wait_for_overlap;
    int         inactive_blocked;   /* release of inactive stripes blocked,
                             * waiting for 25% to be free
                             */
    int         pool_size; /* number of disks in stripeheads in pool */
    spinlock_t      device_lock;
    struct disk_info    *disks;

    /* When taking over an array from a different personality, we store
     * the new thread here until we fully activate the array.
     */
    struct md_thread    *thread;
    struct list_head    temp_inactive_list[NR_STRIPE_HASH_LOCKS];
    struct r5worker_group   *worker_groups;
    int         group_cnt;
    int         worker_cnt_per_group;
};

这里我们需要关注的有如下几点：

元数据，比如chunk_sectors、level、raid_disks、max_nr_stripes等,相应的注释已经写在上述代码片段中。
handle_list、hold_list、delayed_list和bitmap_list，相应代表什么注释中写的很清楚了。由于stripe_head在处理时，会对应不同的状态，所以一个stripe_head在执行时会在上述几个链表中切换，弄清了这几个链表的切换条件和流程对理解raid5的运行原理有很大帮助！
raid5的守护线程 struct md_thread *thread ,在raid5中守护线程为raid5d。

RAID5中数据结构的状态解析

stripe_head的状态标识

在stripe_head的定义中有这样一个域 unsinged long state ,然后在raid5.h中会发现这样一个enum结构：

/*
 * Stripe state
 */
enum {
    STRIPE_ACTIVE,//正在处理
    STRIPE_HANDLE,//需要处理
    STRIPE_SYNC_REQUESTED,//同步请求
    STRIPE_SYNCING,//正在同步
    STRIPE_INSYNC,//已经同步
    STRIPE_REPLACED,
    STRIPE_PREREAD_ACTIVE,//预读
    STRIPE_DELAYED,//延迟处理
    STRIPE_DEGRADED,//降级
    STRIPE_BIT_DELAY,//等待bitmap处理
    STRIPE_EXPANDING,//扩展
    STRIPE_EXPAND_SOURCE,
    STRIPE_EXPAND_READY,
    STRIPE_IO_STARTED,  //IO已经下发
    STRIPE_FULL_WRITE,  /* all blocks are set to be overwritten ,即满写*/
    STRIPE_BIOFILL_RUN,/*bio填充，就是讲page的内容copy到bio的page中*/
    STRIPE_COMPUTE_RUN,//正在计算
    STRIPE_OPS_REQ_PENDING,//handle_stripe 排队用
    STRIPE_ON_UNPLUG_LIST,/*批量处理release_list时标识是否加入unplug链表*/
    STRIPE_ON_RELEASE_LIST,
};

在实际操作中，通过set_bit(&sh->state) 和 clear_bit(&sh->state) 来进行置位和复位操作，上述相应注释已经给出。这些状态代表了此时stripe_head需要什么操作或者正在进行什么操作，根据这些状态决定下面如何操作stripe_head,所以这些状态很重要，一定要熟练掌握，这里只是简要介绍下，下篇博文会结合具体操作来分析stripe_head的变化情况。

dev的状态标识

dev表示了盘的缓冲区，它的状态标识在 unsinged long flags , 相应的状态集合在raid5.h中，如下：

/* Flags for struct r5dev.flags */
enum r5dev_flags {
    R5_UPTODATE,    /* page contains current data */
    R5_LOCKED,  /* IO has been submitted on "req" */
    R5_DOUBLE_LOCKED,/* Cannot clear R5_LOCKED until 2 writes complete */
    R5_OVERWRITE,   /* towrite covers whole page */
/* and some that are internal to handle_stripe */
    R5_Insync,  /* rdev && rdev->in_sync at start */
    R5_Wantread,    /* want to schedule a read */
    R5_Wantwrite,
    R5_Overlap, /* There is a pending overlapping request
             * on this block */
    R5_ReadNoMerge, /* prevent bio from merging in block-layer */
    R5_ReadError,   /* seen a read error here recently */
    R5_ReWrite, /* have tried to over-write the readerror */

    R5_Expanded,    /* This block now has post-expand data */
    R5_Wantcompute, /* compute_block in progress treat as
             * uptodate
             */
    R5_Wantfill,    /* dev->toread contains a bio that needs
             * filling
             */
    R5_Wantdrain,   /* dev->towrite needs to be drained */
    R5_WantFUA, /* Write should be FUA */
    R5_SyncIO,  /* The IO is sync */
    R5_WriteError,  /* got a write error - need to record it */
    R5_MadeGood,    /* A bad block has been fixed by writing to it */
    R5_ReadRepl,    /* Will/did read from replacement rather than orig */
    R5_MadeGoodRepl,/* A bad block on the replacement device has been
             * fixed by writing to it */
    R5_NeedReplace, /* This device has a replacement which is not
             * up-to-date at this stripe. */
    R5_WantReplace, /* We need to update the replacement, we have read
             * data in, and now is a good time to write it out.
             */
    R5_Discard, /* Discard the stripe */
    R5_SkipCopy,    /* Don't copy data from bio to stripe cache */
};