GlusterFS源码分析——dht_create

最新推荐文章于 2022-02-21 18:28:09 发布

带着你的名字

最新推荐文章于 2022-02-21 18:28:09 发布

阅读量780

点赞数 1

分类专栏：分布式存储笔记 GlusterFS 文章标签：分布式存储 glusterfs 源码

本文链接：https://blog.csdn.net/weixin_42565760/article/details/119940321

版权

笔记同时被 3 个专栏收录

14 篇文章 0 订阅

订阅专栏

分布式存储

4 篇文章 0 订阅

订阅专栏

GlusterFS

2 篇文章 0 订阅

订阅专栏

GlusterFS源码分析——dht_create

1、简介

GlusterFS的分布式哈希表（DIstributed Hash Table，简称DHT）机制是数据分布的核心机制，以目录作为文件分布的基本单位。当用户在客户端创建目录的时候，会在所有brick进行创建，但是创建目录的时候，会根据哈希算法确定文件具体存放在哪个brick上。也就是说，所有brick都有相同的目录结构，但是文件只会存储于特定的brick上，根据文件所在目录的gfid和文件名寻找实际存储文件的brick。

我们知道，在GlusterFS的设计中没有元数据服务器的设计，GlusterFS 的设计者彻底摆脱了元数据服务器的设计理念，使用分布式哈希算法替代传统的分布式文件系统中以集中形式或分散形式存在的元数据服务，将数据与其元数据存储在一起，这样的设计保证了对文件的定位可独立、并行地进行，系统中的服务器和客户端只需要根据文件名和路径就能对数据实施定位，进行读写访问。具体的访问步骤如下：

计算 hash 值，输入参数为文件的路径和文件名；
根据 hash 值在统一的虚拟存储池中选择存储子卷，实施文件定位；
依据选择的子卷进行数据的访问。

2、实现

2.1 hash策略

系统使用的是弹性hash算法，数据分布的基本单位是文件。如果存储集群中有 N 个存储节点，那么每个brick可以存储的数据的哈希值范围就是65535/N。也就是说，第一个节点的哈希范围应该是 0 ~ 65535/N ，第二个节点的哈希范围是 65535/N ~ 2 * 65535 / N ，依次类推，第 i 个节点的哈希范围就是 (i-1)*65535/N ~ i*65535/N。

集群每台节点中都有相同的目录结构，但是具体的文件只存在一个节点上。具体的策略如下：

首先，使用算法计算得到文件的hash值，通过两个固定大小的数与文件名进行计算，计算得到文件hash值；然后，这个值的大小依次和各个节点的区间进行比较，找到所处的范围，确定这个文件应该存储在哪个节点；最后，根据graph找到相应的存储节点，将文件放入节点中相应的文件夹。

2.2 扩展属性

GlusterFS中大量使用了拓展属性来实现集群中的各种文件操作，其中关键的get/set扩展属性会触发与之关联的所有服务器，这样就可以通过setxattr或者getxattr想服务器传递消息。

在GlusterFS中，文件夹创建的同时，会给文件夹设置扩展属性，在该扩展属性内包括了分布到该文件夹下的区间范围，文件存储定位等信息。

2.3 graph分析

查看带DHT xlator的volfile，可以看出在dht xlator下面有各个brick节点对应的volume，每个子卷尤其对应的哈希范围，查找文件应该存放在哪个节点的时候通过遍历子卷进行比对来进行查找。文件内容如下：

/var/lib/glusterd/vols/vol1/vol1-rebalance.vol:

volume vol1-client-0
    type protocol/client
    option transport.socket.keepalive-count 5
    option transport.socket.keepalive-interval 2
    option transport.socket.keepalive-time 20
    option password 4bd06a73-e935-4e45-8a30-e45bc609e0bb
    option username c8730cf8-037b-440a-9461-95d4f63609e7
    option transport-type tcp
    option remote-subvolume /data/pool1/vol1-brick
    option remote-host Al2Xtao
    option ping-timeout 200
end-volume

volume vol1-io-retry-0
    type features/io-retry
    option io-retry on
    subvolumes vol1-client-0
end-volume

volume vol1-client-1
    type protocol/client
    option transport.socket.keepalive-count 5
    option transport.socket.keepalive-interval 2
    option transport.socket.keepalive-time 20
    option password 4bd06a73-e935-4e45-8a30-e45bc609e0bb
    option username c8730cf8-037b-440a-9461-95d4f63609e7
    option transport-type tcp
    option remote-subvolume /data/pool2/vol1-brick
    option remote-host Al1Xtao
    option ping-timeout 200
end-volume

volume vol1-io-retry-1
    type features/io-retry
    option io-retry on
    subvolumes vol1-client-1
end-volume

# 加载dht模块
volume vol1-dht
    type cluster/distribute
    # dht的子卷，将来要根据文件名和父目录gfid计算出哈希值，然后进行选择，也就是找出文件具体在哪个节点
    # dht的子卷是不同的brick
    subvolumes vol1-io-retry-0 vol1-io-retry-1	
end-volume

volume vol1
    type debug/io-stats
    option count-fop-hits off
    option frame-latency-measurement off
    option latency-measurement off
    option io-stats on
    option io-stats-global-switch off
    option log-level INFO
    subvolumes vol1-dht
end-volume

2.4 具体实现

步骤

GlusterFS弹性哈希算法的扩展属性设置分为如下步骤：

根据分布空间总范围与子卷数目，计算每个区间的大小。
计算一个 start_subvol，即通过 hash 获得一个开始分配区间的子卷索引subvol。
为开始子卷到最后的子卷分配区间范围。
为子卷索引从 0 到 start_subvol 的子卷分配区间，与上面步骤的分配方式一
致，只是分配的子卷不一样。
分布好的区间范围，会存储到相应目录的 xattr 中。

流程图

程序的流程示意图如下：（有部分比较简略）

在这里插入图片描述
talk is cheap, show me the code，上源码：

dht_create

// 根据分布式哈希算法创建文件
// DHT: Distributed Hash Table
int
dht_create(call_frame_t *frame, xlator_t *this, loc_t *loc, int32_t flags,
           mode_t mode, mode_t umask, fd_t *fd, dict_t *params)
{
    int op_errno = -1;
    xlator_t *subvol = NULL;
    dht_local_t *local = NULL;
    int i = 0;
    dht_conf_t *conf = NULL;
    int ret = 0;

    // 验证传入的实参是否正确
    VALIDATE_OR_GOTO(frame, err);
    VALIDATE_OR_GOTO(this, err);
    VALIDATE_OR_GOTO(loc, err);
    DHT_IF_SNAPSHOT_THEN_GOTO(loc->dataset, err);
    DHT_IF_DOTSNAPSHOT_THEN_GOTO(loc->parent, err);

    conf = this->private;

    dht_get_du_info(frame, this, loc);

    // 初始化当前的xlator，拷贝loc，验证fd并增加引用计数
    local = dht_local_init(frame, loc, fd, GF_FOP_CREATE);
    if (!local) {
        op_errno = ENOMEM;
        goto err;
    }

    // TODO: 随后再看
    // 1、应该是根据文件名使用哈希算法计算hash值
    if (dht_filter_loc_subvol_key(this, loc, local, &local->filter_subvol)) {
        gf_msg(this->name, GF_LOG_INFO, 0, DHT_MSG_SUBVOL_INFO,
               "creating %s on %s (got create on %s)", local->loc.path,
               local->filter_subvol->name, loc->path);
    }

    //  DHT xlator的子卷对应不同brick的ioretry，也就是不同的子卷会指向不同的brick
    //  以此来实现分布式哈希算法
    
    // 2、根据父目录gfid获得hashed的子卷索引subvol
    subvol = dht_subvol_get_hashed(this, &local->loc);

    if (!subvol) {
        // 如果哈希失败，TODO：随后研究
        gf_msg(this->name, GF_LOG_ERROR, 0, DHT_MSG_HASHED_SUBVOL_GET_FAILED,
               "no subvolume in layout for path=%s", loc->path);

        // 保存一下当前的loc等数据，保存到local的stub存根中
        local->stub = fop_create_stub(frame, dht_create, loc, flags, mode,
                                      umask, fd, params);
        if (!local->stub) {
            op_errno = ENOMEM;
            goto err;
        }

        // 刷新一下哈希表的布局
        ret = dht_layout_refreshing(this, loc, frame);
        // 返回值0表示正常，-1表示失败
        if (ret) {
            gf_log(this->name, GF_LOG_ERROR, "refresh layout failed");
            op_errno = ENOMEM;
            goto err;
        }

        goto done;
    }

    // 把哈希得到的子卷赋给当前xlator
    local->hashed_subvol = subvol;

    // 客户端获得的布局可能跟磁盘的布局不一样，create调用就有可能落在过期的节点上
    /* Post remove-brick, the client layout may not be in sync with
     * disk layout because of lack of lookup. Hence,a create call
     * may fall on the decommissioned brick.  Hence, if the
     * hashed_subvol is part of decommissioned bricks  list, do a
     * lookup on parent dir. If a fix-layout is already done by the
     *                          如果remove-brick进程已经完成fix-layout
     * remove-brick process, the parent directory layout will be in
     *                          父目录的layout就会和磁盘的进行同步
     * sync with that of the disk. If fix-layout is still ending
     *                              如果fix-layout仍然在父目录上结束
     * on the parent directory, we can let the file get created on
     *                              我们可以让文件在已经退役的brick上创建，
     * the decommissioned brick which will be eventually migrated to
     * non-decommissioned brick based on the new layout.
     *  最终根据新的layout迁移到非退役的brick上
     */

    // 如果有退役节点，退役节点的个数大于0
    if (conf->decommission_subvols_cnt) {
        for (i = 0; i < conf->subvolume_cnt; i++) {
            // 遍历退役节点列表
            if (conf->decommissioned_bricks[i] &&
                conf->decommissioned_bricks[i] == subvol) {
                // 如果哈希到的节点是退役节点，记录日志，并进行上面将的操作
                gf_msg_debug(this->name, 0,
                             "hashed subvol:%s is "
                             "part of decommission brick list for "
                             "file: %s",
                             subvol->name, loc->path);

                /* dht_refresh_layout needs directory info in
                 * local->loc. Hence, storing the parent_loc in
                 * local->loc and storing the create context in
                 * local->loc2. We will restore this information
                 * in dht_creation do */
                // 要使用的两个变量
                /* loc_t loc;   // @old in rename(), link() */
                /* loc_t loc2;  // @new in rename(), link() */

                // 拷贝loc到新的loc2中
                ret = loc_copy(&local->loc2, &local->loc);
                if (ret) {
                    gf_msg(this->name, GF_LOG_ERROR, ENOMEM, DHT_MSG_NO_MEMORY,
                           "loc_copy failed %s", loc->path);

                    goto err;
                }

                // 保存文件的扩展属性
                local->params = dict_ref(params);
                local->flags = flags;
                local->mode = mode;
                local->umask = umask;

                // 擦掉local->loc，相关引用计数减一，local->loc置为null
                // 在后面要用来保存parent dir的loc
                loc_wipe(&local->loc);

                // 创建父目录的loc，保存在local->loc中
                ret = dht_build_parent_loc(this, &local->loc, loc, &op_errno);

                if (ret) {
                    gf_msg(this->name, GF_LOG_ERROR, ENOMEM, DHT_MSG_NO_MEMORY,
                           "parent loc build failed");
                    goto err;
                }

                // 锁住要使用的子卷？
                ret = dht_create_lock(frame, subvol);

                if (ret < 0) {
                    gf_msg(this->name, GF_LOG_ERROR, 0, DHT_MSG_INODE_LK_ERROR,
                           "locking parent failed");
                    goto err;
                }

                goto done;
            }
        }
    }

    // 走到这里已经得到hashed_subvol了，需要再找到avail_subvol
    dht_create_wind_to_avail_subvol(frame, this, subvol, &local->loc, flags,
                                    mode, umask, fd, params);
done:
    return 0;

err:

    op_errno = (op_errno == -1) ? errno : op_errno;
    DHT_STACK_UNWIND(create, frame, -1, op_errno, NULL, NULL, NULL, NULL, NULL,
                     NULL);

    return 0;
}

下面是几个主要的函数

dht_create_wind_to_avail_subvol

// 根据合法的dht的子subvol去WIND
int
dht_create_wind_to_avail_subvol(call_frame_t *frame, xlator_t *this,
                                xlator_t *hashed_subvol, loc_t *loc,
                                int32_t flags, mode_t mode, mode_t umask,
                                fd_t *fd, dict_t *params)
{
    dht_local_t *local = NULL;
    xlator_t *avail_subvol = NULL;

    local = frame->local;

    // 从subvol列表中找到一个可用的子卷，用来实现brick的负载均衡
    // 阅读其中的代码可以知道默认情况下会直接返回NULL
    // 扩容后才会返回一个非空的avail_subvol
    avail_subvol = dht_select_low_capcity_subvolume(this, hashed_subvol);

    // 如果有可用的subvol
    if (avail_subvol) {
        if (avail_subvol == hashed_subvol) {
            // 如果返回的可用的subvol和哈希得到的哈希值一样，
            // 一切正常，直接WIND，然后退出
            gf_msg_debug(this->name, 0, "creating %s on %s", loc->path,
                         hashed_subvol->name);

            STACK_WIND(frame, dht_create_cbk, hashed_subvol,
                       hashed_subvol->fops->create, loc, flags, mode, umask, fd,
                       params);
            goto out;
        }

        // 如果返回的avail_subvol和hashed_subvol不相等，
        // 说明哈希到的那个brick没有足够空间保存新文件了
        // 需要放到别的brick进行保存，就需要引入linkfile
        local->params = dict_ref(params);
        local->flags = flags;
        local->mode = mode;
        local->umask = umask;
        local->cached_subvol = avail_subvol;    // 使用cache_subvol保存avail_subvol
        local->hashed_subvol = hashed_subvol;

        gf_msg_debug(this->name, 0, "creating %s on %s (link at %s)", loc->path,
                     avail_subvol->name, hashed_subvol->name);

        // 把文件存放到别的brick中，就需要使用linkfile来存放跳过去的信息
        // linkfile和文件的gfid是一样的，只不过内容是指向别的subvol的EA
        dht_linkfile_create(frame, dht_create_linkfile_create_cbk, this,
                            avail_subvol, hashed_subvol, loc);
        // from hashed_subvol to avail_subvol
        goto out;
    }

    // 通常会走到这里来进行判断，但是filter也不常有
    if (local->filter_subvol &&
        !dht_is_subvol_filled(this, local->filter_subvol, _gf_false)) {
        if (local->filter_subvol != hashed_subvol) {
            avail_subvol = local->filter_subvol;
            goto create_link_file;
        }

        STACK_WIND(frame, dht_create_cbk, hashed_subvol,
                   hashed_subvol->fops->create, loc, flags, mode, umask, fd,
                   params);
        goto out;
    }
    // XXX：这里是最常走的一步
    // 判断哈希到的subvol是不是 满/忙
    else if (!dht_is_subvol_filled(this, hashed_subvol, _gf_false)) {
        gf_msg_debug(this->name, 0, "creating %s on %s", loc->path,
                     hashed_subvol->name);

        // 如果不是 满/忙，就可以WIND了
        STACK_WIND(frame, dht_create_cbk, hashed_subvol,
                   hashed_subvol->fops->create, loc, flags, mode, umask, fd,
                   params);
        goto out;

    } else {
        // 如果哈希到的子卷的状态是 满/忙 
        avail_subvol = dht_free_disk_available_subvol(this, hashed_subvol,
                                                      local);

        // 需要使用link_file了
    create_link_file:
        if (avail_subvol != hashed_subvol) {
            local->params = dict_ref(params);
            local->flags = flags;
            local->mode = mode;
            local->umask = umask;
            local->cached_subvol = avail_subvol;
            local->hashed_subvol = hashed_subvol;

            gf_msg_debug(this->name, 0, "creating %s on %s (link at %s)",
                         loc->path, avail_subvol->name, hashed_subvol->name);

            dht_linkfile_create(frame, dht_create_linkfile_create_cbk, this,
                                avail_subvol, hashed_subvol, loc);

            goto out;
        }

        gf_msg_debug(this->name, 0, "creating %s on %s", loc->path,
                     hashed_subvol->name);

        STACK_WIND(frame, dht_create_cbk, hashed_subvol,
                   hashed_subvol->fops->create, loc, flags, mode, umask, fd,
                   params);
    }
out:
    return 0;
}

dht_local_init

dht_local_t *
dht_local_init(call_frame_t *frame, loc_t *loc, fd_t *fd, glusterfs_fop_t fop)
{
    // local记录当前在动作的xlator
    dht_local_t *local = NULL;
    inode_t *inode = NULL;
    int ret = 0;

    // 从内存池中获取地址
    local = mem_get0(THIS->local_pool);
    if (!local)
        goto out;

    if (loc) {
        // 然后把当前的loc保存到local中
        ret = loc_copy(&local->loc, loc);
        if (ret)
            goto out;

        // 记录要进行操作的inode
        inode = loc->inode;
    }

    if (fd) {
        // 加锁，并对该文件描述符的引用计数加一，然后解锁
        local->fd = fd_ref(fd);
        if (!inode)
            inode = fd->inode;
    }

    local->op_ret = -1;
    local->op_errno = EUCLEAN;

    local->main_ret = -1;
    local->main_errno = EUCLEAN;

    // 把要执行的操作交给local
    local->fop = fop;

    if (inode) {
        // 根据this和inode获取layout
        local->layout = dht_layout_get(frame->this, inode);
        local->cached_subvol = dht_subvol_get_cached(frame->this, inode);
    }

    local->dst_filter = _gf_false;
    local->over_write = _gf_false;

    frame->local = local;

out:
    if (ret) {
        if (local)
            mem_put(local);
        local = NULL;
    }
    return local;
}

dht_is_subvol_filled

// XXX: 比较关键切通常情况下要执行的函数
// 从三个指标（空间、inode、queue）来判断brick的状态
gf_boolean_t
dht_is_subvol_filled(xlator_t *this, xlator_t *subvol,
                     gf_boolean_t ignore_watermark)
{
    int i = 0;
    dht_conf_t *conf = NULL;
    gf_boolean_t subvol_filled_inodes = _gf_false;
    gf_boolean_t subvol_filled_space = _gf_false;
    gf_boolean_t subvol_filled_queue = _gf_false;
    // 最终要返回的
    gf_boolean_t is_subvol_filled = _gf_false;

    conf = this->private;

    /* Check for values above specified percent or free disk */
    /* 检查高于指定百分比或者空闲disk的值 */
    LOCK(&conf->subvolume_lock);
    {
        for (i = 0; i < conf->subvolume_cnt; i++) {
            // 遍历子卷，找到和传入的hashed_subvol相等的子卷
            if (subvol == conf->subvolumes[i]) {
                // 在子卷中找到hash的subvol，开始选择分支

                if (!conf->du_stats[i].avail_percent ||
                    !conf->du_stats[i].avail_inodes) {
                    // 如果哈希到的子卷的参数有问题，直接跳出
                    break;
                }

                // 比较brick的可用空间，有两种比较方式：可用的百分比和大小
                if (conf->disk_unit == 'p') {
                    // 如果是按百分比来进行比较
                    if (conf->du_stats[i].avail_percent < conf->min_free_disk) {
                        // 如果可用的百分比小于设定的最小值
                        subvol_filled_space = _gf_true;
                        break;
                    }

                } else {
                    // 按剩余空间来进行比较
                    if (conf->du_stats[i].avail_space < conf->min_free_disk) {
                        // 如果可用的空间小于设定的最小值
                        subvol_filled_space = _gf_true;
                        break;
                    }
                }

                if (conf->du_stats[i].avail_inodes < conf->min_free_inodes) {
                    // 如果子卷可用的块的数量小于设定的最小值
                    subvol_filled_inodes = _gf_true;    // 进行标记
                    break;
                }

                if (conf->watermark_enable && !ignore_watermark &&
                    conf->hw_cnt != conf->subvolume_cnt &&
                    conf->du_stats[i].watermark == 1) {
                    // 如果哈希到的子卷很忙
                    subvol_filled_queue = _gf_true;         // 标记为子卷访问忙
                    conf->du_stats[i].filled_queue_cnt++;   // 忙的节点数量后面要用到
                    break;
                }
                break;
            }
        }
    }
    UNLOCK(&conf->subvolume_lock);

    // 如果满了，记录到日志中
    if (subvol_filled_space && conf->subvolume_status[i]) {
        if (!(conf->du_stats[i].log++ % (GF_UNIVERSAL_ANSWER * 10))) {
            gf_msg(this->name, GF_LOG_WARNING, 0, DHT_MSG_SUBVOL_INSUFF_SPACE,
                   "disk space on subvolume '%s' is getting "
                   "full (%.2f %%), consider adding more bricks",
                   subvol->name, (100 - conf->du_stats[i].avail_percent));
        }
    }

    if (subvol_filled_inodes && conf->subvolume_status[i]) {
        if (!(conf->du_stats[i].log++ % (GF_UNIVERSAL_ANSWER * 10))) {
            gf_msg(this->name, GF_LOG_CRITICAL, 0, DHT_MSG_SUBVOL_INSUFF_INODES,
                   "inodes on subvolume '%s' are at "
                   "(%.2f %%), consider adding more bricks",
                   subvol->name, (100 - conf->du_stats[i].avail_inodes));
        }
    }

    if (subvol_filled_queue && conf->subvolume_status[i]) {
        if (!(conf->du_stats[i].filled_queue_cnt %
              (GF_UNIVERSAL_ANSWER * 10))) {
            gf_msg(this->name, GF_LOG_WARNING, 0, DHT_MSG_SUBVOL_INSUFF_SPACE,
                   "queue on subvolume '%s' is getting "
                   "high watermark "
                   "(high watermark grand cnt %" PRIu64 ")",
                   subvol->name, conf->du_stats[i].hw_grand_cnt);
        }
    }

    // 三个指标中有一个是满的，就标记为true进行返回
    is_subvol_filled = (subvol_filled_space || subvol_filled_inodes ||
                        subvol_filled_queue);

    // 初始为false，根据brick的状态进行修改
    return is_subvol_filled;
}

dht_free_disk_available_subvol

/*Get the best subvolume to create the file in*/
/* 当哈希到的subvol 满/忙，来这里从剩下的节点中挑一个最合适的节点返回 */
xlator_t *
dht_free_disk_available_subvol(xlator_t *this, xlator_t *subvol,
                               dht_local_t *local)
{
    int i = 0;
    xlator_t *avail_subvol = NULL;
    dht_conf_t *conf = NULL;
    dht_layout_t *layout = NULL;
    loc_t *loc = NULL;
    int avail_subvol_cnt = 0;

    conf = this->private;
    if (!local)
        goto out;
    loc = &local->loc;

    // 获取当前xlator父目录的layout
    if (!local->layout) {
        layout = dht_layout_get(this, loc->parent);

        if (!layout) {
            gf_msg_debug(this->name, 0,
                         "Missing layout. path=%s,"
                         " parent gfid = %s",
                         loc->path, uuid_utoa(loc->parent->gfid));
            goto out;
        }
    } else {
        layout = dht_layout_ref(this, local->layout);
    }

    LOCK(&conf->subvolume_lock);
    {
        // watermark_enable默认也是off
        if (conf->watermark_enable) {
            for (i = 0; i < conf->subvolume_cnt; i++) {
                // 遍历子卷，记录不满、不忙的节点的个数
                if ((conf->du_stats[i].avail_percent > conf->min_free_disk) &&
                    (conf->du_stats[i].avail_inodes > conf->min_free_inodes) &&
                    (conf->du_stats[i].watermark < 1)) {
                    avail_subvol_cnt++;
                }
            }
        }

        if (avail_subvol_cnt) {
            // 如果有可用的子卷，使用函数找出要返回的subvol
            avail_subvol = dht_subvol_with_round_robin(this, subvol, layout);
        } else if (!conf->watermark_enable) {
            // 如果剩下的节点是忙的，但是空间还不满时，从还有空间的里面找一个出来
            avail_subvol = dht_subvol_with_free_space_inodes(this, subvol,
                                                             layout);
        }

        if (!avail_subvol) {
            // 如果上面的分支都没有获得一个可用的subvol
            // 就在还有inode的子卷中返回可用空间最大的那个
            avail_subvol = dht_subvol_maxspace_nonzeroinode(this, subvol,
                                                            layout);
        }
    }
    UNLOCK(&conf->subvolume_lock);
out:
    if (!avail_subvol) {
        gf_msg_debug(this->name, 0,
                     "No subvolume has enough free space \
                              and/or inodes to create");
        avail_subvol = subvol;
    }

    if (layout)
        dht_layout_unref(this, layout);
    return avail_subvol;
}

dht_subvol_get_hashed

xlator_t *
dht_subvol_get_hashed(xlator_t *this, loc_t *loc)
{
    dht_layout_t *layout = NULL;
    xlator_t *subvol = NULL;
    dht_conf_t *conf = NULL;
    dht_methods_t *methods = NULL;

    GF_VALIDATE_OR_GOTO("dht", this, out);
    GF_VALIDATE_OR_GOTO(this->name, loc, out);

    conf = this->private;
    GF_VALIDATE_OR_GOTO(this->name, conf, out);

    methods = &(conf->methods);

    // 根据文件名和父目录的gfid计算得到哈希值，然后获得相应的子卷subvol

    // 如果文件在"/"下面创建，直接选择最早成为子节点的子卷作为要使用的subvol并返回
    if (__is_root_gfid(loc->gfid)) {
        subvol = dht_first_up_subvol(this);
        goto out;
    }

    // 否则就需要通过计算获得子卷
    GF_VALIDATE_OR_GOTO(this->name, loc->parent, out);
    GF_VALIDATE_OR_GOTO(this->name, loc->name, out);

    // 首先获取父节点的布局layout，因为后面要使用layout，方式被其它进程回收这个layout
    // 在这个函数内部，让layout的引用计数加了一，在当前函数最终出口out那进行减一操作
    layout = dht_layout_get(this, loc->parent);

    if (!layout) {
        gf_msg_debug(this->name, 0, "Missing layout. path=%s, parent gfid =%s",
                     loc->path, uuid_utoa(loc->parent->gfid));
        goto out;
    }

    // 然后找到状态正常的子卷，计算得到的哈希值和brick的哈希区间进行比较
    // 根据上一步获得的layout，进行查找
    subvol = methods->layout_search(this, layout, loc->name);

    if (!subvol) {
        gf_msg_debug(this->name, 0, "No hashed subvolume for path=%s",
                     loc->path);
        goto out;
    }

out:
    if (layout) {
        // 用完layout了，引用计数减一
        dht_layout_unref(this, layout);
    }

    return subvol;
}

dht_layout_get

dht_layout_t *
dht_layout_get(xlator_t *this, inode_t *inode)
{
    dht_conf_t *conf = NULL;
    dht_layout_t *layout = NULL;
    int ret = 0;

    conf = this->private;
    if (!conf)
        goto out;

    LOCK(&conf->layout_lock);
    {
        // 通过inode的ctx可以获得layout
        ret = dht_inode_ctx_layout_get(inode, this, &layout);
        if ((!ret) && layout) {
            // layout的引用计数加一
            layout->ref++;
        }
    }
    UNLOCK(&conf->layout_lock);

out:
    return layout;
}

dht_first_up_subvol

// 如果是在根目录下进行操作，就返回最早加入的子卷
xlator_t *
dht_first_up_subvol(xlator_t *this)
{
    dht_conf_t *conf = NULL;
    xlator_t *child = NULL;
    int i = 0;
    time_t time = 0;

    // 获取配置信息
    conf = this->private;
    if (!conf)
        goto out;

    LOCK(&conf->subvolume_lock);
    {
        // 遍历子卷，找到最早加入的子卷
        for (i = 0; i < conf->subvolume_cnt; i++) {
            if (conf->subvol_up_time[i]) {
                if (!time) {
                    time = conf->subvol_up_time[i];
                    child = conf->subvolumes[i];
                } else if (time > conf->subvol_up_time[i]) {
                    time = conf->subvol_up_time[i];
                    child = conf->subvolumes[i];
                }
            }
        }
    }
    UNLOCK(&conf->subvolume_lock);

out:
    return child;
}

dht_layout_search

xlator_t *
dht_layout_search(xlator_t *this, dht_layout_t *layout, const char *name)
{
    uint32_t hash = 0;
    xlator_t *subvol = NULL;
    int i = 0;
    int ret = 0;

    // 计算哈希值
    ret = dht_hash_compute(this, layout->type, name, &hash);
    if (ret != 0) {
        gf_msg(this->name, GF_LOG_WARNING, 0, DHT_MSG_COMPUTE_HASH_FAILED,
               "hash computation failed for type=%d name=%s", layout->type,
               name);
        goto out;
    }

    // 找到可以存放文件的哈希区间，也就是存放文件的brick，返回该subvol
    for (i = 0; i < layout->cnt; i++) {
        if (layout->list[i].start <= hash && layout->list[i].stop >= hash) {
            subvol = layout->list[i].xlator;
            break;
        }
    }

    if (!subvol) {
        gf_msg(this->name, GF_LOG_WARNING, 0, DHT_MSG_HASHED_SUBVOL_GET_FAILED,
               "no subvolume for hash (value) = %u", hash);
    }

out:
    return subvol;
}

dht_layout_refreshing

/*
 * 刷新哈希表布局
 * return values
 * 0    lookup healed(修复了) parent layout
 * -1   lookup failed
 */
int
dht_layout_refreshing(xlator_t *this, loc_t *loc, call_frame_t *refresh_frame)
{
    loc_t tmp_loc = {
        0,
    };
    int ret = -1;
    call_frame_t *frame = NULL;
    dht_local_t *local = NULL;

    GF_VALIDATE_OR_GOTO("dht", loc, clean);
    GF_VALIDATE_OR_GOTO("dht", this, clean);
    GF_VALIDATE_OR_GOTO("dht", refresh_frame, clean);

    frame = create_frame(THIS, THIS->ctx->pool);
    if (!frame)
        goto clean;

    tmp_loc.inode = inode_ref(loc->parent);
    gf_uuid_copy(tmp_loc.gfid, loc->parent->gfid);

    local = dht_local_init(frame, &tmp_loc, NULL, GF_FOP_LOOKUP);
    if (!local)
        goto clean;

    local->xattr_req = dict_new();

    local->refresh_frame = refresh_frame;
    ret = dht_refresh_parent_layout(this, frame);
    if (ret) {
        gf_log(this->name, GF_LOG_ERROR,
               "refresh parent gfid: %slayout failed ",
               uuid_utoa(tmp_loc.inode->gfid));
        goto clean;
    }

    loc_wipe(&tmp_loc);

    ret = 0;
    return ret;
clean:
    ret = -1;
    loc_wipe(&tmp_loc);

    if (local)
        local->refresh_frame = NULL;

    if (frame)
        DHT_STACK_DESTROY(frame);

    if (refresh_frame)
        dht_refresh_frame_clean(refresh_frame, _gf_false);

    return ret;
}

带着你的名字

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
GlusterFS源码分析——dht_create

GlusterFS源码分析——dht_create1、简介GlusterFS的分布式哈希表（DIstributed Hash Table，简称DHT）机制是数据分布的核心机制，以目录作为文件分布的基本单位。当用户在客户端创建目录的时候，会在所有brick进行创建，但是创建目录的时候，会根据哈希算法确定文件具体存放在哪个brick上。也就是说，所有brick都有相同的目录结构，但是文件只会存储于特定的brick上，根据文件所在目录的gfid和文件名寻找实际存储文件的brick。我们知道，在Gluster
复制链接

扫一扫

专栏目录