大数据最难源码 hbase 源码（四）之HBase DML(插入数据)源码分析

置顶大数据的江湖

已于 2022-04-04 19:29:11 修改

阅读量1.4k

点赞数 3

分类专栏：大数据 # HBase 文章标签： hbase big data 分布式

于 2022-01-13 21:52:23 首次发布

本文链接：https://blog.csdn.net/lexoning/article/details/122482949

版权

大数据同时被 2 个专栏收录

21 篇文章 2 订阅

订阅专栏

HBase

7 篇文章 3 订阅

订阅专栏

HBase Rowkey 寻址机制

1.1. MetaCache 缓存详解
MetaCache 是存在于 HBase 客户端中，用来给客户端缓存从 ZooKeeper 或者 RegionServer 获取的 Table 的 region 位置信息的组件。它的存在可以极大的帮助
HBase 减小负载。
第一个网络来回：客户端发送请求给 ZK ，获取到 Meta 表的 Region 的位置
第二个网络来回：客户端发送请求给 meta 表的 region 所在的 regionserver ，扫描该 region 获取到用户表的 region 的 regionserver
第三个网络来回：客户端发请求给用户表的 region 所在的 regoinserver 扫描真实数据
当引入 MetaCahce 组件之后，最大的好处，就是可以在客户端缓存第一个网路来回和第二个网络来回的结果
HBase 客户端在第一次网络来回从 ZooKeeper 中获取到的 meta 表的 region 位置，会缓存在 MetaCache 中，同样，扫描 meta 表的 region 的元数据获取到的用
户表的 region 的位置信息，同样也会被保存在 MetaCache 中。
MetaCache 是 ConnectionImplementation 构造方法中初始化的。请看 MetaCache 的定义：

public class MetaCache {
// 该 Map 是核心变量，用来存储 缓存的 Table 的 Region 的位置信息
// 一个 Table 可能会有多个 Region
// ConcurrentMap<A, ConcurrentNavigableMap<B, C>>
//  A = 表名
//  B = startRowkey
//  C = RegionInfo 封装而成的 HRegionLocation 的集合体
private final ConcurrentMap<TableName, ConcurrentNavigableMap<byte[], RegionLocations>> cachedRegionLocations = new
CopyOnWriteArrayMap<>();
// 有 Region 位置缓存信息的 RegionServer 集合
private final Set<ServerName> cachedServers = new CopyOnWriteArraySet<>();
// 获取缓存
// table1 rk01 Regioninfo1
// table1 rk03 RegionInfo3
// table1 rk02 RegionInfo2
// row = rk022
//  先按照 Region 的 startRowkey 进行排序，然后获取 不大于自己的最大的/最后一个 Region
public RegionLocations getCachedLocation(final TableName tableName, final byte[] row) {}
// 清除缓存，突然指定不需要使用缓存的话，则会清空缓存
public void clearCache(...) {}
// 添加缓存
public void cacheLocation(final TableName tableName, final RegionLocations locations) {}
}


17来看 RegionLocations 的定义
public class RegionLocations implements Iterable<HRegionLocation> {
// 由此可见，RegionLocations 实质上，就是一个 HRegionLocation 集合的封装实现，提供了一些高级功能
private final HRegionLocation[] locations;
}
再来看 HRegionLocation 的定义：
public class HRegionLocation implements Comparable<HRegionLocation> {
// Region 的元数据
private final RegionInfo regionInfo;
// Region 所存在的 服务器名称
private final ServerName serverName;
}

1.2. ConnectionRegistry 详解

ConnectionRegistry 的具体实现是 ZKConnectionRegistry ，ZKConnectionRegistry 是存在于 HBase 客户端的内部专门用来处理和 ZooKeeper 的交互，相当于是
一个 ZooKeeper 客户端。来看 ZKConnectionRegistry 的定义：
class ZKConnectionRegistry implements ConnectionRegistry {
private final ReadOnlyZKClient zk;
private final ZNodePaths znodePaths;
ZKConnectionRegistry(Configuration conf) {
//  存储了所有的 znode 节点路径
this.znodePaths = new ZNodePaths(conf);
// ZK 客户端
this.zk = new ReadOnlyZKClient(conf);
} // 获取 ZK 上某个 znode 节点的数据
private <T> CompletableFuture<T> getAndConvert(String path, Converter<T> converter) {}
// 获取 ZK 上 clusterID znode 节点的数据
private static String getClusterId(....) {}
// 从 ZK 上获取 Meta Region 的位置信息
// 如果 meta 表只有一个 region： 则 znode 路径为： /hbase/meta-region-server
// 如果 meta 表不只是一个 Region： 则 znode 有多个 分别是： /hbase/meta-region-server-1， /hbase/meta-region-server-2
private void getMetaRegionLocation(CompletableFuture<RegionLocations> future, List<String> metaReplicaZNodes) {}
public CompletableFuture<RegionLocations> getMetaRegionLocations() {}
// 获取 HBase 集群 Active HMaster 节点的信息
public CompletableFuture<ServerName> getActiveMaster() {}
}

1.3. HBase Rowkey 寻址机制

public class HTable implements Table {
// 链接
private final ClusterConnection connection;
// 表名
private final TableName tableName;
// Region 定位器
private final HRegionLocator locator;
// 构造方法
protected HTable(.....){
this.connection = Preconditions.checkNotNull(connection, "connection is null");
this.tableName = builder.tableName;
// 初始化了一个 Region 定位器

17我们来看 HRegionLocator 的定义：
且看 ConnectionImplementation 中，关于 locateRegion() 方法实现的定义：
this.locator = new HRegionLocator(tableName, connection);
}
}

public class HRegionLocator implements RegionLocator {
private final TableName tableName;
private final ConnectionImplementation connection;
public HRegionLocator(TableName tableName, ConnectionImplementation connection) {
this.connection = connection;
this.tableName = tableName;
} // 执行 row 的 Region 定位，内部都是通过 ConnectionImplementation 的 locateRegion 方法搞定的
public HRegionLocation getRegionLocation(byte[] row, int replicaId, boolean reload) throws IOException {}
public List<HRegionLocation> getRegionLocations(byte[] row, boolean reload) throws IOException {}
public List<HRegionLocation> getAllRegionLocations() throws IOException {}
}

public RegionLocations locateRegion(final TableName tableName, final byte[] row, boolean useCache, boolean retry, int
replicaId) throws IOException {
// 第一个网络来回
// 如果是 meta 表的 region 定位，则发送 请求给 ZooKeeper
if(tableName.equals(TableName.META_TABLE_NAME)) {
return locateMeta(tableName, useCache, replicaId);
} // 第二个网络来回
// 如果是 user 表的 region 定位，则扫描 meta 表，获取 user 表的位置信息
else {
// Region not in the cache - have to go to the meta RS
return locateRegionInMeta(tableName, row, useCache, retry, replicaId);
}
}

如图：
在这里插入图片描述

DML 插入数据客户端处理，DML 插入数据服务端处理

核心入口：HTable.put(List puts)

HTable.put(List<Put> puts){
// 客户端先执行校验，每个 Cell 不能超过 10M 默认大小
for(Put put : puts) {
validatePut(put);
} // 批量异步提交
batch(puts, results, writeRpcTimeoutMs){
// 封装该异步批量提交请求为一个 AsyncProcessTask 对象
AsyncProcessTask task = AsyncProcessTask.newBuilder().setPool(pool).setTableName(tableName).setRowAccess(actions)
.setResults(results).setRpcTimeout(rpcTimeout).setOperationTimeout(operationTimeoutMs)
.setSubmittedRows(AsyncProcessTask.SubmittedRows.ALL).build();
// 提交
AsyncRequestFuture ars = multiAp.submit(task){
// 继续提交
return submitAll(task){
// Action 容器
List<Action> actions = new ArrayList<>(rows.size());
//
for(Row r : rows) {
// 将每一条要插入的数据，抽象成一个 Action
Action action = new Action(r, posInList, highestPriority);
actions.add(action);
}
} // 构建一个异步处理请求对象：AsyncRequestFutureImpl
AsyncRequestFutureImpl<CResult> ars = createAsyncRequestFuture(task, actions, ng.getNonceGroup());

来看 groupAndSendMultiAction 的打包处理逻辑：
关于打包的逻辑，只要理解 actionsByServer 这个变量即可。它本身是一个 Map<ServerName, MultiAction>，也就是说，所有的 Put 操作被抽象成 Action 之后，
这些 Action 可能会被发送给不同的 RegionServer，然后为了方便管理，就抽象了这样的一个 Map，每一台 RegionServer 对应了一个 MultiAction，一个
MultiAction 实际包含了 1 - n 个 Action
总结一下打包逻辑中的多个容器：
关于分组发送的逻辑：
// 分组打包提交

ars.groupAndSendMultiAction(actions, 1);
// 首先把每个 Put 的 rowkey 拿出来找到对应的 region 所在的 RegionServer
// 然后把所有的 Put 对应的 RegionServer 进行分组打包
}
}
}

void groupAndSendMultiAction(List<Action> currentActions, int numAttempt) {
// 结果容器
Map<ServerName, MultiAction> actionsByServer = new HashMap<>();
// 遍历每个 Action 开始执行打包处理
for(Action action : currentActions) {
// 执行 Action 的 Region 定位
RegionLocations locs = findAllLocationsOrFail(action, true);
// 找到具体的 HRegionLocation 信息
HRegionLocation loc = locs.getRegionLocation(action.getReplicaId());
//
if(loc != null || loc.getServerName() != null) {
byte[] regionName = loc.getRegionInfo().getRegionName();
// 进行分组打包
AsyncProcess.addAction(loc.getServerName(), regionName, action, actionsByServer, nonceGroup){
// 一个 RegionServer 对应一个 MultiAction
MultiAction multiAction = actionsByServer.get(server);
// 将刚才遍历的 Action 加入到对应 RegionServer 的 MultiAction 中
multiAction.add(regionName, action);
}
}
} // 分组发送
if(!actionsByServer.isEmpty()) {
// If this is a first attempt to group and send, no replicas, we need replica thread.
sendMultiAction(actionsByServer, numAttempt, (doStartReplica && !hasUnknown) ? currentActions : null,
numAttempt > 1 && !hasUnknown);
}
}

// 在 MultiAction 的内部分类管理每个 Region 的 Action 集合
protected Map<byte[], List<Action>> actions = new TreeMap<>(Bytes.BYTES_COMPARATOR);
1 2
// key 是 RegionServer 的名字
// value 是该 RegionServer 上的所有的 Action 的集合
Map<ServerName, MultiAction> actionsByServer = new HashMap<>();
// 那么发送都一个 RegionServer 上的多个 Put 对应的 Action 又被发送到那些 Region 呢？被 MultiAction 内部的 actions Map 进行分类管理
public final class MultiAction {
// key 一个 Region 的名字
// value 插入到该 Region 上的所有 Action 的集合
protected Map<byte[], List<Action>> actions = new TreeMap<>(Bytes.BYTES_COMPARATOR);
}

void sendMultiAction(Map<ServerName, MultiAction> actionsByServer, int numAttempt,
List<Action> actionsForReplicaThread, boolean reuseThread) {
// 遍历 actionsByServer 这个 map，每个 RegionServer 创建一到多个 SingleServerRequestRunnable 来执行客户端数据发送
for(Map.Entry<ServerName, MultiAction> e : actionsByServer.entrySet()) {
ServerName server = e.getKey();
MultiAction multiAction = e.getValue();
// 构建一到多个 SingleServerRequestRunnable 线程来执行发送
Collection<? extends Runnable> runnables = getNewMultiActionRunnable(server, multiAction, numAttempt);

11来看 SingleServerRequestRunnable 的处理逻辑：
到此为止，客户端真正把要发送的所有的 Put 插入数据都分门别类的分别发送到不同的 RegionServer 去了。
客户端的重点事情：
Put 对象变成 Action 之后进行分组打包，其实就是构建 MultiAction 对象，一个RegionServer 一个 MultiAction
在这个过程中，会涉及到 Region 的定位：ServerName, Rowkey
构建发送请求的线程。要发送的数据由于要发送给多个 RegionServer 的 Region
在真正发送之前，会将 MultiAction 构建成 RegionAction 在执行发送 MultiRequest(List<RegionAction>)

如下图
在这里插入图片描述

2.2 DML 插入数据服务端处理
RegionServer 由多个 HRegion 对象组成，HRegion 由多个 HStore 组成, HStore 由一个 MemStore 加上多个 StoreFile 组成
客户端执行 put 插入数据动作时，HBase 服务端是由相应的 RegionServer 来接收 RPC 请求执行处理的。RegionServer 的 RPC 请求服务类是 RSRpcServices 类，
来看它的 multi 方法：
// 启动发送线程任务

for(Runnable runnable : runnables) {
if((--actionsRemaining == 0) && reuseThread && numAttempt % HConstants.DEFAULT_HBASE_CLIENT_RETRIES_NUMBER != 0) {
runnable.run();
} else {
pool.submit(runnable);
}
}
}
}

SingleServerRequestRunnable.run(){
// 获取 MultiServerCallable，是 RegionServerCallable 的子类
callable = createCallable(server, tableName, multiAction);
// 获取 RpcRetryingCaller 组件执行 RPC 发送
RpcRetryingCaller<AbstractResponse> caller = asyncProcess.createCaller(callable, rpcTimeout);
// 不带重试发送
res = caller.callWithoutRetries(callable, operationTimeout){
// 获取和 RegionSever 的链接
callable.prepare(false);
// 执行发送
return callable.call(callTimeout){
// 执行发送
return rpcCall(){
MultiServerCallable.rpcCall(){
// 将该 RegionServer 上要发送的 Action 按照 Region 进行分类构建成 RegionAction
for(Map.Entry<byte[], List<Action>> e : this.multiAction.actions.entrySet()) {
//  region 
final byte[] regionName = e.getKey();
//  该 region 对应的所有待操作的 Action
final List<Action> actions = e.getValue();
// 构建成 RegionAction
if(this.cellBlock) {
RequestConverter.buildNoDataRegionActions(...);
} else {
RequestConverter.buildRegionActions(...);
}
} // 构建 RPC 请求对象
ClientProtos.MultiRequest requestProto = multiRequestBuilder.build();
// 真正执行 RPC 请求发送
ClientProtos.MultiResponse responseProto = getStub().multi(getRpcController(), requestProto);
// 处理服务端返回回来的响应
return ResponseConverter.getResults(requestProto, indexMap, responseProto, getRpcControllerCellScanner());
}
}
}
}
}


RSRpcServices.multi(){
// 遍历每一个 RegionAction，每一个 RegionAction 都包含了对某一个 Region 的多个 Action 动作
for(RegionAction regionAction : request.getRegionActionList()) {
1 2 3且看 checkAndMutate 方法的实现：
OK，终于来到核心逻辑处理了：
// 获取到 Region
RegionSpecifier regionSpecifier = regionAction.getRegion();
HRegion region = getRegion(regionSpecifier);
// 执行 mutate 动作
if(regionAction.getActionCount() == 1) {
CheckAndMutateResult result = checkAndMutate(region, quota, regionAction.getAction(0).getMutation(), ...);
} else {
CheckAndMutateResult result = checkAndMutate(region, regionAction.getActionList(), cellScanner, ...);
}
}
}

checkAndMutate(region, regionAction.getActionList(), cellScanner, ...){
// 构建一个 Put 集合
List<Mutation> mutations = new ArrayList<>();
for(ClientProtos.Action action : actions) {
MutationProto mutation = action.getMutation();
Put put = ProtobufUtil.toPut(mutation, cellScanner);
mutations.add(put);
} // 执行
CheckAndMutate checkAndMutate = ProtobufUtil.toCheckAndMutate(condition, mutations);
region.checkAndMutate(checkAndMutate, nonceGroup, nonce){
doBatchMutate(mutation, true, nonceGroup, nonce){
this.batchMutate(new Mutation[]{mutation}, atomic, nonceGroup, nonce){
batchMutate(new MutationBatchOperation(this, mutations, atomic, nonceGroup, nonce)){
// 执行批次操作
doMiniBatchMutate(batchOp);
// 判断在操作完成之后是否需要 flush
requestFlushIfNeeded();
}
}
}
}
}

// batchOp = 100条
HRegioin.doMiniBatchMutate(batchOp){
// 第一步：尝试获取尽可能多的行锁
miniBatchOp = batchOp.lockRowsAndBuildMiniBatch(acquiredRowLocks);
// 第二步：更新时间戳
// 时间戳的更新是在行锁之内，意味着，较新的请求，必然有用更大的时间戳
long now = EnvironmentEdgeManager.currentTime();
batchOp.prepareMiniBatchOperations(miniBatchOp, now, acquiredRowLocks);
// 第三步：将这一批次 Cell 按照 Nonce 信息，构建成多个 WALEdit 分组
// 其实可以这么说：并不是一条 Cell 一条日志，而是一堆 Cell 一条日志
List<Pair<NonceKey, WALEdit>> walEdits = batchOp.buildWALEdits(miniBatchOp);
// 按照 NonceKey 作为条件，把这一个批次中的 Put 数据分割成多个 WALEdit
// 写操作日志记录： wal.append(key, value)
// 第四步：将一批次数据对应的操作日志记录下来
// key = WALKeyImpl, value = WALEdit
for(Iterator<Pair<NonceKey, WALEdit>> it = walEdits.iterator(); it.hasNext(); ) {
Pair<NonceKey, WALEdit> nonceKeyWALEditPair = it.next();
walEdit = nonceKeyWALEditPair.getSecond();
NonceKey nonceKey = nonceKeyWALEditPair.getFirst();
if(walEdit != null && !walEdit.isEmpty()) {
//  写入日志
writeEntry = doWALAppend(walEdit, batchOp.durability, batchOp.getClusterIds(), now, ....);
}
} // 第五步：将数据写入到 MemStore
writeEntry = batchOp.writeMiniBatchOperationsToMemStore(miniBatchOp, writeEntry);
// 第六步：MVCC(多版本并发控制技术) 结束
batchOp.completeMiniBatchOperations(miniBatchOp, writeEntry);
// 第七步：标记成功
success = true;


38其中最重要的核心逻辑是：先记录操作日志到 HLog，然后更新数据到 MemStore
HBase MemStore

releaseRowLocks(acquiredRowLocks);
}

HBase MemStore 介绍

MemStore > MutableSegment => CellSet ==> ConcurrentSkipListMap 跳表
MemStore 的初始化是在 HStore 初始化的时候执行的：
// 第八步：释放行锁

releaseRowLocks(acquiredRowLocks);
}

protected HStore(final HRegion region, final ColumnFamilyDescriptor family, ....) throws IOException {
// 顺带提一下这个组件：DataBlock 的编码器
this.dataBlockEncoder = new HFileDataBlockEncoderImpl(family.getDataBlockEncoding());
转到 DefaultMemStore 的构造方法：
数据写入的逻辑：
// 获取 MemStore
this.memstore = getMemstore(){
MemStore ms = null;
// 默认为 NONE，表示不开启 InMemory Compaction
MemoryCompactionPolicy inMemoryCompaction = MemoryCompactionPolicy.NONE;
switch(inMemoryCompaction) {
case NONE:
ms = ReflectionUtils.newInstance(DefaultMemStore.class,...);
break;
default:
Class<? extends CompactingMemStore> clz = conf
.getClass(MEMSTORE_CLASS_NAME, CompactingMemStore.class, CompactingMemStore.class);
ms = ReflectionUtils.newInstance(clz, new Object[]{conf, getComparator(), this, this
.getHRegion().getRegionServicesForStores(), inMemoryCompaction});
} return ms;
}
}

public DefaultMemStore(final Configuration conf, final CellComparator c) {
super(conf, c, null){
// 创建 Active MutableSegment
resetActive(){
active = SegmentFactory.instance().createMutableSegment(conf, comparator, memstoreAccounting){
// 创建 MemStoreLABImpl 实例
MemStoreLAB memStoreLAB = MemStoreLAB.newInstance(conf);
//
return generateMutableSegment(conf, comparator, memStoreLAB, memstoreSizing){
// 创建一个 CellSet
CellSet set = new CellSet(comparator){
// 创建跳表结构
this.delegatee = new ConcurrentSkipListMap<>(c.getSimpleComparator());
} // 创建 MutableSegment 包装 CellSet
return new MutableSegment(set, comparator, memStoreLAB, memstoreSizing);
}
}
} // 创建 ImmutableSegment
this.snapshot = SegmentFactory.instance().createImmutableSegment(c){
MutableSegment segment = generateMutableSegment(null, comparator, null, null);
return createImmutableSegment(segment, null);
}
}
}

BatchOperation.writeMiniBatchOperationsToMemStore(final MiniBatchOperationInProgress<Mutation> miniBatchOp,
final long writeNumber) throws IOException {
// 执行数据插入
//  写 HRegion
applyFamilyMapToMemStore(familyCellMaps[index], memStoreAccounting){
// 遍历每个列簇执行插入
for(Map.Entry<byte[], List<Cell>> e : familyMap.entrySet()) {
byte[] family = e.getKey();
List<Cell> cells = e.getValue();
// 往不同的 Store 中执行插入
//写 HStore
region.applyToMemStore(region.getStore(family), cells, false, memstoreAccounting){
boolean upsert = delta && store.getColumnFamilyDescriptor().getMaxVersions() == 1;
if(upsert) {
store.upsert(cells, getSmallestReadPoint(), memstoreAccounting);
}else {
store.add(cells, memstoreAccounting){

// 写 MemStore
memstore.add(cells, memstoreSizing){
// 遍历每个 Cell 执行插入
for(Cell cell : cells) {
add(cell, memstoreSizing){
doAddOrUpsert(cell, 0, memstoreSizing, true){
if(doAdd) {
doAdd(currentActive, cell, memstoreSizing){
internalAdd(currentActive, toAdd, mslabUsed, memstoreSizing){}

下图所示
在这里插入图片描述

总说，一个 Put 可能包含多个 Cell，这些 Cell 会按照不用的列簇的定义，加入到不同的 HStore 中的 MemStore 中的 Active Segment 中的 CellSet 中的
ConcurrentSkipListMap 跳表中。当插入成功之后，会记录内存使用。当一批次 Cell 都执行成功之后，会执行判断，是否需要执行 Flush。
合一一起

大数据的江湖

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
大数据最难源码 hbase 源码（四）之HBase DML(插入数据)源码分析

HBase Rowkey 寻址机制1.1. MetaCache 缓存详解MetaCache 是存在于 HBase 客户端中，用来给客户端缓存从 ZooKeeper 或者 RegionServer 获取的 Table 的 region 位置信息的组件。它的存在可以极大的帮助HBase 减小负载。第一个网络来回：客户端发送请求给 ZK ，获取到 Meta 表的 Region 的位置第二个网络来回：客户端发送请求给 meta 表的 region 所在的 regionserver ，扫描该 regio
复制链接

扫一扫