本文源码来自HBase-1.2.6版本,查了一些资料,加上个人理解,如理解有误,请不吝赐教,谢谢。
-
put过程主要分为客户端发起请求及服务端处理请求
-
客户端发起请求过程,put入口在HTable.java
/**
* HBASE 客户端put
* {@inheritDoc}
* @throws IOException
*/
@Override
public void put(final List<Put> puts) throws IOException {
getBufferedMutator().mutate(puts);
if (autoFlush) {
flushCommits();
}
}
@VisibleForTesting
BufferedMutator getBufferedMutator() throws IOException {
if (mutator == null) {
this.mutator = (BufferedMutatorImpl) connection.getBufferedMutator(
new BufferedMutatorParams(tableName)
.pool(pool)
.writeBufferSize(connConfiguration.getWriteBufferSize()) //设置autoFlush大小
.maxKeyValueSize(connConfiguration.getMaxKeyValueSize()) //单次put的最大字节数
);
}
return mutator;
}
getBufferedMutator() 方法设置 autoFlush大小及单次put的最大字节数
接着往下走到mutate方法
@Override
public void mutate(List<? extends Mutation> ms) throws InterruptedIOException,
RetriesExhaustedWithDetailsException {
if (closed) {
throw new IllegalStateException("Cannot put when the BufferedMutator is closed.");
}
long toAddSize = 0;
for (Mutation m : ms) {
if (m instanceof Put) {
//判断单次put大小,是否合理,是否达到hbase.client.keyvalue.maxsize值
validatePut((Put) m);
}
toAddSize += m.heapSize();
}
// This behavior is highly non-intuitive... it does not protect us against
// 94-incompatible behavior, which is a timing issue because hasError, the below code
// and setter of hasError are not synchronized. Perhaps it should be removed.
//AsyncProcess ap
if (ap.hasError()) {
currentWriteBufferSize.addAndGet(toAddSize);
writeAsyncBuffer.addAll(ms);
backgroundFlushCommits(true); //若是提交put报错,后续采用同步提交方式
} else {
currentWriteBufferSize.addAndGet(toAddSize);
writeAsyncBuffer.addAll(ms);
}
// Now try and queue what needs to be queued.
// currentWriteBufferSize即puts的大小大于hbase.client.write.buffer参数配置,自动提交flush 默认值为2M
while (currentWriteBufferSize.get() > writeBufferSize) {
backgroundFlushCommits(false);
}
}
1、判断单次put的大小有没有超最大值
2、backgroundFlushCommits提交任务,true为同步执行,false为异步提交。
//dequeuedSize < (writeBufferSize * 2) writeBufferSize默认为2M,每次从writeAsyncBuffer拉取一条放入到buffer中,直到大小超过2M*2 自动提交的限制为2M
while (
(writeBufferSize <= 0 || dequeuedSize < (writeBufferSize * 2) || synchronous)
&& (m = writeAsyncBuffer.poll()) != null) {
buffer.add(m);
long size = m.heapSize();
dequeuedSize += size;
currentWriteBufferSize.addAndGet(-size);
}
//异步提交
if (!synchronous) {
ap.submit(tableName, buffer, true, null, false);
if (ap.hasError()) {
LOG.debug(tableName + ": One or more of the operations have failed -"
+ " waiting for all operation in progress to finish (successfully or not)");
}
}
//操作异常,返回结果,throw异常
if (synchronous || ap.hasError()) {
while (!buffer.isEmpty()) {
ap.submit(tableName, buffer, true, null, false);
}
RetriesExhaustedWithDetailsException error =
ap.waitForAllPreviousOpsAndReset(null, tableName.getNameAsString());
if (error != null) {
if (listener == null) {
throw error;
} else {
this.listener.onException(error, this);
}
}
}
1、根据synchronous 判断 异步还是同步提交任务,只有异步提交任务出错的时候才会使用同步提交
//判断是否可以对region进行操作
if (canTakeOperation(loc, regionIncluded, serverIncluded)) {
Action<Row> action = new Action<Row>(r, ++posInList);
setNonce(ng, r, action);
retainedActions.add(action);
// TODO: replica-get is not supported on this path
byte[] regionName = loc.getRegionInfo().getRegionName();
//根据region对put进行分类
addAction(loc.getServerName(), regionName, action, actionsByServer, nonceGroup);
it.remove();
}
遍历put-rows,根据不同的region进行分组,放入到Map中,然后到submitMultiActions方法,提交请求到服务端
sendMultiAction->getNewMultiActionRunnable()生成runnables,线程池submit提交任务
AsyncProcess.SingleServerRequestRunnable.run->callWithoutRetries->call->HTable.checkAndPut.call() 发送请求到服务端
- 服务端处理请求
RPC请求到RSRpcServices.mutate方法,根据操作类型进入到对应的方法中。
checkAndMutate
case PUT:
Put put = ProtobufUtil.toPut(mutation, cellScanner);
quota.addMutation(put);
if (request.hasCondition()) {
Condition condition = request.getCondition();
byte[] row = condition.getRow().toByteArray();
byte[] family = condition.getFamily().toByteArray();
byte[] qualifier = condition.getQualifier().toByteArray();
CompareOp compareOp = CompareOp.valueOf(condition.getCompareType().name());
ByteArrayComparable comparator =
ProtobufUtil.toComparator(condition.getComparator());
if (region.getCoprocessorHost() != null) {
processed = region.getCoprocessorHost().preCheckAndPut(
row, family, qualifier, compareOp, comparator, put);
}
if (processed == null) {
boolean result = region.checkAndMutate(row, family,
qualifier, compareOp, comparator, put, true);
if (region.getCoprocessorHost() != null) {
result = region.getCoprocessorHost().postCheckAndPut(row, family,
qualifier, compareOp, comparator, put, result);
}
processed = result;
}
} else {
region.put(put);
processed = Boolean.TRUE;
}
根据插入请求是否带row,family等条件,进入到各自的方法,设置了rowkey的需要先获得行锁。
doBatchMutate->batchMutate->doMiniBatchMutation 批量操作。
OperationStatus[] batchMutate(BatchOperationInProgress<?> batchOp) throws IOException {
boolean initialized = false;
Operation op = batchOp.isInReplay() ? Operation.REPLAY_BATCH_MUTATE : Operation.BATCH_MUTATE;
startRegionOperation(op);
try {
while (!batchOp.isDone()) {
if (!batchOp.isInReplay()) {
checkReadOnly();
}
//检查region memstore大小是否达到flush上限;
checkResources();
if (!initialized) {
this.writeRequestsCount.add(batchOp.operations.length);
if (!batchOp.isInReplay()) {
doPreMutationHook(batchOp);
}
initialized = true;
}
doMiniBatchMutation(batchOp);
long newSize = this.getMemstoreSize();
if (isFlushSize(newSize)) {
requestFlush(); //插入新数据之后的 memstore大小超过上限,触发flush请求
}
}
} finally {
closeRegionOperation(op);
}
return batchOp.retCodeDetails;
}
checkResources():检查region memstore大小是否达到flush上限,如果达到flush上限,提交flush请求并报RegionTooBusyException异常。
doMiniBatchMutation() 方法是put操作主方法。主要分为以下几步。
@SuppressWarnings("unchecked")
private long doMiniBatchMutation(BatchOperationInProgress<?> batchOp) throws IOException {
boolean isInReplay = batchOp.isInReplay(); //默认false
// variable to note if all Put items are for the same CF -- metrics related
boolean putsCfSetConsistent = true;
//The set of columnFamilies first seen for Put.
Set<byte[]> putsCfSet = null;
// variable to note if all Delete items are for the same CF -- metrics related
boolean deletesCfSetConsistent = true;
//The set of columnFamilies first seen for Delete.
Set<byte[]> deletesCfSet = null;
long currentNonceGroup = HConstants.NO_NONCE, currentNonce = HConstants.NO_NONCE;
WALEdit walEdit = new WALEdit(isInReplay);
MultiVersionConcurrencyControl.WriteEntry writeEntry = null;
long txid = 0;
boolean doRollBackMemstore = false;
boolean locked = false;
/** Keep track of the locks we hold so we can release them in finally clause */ //定义puts大小的锁空list,任务结束后在finally释放
List<RowLock> acquiredRowLocks = Lists.newArrayListWithCapacity(batchOp.operations.length);
// reference family maps directly so coprocessors can mutate them if desired
Map<byte[], List<Cell>>[] familyMaps = new Map[batchOp.operations.length];
// We try to set up a batch in the range [firstIndex,lastIndexExclusive)
int firstIndex = batchOp.nextIndexToProcess;
int lastIndexExclusive = firstIndex;
boolean success = false;
int noOfPuts = 0, noOfDeletes = 0;
WALKey walKey = null;
long mvccNum = 0;
long addedSize = 0;
try {
// ------------------------------------
// STEP 1. Try to acquire as many locks as we can, and ensure
// we acquire at least one.
// 遍历每个mutation,对应每行,获得一个行锁;一次性获取所有行锁
// ----------------------------------
int numReadyToWrite = 0;
long now = EnvironmentEdgeManager.currentTime();
while (lastIndexExclusive < batchOp.operations.length) {
Mutation mutation = batchOp.getMutation(lastIndexExclusive);
boolean isPutMutation = mutation instanceof Put;
Map<byte[], List<Cell>> familyMap = mutation.getFamilyCellMap();
// store the family map reference to allow for mutations
familyMaps[lastIndexExclusive] = familyMap;
// skip anything that "ran" already
if (batchOp.retCodeDetails[lastIndexExclusive].getOperationStatusCode()
!= OperationStatusCode.NOT_RUN) {
lastIndexExclusive++;
continue;
}
try {
if (isPutMutation) {
// Check the families in the put. If bad, skip this one.
if (isInReplay) {
removeNonExistentColumnFamilyForReplay(familyMap);
} else {
checkFamilies(familyMap.keySet()); //检查列簇是否存在
}
checkTimestamps(mutation.getFamilyCellMap(), now);//检查时间戳
} else {
prepareDelete((Delete) mutation);
}
checkRow(mutation.getRow(), "doMiniBatchMutation");//检查rowkey是否在当前region范围
} catch (NoSuchColumnFamilyException nscf) {
LOG.warn("No such column family in batch mutation", nscf);
batchOp.retCodeDetails[lastIndexExclusive] = new OperationStatus(
OperationStatusCode.BAD_FAMILY, nscf.getMessage());
lastIndexExclusive++;
continue;
} catch (FailedSanityCheckException fsce) {
LOG.warn("Batch Mutation did not pass sanity check", fsce);
batchOp.retCodeDetails[lastIndexExclusive] = new OperationStatus(
OperationStatusCode.SANITY_CHECK_FAILURE, fsce.getMessage());
lastIndexExclusive++;
continue;
} catch (WrongRegionException we) {
LOG.warn("Batch mutation had a row that does not belong to this region", we);
batchOp.retCodeDetails[lastIndexExclusive] = new OperationStatus(
OperationStatusCode.SANITY_CHECK_FAILURE, we.getMessage());
lastIndexExclusive++;
continue;
}
// If we haven't got any rows in our batch, we should block to
// get the next one.
RowLock rowLock = null;
try {
rowLock = getRowLock(mutation.getRow(), true); //true获得读锁,false获得写锁
} catch (IOException ioe) {
LOG.warn("Failed getting lock in batch put, row="
+ Bytes.toStringBinary(mutation.getRow()), ioe);
}
if (rowLock == null) {
// We failed to grab another lock
break; // stop acquiring more rows for this batch
} else {
acquiredRowLocks.add(rowLock);
}
lastIndexExclusive++;
numReadyToWrite++;
if (isPutMutation) {
// If Column Families stay consistent through out all of the
// individual puts then metrics can be reported as a mutliput across
// column families in the first put.
if (putsCfSet == null) {
putsCfSet = mutation.getFamilyCellMap().keySet();
} else {
putsCfSetConsistent = putsCfSetConsistent
&& mutation.getFamilyCellMap().keySet().equals(putsCfSet);
}
} else {
if (deletesCfSet == null) {
deletesCfSet = mutation.getFamilyCellMap().keySet();
} else {
deletesCfSetConsistent = deletesCfSetConsistent
&& mutation.getFamilyCellMap().keySet().equals(deletesCfSet);
}
}
}
// we should record the timestamp only after we have acquired the rowLock,
// otherwise, newer puts/deletes are not guaranteed to have a newer timestamp
now = EnvironmentEdgeManager.currentTime();
byte[] byteNow = Bytes.toBytes(now);
// Nothing to put/delete -- an exception in the above such as NoSuchColumnFamily?
if (numReadyToWrite <= 0) return 0L;
// We've now grabbed as many mutations off the list as we can
// ------------------------------------
// STEP 2. Update any LATEST_TIMESTAMP timestamps
// 更新timestamps
// ----------------------------------
for (int i = firstIndex; !isInReplay && i < lastIndexExclusive; i++) {
// skip invalid
if (batchOp.retCodeDetails[i].getOperationStatusCode()
!= OperationStatusCode.NOT_RUN) continue;
Mutation mutation = batchOp.getMutation(i);
if (mutation instanceof Put) {
updateCellTimestamps(familyMaps[i].values(), byteNow);
noOfPuts++;
} else {
prepareDeleteTimestamps(mutation, familyMaps[i], byteNow);
noOfDeletes++;
}
rewriteCellTags(familyMaps[i], mutation);
}
lock(this.updatesLock.readLock(), numReadyToWrite); //给对应的region加读锁,根据numReadyToWrite条数,计算等待时间
locked = true;
// calling the pre CP hook for batch mutation
if (!isInReplay && coprocessorHost != null) {
MiniBatchOperationInProgress<Mutation> miniBatchOp =
new MiniBatchOperationInProgress<Mutation>(batchOp.getMutationsForCoprocs(),
batchOp.retCodeDetails, batchOp.walEditsFromCoprocessors, firstIndex, lastIndexExclusive);
if (coprocessorHost.preBatchMutate(miniBatchOp)) return 0L;
}
// ------------------------------------
// STEP 3. Build WAL edit 构建 WAL edit
// ----------------------------------
Durability durability = Durability.USE_DEFAULT; //WAL 持久类型 默认-SYNC_WAL
for (int i = firstIndex; i < lastIndexExclusive; i++) {
// Skip puts that were determined to be invalid during preprocessing
if (batchOp.retCodeDetails[i].getOperationStatusCode() != OperationStatusCode.NOT_RUN) {
continue;
}
Mutation m = batchOp.getMutation(i);
Durability tmpDur = getEffectiveDurability(m.getDurability());
if (tmpDur.ordinal() > durability.ordinal()) {
durability = tmpDur;
}
if (tmpDur == Durability.SKIP_WAL) {
recordMutationWithoutWal(m.getFamilyCellMap());
continue;
}
long nonceGroup = batchOp.getNonceGroup(i), nonce = batchOp.getNonce(i);
// In replay, the batch may contain multiple nonces. If so, write WALEdit for each.
// Given how nonces are originally written, these should be contiguous.
// They don't have to be, it will still work, just write more WALEdits than needed.
if (nonceGroup != currentNonceGroup || nonce != currentNonce) {
if (walEdit.size() > 0) {
assert isInReplay;
if (!isInReplay) {
throw new IOException("Multiple nonces per batch and not in replay");
}
// txid should always increase, so having the one from the last call is ok.
// we use HLogKey here instead of WALKey directly to support legacy coprocessors.
walKey = new ReplayHLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
this.htableDescriptor.getTableName(), now, m.getClusterIds(),
currentNonceGroup, currentNonce, mvcc);
txid = this.wal.append(this.htableDescriptor, this.getRegionInfo(), walKey,
walEdit, true);
walEdit = new WALEdit(isInReplay);
walKey = null;
}
currentNonceGroup = nonceGroup;
currentNonce = nonce;
}
// Add WAL edits by CP
WALEdit fromCP = batchOp.walEditsFromCoprocessors[i];
if (fromCP != null) {
for (Cell cell : fromCP.getCells()) {
walEdit.add(cell);
}
}
addFamilyMapToWALEdit(familyMaps[i], walEdit); //构建WalEdit对象
}
// -------------------------
// STEP 4. Append the final edit to WAL. Do not sync wal.
// append HLOG 到 WalEdit
// -------------------------
Mutation mutation = batchOp.getMutation(firstIndex);
if (isInReplay) {
// use wal key from the original
walKey = new ReplayHLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,
mutation.getClusterIds(), currentNonceGroup, currentNonce, mvcc);
long replaySeqId = batchOp.getReplaySequenceId();
walKey.setOrigLogSeqNum(replaySeqId);
}
if (walEdit.size() > 0) {
if (!isInReplay) {
// we use HLogKey here instead of WALKey directly to support legacy coprocessors.
walKey = new HLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,
mutation.getClusterIds(), currentNonceGroup, currentNonce, mvcc);
}
//将数据构建为walEdit对象,然后一次写入到HLog中 txid为append操作唯一标识
txid = this.wal.append(this.htableDescriptor, this.getRegionInfo(), walKey, walEdit, true);
}
// ------------------------------------
// Acquire the latest mvcc number
// ----------------------------------
if (walKey == null) {
// If this isust get a skip wal operation j the read point from mvcc
walKey = this.appendEmptyEdit(this.wal);
}
if (!isInReplay) {
writeEntry = walKey.getWriteEntry();
mvccNum = writeEntry.getWriteNumber();//mvcc机制保证数据一致性,只有整个事务完成了,用户才可见
} else {
mvccNum = batchOp.getReplaySequenceId();
}
// ------------------------------------
// STEP 5. Write back to memstore 将数据写入到memstore中
// Write to memstore. It is ok to write to memstore
// first without syncing the WAL because we do not roll
// forward the memstore MVCC. The MVCC will be moved up when
// the complete operation is done. These changes are not yet
// visible to scanners till we update the MVCC. The MVCC is
// moved only when the sync is complete.
// ----------------------------------
for (int i = firstIndex; i < lastIndexExclusive; i++) {
if (batchOp.retCodeDetails[i].getOperationStatusCode()
!= OperationStatusCode.NOT_RUN) {
continue;
}
doRollBackMemstore = true; // If we have a failure, we need to clean what we wrote
addedSize += applyFamilyMapToMemstore(familyMaps[i], mvccNum, isInReplay);
}
// -------------------------------
// STEP 6. Release row locks, etc.
// 释放该regon的读锁
// -------------------------------
if (locked) {
this.updatesLock.readLock().unlock();
locked = false;
}
releaseRowLocks(acquiredRowLocks); //释放所有row的行锁
// -------------------------
// STEP 7. Sync wal. 同步wal,将Hlog同步到hdfs中
// -------------------------
if (txid != 0) {
syncOrDefer(txid, durability);
}
doRollBackMemstore = false;
// update memstore size 更新memstore的大小
this.addAndGetGlobalMemstoreSize(addedSize);
// calling the post CP hook for batch mutation
if (!isInReplay && coprocessorHost != null) {
MiniBatchOperationInProgress<Mutation> miniBatchOp =
new MiniBatchOperationInProgress<Mutation>(batchOp.getMutationsForCoprocs(),
batchOp.retCodeDetails, batchOp.walEditsFromCoprocessors, firstIndex, lastIndexExclusive);
coprocessorHost.postBatchMutate(miniBatchOp);
}
// ------------------------------------------------------------------
// STEP 8. Advance mvcc. This will make this put visible to scanners and getters.
// 结束事务
// ------------------------------------------------------------------
if (writeEntry != null) {
mvcc.completeAndWait(writeEntry);
writeEntry = null;
} else if (isInReplay) {
// ensure that the sequence id of the region is at least as big as orig log seq id
mvcc.advanceTo(mvccNum);
}
for (int i = firstIndex; i < lastIndexExclusive; i ++) {
if (batchOp.retCodeDetails[i] == OperationStatus.NOT_RUN) {
batchOp.retCodeDetails[i] = OperationStatus.SUCCESS;
}
}
// ------------------------------------
// STEP 9. Run coprocessor post hooks. This should be done after the wal is
// synced so that the coprocessor contract is adhered to.
// ------------------------------------
if (!isInReplay && coprocessorHost != null) {
for (int i = firstIndex; i < lastIndexExclusive; i++) {
// only for successful puts
if (batchOp.retCodeDetails[i].getOperationStatusCode()
!= OperationStatusCode.SUCCESS) {
continue;
}
Mutation m = batchOp.getMutation(i);
if (m instanceof Put) {
coprocessorHost.postPut((Put) m, walEdit, m.getDurability());
} else {
coprocessorHost.postDelete((Delete) m, walEdit, m.getDurability());
}
}
}
success = true;
return addedSize;
} finally {
// if the wal sync was unsuccessful, remove keys from memstore //如果wal-HLog同步失败,删除memstore
if (doRollBackMemstore) {
for (int j = 0; j < familyMaps.length; j++) {
for(List<Cell> cells:familyMaps[j].values()) {
rollbackMemstore(cells);
}
}
if (writeEntry != null) mvcc.complete(writeEntry);
} else {
if (writeEntry != null) {
mvcc.completeAndWait(writeEntry);
}
}
if (locked) {
this.updatesLock.readLock().unlock();
}
releaseRowLocks(acquiredRowLocks);
// See if the column families were consistent through the whole thing.
// if they were then keep them. If they were not then pass a null.
// null will be treated as unknown.
// Total time taken might be involving Puts and Deletes.
// Split the time for puts and deletes based on the total number of Puts and Deletes.
if (noOfPuts > 0) {
// There were some Puts in the batch.
if (this.metricsRegion != null) {
this.metricsRegion.updatePut();
}
}
if (noOfDeletes > 0) {
// There were some Deletes in the batch.
if (this.metricsRegion != null) {
this.metricsRegion.updateDelete();
}
}
if (!success) {
for (int i = firstIndex; i < lastIndexExclusive; i++) {
if (batchOp.retCodeDetails[i].getOperationStatusCode() == OperationStatusCode.NOT_RUN) {
batchOp.retCodeDetails[i] = OperationStatus.FAILURE;
}
}
}
if (coprocessorHost != null && !batchOp.isInReplay()) {
// call the coprocessor hook to do any finalization steps
// after the put is done
MiniBatchOperationInProgress<Mutation> miniBatchOp =
new MiniBatchOperationInProgress<Mutation>(batchOp.getMutationsForCoprocs(),
batchOp.retCodeDetails, batchOp.walEditsFromCoprocessors, firstIndex,
lastIndexExclusive);
coprocessorHost.postBatchMutateIndispensably(miniBatchOp, success);
}
batchOp.nextIndexToProcess = lastIndexExclusive;
}
}
put的主要步骤:
1、遍历所有row,获得对应行锁;一次性获取所有行锁
2、更新数据的timestamp,获得目前操作region的读锁
3、构建WALEdit对象,默认WAL持久类型为SYNC_WAL
4、生成walKey,将WALEdit对象依次append到当前region的HLog中(当前未同步,未落盘),返回当前操作的唯一标示txid,用于后续同步HLog到hdfs上;获取mvcc_num,保证写入数据的一致性,只有当事务完成时,数据才对用户可见
5、将数据写入到memstore中,估算新增memstore的大小
6、释放当前region的读锁,释放所有行的行锁
7、同步WAL,将HLog同步到hdfs中,同步成功,更新memstore大小,同步失败,删除已写入的memstore
8、完成事务,操作成功,当前操作对用户可见