前言
处理Scan的过程
1.获得scanner id,签订租约Leases
代码中依靠scanner id来判断是否已经完成了第一阶段,如果没有完成,就查找到数据所在的Region,创建一个scanner,并把这个scanner添加到缓存中。(查找Region的过程就是从map里取一个Region,这个在Put一节中已经讲过了)
if (request.hasScannerId()) {
...
} else {
region = getRegion(request.getRegion());
...
if (!scan.hasFamilies()) {
// Adding all families to scanner
for (byte[] family: region.getTableDesc().getFamiliesKeys()) {
scan.addFamily(family);
}
}
...
if (scanner == null) {
scanner = region.getScanner(scan);
}
...
scannerId = addScanner(scanner, region);
scannerName = String.valueOf(scannerId);
ttl = this.scannerLeaseTimeoutPeriod;
}
在scannerId = addScanner(scanner, region);方法中,scan与Region Server签订了租约,表示scanner会其缓存多长时间。通过hbase.client.scanner.timeout.perio参数设置,默认情况下为60000ms,即一分钟。租约是一个异步线程,通过线程sleep,等待到租约到期,然后清除缓存。
public void createLease(String leaseName, int leaseTimeoutPeriod, final LeaseListener listener)
throws LeaseStillHeldException {
addLease(new Lease(leaseName, leaseTimeoutPeriod, listener));
}
然后设置上ttl、scanner id、moreResults(这里的moreResults为初始值true),就返回了
if (ttl > 0) {
tags.add(new ArrayBackedTag(TagType.TTL_TAG_TYPE, Bytes.toBytes(ttl)));
}
2. 扫描获取数据,签订租约Leases
如果超出了租约时间继续请求,那就会抛出错误;如果在租约时间内继续请求了,那么就到了第二阶段,扫描获取数据。
首先从request中获得scanner id。
long scannerId = -1;
if (request.hasScannerId()) {
scannerId = request.getScannerId();
scannerName = String.valueOf(scannerId);
}
然后从缓存中获取缓存的scanner,这里就是第一步的if中省略的代码
if (request.hasScannerId()) {
long scannerId = request.getScannerId();
scanDetails = rsRpcServices.getScanDetailsWithId(scannerId);
} else {
scanDetails = rsRpcServices.getScanDetailsWithRequest(request);
}
然后将租约移除,因为租约是异步的,所以很可能在执行过程中过期了,还是先移除掉。
lease = regionServer.leases.removeLease(scannerName);
然后获取数据,循环调用scanner.nextRaw方法获取数据,获取到的数据先存入values,转换完成后放入results中。如果results中的数量达到了上限或者没有更多数据了,就不再获取了,break出来。
List<Result> results = new ArrayList<Result>();
...
while (i < rows) {
...
moreRows = scanner.nextRaw(values, scannerContext);
if (!values.isEmpty()) {
final boolean partial = scannerContext.partialResultFormed();
Result r = Result.create(values, null, stale, partial);
lastBlock = addSize(context, r, lastBlock);
results.add(r);
i++;
}
if (limitReached || !moreRows) {
break;
}
}
经过一轮扫描,如果没有更多数据、或者达到了一次请求的上限,就把已经取到的数据results放进builder返回。然后续签一个新的租约,租约时长还是1分钟。
if (scanner.isFilterDone() && results.isEmpty()) {
builder.setMoreResults(false);
}
assert builder.hasMoreResultsInRegion();}
3. 再次请求,确认数据扫描已经完成
发现扫描完成了
if (scanner.isFilterDone() && results.isEmpty()) {
// If the scanner's filter - if any - is done with the scan
// only set moreResults to false if the results is empty. This is used to keep compatible
// with the old scan implementation where we just ignore the returned results if moreResults
// is false. Can remove the isEmpty check after we get rid of the old implementation.
builder.setMoreResults(false);
}
// Later we may close the scanner depending on this flag so here we need to make sure that we
// have already set this flag.
assert builder.hasMoreResultsInRegion();
当moreResults为false时,就会关闭缓存的scanner,closeScanner方法中会将租约移除,这样占用的资源就释放完了,就可以返回没有results的builder,确认扫描完成了。
if (!moreResults || closeScanner) {
ttl = 0;
moreResults = false;
closeScanner(region, scanner, scannerName);
}
看moreRows = scanner.nextRaw(values, scannerContext);,这里的scanner是通过HRegion#getScanner创建出来的一个RegionScannerImpl实例,他的nextRaw方法调用了RegionScannerImpl#nextInternal方法。这个方法主要目的是获取下一条数据放入results,并取得返回值表示是否还有更多的数据。
我们知道HBase没有建立索引,数据的查找是靠遍历文件实现的,所以查找下一行数据需要一个while循环。
private boolean nextInternal(List<Cell> results, ScannerContext scannerContext)
throws IOException {
if (!results.isEmpty()) {
throw new IllegalArgumentException("First parameter should be an empty list");
}
if (scannerContext == null) {
throw new IllegalArgumentException("Scanner context cannot be null");
}
Optional<RpcCall> rpcCall = RpcServer.getCurrentCall();
剥去while,看内部,首先从storeHeap里取了个值,第一次取得的值是null,后续会往storeHeap里塞数据。多次的请求其实用的是同一个Scanner实例,所以可以在一次请求时,将值放入storeHeap,下一次请求还是可以从storeHeap里取出来。
Cell current = this.storeHeap.peek();
boolean shouldStop = shouldStop(current);
boolean hasFilterRow = this.filter != null && this.filter.hasFilterRow();
if (hasFilterRow) {
if (LOG.isTraceEnabled()) {
LOG.trace("filter#hasFilterRow is true which prevents partial results from being "
+ " formed. Changing scope of limits that may create partials");
}
scannerContext.setSizeLimitScope(LimitScope.BETWEEN_ROWS);
scannerContext.setTimeLimitScope(LimitScope.BETWEEN_ROWS);
limitScope = LimitScope.BETWEEN_ROWS;
}
判断当前数据是否是停止行,如果是停止行,就直接返回false,表示没有更多数据了。这里就是客户端设置scan.setStopRow可以提高效率的原因,因为直接return了false,所以就不会继续遍历了。
boolean stopRow = isStopRow(currentRow, offset, length);
...
if (stopRow) {
return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
}
接着调用过滤器判断下一步的行为,如果这条数据无法满足filter的要求,就再判断下是否还有更多数据,没有就返回,有就continue while循环。Filter曾经被人诟病,即使后续的数据无法满足filter要求了,也必须遍历完,所以这里多出了个isFilterDoneInternal方法,来判断是否退出循环,返回false。
if (filterRowKey(currentRow, offset, length)) {
if (isFilterDoneInternal()) {
return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
}
boolean moreRows = nextRow(scannerContext, currentRow, offset, length);
if (!moreRows) {
return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
}
results.clear();
continue;
}
这个时候,stopRow和filterRowKey都放行了,可以读取数据了。
populateResult(results, this.storeHeap, scannerContext, currentRow, offset, length);
这个方法里遍历数据,查找到下一行。populateResult的遍历和外层nextInternal遍历的差别在于,populateResult遍历负责找齐一行数据的所有列,从他while循环的条件moreCellsInRow就能看出来,外层的遍历负责对这一行数据进行过滤,包括stopRow、Filter。其他方法都很简单,就看heap.next(results, scannerContext);方法,这里的heap就是this.storeHeap。
private boolean populateResult(List<Cell> results, KeyValueHeap heap,
ScannerContext scannerContext, Cell currentRowCell) throws IOException {
Cell nextKv;
boolean moreCellsInRow = false;
boolean tmpKeepProgress = scannerContext.getKeepProgress();
// Scanning between column families and thus the scope is between cells
LimitScope limitScope = LimitScope.BETWEEN_CELLS;
do {
// Check for thread interrupt status in case we have been signaled from
// #interruptRegionOperation.
checkInterrupt();
// We want to maintain any progress that is made towards the limits while scanning across
// different column families. To do this, we toggle the keep progress flag on during calls
// to the StoreScanner to ensure that any progress made thus far is not wiped away.
scannerContext.setKeepProgress(true);
heap.next(results, scannerContext);
scannerContext.setKeepProgress(tmpKeepProgress);
nextKv = heap.peek();
moreCellsInRow = moreCellsInRow(nextKv, currentRowCell);
再看heap.next(results, scannerContext);方法,这里调用了current,即StoreScanner来读取下一行数据,
private boolean populateResult(List<Cell> results, KeyValueHeap heap,
ScannerContext scannerContext, Cell currentRowCell) throws IOException {
Cell nextKv;
boolean moreCellsInRow = false;
boolean tmpKeepProgress = scannerContext.getKeepProgress();
// Scanning between column families and thus the scope is between cells
LimitScope limitScope = LimitScope.BETWEEN_CELLS;
do {
checkInterrupt();
scannerContext.setKeepProgress(true);
heap.next(results, scannerContext);
scannerContext.setKeepProgress(tmpKeepProgress);
nextKv = heap.peek();
moreCellsInRow = moreCellsInRow(nextKv, currentRowCell);
取完数据,再看看下一行数据是否是stopRow
Cell nextKv = this.storeHeap.peek();
shouldStop = shouldStop(nextKv);
// save that the row was empty before filters applied to it.
final boolean isEmptyRow = results.isEmpty();
如果这次没找到数据,且不是停止行,就继续while遍历。
if (results.isEmpty()) {
* System.out.println("No row after " + Bytes.toStringBinary(row));
* } else {
* System.out.println("The closest row after " + Bytes.toStringBinary(row) + " is "
* + Bytes.toStringBinary(results.stream().findFirst().get().getRow()));
* }
最后判断一次stopRow,这样查找就可以了。
if (stopRow) {
return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
} else {
return scannerContext.setScannerState(NextState.MORE_VALUES).hasMoreValues();
}
总结
以上就是今天要讲的内容,本文介绍了scan的使用,接下来的分析会在后文中继续展示。