HBase源码分析(四) 2021SC@SDUSC


前言


处理Scan的过程

1.获得scanner id,签订租约Leases

代码中依靠scanner id来判断是否已经完成了第一阶段,如果没有完成,就查找到数据所在的Region,创建一个scanner,并把这个scanner添加到缓存中。(查找Region的过程就是从map里取一个Region,这个在Put一节中已经讲过了)

if (request.hasScannerId()) {
  ...
} else {
  region = getRegion(request.getRegion());
  ...
  if (!scan.hasFamilies()) {
    // Adding all families to scanner
    for (byte[] family: region.getTableDesc().getFamiliesKeys()) {
      scan.addFamily(family);
    }
  }
  ...
  if (scanner == null) {
    scanner = region.getScanner(scan);
  }
  ...
  scannerId = addScanner(scanner, region);
  scannerName = String.valueOf(scannerId);
  ttl = this.scannerLeaseTimeoutPeriod;
}

在scannerId = addScanner(scanner, region);方法中,scan与Region Server签订了租约,表示scanner会其缓存多长时间。通过hbase.client.scanner.timeout.perio参数设置,默认情况下为60000ms,即一分钟。租约是一个异步线程,通过线程sleep,等待到租约到期,然后清除缓存。

public void createLease(String leaseName, int leaseTimeoutPeriod, final LeaseListener listener)
    throws LeaseStillHeldException {
  addLease(new Lease(leaseName, leaseTimeoutPeriod, listener));
}

然后设置上ttl、scanner id、moreResults(这里的moreResults为初始值true),就返回了

  if (ttl > 0) {
            tags.add(new ArrayBackedTag(TagType.TTL_TAG_TYPE, Bytes.toBytes(ttl)));
           }

2. 扫描获取数据,签订租约Leases

如果超出了租约时间继续请求,那就会抛出错误;如果在租约时间内继续请求了,那么就到了第二阶段,扫描获取数据。

首先从request中获得scanner id。

long scannerId = -1;
if (request.hasScannerId()) {
  scannerId = request.getScannerId();
  scannerName = String.valueOf(scannerId);
}

然后从缓存中获取缓存的scanner,这里就是第一步的if中省略的代码

 if (request.hasScannerId()) {
        long scannerId = request.getScannerId();
        scanDetails = rsRpcServices.getScanDetailsWithId(scannerId);
      } else {
        scanDetails = rsRpcServices.getScanDetailsWithRequest(request);
      }

然后将租约移除,因为租约是异步的,所以很可能在执行过程中过期了,还是先移除掉。

lease = regionServer.leases.removeLease(scannerName);

然后获取数据,循环调用scanner.nextRaw方法获取数据,获取到的数据先存入values,转换完成后放入results中。如果results中的数量达到了上限或者没有更多数据了,就不再获取了,break出来。

List<Result> results = new ArrayList<Result>();
...
while (i < rows) {
  ...
  moreRows = scanner.nextRaw(values, scannerContext);

  if (!values.isEmpty()) {
    final boolean partial = scannerContext.partialResultFormed();
    Result r = Result.create(values, null, stale, partial);
    lastBlock = addSize(context, r, lastBlock);
    results.add(r);
    i++;
  }
  if (limitReached || !moreRows) {
    break;
  }
}

经过一轮扫描,如果没有更多数据、或者达到了一次请求的上限,就把已经取到的数据results放进builder返回。然后续签一个新的租约,租约时长还是1分钟。

if (scanner.isFilterDone() && results.isEmpty()) {
 builder.setMoreResults(false);
  }
assert builder.hasMoreResultsInRegion();}

3. 再次请求,确认数据扫描已经完成

发现扫描完成了

 if (scanner.isFilterDone() && results.isEmpty()) {
        // If the scanner's filter - if any - is done with the scan
        // only set moreResults to false if the results is empty. This is used to keep compatible
        // with the old scan implementation where we just ignore the returned results if moreResults
        // is false. Can remove the isEmpty check after we get rid of the old implementation.
        builder.setMoreResults(false);
      }
      // Later we may close the scanner depending on this flag so here we need to make sure that we
      // have already set this flag.
      assert builder.hasMoreResultsInRegion();

当moreResults为false时,就会关闭缓存的scanner,closeScanner方法中会将租约移除,这样占用的资源就释放完了,就可以返回没有results的builder,确认扫描完成了。

  if (!moreResults || closeScanner) {
  ttl = 0;
  moreResults = false;
  closeScanner(region, scanner, scannerName);
}

看moreRows = scanner.nextRaw(values, scannerContext);,这里的scanner是通过HRegion#getScanner创建出来的一个RegionScannerImpl实例,他的nextRaw方法调用了RegionScannerImpl#nextInternal方法。这个方法主要目的是获取下一条数据放入results,并取得返回值表示是否还有更多的数据。

我们知道HBase没有建立索引,数据的查找是靠遍历文件实现的,所以查找下一行数据需要一个while循环。

private boolean nextInternal(List<Cell> results, ScannerContext scannerContext)
        throws IOException {
      if (!results.isEmpty()) {
        throw new IllegalArgumentException("First parameter should be an empty list");
      }
      if (scannerContext == null) {
        throw new IllegalArgumentException("Scanner context cannot be null");
      }
      Optional<RpcCall> rpcCall = RpcServer.getCurrentCall();

剥去while,看内部,首先从storeHeap里取了个值,第一次取得的值是null,后续会往storeHeap里塞数据。多次的请求其实用的是同一个Scanner实例,所以可以在一次请求时,将值放入storeHeap,下一次请求还是可以从storeHeap里取出来。

Cell current = this.storeHeap.peek();

        boolean shouldStop = shouldStop(current);
      
        boolean hasFilterRow = this.filter != null && this.filter.hasFilterRow();

      
        if (hasFilterRow) {
          if (LOG.isTraceEnabled()) {
            LOG.trace("filter#hasFilterRow is true which prevents partial results from being "
                + " formed. Changing scope of limits that may create partials");
          }
          scannerContext.setSizeLimitScope(LimitScope.BETWEEN_ROWS);
          scannerContext.setTimeLimitScope(LimitScope.BETWEEN_ROWS);
          limitScope = LimitScope.BETWEEN_ROWS;
        }

判断当前数据是否是停止行,如果是停止行,就直接返回false,表示没有更多数据了。这里就是客户端设置scan.setStopRow可以提高效率的原因,因为直接return了false,所以就不会继续遍历了。

boolean stopRow = isStopRow(currentRow, offset, length);
...
if (stopRow) {
  return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
}

接着调用过滤器判断下一步的行为,如果这条数据无法满足filter的要求,就再判断下是否还有更多数据,没有就返回,有就continue while循环。Filter曾经被人诟病,即使后续的数据无法满足filter要求了,也必须遍历完,所以这里多出了个isFilterDoneInternal方法,来判断是否退出循环,返回false。

if (filterRowKey(currentRow, offset, length)) {
  if (isFilterDoneInternal()) {
    return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
  }
  boolean moreRows = nextRow(scannerContext, currentRow, offset, length);
  if (!moreRows) {
    return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
  }
  results.clear();
  continue;
}

这个时候,stopRow和filterRowKey都放行了,可以读取数据了。

populateResult(results, this.storeHeap, scannerContext, currentRow, offset, length);

这个方法里遍历数据,查找到下一行。populateResult的遍历和外层nextInternal遍历的差别在于,populateResult遍历负责找齐一行数据的所有列,从他while循环的条件moreCellsInRow就能看出来,外层的遍历负责对这一行数据进行过滤,包括stopRow、Filter。其他方法都很简单,就看heap.next(results, scannerContext);方法,这里的heap就是this.storeHeap。

private boolean populateResult(List<Cell> results, KeyValueHeap heap,   
        ScannerContext scannerContext, Cell currentRowCell) throws IOException {
      Cell nextKv;
      boolean moreCellsInRow = false;
      boolean tmpKeepProgress = scannerContext.getKeepProgress();
      // Scanning between column families and thus the scope is between cells
      LimitScope limitScope = LimitScope.BETWEEN_CELLS;
      do {
        // Check for thread interrupt status in case we have been signaled from
        // #interruptRegionOperation.
        checkInterrupt();

        // We want to maintain any progress that is made towards the limits while scanning across
        // different column families. To do this, we toggle the keep progress flag on during calls
        // to the StoreScanner to ensure that any progress made thus far is not wiped away.
        scannerContext.setKeepProgress(true);
        heap.next(results, scannerContext);
        scannerContext.setKeepProgress(tmpKeepProgress);

        nextKv = heap.peek();
        moreCellsInRow = moreCellsInRow(nextKv, currentRowCell);

再看heap.next(results, scannerContext);方法,这里调用了current,即StoreScanner来读取下一行数据,

private boolean populateResult(List<Cell> results, KeyValueHeap heap,   
        ScannerContext scannerContext, Cell currentRowCell) throws IOException {
      Cell nextKv;
      boolean moreCellsInRow = false;
      boolean tmpKeepProgress = scannerContext.getKeepProgress();
      // Scanning between column families and thus the scope is between cells
      LimitScope limitScope = LimitScope.BETWEEN_CELLS;
      do {
       
        checkInterrupt();
        scannerContext.setKeepProgress(true);
        heap.next(results, scannerContext);
        scannerContext.setKeepProgress(tmpKeepProgress);

        nextKv = heap.peek();
        moreCellsInRow = moreCellsInRow(nextKv, currentRowCell);

取完数据,再看看下一行数据是否是stopRow

Cell nextKv = this.storeHeap.peek();
          shouldStop = shouldStop(nextKv);
          // save that the row was empty before filters applied to it.
          final boolean isEmptyRow = results.isEmpty();

如果这次没找到数据,且不是停止行,就继续while遍历。

 if (results.isEmpty()) {
   *      System.out.println("No row after " + Bytes.toStringBinary(row));
   *   } else {
   *     System.out.println("The closest row after " + Bytes.toStringBinary(row) + " is "
   *         + Bytes.toStringBinary(results.stream().findFirst().get().getRow()));
   *   }

最后判断一次stopRow,这样查找就可以了。

if (stopRow) {
  return scannerContext.setScannerState(NextState.NO_MORE_VALUES).hasMoreValues();
} else {
  return scannerContext.setScannerState(NextState.MORE_VALUES).hasMoreValues();

}

总结

以上就是今天要讲的内容,本文介绍了scan的使用,接下来的分析会在后文中继续展示。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值