phoenix查不到数据_Phoenix duplicate record -- 查询数据重复的原因和解决方案

最新推荐文章于 2023-04-04 21:06:56 发布

weixin_39983383

最新推荐文章于 2023-04-04 21:06:56 发布

阅读量335

点赞数 1

文章标签： phoenix查不到数据

本文链接：https://blog.csdn.net/weixin_39983383/article/details/111724963

版权

问题说明

issue A: 开启参数后(phoenix.stats.enabled=true)，使用Phoenix SQL查询数据，出现重复(查出来的数据多余HBase实际存储的内容)

issue B:关闭参数后(phoenix.stats.enabled=false)，Phoenix SQL性能降低。

环境

Phoenix 版本：phoenix-4.8.0-HBase-1.1

本文目的

探究stats对查询的影响

参数描述

phoenix.stats.enabled: 是否启用统计(默认值true)。

参数功能

在stats开启的情况下，major compaction以及region split 会自动调用StatisticsCollector的updateStatistic方法，收集Region的key信息，计算guideposts并写入到system.stats表中。

参数影响(并行度)

Phoenix SQL通过将查询划分成更多的scan、并行执行scan来提升性能。

在guideposts之间的数据都会当成一个chunk，每一个chunk对应一个scan，通过并行执行scan来获取查询性能的提升。

chunk 的大小可以通过 phoenix.stats.guidepost.width来配置。更小的chunk意味着更多的scan&更大的并发度，同时也意味着客户端需要合并更多的chunk。

guideposts相关SQL

设置GUIDE_POSTS_WIDTH

ALTER TABLE my_table SET GUIDE_POSTS_WIDTH = 10000000

ALTER TABLE my_table SET GUIDE_POSTS_WIDTH = null

重新计算guideposts

UPDATE STATISTICS my_table

查看guideposts

select * from system.stats where physical_name='my_table' ;

guidepost对性能的提升

从上文可知，guidepos可以将region的数据划分成更小的块，从而生成更多的scan。这个行为可以通过explain SQL观察到变化。

不使用guidepost

清除guideposts: delete from system.stats where physical_name='DB.TABLE' ;

执行explain: explain select * from XXX where XXX > 'XXX';

返回: CLIENT 2-CHUNK PARALLEL 2-WAY ROUND ROBIN RANGE SCAN OVER ……

使用guidepost

生成guideposts：update statistics DB.TABLE all;

查看生成的guideposts: select * from system.stats where physical_name='DB.TABLE' ;

查看执行计划： explain select * from XXX where XXX > 'XXX';

返回：CLIENT 10-CHUNK XXX ROWS XXX BYTES PARALLEL 2-WAY ROUND ROBIN RANGE SCAN OVER ……

可以发现启用了guideposts后，确实生成了更多的scan。

dive deep into code

guidepost哪里来,到哪里去

追踪链路

DefaultStatisticsCollector -> updateStatistic -> commitStats -> StatisticsWriter -> addStats -> addGuidepost -> addGuidepost

// tableName = SYSTEM_STATS_NAME(system.stats)

byte[] prefix = StatisticsUtil.getRowKey(tableName, cfKey, ptr);

Put put = new Put(prefix);

put.add(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES, PhoenixDatabaseMetaData.GUIDE_POSTS_WIDTH_BYTES,

timeStamp, PLong.INSTANCE.toBytes(byteCount));

put.add(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES,

PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES, timeStamp,

PLong.INSTANCE.toBytes(rowCount));

// Add our empty column value so queries behave correctly

put.add(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES, QueryConstants.EMPTY_COLUMN_BYTES, timeStamp,

ByteUtil.EMPTY_BYTE_ARRAY);

mutations.add(put);

为何能生成更多scan

追踪链路

PhoenixStatement.executeQuery -> BaseQueryPlain.iterator -> ScanPlan.newIterator -> ParallelIterators -> BaseResultIterators.getParallelScans

int gpsSize = gps.getGuidePostsCount();

int estGuidepostsPerRegion = gpsSize == 0 ? 1 : gpsSize / regionLocations.size() + 1;

int keyOffset = 0;

ImmutableBytesWritable currentGuidePost = ByteUtil.EMPTY_IMMUTABLE_BYTE_ARRAY;

List scans = Lists.newArrayListWithExpectedSize(estGuidepostsPerRegion);

代码里面需要注意的是，regionLocations而不是regions。即统计的是region所在的节点数，而不是region的个数。

从上面的代码中，可以看到，guidepost帮助Phoenix生成了更多的scan。

为何引发duplicate records

名词解释：duplicate records -- SQL查询出了“更多”的数据，这些数据在HBase中并不存在。

这个是一个隐藏的逻辑。首先从结论上讲，如果一些表最后的那个region比较小，达不到guide.post.width(默认104857600 ,即100M)那么这个Region的guidepost不会生成。在查询时，并行scan是从最后的guidepost扫描的(而不是region的startkey)从而导致为Region生成Scan的时候重复了。

while (regionIndex <= stopIndex) {

……

try {

while (guideIndex < gpsSize && (endKey.length == 0 || currentGuidePost.compareTo(endKey) <= 0)) {

Scan newScan = scanRanges.intersectScan(scan, currentKeyBytes, currentGuidePostBytes, keyOffset,

false);

……

scans = addNewScan(parallelScans, scans, newScan, currentGuidePostBytes, false, regionLocation);

currentKeyBytes = currentGuidePostBytes;

currentGuidePost = PrefixByteCodec.decode(decoder, input);

currentGuidePostBytes = currentGuidePost.copyBytes();

guideIndex++;

}

} catch (EOFException e) {}

Scan newScan = scanRanges.intersectScan(scan, currentKeyBytes, endKey, keyOffset, true);

if(newScan != null) {

ScanUtil.setLocalIndexAttributes(newScan, keyOffset, regionInfo.getStartKey(),

regionInfo.getEndKey(), newScan.getStartRow(), newScan.getStopRow());

}

scans = addNewScan(parallelScans, scans, newScan, endKey, true, regionLocation);

currentKeyBytes = endKey;

regionIndex++;

}

在遍历到最后一个Region的时候，如果该region没有guideposts,那么scan可能从之前的guideposts开始，导致上一个scan和这个scan重复扫描。

guidepost 是否写入判断

if (byteCount >= guidepostDepth) {

ImmutableBytesWritable row = new ImmutableBytesWritable(kv.getRowArray(), kv.getRowOffset(), kv.getRowLength());

if (gps.getSecond().addGuidePosts(row, byteCount, gps.getSecond().getRowCount())) {

gps.setFirst(0l);

gps.getSecond().resetRowCount();

}

guidepostDepth 计算

public static long getGuidePostDepth(int guidepostPerRegion, long guidepostWidth, HTableDescriptor tableDesc) {

if (guidepostPerRegion > 0) {

long maxFileSize = HConstants.DEFAULT_MAX_FILE_SIZE;

if (tableDesc != null) {

long tableMaxFileSize = tableDesc.getMaxFileSize();

if (tableMaxFileSize >= 0) {

maxFileSize = tableMaxFileSize;

}

return maxFileSize / guidepostPerRegion;

} else {

return guidepostWidth;

}

duplicate records 修复

Solution

升级Phoenix到4.12

参考文档

weixin_39983383

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
phoenix查不到数据_Phoenix duplicate record -- 查询数据重复的原因和解决方案

问题说明issue A: 开启参数后(phoenix.stats.enabled=true)，使用Phoenix SQL查询数据，出现重复(查出来的数据多余HBase实际存储的内容)issue B:关闭参数后(phoenix.stats.enabled=false)，Phoenix SQL性能降低。环境Phoenix 版本：phoenix-4.8.0-HBase-1.1本文目的探究stats对查询...
复制链接

扫一扫