文章目录
impala整体架构
参考文章:https://www.cnblogs.com/Rainbow-G/articles/4282444.html
1.impala分为java端的fe部分和c++端的be部分。fe部分用于生成执行计划树,通过thrift发给be部分去具体执行。
2.impala的客户端是用python写的,通过thrift将请求发到be的impalad执行。
创建分析函数
编写udaf函数
针对有序漏斗,目前impala提供的函数不支持我们的需求。所以我们自己写一个udaf函数来实现该功能。
具体的方法参考:https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_udf.html
添加参数,支持over
但是我们创建的udaf默认不支持over分析函数,可以执行group by聚合函数,所以要对这块改造。
具体的实现函数是fe的org.apache.impala.catalog.AggregateFunction
具体的是里面有一个isAnalyticFn_字段,如果是true,则表示支持over分析函数
public class AggregateFunction extends Function {
.......................
// True if this function can appear within an analytic expr (fn() OVER(...)).
// TODO: Instead of manually setting this flag for all builtin aggregate functions
// we should identify this property from the function itself (e.g., based on which
// functions of the UDA API are implemented).
// Currently, there is no reliable way of doing that.
private boolean isAnalyticFn_;
..................
}
主要是在创建函数的时候添加了一个参数,ANALYSIS=‘true’,这个参数会传给AggregateFunction的isAnalyticFn_字段,这样在使用 findmaxpage(int) over(…) 的时候就可以使用了
CREATE AGGREGATE FUNCTION findmaxpage(INT)
RETURNS INT
LOCATION 'hdfs://localhost/impala_lib/libudasample.so'
UPDATE_FN='FunnelUpdate'
ANALYSIS='true'
修改源码支持参数
主要是修改aggregatefunction对应的thrift文件,生成java、c++,python三个版本的文件。以及修改解析sql的相应的东西,以便让其识别ANALYSIS字段。
优化分析函数排序
整体执行流程
整体类图
用Java前端对用户的查询SQL进行分析生成执行计划树,不同的操作对应不用的PlanNode, 如:SelectNode, ScanNode, SortNode, AggregationNode, HashJoinNode等等
fe端生产执行计划,填充相应的node的信息,然后序列化之后通过thrift传给be,一般情况下每个节点会有对应的序列化对象,如SortNode对应的序列化对象TSortNode.java
整体类图关系如下:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GL0ruine-1590241901335)(https://note.youdao.com/yws/public/resource/8a4cd9cf20be62512c0c57c4b8690c6f/xmlnote/657A1F1F1EED4F3FB33EA8E1D42D8B0D/12594)]
执行sql
select day,s,count(1) from
(
select day,id,max(step) as s from
(
select day, id,
logs.findmaxpage54(
case
when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.homepage.click' and lower(f.cspot) in ('微信首页国际酒店tab按钮') and f.bns in (2) then 1
when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.listpage.load' and f.bns in (2) then 2
when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.detailpage.load' and f.bns in (2) then 3
when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.fillingorderpage.load' and f.bns in (2) then 4
when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.ordercreatedpage.load' and f.bns in (2) then 5
else 0
end
) over (partition by day,id order by datestamp) as step
from logs.ui8 f
where bns = 2 and dateline = day and day between 20180410 and 20180416 and lower(concat(platform,'.',biz,'.',page,'.',et)) in ('ui.ihotel.homepage.click','ui.ihotel.listpage.load','ui.ihotel.detailpage.load','ui.ihotel.fillingorderpage.load','ui.ihotel.ordercreatedpage.load')
) a group by day,id
)b where s > 0
group by day,s order by day,s
生成的执行计划:
省略部分...................
|
02:ANALYTIC
| functions: logs.findmaxpage54(CASE WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.homepage.click' AND lower(cspot) IN ('微信首页国际酒店tab按钮') AND bns IN (2) THEN 1 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.listpage.load' AND bns IN (2) THEN 2 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.detailpage.load' AND bns IN (2) THEN 3 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.fillingorderpage.load' AND bns IN (2) THEN 4 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.ordercreatedpage.load' AND bns IN (2) THEN 5 ELSE 0 END)
| partition by: day, id
| order by: datestamp ASC
| window: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
| mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
| tuple-ids=9,8 row-size=104B cardinality=unavailable
|
01:SORT
| order by: day ASC NULLS FIRST, id ASC NULLS FIRST, datestamp ASC
| mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB
| tuple-ids=9 row-size=100B cardinality=unavailable
|
06:EXCHANGE [HASH(day,id)]
| mem-estimate=0B mem-reservation=0B
| tuple-ids=0 row-size=100B cardinality=unavailable
|
F00:PLAN FRAGMENT [RANDOM] hosts=2 instances=2
Per-Host Resources: mem-estimate=792.00MB mem-reservation=0B
00:SCAN HDFS [logs.ui8 f, RANDOM]
partitions=7/15 files=14 size=2.18GB
predicates: bns = 2, dateline = day, f.dateline <= 20180416, f.dateline >= 20180410, lower(concat(platform, '.', biz, '.', page, '.', et)) IN ('ui.ihotel.homepage.click', 'ui.ihotel.listpage.load', 'ui.ihotel.detailpage.load', 'ui.ihotel.fillingorderpage.load', 'ui.ihotel.ordercreatedpage.load')
stats-rows=unavailable extrapolated-rows=disabled
table stats: rows=unavailable size=unavailable
columns missing stats: id, platform, biz, page, et, cspot, bns, datestamp, dateline
parquet statistics predicates: bns = 2, f.dateline <= 20180416, f.dateline >= 20180410
parquet dictionary predicates: bns = 2, f.dateline <= 20180416, f.dateline >= 20180410
mem-estimate=792.00MB mem-reservation=0B
tuple-ids=0 row-size=100B cardinality=unavailable
----------------
流程分析
从底层向上执行
最下层scan节点,数据然后到exchange节点,然后到sort节点。
执行计划查询,找出瓶颈
从执行时间找出执行最慢的部分。
示例:http://ip:25000/queries
be端执行优化,全排序改成预排序+归并排序
计算hdfs的扫描范围
具体的是在fe部分
org.apache.impala.planner.HdfsScanNode.computeScanRangeLocations(Analyzer)
获取表的分区信息
1.通过要查询的表的信息获取分区信息,封装在List类型的变量partitions_;里面
2.循环遍历分区获取分区下的所有文件(FileDescriptor)
3.获取该文件的所有block,并且循环每一个block,构造扫描范围对象TScanRange(包括开始扫描的offset和长度等)和扫描范围位置对象TScanRangeLocation(所在的磁盘id、是否缓存等)。
4.默认情况下一个block创建一个扫描范围,叫做split,这个传给后台,每个线程扫描一个split,分批次去扫描,每次一个rowbatch(默认1024条数据)
private Set<HdfsFileFormat> computeScanRangeLocations(Analyzer analyzer)
throws ImpalaRuntimeException {
...............................
//获取最大扫描长度
long maxScanRangeLength = analyzer.getQueryCtx().client_request.getQuery_options()
.getMax_scan_range_length();
//循环所有的分区
for (HdfsPartition partition: partitions_) {
...............................
//获取这个分区下的所有文件
List<FileDescriptor> fileDescs = partition.getFileDescriptors();
...............................
for (FileDescriptor fileDesc: fileDescs) {
totalBytes_ += fileDesc.getFileLength();
boolean fileDescMissingDiskIds = false;
//获取文件的所有block
for (int j = 0; j < fileDesc.getNumFileBlocks(); ++j) {
FbFileBlock block = fileDesc.getFbFileBlock(j);
//获取副本数
int replicaHostCount = FileBlock.getNumReplicaHosts(block);
if (replicaHostCount == 0) {
// we didn't get locations for this block; for now, just ignore the block
// TODO: do something meaningful with that
continue;
}
// Collect the network address and volume ID of all replicas of this block.
List<TScanRangeLocation> locations = Lists.newArrayList();
//通过副本数获取block所在的磁盘以及是否缓存等
for (int i = 0; i < replicaHostCount; ++i) {
TScanRangeLocation location = new TScanRangeLocation();
...............................
location.setVolume_id(FileBlock.getDiskId(block, i));
location.setIs_cached(FileBlock.isReplicaCached(block, i));
locations.add(location);
}
//构造扫描范围对象TScanRange,默认情况下扫描长度是一个block的长度。扫描的开始offset是block的offset。
// create scan ranges, taking into account maxScanRangeLength
long currentOffset = FileBlock.getOffset(block);
long remainingLength = FileBlock.getLength(block);
while (remainingLength > 0) {
long currentLength = remainingLength;
if (maxScanRangeLength > 0 && remainingLength > maxScanRangeLength) {
currentLength = maxScanRangeLength;
}
TScanRange scanRange = new TScanRange();
scanRange.setHdfs_file_split(new THdfsFileSplit(fileDesc.getFileName(),
currentOffset, currentLength, partition.getId(), fileDesc.getFileLength(),
fileDesc.getFileCompression().toThrift(),
fileDesc.getModificationTime()));
TScanRangeLocationList scanRangeLocations = new TScanRangeLocationList();
scanRangeLocations.scan_range = scanRange;
scanRangeLocations.locations = locations;
scanRanges_.add(scanRangeLocations);
remainingLength -= currentLength;
currentOffset += currentLength;
}
}
...........................
}
if (partitionMissingDiskIds) ++numPartitionsNoDiskIds_;
}
return fileFormats;
}
具体流程
1.扫描节点开启多个线程去扫描 ,默认有几个block就开启几个扫描线程,最大为cpu的core数量
2. 扫描的结果放在一个blockqueue,按照批次放,每次一个rowbatch。
3. 主线程从queue里取数据,通过rpc的方式发送exchange,具体的发送方式为根据partition by的字段来计算hash。
4. sort节点会按照rowbatch一批批的接受数据,放到一个run的队列里,直到run满了为止,然后使用快排进行排序。如果最后有多个run,则进行merge sort。
be初始化(代码细节,暂时忽略)
HdfsScanNodeBase::Prepare方法中,从scan_range_params_参数中获取相应的参数信息
在调用HdfsScanNodeBase::Prepare方法之前,必须调用其父类ScanNode的SetScanRanges方法来设置参数
void SetScanRanges(const std::vector<TScanRangeParams>& scan_range_params) {
scan_range_params_ = &scan_range_params;
}
hdfs-scan-node-base.cc
从这个map类型的变量per_type_files_中获取相应的文件HdfsFileDesc
/// File format => file descriptors.
typedef std::map<THdfsFileFormat::type, std::vector<HdfsFileDesc*>>
FileFormatsMap;
FileFormatsMap per_type_files_;
HdfsParquetScanner::InitColumns
初始化读取列的ScanRange,因为parquet是按照列存储的,所以读取的时候是按照列读取的,通过元数据获取列的长度,获取ScanRange需要的offset和length.
通过RequestContext::AddRequestRange添加到unstarted_scan_ranges队列里
void RequestContext::AddRequestRange(
RequestRange* range, bool schedule_immediately) {
......................
bool schedule_context;
//读取数据的操作,
if (range->request_type() == RequestType::READ) {
ScanRange* scan_range = static_cast<ScanRange*>(range);
if (schedule_immediately) {
ScheduleScanRange(scan_range);
} else {
state.unstarted_scan_ranges()->Enqueue(scan_range);
num_unstarted_scan_ranges_.Add(1);
}
.....................
} else {
//写数据的操作。
DCHECK(range->request_type() == RequestType::WRITE);
DCHECK(!schedule_immediately);
WriteRange* write_range = static_cast<WriteRange*>(range);
.....................
}