impala 源码级别优化

impala整体架构

参考文章:https://www.cnblogs.com/Rainbow-G/articles/4282444.html

1.impala分为java端的fe部分和c++端的be部分。fe部分用于生成执行计划树,通过thrift发给be部分去具体执行。
2.impala的客户端是用python写的,通过thrift将请求发到be的impalad执行。

创建分析函数

编写udaf函数

针对有序漏斗,目前impala提供的函数不支持我们的需求。所以我们自己写一个udaf函数来实现该功能。
具体的方法参考:https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_udf.html

添加参数,支持over

但是我们创建的udaf默认不支持over分析函数,可以执行group by聚合函数,所以要对这块改造。

具体的实现函数是fe的org.apache.impala.catalog.AggregateFunction
具体的是里面有一个isAnalyticFn_字段,如果是true,则表示支持over分析函数


public class AggregateFunction extends Function {
   .......................
  // True if this function can appear within an analytic expr (fn() OVER(...)).
  // TODO: Instead of manually setting this flag for all builtin aggregate functions
  // we should identify this property from the function itself (e.g., based on which
  // functions of the UDA API are implemented).
  // Currently, there is no reliable way of doing that.
  private boolean isAnalyticFn_;

  ..................
}

主要是在创建函数的时候添加了一个参数,ANALYSIS=‘true’,这个参数会传给AggregateFunction的isAnalyticFn_字段,这样在使用 findmaxpage(int) over(…) 的时候就可以使用了


 CREATE AGGREGATE FUNCTION findmaxpage(INT)                              
  RETURNS INT                                                                   
  LOCATION 'hdfs://localhost/impala_lib/libudasample.so'                        
  UPDATE_FN='FunnelUpdate' 
 ANALYSIS='true' 

修改源码支持参数

主要是修改aggregatefunction对应的thrift文件,生成java、c++,python三个版本的文件。以及修改解析sql的相应的东西,以便让其识别ANALYSIS字段。

优化分析函数排序

整体执行流程

整体类图

用Java前端对用户的查询SQL进行分析生成执行计划树,不同的操作对应不用的PlanNode, 如:SelectNode, ScanNode, SortNode, AggregationNode, HashJoinNode等等

fe端生产执行计划,填充相应的node的信息,然后序列化之后通过thrift传给be,一般情况下每个节点会有对应的序列化对象,如SortNode对应的序列化对象TSortNode.java

整体类图关系如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GL0ruine-1590241901335)(https://note.youdao.com/yws/public/resource/8a4cd9cf20be62512c0c57c4b8690c6f/xmlnote/657A1F1F1EED4F3FB33EA8E1D42D8B0D/12594)]

执行sql

select day,s,count(1) from
  (
    select day,id,max(step) as s from
    (
      select day, id,
      logs.findmaxpage54(
        case
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.homepage.click' and lower(f.cspot) in ('微信首页国际酒店tab按钮') and f.bns in (2) then 1
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.listpage.load'  and f.bns in (2) then 2
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.detailpage.load' and f.bns in (2) then 3
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.fillingorderpage.load' and f.bns in (2) then 4
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.ordercreatedpage.load' and f.bns in (2) then 5
        else 0
        end
        ) over (partition by day,id order by datestamp) as step
        from logs.ui8 f
        where    bns = 2 and dateline = day  and day between 20180410 and 20180416   and lower(concat(platform,'.',biz,'.',page,'.',et)) in ('ui.ihotel.homepage.click','ui.ihotel.listpage.load','ui.ihotel.detailpage.load','ui.ihotel.fillingorderpage.load','ui.ihotel.ordercreatedpage.load')
      ) a group by day,id
    )b where s > 0
group by day,s order by day,s


生成的执行计划:


省略部分...................


|
02:ANALYTIC
|  functions: logs.findmaxpage54(CASE WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.homepage.click' AND lower(cspot) IN ('微信首页国际酒店tab按钮') AND bns IN (2) THEN 1 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.listpage.load' AND bns IN (2) THEN 2 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.detailpage.load' AND bns IN (2) THEN 3 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.fillingorderpage.load' AND bns IN (2) THEN 4 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.ordercreatedpage.load' AND bns IN (2) THEN 5 ELSE 0 END)
|  partition by: day, id
|  order by: datestamp ASC
|  window: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
|  tuple-ids=9,8 row-size=104B cardinality=unavailable
|
01:SORT
|  order by: day ASC NULLS FIRST, id ASC NULLS FIRST, datestamp ASC
|  mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB
|  tuple-ids=9 row-size=100B cardinality=unavailable
|
06:EXCHANGE [HASH(day,id)]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=0 row-size=100B cardinality=unavailable
|
F00:PLAN FRAGMENT [RANDOM] hosts=2 instances=2
Per-Host Resources: mem-estimate=792.00MB mem-reservation=0B
00:SCAN HDFS [logs.ui8 f, RANDOM]
   partitions=7/15 files=14 size=2.18GB
   predicates: bns = 2, dateline = day, f.dateline <= 20180416, f.dateline >= 20180410, lower(concat(platform, '.', biz, '.', page, '.', et)) IN ('ui.ihotel.homepage.click', 'ui.ihotel.listpage.load', 'ui.ihotel.detailpage.load', 'ui.ihotel.fillingorderpage.load', 'ui.ihotel.ordercreatedpage.load')
   stats-rows=unavailable extrapolated-rows=disabled
   table stats: rows=unavailable size=unavailable
   columns missing stats: id, platform, biz, page, et, cspot, bns, datestamp, dateline
   parquet statistics predicates: bns = 2, f.dateline <= 20180416, f.dateline >= 20180410
   parquet dictionary predicates: bns = 2, f.dateline <= 20180416, f.dateline >= 20180410
   mem-estimate=792.00MB mem-reservation=0B
   tuple-ids=0 row-size=100B cardinality=unavailable
----------------

流程分析

从底层向上执行

最下层scan节点,数据然后到exchange节点,然后到sort节点。

执行计划查询,找出瓶颈

从执行时间找出执行最慢的部分。

示例:http://ip:25000/queries

image

be端执行优化,全排序改成预排序+归并排序

计算hdfs的扫描范围

具体的是在fe部分
org.apache.impala.planner.HdfsScanNode.computeScanRangeLocations(Analyzer)

获取表的分区信息
1.通过要查询的表的信息获取分区信息,封装在List类型的变量partitions_;里面
2.循环遍历分区获取分区下的所有文件(FileDescriptor)
3.获取该文件的所有block,并且循环每一个block,构造扫描范围对象TScanRange(包括开始扫描的offset和长度等)和扫描范围位置对象TScanRangeLocation(所在的磁盘id、是否缓存等)。
4.默认情况下一个block创建一个扫描范围,叫做split,这个传给后台,每个线程扫描一个split,分批次去扫描,每次一个rowbatch(默认1024条数据)


  private Set<HdfsFileFormat> computeScanRangeLocations(Analyzer analyzer)
      throws ImpalaRuntimeException {
  ...............................

    //获取最大扫描长度
    long maxScanRangeLength = analyzer.getQueryCtx().client_request.getQuery_options()
        .getMax_scan_range_length();

    //循环所有的分区
    for (HdfsPartition partition: partitions_) {
     ...............................
      //获取这个分区下的所有文件
      List<FileDescriptor> fileDescs = partition.getFileDescriptors();
      ...............................
      for (FileDescriptor fileDesc: fileDescs) {
        totalBytes_ += fileDesc.getFileLength();
        boolean fileDescMissingDiskIds = false;
        //获取文件的所有block
        for (int j = 0; j < fileDesc.getNumFileBlocks(); ++j) {
          FbFileBlock block = fileDesc.getFbFileBlock(j);
          //获取副本数
          int replicaHostCount = FileBlock.getNumReplicaHosts(block);
          if (replicaHostCount == 0) {
            // we didn't get locations for this block; for now, just ignore the block
            // TODO: do something meaningful with that
            continue;
          }
          // Collect the network address and volume ID of all replicas of this block.
          List<TScanRangeLocation> locations = Lists.newArrayList();
          //通过副本数获取block所在的磁盘以及是否缓存等
          for (int i = 0; i < replicaHostCount; ++i) {
            TScanRangeLocation location = new TScanRangeLocation();
            ...............................
            location.setVolume_id(FileBlock.getDiskId(block, i));
            location.setIs_cached(FileBlock.isReplicaCached(block, i));
            locations.add(location);
          }
          //构造扫描范围对象TScanRange,默认情况下扫描长度是一个block的长度。扫描的开始offset是block的offset。
          // create scan ranges, taking into account maxScanRangeLength
          long currentOffset = FileBlock.getOffset(block);
          long remainingLength = FileBlock.getLength(block);
          while (remainingLength > 0) {
            long currentLength = remainingLength;
            if (maxScanRangeLength > 0 && remainingLength > maxScanRangeLength) {
              currentLength = maxScanRangeLength;
            }
            TScanRange scanRange = new TScanRange();
            scanRange.setHdfs_file_split(new THdfsFileSplit(fileDesc.getFileName(),
                currentOffset, currentLength, partition.getId(), fileDesc.getFileLength(),
                fileDesc.getFileCompression().toThrift(),
                fileDesc.getModificationTime()));
            TScanRangeLocationList scanRangeLocations = new TScanRangeLocationList();
            scanRangeLocations.scan_range = scanRange;
            scanRangeLocations.locations = locations;
            scanRanges_.add(scanRangeLocations);
            remainingLength -= currentLength;
            currentOffset += currentLength;
          }
        }
        ...........................
      }
      if (partitionMissingDiskIds) ++numPartitionsNoDiskIds_;
    }
    return fileFormats;
  }

具体流程

image

1.扫描节点开启多个线程去扫描 ,默认有几个block就开启几个扫描线程,最大为cpu的core数量
2. 扫描的结果放在一个blockqueue,按照批次放,每次一个rowbatch。
3. 主线程从queue里取数据,通过rpc的方式发送exchange,具体的发送方式为根据partition by的字段来计算hash。
4. sort节点会按照rowbatch一批批的接受数据,放到一个run的队列里,直到run满了为止,然后使用快排进行排序。如果最后有多个run,则进行merge sort。

be初始化(代码细节,暂时忽略)

HdfsScanNodeBase::Prepare方法中,从scan_range_params_参数中获取相应的参数信息

在调用HdfsScanNodeBase::Prepare方法之前,必须调用其父类ScanNode的SetScanRanges方法来设置参数


  void SetScanRanges(const std::vector<TScanRangeParams>& scan_range_params) {
    scan_range_params_ = &scan_range_params;
  }

hdfs-scan-node-base.cc
从这个map类型的变量per_type_files_中获取相应的文件HdfsFileDesc

  /// File format => file descriptors.
  typedef std::map<THdfsFileFormat::type, std::vector<HdfsFileDesc*>>
    FileFormatsMap;
  FileFormatsMap per_type_files_;
  

HdfsParquetScanner::InitColumns
初始化读取列的ScanRange,因为parquet是按照列存储的,所以读取的时候是按照列读取的,通过元数据获取列的长度,获取ScanRange需要的offset和length.

通过RequestContext::AddRequestRange添加到unstarted_scan_ranges队列里


void RequestContext::AddRequestRange(
   RequestRange* range, bool schedule_immediately) {
 ......................
 bool schedule_context;
 //读取数据的操作,
 if (range->request_type() == RequestType::READ) {
   ScanRange* scan_range = static_cast<ScanRange*>(range);
   if (schedule_immediately) {
     ScheduleScanRange(scan_range);
   } else {
     state.unstarted_scan_ranges()->Enqueue(scan_range);
     num_unstarted_scan_ranges_.Add(1);
   }
   .....................
 } else {
   //写数据的操作。
   DCHECK(range->request_type() == RequestType::WRITE);
   DCHECK(!schedule_immediately);
   WriteRange* write_range = static_cast<WriteRange*>(range);
 .....................
}

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值