impala 源码级别优化

最新推荐文章于 2023-12-29 17:14:13 发布

大数据技术与应用实战

最新推荐文章于 2023-12-29 17:14:13 发布

阅读量1.9k

点赞数

分类专栏： impala 文章标签： impala 优化源码有序排序

本文链接：https://blog.csdn.net/zhangjun5965/article/details/80181245

版权

impala 专栏收录该内容

1 篇文章 1 订阅

订阅专栏

文章目录

impala整体架构
创建分析函数
优化分析函数排序

impala整体架构

参考文章：https://www.cnblogs.com/Rainbow-G/articles/4282444.html

1.impala分为java端的fe部分和c++端的be部分。fe部分用于生成执行计划树，通过thrift发给be部分去具体执行。
2.impala的客户端是用python写的，通过thrift将请求发到be的impalad执行。

创建分析函数

编写udaf函数

针对有序漏斗，目前impala提供的函数不支持我们的需求。所以我们自己写一个udaf函数来实现该功能。
具体的方法参考：https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_udf.html

添加参数，支持over

但是我们创建的udaf默认不支持over分析函数，可以执行group by聚合函数，所以要对这块改造。

具体的实现函数是fe的org.apache.impala.catalog.AggregateFunction
具体的是里面有一个isAnalyticFn_字段，如果是true，则表示支持over分析函数


public class AggregateFunction extends Function {
   .......................
  // True if this function can appear within an analytic expr (fn() OVER(...)).
  // TODO: Instead of manually setting this flag for all builtin aggregate functions
  // we should identify this property from the function itself (e.g., based on which
  // functions of the UDA API are implemented).
  // Currently, there is no reliable way of doing that.
  private boolean isAnalyticFn_;

  ..................
}

主要是在创建函数的时候添加了一个参数，ANALYSIS=‘true’，这个参数会传给AggregateFunction的isAnalyticFn_字段，这样在使用 findmaxpage(int) over(…) 的时候就可以使用了


 CREATE AGGREGATE FUNCTION findmaxpage(INT)                              
  RETURNS INT                                                                   
  LOCATION 'hdfs://localhost/impala_lib/libudasample.so'                        
  UPDATE_FN='FunnelUpdate' 
 ANALYSIS='true'

修改源码支持参数

主要是修改aggregatefunction对应的thrift文件，生成java、c++,python三个版本的文件。以及修改解析sql的相应的东西，以便让其识别ANALYSIS字段。

优化分析函数排序

整体执行流程

整体类图

用Java前端对用户的查询SQL进行分析生成执行计划树，不同的操作对应不用的PlanNode, 如：SelectNode， ScanNode， SortNode， AggregationNode， HashJoinNode等等

fe端生产执行计划，填充相应的node的信息，然后序列化之后通过thrift传给be，一般情况下每个节点会有对应的序列化对象，如SortNode对应的序列化对象TSortNode.java

整体类图关系如下：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GL0ruine-1590241901335)(https://note.youdao.com/yws/public/resource/8a4cd9cf20be62512c0c57c4b8690c6f/xmlnote/657A1F1F1EED4F3FB33EA8E1D42D8B0D/12594)]

执行sql

select day,s,count(1) from
  (
    select day,id,max(step) as s from
    (
      select day, id,
      logs.findmaxpage54(
        case
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.homepage.click' and lower(f.cspot) in ('微信首页国际酒店tab按钮') and f.bns in (2) then 1
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.listpage.load'  and f.bns in (2) then 2
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.detailpage.load' and f.bns in (2) then 3
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.fillingorderpage.load' and f.bns in (2) then 4
        when concat(platform,'.',biz,'.',page,'.',et)='ui.ihotel.ordercreatedpage.load' and f.bns in (2) then 5
        else 0
        end
        ) over (partition by day,id order by datestamp) as step
        from logs.ui8 f
        where    bns = 2 and dateline = day  and day between 20180410 and 20180416   and lower(concat(platform,'.',biz,'.',page,'.',et)) in ('ui.ihotel.homepage.click','ui.ihotel.listpage.load','ui.ihotel.detailpage.load','ui.ihotel.fillingorderpage.load','ui.ihotel.ordercreatedpage.load')
      ) a group by day,id
    )b where s > 0
group by day,s order by day,s

生成的执行计划：


省略部分...................


|
02:ANALYTIC
|  functions: logs.findmaxpage54(CASE WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.homepage.click' AND lower(cspot) IN ('å¾®ä¿¡é¦–é¡µå›½é™…é…’åº—tabæŒ‰é’®') AND bns IN (2) THEN 1 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.listpage.load' AND bns IN (2) THEN 2 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.detailpage.load' AND bns IN (2) THEN 3 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.fillingorderpage.load' AND bns IN (2) THEN 4 WHEN concat(platform, '.', biz, '.', page, '.', et) = 'ui.ihotel.ordercreatedpage.load' AND bns IN (2) THEN 5 ELSE 0 END)
|  partition by: day, id
|  order by: datestamp ASC
|  window: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
|  tuple-ids=9,8 row-size=104B cardinality=unavailable
|
01:SORT
|  order by: day ASC NULLS FIRST, id ASC NULLS FIRST, datestamp ASC
|  mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB
|  tuple-ids=9 row-size=100B cardinality=unavailable
|
06:EXCHANGE [HASH(day,id)]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=0 row-size=100B cardinality=unavailable
|
F00:PLAN FRAGMENT [RANDOM] hosts=2 instances=2
Per-Host Resources: mem-estimate=792.00MB mem-reservation=0B
00:SCAN HDFS [logs.ui8 f, RANDOM]
   partitions=7/15 files=14 size=2.18GB
   predicates: bns = 2, dateline = day, f.dateline <= 20180416, f.dateline >= 20180410, lower(concat(platform, '.', biz, '.', page, '.', et)) IN ('ui.ihotel.homepage.click', 'ui.ihotel.listpage.load', 'ui.ihotel.detailpage.load', 'ui.ihotel.fillingorderpage.load', 'ui.ihotel.ordercreatedpage.load')
   stats-rows=unavailable extrapolated-rows=disabled
   table stats: rows=unavailable size=unavailable
   columns missing stats: id, platform, biz, page, et, cspot, bns, datestamp, dateline
   parquet statistics predicates: bns = 2, f.dateline <= 20180416, f.dateline >= 20180410
   parquet dictionary predicates: bns = 2, f.dateline <= 20180416, f.dateline >= 20180410
   mem-estimate=792.00MB mem-reservation=0B
   tuple-ids=0 row-size=100B cardinality=unavailable
----------------

流程分析

从底层向上执行

最下层scan节点，数据然后到exchange节点，然后到sort节点。

执行计划查询，找出瓶颈

从执行时间找出执行最慢的部分。

示例：http://ip:25000/queries

be端执行优化，全排序改成预排序+归并排序

计算hdfs的扫描范围

具体的是在fe部分
org.apache.impala.planner.HdfsScanNode.computeScanRangeLocations(Analyzer)

获取表的分区信息
1.通过要查询的表的信息获取分区信息，封装在List类型的变量partitions_;里面
2.循环遍历分区获取分区下的所有文件(FileDescriptor)
3.获取该文件的所有block，并且循环每一个block，构造扫描范围对象TScanRange(包括开始扫描的offset和长度等)和扫描范围位置对象TScanRangeLocation（所在的磁盘id、是否缓存等）。
4.默认情况下一个block创建一个扫描范围，叫做split，这个传给后台，每个线程扫描一个split，分批次去扫描，每次一个rowbatch（默认1024条数据）


  private Set<HdfsFileFormat> computeScanRangeLocations(Analyzer analyzer)
      throws ImpalaRuntimeException {
  ...............................

    //获取最大扫描长度
    long maxScanRangeLength = analyzer.getQueryCtx().client_request.getQuery_options()
        .getMax_scan_range_length();

    //循环所有的分区
    for (HdfsPartition partition: partitions_) {
     ...............................
      //获取这个分区下的所有文件
      List<FileDescriptor> fileDescs = partition.getFileDescriptors();
      ...............................
      for (FileDescriptor fileDesc: fileDescs) {
        totalBytes_ += fileDesc.getFileLength();
        boolean fileDescMissingDiskIds = false;
        //获取文件的所有block
        for (int j = 0; j < fileDesc.getNumFileBlocks(); ++j) {
          FbFileBlock block = fileDesc.getFbFileBlock(j);
          //获取副本数
          int replicaHostCount = FileBlock.getNumReplicaHosts(block);
          if (replicaHostCount == 0) {
            // we didn't get locations for this block; for now, just ignore the block
            // TODO: do something meaningful with that
            continue;
          }
          // Collect the network address and volume ID of all replicas of this block.
          List<TScanRangeLocation> locations = Lists.newArrayList();
          //通过副本数获取block所在的磁盘以及是否缓存等
          for (int i = 0; i < replicaHostCount; ++i) {
            TScanRangeLocation location = new TScanRangeLocation();
            ...............................
            location.setVolume_id(FileBlock.getDiskId(block, i));
            location.setIs_cached(FileBlock.isReplicaCached(block, i));
            locations.add(location);
          }
          //构造扫描范围对象TScanRange，默认情况下扫描长度是一个block的长度。扫描的开始offset是block的offset。
          // create scan ranges, taking into account maxScanRangeLength
          long currentOffset = FileBlock.getOffset(block);
          long remainingLength = FileBlock.getLength(block);
          while (remainingLength > 0) {
            long currentLength = remainingLength;
            if (maxScanRangeLength > 0 && remainingLength > maxScanRangeLength) {
              currentLength = maxScanRangeLength;
            }
            TScanRange scanRange = new TScanRange();
            scanRange.setHdfs_file_split(new THdfsFileSplit(fileDesc.getFileName(),
                currentOffset, currentLength, partition.getId(), fileDesc.getFileLength(),
                fileDesc.getFileCompression().toThrift(),
                fileDesc.getModificationTime()));
            TScanRangeLocationList scanRangeLocations = new TScanRangeLocationList();
            scanRangeLocations.scan_range = scanRange;
            scanRangeLocations.locations = locations;
            scanRanges_.add(scanRangeLocations);
            remainingLength -= currentLength;
            currentOffset += currentLength;
          }
        }
        ...........................
      }
      if (partitionMissingDiskIds) ++numPartitionsNoDiskIds_;
    }
    return fileFormats;
  }

具体流程

1.扫描节点开启多个线程去扫描，默认有几个block就开启几个扫描线程，最大为cpu的core数量
2. 扫描的结果放在一个blockqueue，按照批次放，每次一个rowbatch。
3. 主线程从queue里取数据，通过rpc的方式发送exchange，具体的发送方式为根据partition by的字段来计算hash。
4. sort节点会按照rowbatch一批批的接受数据，放到一个run的队列里，直到run满了为止，然后使用快排进行排序。如果最后有多个run，则进行merge sort。

be初始化（代码细节，暂时忽略）

HdfsScanNodeBase::Prepare方法中，从scan_range_params_参数中获取相应的参数信息

在调用HdfsScanNodeBase::Prepare方法之前，必须调用其父类ScanNode的SetScanRanges方法来设置参数


  void SetScanRanges(const std::vector<TScanRangeParams>& scan_range_params) {
    scan_range_params_ = &scan_range_params;
  }

hdfs-scan-node-base.cc
从这个map类型的变量per_type_files_中获取相应的文件HdfsFileDesc

  /// File format => file descriptors.
  typedef std::map<THdfsFileFormat::type, std::vector<HdfsFileDesc*>>
    FileFormatsMap;
  FileFormatsMap per_type_files_;

HdfsParquetScanner::InitColumns
初始化读取列的ScanRange，因为parquet是按照列存储的，所以读取的时候是按照列读取的，通过元数据获取列的长度，获取ScanRange需要的offset和length.

通过RequestContext::AddRequestRange添加到unstarted_scan_ranges队列里


void RequestContext::AddRequestRange(
   RequestRange* range, bool schedule_immediately) {
 ......................
 bool schedule_context;
 //读取数据的操作，
 if (range->request_type() == RequestType::READ) {
   ScanRange* scan_range = static_cast<ScanRange*>(range);
   if (schedule_immediately) {
     ScheduleScanRange(scan_range);
   } else {
     state.unstarted_scan_ranges()->Enqueue(scan_range);
     num_unstarted_scan_ranges_.Add(1);
   }
   .....................
 } else {
   //写数据的操作。
   DCHECK(range->request_type() == RequestType::WRITE);
   DCHECK(!schedule_immediately);
   WriteRange* write_range = static_cast<WriteRange*>(range);
 .....................
}

大数据技术与应用实战

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
6
评论
impala 源码级别优化

impala整体架构创建分析函数编写udaf函数添加参数，支持over修改源码支持参数优化分析函数排序整体执行流程整体类图执行sql生成的执行计划：流程分析执行计划查询，找出瓶颈be端执行优化，全排序改成预排序+归并排序计算hdfs的扫描范围具体流程be初始化（代码细节，暂时忽略）impala整体架构参考文章：https://w...
复制链接

扫一扫

专栏目录