Impala3.4源码阅读笔记（六）ScanRange分配

最新推荐文章于 2024-08-15 16:36:33 发布

Eyizoha

最新推荐文章于 2024-08-15 16:36:33 发布

阅读量202

点赞数 1

分类专栏： Impala 文章标签： Impala 数据仓库大数据

本文链接：https://blog.csdn.net/Eyizoha/article/details/131554492

版权

Impala 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

前言

本文为笔者个人阅读Apache Impala源码时的笔记，仅代表我个人对代码的理解，个人水平有限，文章可能存在理解错误、遗漏或者过时之处。如果有任何错误或者有更好的见解，欢迎指正。

基本信息

ScanRange可以简单的理解为对数据文件中一段范围的抽象，Impala读取一张表前会生成若干ScanRange，再由后续的ScanNode处理ScanRange得到数据。在Impala集群中，Coordinator需要将生成的ScanRange分配到各个负责执行的Executor，分配的过程需要考虑许多因素，比如各个Executor与存储数据的DataNode的“距离”（或者说是读取代价）、Executor之间的负载是否均衡、分配是否均衡无偏斜。同时，随着文件句柄缓存（Impala3.4源码阅读笔记（四）file-handle-cache功能）和数据缓存（Impala3.4源码阅读笔记（一）data-cache功能）功能的引入，各个executor还有了各自的数据缓存，那么ScanRange的分配还需要考虑这些缓存，不然会导致不同节点上的缓存之间出现大量重复，这会显著降低缓存的利用率和命中率。

ScanRange的分配过程

显然ScanRange分配并非一件易事，我们还是从代码层面进行解析，看看Impala如何解决这一难题。首先我们从ScanRange分配的入口开始，在Coordinator将查询下发到各Executor开始执行之前先需要进行规划调度，即给各个Executor分配任务，这由Scheduler类实现，通过其成员方法 Scheduler::Schedule实现调度，我们看下其关键代码：

// ExecutorConfig记录了本地Backend以及一组其他Executor的信息供调度使用，
// QuerySchedule用于保存调度结果
Status Scheduler::Schedule(
    const ExecutorConfig& executor_config, QuerySchedule* schedule) {
  ...
  // 完成ScanRange分配
  RETURN_IF_ERROR(ComputeScanRangeAssignment(executor_config, schedule));
  // 计算填充Fragment执行参数，主要是Fragment之间的路由信息
  ComputeFragmentExecParams(executor_config, schedule);
  // 为所有参与的Backend计算填充执行参数
  ComputeBackendExecParams(executor_config, schedule);
  ...
}

可以发现其中的ComputeScanRangeAssignment函数就是分配ScanRange的关键了，该函数还有一个重载函数先按下不表，我们先看此处调用的版本（删去非关键代码）：

Status Scheduler::ComputeScanRangeAssignment(
    const ExecutorConfig& executor_config, QuerySchedule* schedule) {
  ...
  // 拿到执行计划
  const TQueryExecRequest& exec_request = schedule->request();
    
  // 第一层循环遍历每个执行计划信息TPlanExecInfo
  for (const TPlanExecInfo& plan_exec_info : exec_request.plan_exec_info) {
    // 第二层循环遍历一个计划中的每个ScanNode及对应的ScanRanges
    // per_node_scan_ranges是一个Map，记录了ScanNodeID->ScanRanges的映射
    for (const auto& entry : plan_exec_info.per_node_scan_ranges) {
      // 拿到对应的Node ID和fragment信息
      const TPlanNodeId node_id = entry.first;
      const TPlanFragment& fragment = schedule->GetContainingFragment(node_id);
        
      // 根据该fragment的实例创建的数据的分区结果，
      // 如果是UNPARTITIONED，即fragment只有一个实例，则该ScanRange在Coordinator处执行
      bool exec_at_coord = (fragment.partition.type == TPartitionType::UNPARTITIONED);
        
      // 对于HDFS的Scan是否有副本选择偏好，有则按照偏好选择
      bool has_preference =
          node.__isset.hdfs_scan_node && node.hdfs_scan_node.__isset.replica_preference;
      const TReplicaPreference::type* node_replica_preference = has_preference ?
          &node.hdfs_scan_node.replica_preference :
          nullptr;
        
      // 是否随机选择副本
      bool node_random_replica = node.__isset.hdfs_scan_node
          && node.hdfs_scan_node.__isset.random_replica
          && node.hdfs_scan_node.random_replica;
        
      // assignment是用于记录单个fragment的ScanRange分配结果的map
      FragmentScanRangeAssignment* assignment =
          &schedule->GetFragmentExecParams(fragment.idx)->scan_range_assignment;
        
      // 一个TScanRangeLocationList为一个ScanRange加上其位置信息
      const vector<TScanRangeLocationList>* locations = nullptr;
      vector<TScanRangeLocationList> expanded_locations;
      // entry.second为TScanRangeSpec类型，可以是具体的concrete_ranges或说明性的split_specs
        
      if (entry.second.split_specs.empty()) {
        // 说明信息为空，直接使用具体的ScanRange
        locations = &entry.second.concrete_ranges;
      } else {
        // 说明信息不为空，则根据其生成ScanRange，并与已有的合并
        expanded_locations.insert(expanded_locations.end(),
            entry.second.concrete_ranges.begin(), entry.second.concrete_ranges.end());
        RETURN_IF_ERROR(
            GenerateScanRanges(entry.second.split_specs, &expanded_locations));
        locations = &expanded_locations;
      }
      
      // 调用ComputeScanRangeAssignment重载，实现真正的分配
      RETURN_IF_ERROR(
          ComputeScanRangeAssignment(executor_config, node_id, node_replica_preference,
              node_random_replica, *locations, exec_request.host_list, exec_at_coord,
              schedule->query_options(), total_assignment_timer, assignment));
        
      // 更新ScanRange的数量
      schedule->IncNumScanRanges(locations->size());
    }
  }
  return Status::OK();
}

可以看到这个版本的ComputeScanRangeAssignment函数实际上做的都是一些ScanRange分配前的准备工作，我们继续看具体实现ScanRange分配逻辑的重载版本ComputeScanRangeAssignment，这是一个很长的函数，同样删去非关键代码：

Status Scheduler::ComputeScanRangeAssignment(const ExecutorConfig& executor_config,
    PlanNodeId node_id, const TReplicaPreference::type* node_replica_preference,
    bool node_random_replica, const vector<TScanRangeLocationList>& locations,
    const vector<TNetworkAddress>& host_list, bool exec_at_coord,
    const TQueryOptions& query_options, RuntimeProfile::Counter* timer,
    FragmentScanRangeAssignment* assignment) {
  // 获取ExecutorGroup，其中包括了若干可分配ScanRange的Executor信息
  const ExecutorGroup& executor_group = executor_config.group;
    
  // 没有可用的Executor且非Coordinator执行，则返回一个错误
  if (executor_group.NumExecutors() == 0 && !exec_at_coord) {
    return Status(TErrorCode::NO_REGISTERED_BACKENDS);
  }
    
  // base_distance为基准距离，将内存距离小于base_distance的所有均副本视为base_distance
  // 此处的内存距离可以理解为读取该ScanRange需要消耗的代价，按照Impalad与HDFS DataNode部署位置分为5个等级，实际只使用了3个等级
  TReplicaPreference::type base_distance = query_options.replica_preference;
  // 附加到PlanNode的偏好具有更高的优先级，可以覆盖QueryOption的偏好
  if (node_replica_preference) base_distance = *node_replica_preference;
    
  // 根据PlanNode或QueryOption设置是否启用随机选择副本
  bool random_replica = query_options.schedule_random_replica || node_random_replica;
    
  // 此处构建了一个临时的、只包括本地Executor（Coordinator所在实例）的ExecutorGroup
  ExecutorGroup coord_only_executor_group("coordinator-only-group");
  const TBackendDescriptor& local_be_desc = executor_config.local_be_desc;
  coord_only_executor_group.AddExecutor(local_be_desc);
  VLOG_QUERY << "Exec at coord is " << (exec_at_coord ? "true" : "false");
    
  // AssignmentCtx类在调度期间存储了分配的上下文信息，实现了具体的分配逻辑，下文再进一步介绍
  // 构造AssignmentCtx时，根据是否在Coordinator执行来传入对应的ExecutorGroup
  AssignmentCtx assignment_ctx(
      exec_at_coord ? coord_only_executor_group : executor_group, total_assignments_,
      total_local_assignments_);
    
  // 保存为远程读取分配的ScanRange，所谓远程读取即Impalad与HDFS DataNode部署在不同物理机
  vector<const TScanRangeLocationList*> remote_scan_range_locations;

  // 遍历所有的ScanRange，首先为非远程读取的ScanRange进行分配，并收集所有其他的之后处理。
  for (const TScanRangeLocationList& scan_range_locations : locations) {
    // min_distance为该ScanRange在所有DataNode选择中最小的内存距离
    TReplicaPreference::type min_distance = TReplicaPreference::REMOTE;

    // 对于在Coordinator执行的ScanRange，直接分配local_be_desc（Coordinator的Backend）
    if (exec_at_coord) {
      // RecordScanRangeAssignment将分配结果记录到assignment
      assignment_ctx.RecordScanRangeAssignment(local_be_desc, node_id, host_list,
          scan_range_locations, assignment);
    } else {
      // 收集内存距离最小的所有Executor作为候选项
      vector<IpAddr> executor_candidates;
      // 当基准距离为远程读取REMOTE时，所有读取都被视为远程读取，可以直接跳过这个if
      if (base_distance < TReplicaPreference::REMOTE) {
        // 遍历该ScanRange的所有可选位置（即可选的DataNode）
        for (const TScanRangeLocation& location : scan_range_locations.locations) {
          const TNetworkAddress& replica_host = host_list[location.host_idx];
          // 以下一系列语句用来确定DataNode主机到最近的Executor的内存距离
          TReplicaPreference::type memory_distance = TReplicaPreference::REMOTE;
          IpAddr executor_ip;
          // 检查该DataNode主机上是否有Executor
          bool has_local_executor = assignment_ctx.executor_group().LookUpExecutorIp(
              replica_host.hostname, &executor_ip);
          // 该DataNode主机上有Executor
          if (has_local_executor) {
            // 该DataNode缓存了该ScanRange的数据
            if (location.is_cached) {
              // CACHE_LOCAL是最小的内存距离，说明有Executor可直接从本机缓存读取数据
              memory_distance = TReplicaPreference::CACHE_LOCAL;
            } else {
              // DISK_LOCAL说明有Executor可直接从本机磁盘读取数据
              memory_distance = TReplicaPreference::DISK_LOCAL;
            }
          } else {
            // 该DataNode主机没有部署Executor，只能远程读取
            memory_distance = TReplicaPreference::REMOTE;
          }
          // 所有小于基准距离的距离均被视为基准距离
          memory_distance = max(memory_distance, base_distance);

          // 只需要收集非远程读取的Executor候选项，因为远程读取没有可选的Executor。
          if (memory_distance < TReplicaPreference::REMOTE) {
            // 检查是否找到了一个具有更小内存距离的DataNode。
            if (memory_distance < min_distance) {
              min_distance = memory_distance;
              executor_candidates.clear();
              executor_candidates.push_back(executor_ip);
            } else if (memory_distance == min_distance) {
              executor_candidates.push_back(executor_ip);
            }
          }
        }
      } // 完成候选Executor的筛选，executor_candidates包括了候选的Executor IP

      // 内存距离为CACHE_LOCAL的ScanRange视为被缓存了
      bool cached_replica = min_distance == TReplicaPreference::CACHE_LOCAL;
      // 非远程读取表明DataNode有本地Executor
      bool local_executor = min_distance != TReplicaPreference::REMOTE;
      // 远程读取的ScanRange先放入remote_scan_range_locations，之后再处理
      if (!local_executor) {
        remote_scan_range_locations.push_back(&scan_range_locations);
        continue;
      }
      // 对于指定了随机副本或已缓存的ScanRange，可以使用随机Rank来干预分配
      bool decide_local_assignment_by_rank = random_replica || cached_replica;
      const IpAddr* executor_ip = nullptr;
      // 调用SelectExecutorFromCandidates来在候选Executor IP中选择一个IP
      // 这一步会选择已分配字节数最少的主机，如果多个Executor同时最少，
      // 并且decide_local_assignment_by_rank为真则按照事先生成的随机Rank来选择，否则选择第一个候选Executor。
      executor_ip = assignment_ctx.SelectExecutorFromCandidates(
          executor_candidates, decide_local_assignment_by_rank);
      TBackendDescriptor executor;
      // 调用SelectExecutorOnHost选择该主机上的一个Executor（一台物理机可能部署了多个Executor实例）
      assignment_ctx.SelectExecutorOnHost(*executor_ip, &executor);
      // 记录分配结果
      assignment_ctx.RecordScanRangeAssignment(
          executor, node_id, host_list, scan_range_locations, assignment);
    } // 完成在候选Executor中选择一个Executor
  } // 完成所有ScanRange的遍历

  // 接下来为远程读取的ScanRange进行分配
  // 计算供远程读取分配的候选Executor数量的限制值，由QueryOption和Executor数量的较小值决定
  int num_remote_executor_candidates =
      min(query_options.num_remote_executor_candidates, executor_group.NumExecutors());
  // 遍历remote_scan_range_locations中每个ScanRange
  for (const TScanRangeLocationList* scan_range_locations : remote_scan_range_locations) {
    const IpAddr* executor_ip;
    vector<IpAddr> remote_executor_candidates;
    // 对于HDFS文件的ScanRange且num_remote_executor_candidates大于零时，限制候选Executor数量
    // 这是IMPALA-7928提供的优化，为了让远程读取ScanRange的分配保持一致性，以充分发挥Impalad节点缓存的效果
    if (scan_range_locations->scan_range.__isset.hdfs_file_split &&
        num_remote_executor_candidates > 0) {
      // 调用GetRemoteExecutorCandidates获取候选Executor名单
      assignment_ctx.GetRemoteExecutorCandidates(
          &scan_range_locations->scan_range.hdfs_file_split,
          num_remote_executor_candidates, &remote_executor_candidates);
      // 调用SelectExecutorFromCandidates来在候选Executor IP中选择一个IP
      executor_ip = assignment_ctx.SelectExecutorFromCandidates(
          remote_executor_candidates, random_replica);
    } else {
      // SelectRemoteExecutor退回到选择远程读取Executor的常规方法，该方法允许在所有Executor中选择
      executor_ip = assignment_ctx.SelectRemoteExecutor();
    }
    TBackendDescriptor executor;
    // 选择该主机上的一个Executor，并记录分配结果
    assignment_ctx.SelectExecutorOnHost(*executor_ip, &executor);
    assignment_ctx.RecordScanRangeAssignment(
        executor, node_id, host_list, *scan_range_locations, assignment);
  }
  ...
  return Status::OK();
}

可以发现ScanRange的分配过程还是比较复杂的，对于非Remote的ScanRange分配，分配依据了各个DataNode与Executor的部署位置以及ScanRange是否被DataNode所缓存。

分配算法的细节——AssignmentCtx类

Remote ScanRange的分配逻辑则集中在AssignmentCtx对象中，我们继续看AssignmentCtx的几个关键方法，首先是从候选Executor IP中选择Executor IP的SelectExecutorFromCandidates函数：

const IpAddr* Scheduler::AssignmentCtx::SelectExecutorFromCandidates(
    const std::vector<IpAddr>& data_locations, bool break_ties_by_rank) {
  // 报错候选Executor在data_locations中的下标，用来记录已分配字节数最少的所有Executor
  vector<int> candidates_idxs;
  // 然后利用自身维护的一个堆结构来查找具有最小分配字节数的位置
  int64_t min_assigned_bytes = numeric_limits<int64_t>::max();
  // 遍历每个候选Executor
  for (int i = 0; i < data_locations.size(); ++i) {
    const IpAddr& executor_ip = data_locations[i];
    int64_t assigned_bytes = 0;
    // 在堆中查找，该堆按照已分配给该Executor的字节数排序
    auto handle_it = assignment_heap_.find(executor_ip);
    if (handle_it != assignment_heap_.end()) {
      // 获取该Executor已分配的字节数
      assigned_bytes = (*handle_it->second).assigned_bytes;
    }
    if (assigned_bytes < min_assigned_bytes) {
      // 如果分配字节数更少，则清空candidates_idxs重新记录
      candidates_idxs.clear();
      min_assigned_bytes = assigned_bytes;
    }
    // 记录到已分配字节数最少的所有Executor列表中
    if (assigned_bytes == min_assigned_bytes) candidates_idxs.push_back(i);
  }
  
  // 默认选择列表中第一个候选Executor
  auto min_rank_idx = candidates_idxs.begin();
  if (break_ties_by_rank) {
    // 如果设定了break_ties_by_rank，者按照随机生成的ExecutorRank，选择Rank最小的Executor
    // ExecutorRank是AssignmentCtx类初始化时为每个Executor随机生成的一个不重复数字
    min_rank_idx = min_element(candidates_idxs.begin(), candidates_idxs.end(),
        [&data_locations, this](const int& a, const int& b) {
          return GetExecutorRank(data_locations[a]) < GetExecutorRank(data_locations[b]);
        });
  }
  return &data_locations[*min_rank_idx];
}

然后再看看根据Executor IP选择Executor的SelectExecutorOnHost函数：

void Scheduler::AssignmentCtx::SelectExecutorOnHost(
    const IpAddr& executor_ip, TBackendDescriptor* executor) {
  // 首先调用GetExecutorsForHost方法获得该主机上的所有Executor，即Executors（vector<Executor>）
  const ExecutorGroup::Executors& executors_on_host =
      executor_group_.GetExecutorsForHost(executor_ip);
  
  if (executors_on_host.size() == 1) {
    // 该主机上只有一个Executor时，直接选择
    *executor = *executors_on_host.begin();
  } else {
    // 否则根据IP从Map next_executor_per_host_里查找或插入该Executors
    // 该Map记录IP->Vector迭代器的映射
    ExecutorGroup::Executors::const_iterator* next_executor_on_host;
    next_executor_on_host =
        FindOrInsert(&next_executor_per_host_, executor_ip, executors_on_host.begin());
    // 选择迭代器指向的Executor，对于新插入Map的Executors来说，这是Vector中第一个Executor
    // 对于Map中已存在的Executors，该迭代器指向上次使用的Executor的下一个
    *executor = **next_executor_on_host;
    // 令该迭代器向后移动一位，若移动到尾部则重返开头
    ++(*next_executor_on_host);
    if (*next_executor_on_host == executors_on_host.end()) {
      *next_executor_on_host = executors_on_host.begin();
    }
  }
}

可以发现SelectExecutorOnHost函数选择Executor采用了轮询的方式，而轮询的顺序取决于Executors的顺序，其本质上其实是该主机上Executor的启动顺序（严格来说是在该Coordinator注册的顺序）。选择好了Executor之后，便是调用RecordScanRangeAssignment将分配结果进行记录，实际上RecordScanRangeAssignment还负责了一些其他工作，包括更新堆结构assignment_heap_、更新Metrics和补充ScanRange信息。

最后我们再看一下Remote ScanRange分配的两种模式，首先是简单的SelectRemoteExecutor方式，这在非HDFS或QueryOption中的num_remote_executor_candidates为0时会启用，这种方式没考虑Impalad节点本身的缓存（如文件句柄缓存、DataCache），是比较原始的分配方式：

const IpAddr* Scheduler::AssignmentCtx::SelectRemoteExecutor() {
  const IpAddr* candidate_ip;
  if (HasUnusedExecutors()) {
    // 有未分配的Executor时，直接分配下一个未分配的Executor
    candidate_ip = GetNextUnusedExecutorAndIncrement();
  } else {
    // 此时所有的Executor都至少分配了一次ScanRange，所有Executor都应该在堆结构中
    // 总是分配堆顶的Executor，即分配字节数最少（之一）的Executor
    candidate_ip = &(assignment_heap_.top().ip);
  }
  return candidate_ip;
}

bool Scheduler::AssignmentCtx::HasUnusedExecutors() const {
  // random_executor_order_是AssignmentCtx初始化时随机打乱的Executor列表
  return first_unused_executor_idx_ < random_executor_order_.size();
}

const IpAddr* Scheduler::AssignmentCtx::GetNextUnusedExecutorAndIncrement() {
  const IpAddr* ip = &random_executor_order_[first_unused_executor_idx_++];
  return ip;
}

在实际使用中，Remote ScanRange的分配方式还是以GetRemoteExecutorCandidates先获取候选Executor，再调用SelectExecutorFromCandidates函数来选择为主的。为了保持Remote ScanRange分配的一致性，GetRemoteExecutorCandidates采用了一致性哈希的哈希环（HashRing）结构，关于一致性哈希可以参考How we efficiently implemented consistent hashing。GetRemoteExecutorCandidates的实现本身不长：

void Scheduler::AssignmentCtx::GetRemoteExecutorCandidates(
    const THdfsFileSplit* hdfs_file_split, int num_candidates,
    vector<IpAddr>* remote_executor_candidates) {
  // 保持Executor IP的去重集合，由于ScanRange文件名的两个不同哈希值可能映射到相同的Executor
  // 而候选名单不能包含重复的执行器，所以可能需要集合来保持Executor唯一。
  unordered_set<IpAddr> distinct_backends;
  distinct_backends.reserve(num_candidates);
  // 根据ScanRange的文件名（包括分区的完整路径名）来计算哈希值，以保证相同文件的扫描总是分配给某个Executor
  uint32_t hash = static_cast<uint32_t>(hdfs_file_split->partition_path_hash);
  hash = HashUtil::Hash(hdfs_file_split->relative_path.data(),
      hdfs_file_split->relative_path.length(), hash);
  hash = HashUtil::Hash(&hdfs_file_split->offset, sizeof(hdfs_file_split->offset), hash);
  // pcg32是一个随机数生成器，这里以hash作为其seed
  pcg32 prng(hash);
  // 函数应该返回不同的候选Executor，所以它可能需要做比num_candidate更多次的哈希
  // 为了避免多次迭代仍无法生成足够的候选Executor，需要限制迭代次数为候选数的8倍
  // 可以算得，生成3个候选Executor需要迭代20次以上的概率为0.09%
  int max_iterations = num_candidates * MAX_ITERATIONS_PER_EXECUTOR_CANDIDATE;
  for (int i = 0; i < max_iterations; ++i) {
    // 根据哈希值在哈希环上查找最近的Executor，哈希环由HashRing类实现
    const IpAddr* executor_addr = executor_group_.GetHashRing()->GetNode(prng());
    // 将其插入到去重集合中
    auto insert_ret = distinct_backends.insert(*executor_addr);
    // unordered_set.insert()的返回类型是一个pair<iterator, bool>
    // 第二个元素指示了是否是一个新元素，如果是新元素，即得到了一个新候选Executor，将其添加到候选名单
    if (insert_ret.second) {
      remote_executor_candidates->push_back(*executor_addr);
    }
    // 得到了足够的候选Executor后就可以结束循环
    if (remote_executor_candidates->size() == num_candidates) break;
  }
}

至此，我们就把ScanRange的分配流程给梳理完了，总结一下，ScanRange的分配可以包括非远程读取和远程读取两部分。其中非远程读取指的是Impalad和HDFS DataNode混部的情况下，Executor能够直接读取本机DataNode的数据（磁盘或缓存中的），这种情况下ScanRange的分配依赖于内存距离，距离相同的再按照已分配的字节数最少的分配。远程读取的情况下，Executor和DataNode不在一台物理机上，需要通过网络读取，这种情况下考虑到Impalad本地缓存的利用，会根据ScanRange的文件名总是保持一致地分配ScanRange，即某个文件的ScanRange总是分配到某个Executor主机上。以上两种分配都是将ScanRange分配到Executor的主机上，对于一个主机运行多个Executor实例的情况，还要按照轮询方式具体分配到一个Executor进程上，而轮询的顺序与Executor在Coordinator注册的顺序相同。