2021-03-09 hive 参数设置

最新推荐文章于 2023-12-04 09:53:33 发布

三吱儿

最新推荐文章于 2023-12-04 09:53:33 发布

阅读量706

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/qq_22482081/article/details/114595339

版权

hive 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

查看hive参数

查看所有参数

set

查看block大小

set dfs.block.size;
dfs.block.size=134217728 --默认大小128M

map个数(和inputsize相关)

计算inputsplit size
新API CombineHiveInputFormat Math.max(minSize, Math.min(maxSize, blockSize))
老API HiveInputFormat Math.max(minSize, Math.min(goalSize, blockSize))

goalSize：该值由 totalSize/numSplits totlasize文件块大小，numslplits=mapred.map.tasks
minSize：由配置参数 mapred.min.split.size（或者新版本的mapreduce.input.fileinputformat.split.minsize）
blockSize：HDFS 中的文件存储块block的大小，默认为128MB。

参数由hive.input.format控制

hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

set mapred.max.split.size=100000000;
set mapred.min.split.size= 256000000; – 优先级最低
set mapred.min.split.size.per.node=100000000;–优先级低
set mapred.min.split.size.per.rack=100000000;–优先级高

public InputSplitShim[] getSplits(JobConf job, int numSplits) throws IOException {
  long minSize = job.getLong("mapred.min.split.size", 0);
 
  // For backward compatibility, let the above parameter be used
  if (job.getLong("mapred.min.split.size.per.node", 0) == 0) {
    super.setMinSplitSizeNode(minSize);
  }
 
  if (job.getLong("mapred.min.split.size.per.rack", 0) == 0) {
    super.setMinSplitSizeRack(minSize);
  }
 
  if (job.getLong("mapred.max.split.size", 0) == 0) {
    super.setMaxSplitSize(minSize);
  }
 
  InputSplit[] splits = (InputSplit[]) super.getSplits(job, numSplits);
 
  InputSplitShim[] isplits = new InputSplitShim[splits.length];
  for (int pos = 0; pos < splits.length; pos++) {
    isplits[pos] = new InputSplitShim((CombineFileSplit)splits[pos]);
  }
 
  return isplits;
}

hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

set mapred.map.tasks=3;
set mapred.max.split.size=100000000;
set mapred.min.split.size= 100000000; – 优先级最低
set mapred.min.split.size.per.node=100000000;–优先级低
set mapred.min.split.size.per.rack=100000000;–优先级高

reduce个数

set mapred.reduce.tasks=10; --设置reduce数量,优先级最高
set hive.exec.reducers.bytes.per.reducer=100000000;–每个reducer处理的数据量，input / bytes per reduce 为reduce个数
set hive.exec.reducers.max=1009;–这个参数是input / bytes per reduce > max 时的reducer数量

算法参考源码：
根据文件大小计算reduce数量

主要计算代码
double bytes = Math.max(totalInputFileSize, bytesPerReducer);
int reducers = (int) Math.ceil(bytes / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);

/**
 * Estimate the number of reducers needed for this job, based on job input,
 * and configuration parameters.
 *
 * The output of this method should only be used if the output of this
 * MapRedTask is not being used to populate a bucketed table and the user
 * has not specified the number of reducers to use.
 *
 * @return the number of reducers.
 */
public static int estimateNumberOfReducers(HiveConf conf, ContentSummary inputSummary,
                                           MapWork work, boolean finalMapRed) throws IOException {
  long bytesPerReducer = conf.getLongVar(HiveConf.ConfVars.BYTESPERREDUCER);
  int maxReducers = conf.getIntVar(HiveConf.ConfVars.MAXREDUCERS);
 
  double samplePercentage = getHighestSamplePercentage(work);
  long totalInputFileSize = getTotalInputFileSize(inputSummary, work, samplePercentage);
 
  // if all inputs are sampled, we should shrink the size of reducers accordingly.
  if (totalInputFileSize != inputSummary.getLength()) {
    LOG.info("BytesPerReducer=" + bytesPerReducer + " maxReducers="
        + maxReducers + " estimated totalInputFileSize=" + totalInputFileSize);
  } else {
    LOG.info("BytesPerReducer=" + bytesPerReducer + " maxReducers="
      + maxReducers + " totalInputFileSize=" + totalInputFileSize);
  }
 
  // If this map reduce job writes final data to a table and bucketing is being inferred,
  // and the user has configured Hive to do this, make sure the number of reducers is a
  // power of two
  boolean powersOfTwo = conf.getBoolVar(HiveConf.ConfVars.HIVE_INFER_BUCKET_SORT_NUM_BUCKETS_POWER_TWO) &&
      finalMapRed && !work.getBucketedColsByDirectory().isEmpty();
 
  return estimateReducers(totalInputFileSize, bytesPerReducer, maxReducers, powersOfTwo);
}
 
public static int estimateReducers(long totalInputFileSize, long bytesPerReducer,
    int maxReducers, boolean powersOfTwo) {
  double bytes = Math.max(totalInputFileSize, bytesPerReducer);
  int reducers = (int) Math.ceil(bytes / bytesPerReducer);
  reducers = Math.max(1, reducers);
  reducers = Math.min(maxReducers, reducers);
 
  int reducersLog = (int)(Math.log(reducers) / Math.log(2)) + 1;
  int reducersPowerTwo = (int)Math.pow(2, reducersLog);
 
  if (powersOfTwo) {
    // If the original number of reducers was a power of two, use that
    if (reducersPowerTwo / 2 == reducers) {
      // nothing to do
    } else if (reducersPowerTwo > maxReducers) {
      // If the next power of two greater than the original number of reducers is greater
      // than the max number of reducers, use the preceding power of two, which is strictly
      // less than the original number of reducers and hence the max
      reducers = reducersPowerTwo / 2;
    } else {
      // Otherwise use the smallest power of two greater than the original number of reducers
      reducers = reducersPowerTwo;
    }
  }
  return reducers;
}
double bytes = Math.max(totalInputFileSize, bytesPerReducer);
int reducers = (int) Math.ceil(bytes / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);

上面代码是根据输入计算Reduce的个数，最终的设置是通过setNumberOfReducers函数决定的，如下：

/**
 * Set the number of reducers for the mapred work.
 */
private void setNumberOfReducers() throws IOException {
  ReduceWork rWork = work.getReduceWork();
  // this is a temporary hack to fix things that are not fixed in the compiler
  Integer numReducersFromWork = rWork == null ? 0 : rWork.getNumReduceTasks();
 
  if (rWork == null) {
    console
        .printInfo("Number of reduce tasks is set to 0 since there's no reduce operator");
  } else {
    if (numReducersFromWork >= 0) {
      console.printInfo("Number of reduce tasks determined at compile time: "
          + rWork.getNumReduceTasks());
    } else if (job.getNumReduceTasks() > 0) {
      int reducers = job.getNumReduceTasks();
      rWork.setNumReduceTasks(reducers);
      console.printInfo("Number of reduce tasks not specified. Defaulting to jobconf value of: "
          + reducers);
    } else {
      if (inputSummary == null) {
        inputSummary =  Utilities.getInputSummary(driverContext.getCtx(), work.getMapWork(), null);
      }
      int reducers = Utilities.estimateNumberOfReducers(conf, inputSummary, work.getMapWork(),
                                                        work.isFinalMapRed());
      rWork.setNumReduceTasks(reducers);
      console
          .printInfo("Number of reduce tasks not specified. Estimated from input data size: "
          + reducers);
 
    }
    console.printInfo("In order to change the average load for a reducer (in bytes):");
    console.printInfo("  set " + HiveConf.ConfVars.BYTESPERREDUCER.varname
        + "=<number>");
    console.printInfo("In order to limit the maximum number of reducers:");
    console.printInfo("  set " + HiveConf.ConfVars.MAXREDUCERS.varname
        + "=<number>");
    console.printInfo("In order to set a constant number of reducers:");
    console.printInfo("  set " + HiveConf.ConfVars.HADOOPNUMREDUCERS
        + "=<numbe>");
  }
}

参考链接：

1.Hive中reduce个数设定
2.Hive中Reduce个数是如何计算的
3.控制hive map reduce个数
4.Hive中如何确定map数

三吱儿

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
2021-03-09 hive 参数设置

查看hive参数查看所有参数set查看block大小set dfs.block.size;dfs.block.size=134217728 --默认大小128Mmap个数(和inputsize相关)计算inputsplit size新API CombineHiveInputFormat Math.max(minSize, Math.min(maxSize, blockSize))老API HiveInputFormat Math.max(minSize, Math.min(goalSiz
复制链接

扫一扫

专栏目录