Statistics in Hive （hive的统计信息搜集）翻译

最新推荐文章于 2024-07-05 18:05:11 发布

tobyqiu

最新推荐文章于 2024-07-05 18:05:11 发布

阅读量1.3k

点赞数

分类专栏： hadoop sqoop hive 文章标签：大数据数据库

本文链接：https://blog.csdn.net/tobyqiu/article/details/84598796

版权

hadoop sqoop hive 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

原文 https://cwiki.apache.org/confluence/display/Hive/StatsDev

hive的统计信息搜集

动机
范围
执行
用法
- 变量配置
- 全新的表
- 现有的表
例子

Motivation 动机

统计信息，例如一个表有多少行，多少个分区，列的直方图等重要的信息。统计信息的关键作用是查询优化。把统计信息作为输入，通过成本优化函数，可以方便的比较不同的查询方案，并且从中进行选择。统计数据有时可以直接满足用户的查询目的。比如他们只是查询一些基础数据，而不需要通过运行执行计划.举几个例子，得到用户的年龄分布，人们使用的top10的apps，多少个不同的session.

Scope 范围

支持统计的第一个里程碑是支持表和分区层面的统计数据。对于新建表或者是已经存在的表，表和分区统计数据现在存储在Hive的元数据中对。目前支持的分区的如下统计：

1.多少行

2.多少个文件

3.大小（字节数）

对于表来说，统计信息支持新加的分区的统计。

列级别的top K值也可搜集基于分区级别统计。参见top k Statistics。

Implementation 执行

统计信息的搜集大概分2种，新表和现有表

对于新创建的表，创建一个新表的就是一个MapReduce job。在创建的过程中，每个mapper在文件拷贝的操作中搜集行数，然后放进数据库（可能是mysql）。在MapReduce作业结束时，把统计数据汇总并存储在MetaStore。一个类似的过程发生在已经存在的表，当新建一个map-only的job，当每个mapper在扫描表的过程中，搜集行的统计信息，然后同样的过程。

有一点需要明确，这里需要的用来存储临时统计信息的数据。现在有2种实现方式，一个是用mysql，另一个是hbase。这里有个接口IStatsPublisher和IStatsAggregator。开发人员可以实现支持任何其他的存储。接口列表如下

package org.apache.hadoop.hive.ql.stats;
 
import org.apache.hadoop.conf.Configuration;
 
/**
 * An interface for any possible implementation for publishing statics.
 */
 
public interface IStatsPublisher {
 
  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);
 
  /**
 * This method publishes a given statistic into a disk storage, possibly HBase or MySQL.
   *
 * rowID : a string identification the statistics to be published then gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be published. Ex: "numRows".
   *
 * value : an integer noting the value of the published key.
 * */
  public boolean publishStat(String rowID, String key, String value);
 
  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();
 
}

package org.apache.hadoop.hive.ql.stats;
 
import org.apache.hadoop.conf.Configuration;
 
/**
 * An interface for any possible implementation for gathering statistics.
 */
 
public interface IStatsAggregator {
 
  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);
 
  /**
 * This method aggregates a given statistic from a disk storage.
 * After aggregation, this method does cleaning by removing all records from the disk storage that have the same given rowID.
   *
 * rowID : a string identification the statistic to be gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be gathered. Ex: "numRows".
   *
 * */
  public String aggregateStats(String rowID, String key);
 
  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();
 
}

Usage用法

Configuration Variables参数配置

详见统计参数配置列表，如何使用参数。

Newly Created Tables新表

对于新建表/分区（通过INSERT OVERWRITE ），统计信息默认情况下会自动计算。如果用户把 hive.stats.autogather设置成false，那么统计信息就不会被自动计算，然后存储进hive 元数据。

set hive.stats.autogather=false;

用户还可以指定临时统计存储的变量 hive.stats.dbclass，例如，要设置hbase（默认是 {{jdbc:derby}}作为临时的统计信息存储）就使用，

set hive.stats.dbclass=hbase;

如果是通过jdbc来实现临时存储（ex. Derby or MySQL），可以通过设置hive.stats.dbconnectionstring指定适当的连接字符串来实现。同时还可以通过hive.stats.jdbcdriver来指定jdbc驱动

set hive.stats.dbclass=jdbc:derby;
set hive.stats.dbconnectionstring="jdbc:derby:;databaseName=TempStatsStore;create=true";
set hive.stats.jdbcdriver="org.apache.derby.jdbc.EmbeddedDriver";

查询可能无法正确的搜集统计信息。如果出现这种情况，这里还有一个设置。hive.stats.reliable。默认是false

Existing Tables现有表

对于现有的表和/或分区，用户可以发出ANALYZE命令来收集统计信息，并将其写入到元数据存储。语法该命令的描述如下：

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

当用户发出的命令，他可能会或可能不会指定分区。如果用户没有指定任何分区，就会收集统计所有分区的统计信息（如果有的话）。如果指定某个分区，只会收集那些分区的统计信息。当搜集所有分区时，分区字段会被罗列。

当指定可选参数NOSCAN，该命令将不会扫描文件，以便它更快。它得到的不是所有统计数据，只是收集了以下统计数据：

文件数
物理大小（字节）

Examples例子

假设table1 有4个分区

Partition1: (ds='2008-04-08', hr=11)
Partition2: (ds='2008-04-08', hr=12)
Partition3: (ds='2008-04-09', hr=11)
Partition4: (ds='2008-04-09', hr=12)

用户打了以下的命令

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS;

那么只会统计分区3的数据(ds='2008-04-09', hr=11)

如果打了以下的命令

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS;

那么只统计了分区3和分区4的数据

如果打了下面的命令

ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS;

那么会统计4个分区的数据

对于非分区表可以使用以下命令

ANALYZE TABLE Table1 COMPUTE STATISTICS;

如果是个分区表，你就需要像上面写的那样明确分区字段，否则予以分析器就会抛出错误。

用户可以使用DESCRIBE 命令来查看已经搜集完毕的统计信息。统计信息被存放在一个参数array中，假设用户打算查看全表的统计信息，需要以下命令

DESCRIBE EXTENDED TABLE1;

然后会有以下的输出

... , parameters:{numPartitions=4, numFiles=16, numRows=2000, totalSize=16384, ...}, ....

如果使用以下命令

DESCRIBE EXTENDED TABLE1 PARTITION(ds='2008-04-09', hr=11);

会有以下输出

... , parameters:{numFiles=4, numRows=500, totalSize=4096, ...}, ....

如果用户使用以下命令

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS noscan;

就只会统计分区3和分区4中有多少个文件，以及物理大小（单位byte）