原文转载自:http://suanwuxian.com/?p=98
Hive Statistics
- 一、 社区介绍
- 二、配置方式
<name>hive.stats.dbclass</name>
<value>jdbc:derby</value>
<description>The default database that stores temporary hive statistics.</description>
</property>
<property>
<name>hive.stats.autogather</name>
<value>true</value>
<description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
</property>
<property>
<name>hive.stats.jdbcdriver</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>The JDBC driver for the database that stores temporary hive statistics.</description>
</property>
<property>
<name>hive.stats.dbconnectionstring</name>
<value>jdbc:derby:;databaseName=TempStatsStore;create=true</value>
<description>The default connection string for the database that stores temporary hive statistics.</description>
</property>
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
<name>hive.stats.dbconnectionstring</name>
<value>jdbc:mysql://hd17-vm1:3306/hive_stats?createDatabaseIfNotExist=true&user=root&password=root</value>
<description>The default connection string for the database that stores temporary hive statistics.</description>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>jdbc:mysql</value>
<description>The default database that stores temporary hive statistics.</description>
</property>
<property>
<name>hive.stats.jdbcdriver</name>
<value>com.mysql.jdbc.Driver</value>
<description>The JDBC driver for the database that stores temporary hive statistics.</description>
</property>
2:要求所有节点$HADOOP_HOME/lib下包含有postgresql-9.1-902.jdbc4.jar
3:修改10.227.8.16机器的pg_hba.conf使集群50个节点以及hive客户端ip都可以连接此pg
4:$HIVE_CONF_DIR/hive-site.xml新增如下信息
<property>
<name>hive.stats.dbconnectionstring</name>
<value>jdbc:postgresql://10.227.8.16/hive_stats?createDatabaseIfNotExist=true&user=payods&password=payods123</value>
<description>The default connection string for the database that stores temporary hive statistics.</description>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>jdbc:postgresql</value>
<description>The default database that stores temporary hive statistics.</description>
</property>
<property>
<name>hive.stats.jdbcdriver</name>
<value>org.postgresql.Driver</value>
<description>The JDBC driver for the database that stores temporary hive statistics.</description>
</property>
<name>hive.stats.retries.max</name>
<value>3</value>
</property>
<name>hive.stats.jdbc.timeout</name>
<value>0</value>
</property>
2013-03-01 13:56:41,205 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: StatsPublishing error: cannot connect to database.
这是因为ostgresql-9.1-902.jdbc4.jar 对jdbc4的支持还有限,有一些方法还没有实现。其中有setQueryTimeout没有实现,但其不是空方法,如果调用就会抛出异常,因此只能设置这项的值为0,避免抛出异常导致错误。
- 三、使用方式
hive (default)> desc formatted src;
last_modified_by zongren
last_modified_time 1354590992
numFiles 4
numPartitions 0
numRows 0
rawDataSize 0
totalSize 41392
transient_lastDdlTime 1358387408
Table Parameters:
numFiles 6
numPartitions 6
numRows 0
rawDataSize 0
totalSize 34872
numFiles 1
numRows 0
rawDataSize 0
totalSize 5812
transient_lastDdlTime 1358386808
- 四、 提供的统计项
numFiles 分区下或表下文件总个数