Hive优化

最新推荐文章于 2022-10-12 01:35:38 发布

你说_

最新推荐文章于 2022-10-12 01:35:38 发布

阅读量344

点赞数

分类专栏： hive 文章标签： hive优化

本文链接：https://blog.csdn.net/yuanyi0501/article/details/83275694

版权

hive 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Hive server2

wiki

Hive优化

FetchTask

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
  </property>

大表拆分子表

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[AS select_statement];

分区表，外部表

CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
 [ROW FORMAT row_format]

数据
- 存储格式
  - 方式：
    - 按行存储：SEQUENCEFILE、TEXTFILE
    - 按列存储：RCFILE 、ORC 、PARQUET
    - 文件格式

| SEQUENCEFILE	序列化文件
| TEXTFILE – (Default, depending on hive.default.fileformat configuration)
| RCFILE – (Note: Available in Hive 0.6.0 and later)
| ORC – (Note: Available in Hive 0.11.0 and later) 对RCFILE优化，常用
| PARQUET – (Note: Available in Hive 0.13.0 and later) 常用
| AVRO – (Note: Available in Hive 0.14.0 and later)
| JSONFILE – (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

数据压缩 snappy
1）安装sanppy库
2）编译hadoop源码
- mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy
3) 将编译好的压缩包中的文件copy到hadoop/native下
4）bin/hadoop checknative检查有没有安装成功
sql优化
- 优化sql语句
- 特别介绍一下join，在hive阶段分为三种join学习博客
  - Common/Shuffle/Reduce join
    - 连接发生在reduce Task阶段
    - 大表对大表。每个表的数据都是从文件中读取的
  - Map Join
    - 连接发生在map task阶段
    - 小表对大表。大表的数据从文件中读取；小表的数据在内存中
    - DistributedCache类将小表中的文件缓存到各个节点的内存中
  - SMB join（Sort-Merge-Bucket）
执行计划explainwiki
- EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION] query
并行执行

<property>
    <name>hive.exec.parallel</name>
    <value>false</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

JVM重用——默认一个容器运行一个任务
- mapreduce.job.jvm.numtasks
reduce数目
- mapreduce.job.reduces
推测执行（下面两个属性必须同为true或false）
- mapreduce.map.speculative
- hive.mared.reduce.tasks.speculative.execution
map数目

  <property>
    <name>hive.merge.size.per.task</name>
    <value>256000000</value>
    <description>Size of merged files at the end of the job</description>
  </property>

动态分区调整
strict mode

你说_

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive优化

Hive server2wikiHive优化FetchTask&lt;property&gt; &lt;name&gt;hive.fetch.task.conversion&lt;/name&gt; &lt;value&gt;more&lt;/value&gt; &lt;description&gt; Expects one of [none, mi...
复制链接

扫一扫