Hive总结_hive 均匀分布随机数-CSDN博客

本文链接：https://blog.csdn.net/jinjiating/article/details/79231674

Create Table

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] 
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]
  [AS select_statement]; 

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path]; 

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE     
  | ORC        
  | PARQUET     
  | AVRO        
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

row_format中可指定field间隔字符(默认字符01, 集合成员间隔(默认逗号）, Map KeyValue间隔（默认冒号）,行间隔字符(默认\n)，NULL表示字符。

查看建表语句：

show create table table_name;

查看表属性：

desc [extended] table_name;

Data Type

基本数据类型：tinyint/smallint/int/bigint/float/double/decimal/boolean
字符串：string/varchar/char
字节数组：binary
时间：timestamp/dates/interval
复杂类型：ARRAY/MAP

File Formats

ORC

由一系列stripes、FileFooter、Postscript三部分组成：

Postscript保存压缩相关信息，FileFooter保存文件元信息，主要是strips的信息，每个stripe有多少行数据，列数据类型，以及整个文件中每一列的count/min/max/sum.

每个strip默认250M, Stripe中Index data包含当前strip中每一列min/max以及每一列的位置索引。

查看ORC文件：hive –orcfiledump

Select

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

排序

order by：全局有序

sort by: 每个reduce局部有序

distribute by: 控制map输出数据倒reduce的规则，一般后跟sort by实现groupOrder功能

cluster by： distribute by + sort by 到同一列，但对这一列只能降序

strict

set hive.mapred.mode=strict

strict模式限制使用性能敏感的操作：

order by without limit
限制查询分区表时无patition过滤条件
限制笛卡尔乘积，即无条件join

UDF/UDAF/UDTF

谓词操作：

A [NOT] BETWEEN B AND C
A IS [NOT] NULL
A IS [NOT] (TRUE|FALSE)
A [NOT] LIKE B
A RLIKE B
A AND B
A OR B
NOT A
! A
A IN (val1, val2, …)
A NOT IN (val1, val2, …)
[NOT] EXISTS (subquery)

Collection

复杂类型构造函数：

数组：(val1, val2, …)
Map: map(key1, value1, key2, value2, …)
结构体(字段匿名): (val1, val2, val3, …)
结构体(字段命名): (name1, val1, name2, val2, …)
共同体: (tag, val1, val2, …)

复杂类型操作：

数组： A[n]
Map: M[key]
结构体： S.x

复杂类型函数：

size(Map

Built-in Functions

类型转换

强转: type(column)
cast: cast(expr as )

字符串操作

instr(string str, string substr)
length(string A)
trim/ltrim/rtrim(string A)
upper/lower(string A)
reverse(string A)
substr(string|binary A, int start)
concat(string|binary A, string|binary B…)
split(string str, string pat)
find_in_set(string str, string strList)
base64(binary bin)
get_json_object(string json_string, string path)

数学计算

abs(DOUBLE a)
round(DOUBLE a)
floor/ceil(DOUBLE a)
exp/ln(DOUBLE a)
pow(DOUBLE a, DOUBLE p)
log(DOUBLE base, DOUBLE a)
sqrt(DOUBLE a)
sin/cos/tan(DOUBLE a)

日期

字符串转日期：

bigint unix_timestamp()
bigint unix_timestamp(string date[, string pattern])

日期转字符串

String from_unixtime(bigint unixtime[, string format])

日期间操作：

int datediff(string enddate, string startdate);
date date_add(date/timestamp/string startdate, tinyint/smallint/int days)date date_sub(date/timestamp/string startdate, tinyint/smallint/int days)

条件函数

if(boolean testCondition, T valueTrue, T valueFalseOrNull)
isnull/isnotnull(a)
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END

Hash Function

int hash(a1[, a2…])
string md5(string/binary)
bigint crc32(string/binary)
string sha(string/binary)

UDAF

count/max/min/avg/sum

UDTF

T explode(ARRAY a)
Tkey,Tvalue explode(MAP

窗口函数

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

优化选项

set hive.execution.engine=tez;
set hive.exec.parallel=true;    //并行执行Job
set hive.auto.convert.join=true; //自动进行mapjoin,小表在左
hive.map.aggr=true;
hive.merge.mapfiles=true;
hive.merge.mapredfiles=false;
//Vectorization
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
//CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Join优化参考：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

常用选项

hive.cli.print.current.db
hive.cli.print.header   
hive.exec.reducers.bytes.per.reducer
mapred.reduce.tasks=-1
hive.mapred.reduce.tasks.speculative.execution
hive.exec.parallel.thread.number