Hive总结

Create Table

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] 
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]
  [AS select_statement]; 

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path]; 

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE     
  | ORC        
  | PARQUET     
  | AVRO        
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

row_format中可指定field间隔字符(默认字符01, 集合成员间隔(默认逗号), Map KeyValue间隔(默认冒号),行间隔字符(默认\n),NULL表示字符。

查看建表语句:

show create table table_name;

查看表属性:

desc [extended] table_name;

Data Type

  • 基本数据类型:tinyint/smallint/int/bigint/float/double/decimal/boolean
  • 字符串:string/varchar/char
  • 字节数组:binary
  • 时间:timestamp/dates/interval
  • 复杂类型:ARRAY/MAP

File Formats

ORC

由一系列stripes、FileFooter、Postscript三部分组成:

Postscript保存压缩相关信息,FileFooter保存文件元信息,主要是strips的信息,每个stripe有多少行数据,列数据类型,以及整个文件中每一列的count/min/max/sum.

每个strip默认250M, Stripe中Index data包含当前strip中每一列min/max以及每一列的位置索引。

查看ORC文件:hive –orcfiledump

Select

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

排序

order by:全局有序

sort by: 每个reduce局部有序

distribute by: 控制map输出数据倒reduce的规则,一般后跟sort by实现groupOrder功能

cluster by: distribute by + sort by 到同一列,但对这一列只能降序

strict

set hive.mapred.mode=strict

strict模式限制使用性能敏感的操作:

  • order by without limit
  • 限制查询分区表时无patition过滤条件
  • 限制笛卡尔乘积,即无条件join

UDF/UDAF/UDTF

谓词操作:

  • A [NOT] BETWEEN B AND C
  • A IS [NOT] NULL
  • A IS [NOT] (TRUE|FALSE)
  • A [NOT] LIKE B
  • A RLIKE B
  • A AND B
  • A OR B
  • NOT A
  • ! A
  • A IN (val1, val2, …)
  • A NOT IN (val1, val2, …)
  • [NOT] EXISTS (subquery)

Collection

复杂类型构造函数:

  • 数组:(val1, val2, …)
  • Map: map(key1, value1, key2, value2, …)
  • 结构体(字段匿名): (val1, val2, val3, …)
  • 结构体(字段命名): (name1, val1, name2, val2, …)
  • 共同体: (tag, val1, val2, …)

复杂类型操作:

  • 数组: A[n]
  • Map: M[key]
  • 结构体: S.x

复杂类型函数:

  • size(Map

Built-in Functions

类型转换

  • 强转: type(column)
  • cast: cast(expr as )

字符串操作

  • instr(string str, string substr)
  • length(string A)
  • trim/ltrim/rtrim(string A)
  • upper/lower(string A)
  • reverse(string A)
  • substr(string|binary A, int start)
  • concat(string|binary A, string|binary B…)
  • split(string str, string pat)
  • find_in_set(string str, string strList)
  • base64(binary bin)
  • get_json_object(string json_string, string path)

数学计算

  • abs(DOUBLE a)
  • round(DOUBLE a)
  • floor/ceil(DOUBLE a)
  • exp/ln(DOUBLE a)
  • pow(DOUBLE a, DOUBLE p)
  • log(DOUBLE base, DOUBLE a)
  • sqrt(DOUBLE a)
  • sin/cos/tan(DOUBLE a)

日期

字符串转日期:

  • bigint unix_timestamp()
  • bigint unix_timestamp(string date[, string pattern])

日期转字符串

  • String from_unixtime(bigint unixtime[, string format])

日期间操作:

  • int datediff(string enddate, string startdate);
  • date date_add(date/timestamp/string startdate, tinyint/smallint/int days)date date_sub(date/timestamp/string startdate, tinyint/smallint/int days)

条件函数

  • if(boolean testCondition, T valueTrue, T valueFalseOrNull)
  • isnull/isnotnull(a)
  • CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END

Hash Function

  • int hash(a1[, a2…])
  • string md5(string/binary)
  • bigint crc32(string/binary)
  • string sha(string/binary)

UDAF

  • count/max/min/avg/sum

UDTF

  • T explode(ARRAY a)
  • Tkey,Tvalue explode(MAP

窗口函数

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

优化选项

set hive.execution.engine=tez;
set hive.exec.parallel=true;    //并行执行Job
set hive.auto.convert.join=true; //自动进行mapjoin,小表在左
hive.map.aggr=true;
hive.merge.mapfiles=true;
hive.merge.mapredfiles=false;
//Vectorization
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
//CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Join优化参考:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

常用选项

hive.cli.print.current.db
hive.cli.print.header   
hive.exec.reducers.bytes.per.reducer
mapred.reduce.tasks=-1
hive.mapred.reduce.tasks.speculative.execution
hive.exec.parallel.thread.number

数据倾斜

数据倾斜的现象是某个reduce处理key过多,影响整个Job进度,将map的输出均匀分布给reduce或者不进行reduce是解决数据倾斜的根本之道。

  • 无reduce, 可在map端聚合,mapjoin大表驱动小表
  • map端combiner,减少reduce输入hive.map.aggr=true
  • 双Reduce, set hive.groupby.skewindata=true
  • 特殊值单独处理后再Union, 比如count(ditinct null)
  • 特殊值增加随机数均匀分布后再join
  • 自定义Partitioner,指定map输出对应的reduce,可为倾斜key指定更多的reduce

数据仓库建模

事实表:记录发送的事件,量大

维表: 维度信息,变化慢

缓慢变化维: 维度变化缓慢,在数据仓库中根据业务需求觉定是否保留变更

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值