Create Table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE
| ORC
| PARQUET
| AVRO
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
row_format中可指定field间隔字符(默认字符01, 集合成员间隔(默认逗号), Map KeyValue间隔(默认冒号),行间隔字符(默认\n),NULL表示字符。
查看建表语句:
show create table table_name;
查看表属性:
desc [extended] table_name;
Data Type
- 基本数据类型:tinyint/smallint/int/bigint/float/double/decimal/boolean
- 字符串:string/varchar/char
- 字节数组:binary
- 时间:timestamp/dates/interval
- 复杂类型:ARRAY/MAP
File Formats
ORC
由一系列stripes、FileFooter、Postscript三部分组成:
Postscript保存压缩相关信息,FileFooter保存文件元信息,主要是strips的信息,每个stripe有多少行数据,列数据类型,以及整个文件中每一列的count/min/max/sum.
每个strip默认250M, Stripe中Index data包含当前strip中每一列min/max以及每一列的位置索引。
查看ORC文件:hive –orcfiledump
Select
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
排序
order by:全局有序
sort by: 每个reduce局部有序
distribute by: 控制map输出数据倒reduce的规则,一般后跟sort by实现groupOrder功能
cluster by: distribute by + sort by 到同一列,但对这一列只能降序
strict
set hive.mapred.mode=strict
strict模式限制使用性能敏感的操作:
- order by without limit
- 限制查询分区表时无patition过滤条件
- 限制笛卡尔乘积,即无条件join
UDF/UDAF/UDTF
谓词操作:
- A [NOT] BETWEEN B AND C
- A IS [NOT] NULL
- A IS [NOT] (TRUE|FALSE)
- A [NOT] LIKE B
- A RLIKE B
- A AND B
- A OR B
- NOT A
- ! A
- A IN (val1, val2, …)
- A NOT IN (val1, val2, …)
- [NOT] EXISTS (subquery)
Collection
复杂类型构造函数:
- 数组:(val1, val2, …)
- Map: map(key1, value1, key2, value2, …)
- 结构体(字段匿名): (val1, val2, val3, …)
- 结构体(字段命名): (name1, val1, name2, val2, …)
- 共同体: (tag, val1, val2, …)
复杂类型操作:
- 数组: A[n]
- Map: M[key]
- 结构体: S.x
复杂类型函数:
- size(Map
Built-in Functions
类型转换
- 强转: type(column)
- cast: cast(expr as )
字符串操作
- instr(string str, string substr)
- length(string A)
- trim/ltrim/rtrim(string A)
- upper/lower(string A)
- reverse(string A)
- substr(string|binary A, int start)
- concat(string|binary A, string|binary B…)
- split(string str, string pat)
- find_in_set(string str, string strList)
- base64(binary bin)
- get_json_object(string json_string, string path)
数学计算
- abs(DOUBLE a)
- round(DOUBLE a)
- floor/ceil(DOUBLE a)
- exp/ln(DOUBLE a)
- pow(DOUBLE a, DOUBLE p)
- log(DOUBLE base, DOUBLE a)
- sqrt(DOUBLE a)
- sin/cos/tan(DOUBLE a)
日期
字符串转日期:
- bigint unix_timestamp()
- bigint unix_timestamp(string date[, string pattern])
日期转字符串
- String from_unixtime(bigint unixtime[, string format])
日期间操作:
- int datediff(string enddate, string startdate);
- date date_add(date/timestamp/string startdate, tinyint/smallint/int days)date date_sub(date/timestamp/string startdate, tinyint/smallint/int days)
条件函数
- if(boolean testCondition, T valueTrue, T valueFalseOrNull)
- isnull/isnotnull(a)
- CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
Hash Function
- int hash(a1[, a2…])
- string md5(string/binary)
- bigint crc32(string/binary)
- string sha(string/binary)
UDAF
- count/max/min/avg/sum
UDTF
- T explode(ARRAY a)
- Tkey,Tvalue explode(MAP
窗口函数
Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])
优化选项
set hive.execution.engine=tez;
set hive.exec.parallel=true; //并行执行Job
set hive.auto.convert.join=true; //自动进行mapjoin,小表在左
hive.map.aggr=true;
hive.merge.mapfiles=true;
hive.merge.mapredfiles=false;
//Vectorization
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
//CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Join优化参考:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
常用选项
hive.cli.print.current.db
hive.cli.print.header
hive.exec.reducers.bytes.per.reducer
mapred.reduce.tasks=-1
hive.mapred.reduce.tasks.speculative.execution
hive.exec.parallel.thread.number
数据倾斜
数据倾斜的现象是某个reduce处理key过多,影响整个Job进度,将map的输出均匀分布给reduce或者不进行reduce是解决数据倾斜的根本之道。
- 无reduce, 可在map端聚合,mapjoin大表驱动小表
- map端combiner,减少reduce输入hive.map.aggr=true
- 双Reduce, set hive.groupby.skewindata=true
- 特殊值单独处理后再Union, 比如count(ditinct null)
- 特殊值增加随机数均匀分布后再join
- 自定义Partitioner,指定map输出对应的reduce,可为倾斜key指定更多的reduce
数据仓库建模
事实表:记录发送的事件,量大
维表: 维度信息,变化慢
缓慢变化维: 维度变化缓慢,在数据仓库中根据业务需求觉定是否保留变更