大数据_hive_基本操作

最新推荐文章于 2023-11-09 21:05:25 发布

kanbuqinghuanyizhang

最新推荐文章于 2023-11-09 21:05:25 发布

阅读量1.1k

点赞数

分类专栏： hive 大数据笔记面试

本文链接：https://blog.csdn.net/kanbuqinghuanyizhang/article/details/79132595

版权

面试同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

笔记

15 篇文章 0 订阅

订阅专栏

大数据

8 篇文章 0 订阅

订阅专栏

DDL

create

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

列数据类型data_type

参考：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type
  : ARRAY < data_type >

map_type
  : MAP < primitive_type, data_type >

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

字段分隔符ROW FORMAT row_format

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

存储数据类型STORED AS file_format

file_format:
通过hive.default.fileformat设置默认值。
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration) 
  | --使用DELIMITED分割数据,使用ESCAPED BY指定排除分隔符似的数据不被分割(Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\') 
  | --Escaping is needed if you want to work with data that can contain these delimiter characters. );A custom NULL format can also be specified using the 'NULL DEFINED AS' clause (default is '\N').
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
  | -默认使用的是TextInputFormat，一行表示一条记录。在有些时候，我们往往面对多行，结构化的文档，并需要将其导入Hive处理，此时，就需要自定义InputFormat、OutputFormat，以及SerDe了。参考：`https://www.coder4.com/archives/4031`

constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 


不同的存储类型对应的INPUTFORMAT和OUTPUTFORMAT不同，也可以自定义INPUTFORMAT和OUTPUTFORMAT。

比如：
TEXTFILE对应的为：
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

RCFILE对应的为：
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat   
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat

除INPUTFORMAT和OUTPUTFORMAT不同之外，他们可以配置的SerDe也有限制，比如：
RCFILE默认的为：
org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
TextFile可配置SerDe：比如RegEx，JSON，CSV/TSV
如果配置FILEFORMAT为RCFILE，配置的SERDE为JSON，那么就会出错Failed with exception Wrong file format. Please check the file's format.。

ROW FORMAT SERDE & STORED AS INPUTFORMAT OUTPUTFORMAT

这个比较复杂，就先不看了��
参考地址: https://www.coder4.com/archives/4031
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
SerDe is a short name for “Serializer and Deserializer.”
Hive uses SerDe (and !FileFormat) to read and write table rows.
HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files

总结一下，当面临一个HDFS上的文件时，Hive将如下处理（以读为例）：

(1) 调用InputFormat，将文件切成不同的文档。每篇文档即一行(Row)。
(2) 调用SerDe的Deserializer，将一行(Row)，切分为各个字段。

当HIVE执行INSERT操作，将Row写入文件时，主要调用OutputFormat、SerDe的Seriliazer，顺序与读取相反。

hive -e "create table hive_json_test(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;"

一个json serde的例子：
参考：https://github.com/rcongiu/Hive-JSON-Serde/tree/develop

1、下载对应Hive版本hcatalog包（包含JsonSerDe类）到hive lib中:https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core
2、将jar包的地址配置到hive-site.xml中:
<property>
    <name>hive.aux.jars.path</name>
    <value>file:///usr/local/execpackage/apache-hive-0.13.0-bin/lib/hive-hcatalog-core-0.13.0.jar</value>
</property>
3、测试
hive -e "load data local inpath '/Users/hadoop/Desktop/ws/tempfiles/hive_json.1.txt' into table hive_json_test;"
hive -e "select * from hive_json_test;"
hive -e "alter table hive_json_test add columns(c string);"
hive -e "load data local inpath '/Users/hadoop/Desktop/ws/tempfiles/hive_json.2.txt' into table hive_json_test;"
其中：
hive_json.1.txt：
{"a":"va2","b":"vb2"}
hive_json.2.txt：
{"a":"va2","b":"vb2","c":"vc"}
输出：
va2 vb2 NULL
va2 vb2 vc


如果某条数据不是规范的json,将会引起整个任务错误：

使用以下设置，将会返回NULL值替代
ALTER TABLE json_table SET SERDEPROPERTIES ( 'ignore.malformed.json' = 'true');
https://stackoverflow.com/questions/14137688/hive-elasticmapreduce-how-bring-jsonserde-to-ignore-malformed-json

创建表时指定表属性

TBLPROPERTIES ("comment"="table_comment") 对表的描述
TBLPROPERTIES ("hbase.table.name"="table_name") – see HBase Integration.
TBLPROPERTIES ("immutable"="true") or ("immutable"="false") in release 0.13.0+ (HIVE-6406) – see Inserting Data into Hive Tables from Queries.
TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or ("orc.compress"="NONE") and other ORC properties – see ORC Files.
TBLPROPERTIES ("transactional"="true") or ("transactional"="false") in release 0.14.0+, the default is "false" – see Hive Transactions.
TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the default is "false" – see Hive Transactions.
TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory") – see Hive Transactions.
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num") – see Hive Transactions.
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct") – see Hive Transactions.
TBLPROPERTIES ("auto.purge"="true") or ("auto.purge"="false") in release 1.2.0+ (HIVE-9118) – see Drop Table and Drop Partitions.
TBLPROPERTIES ("EXTERNAL"="TRUE") in release 0.6.0+ (HIVE-1329) – Change a managed table to an external table and vice versa for "FALSE".
TBLPROPERTIES("parquet.mr.int96.write.zone"="timeZoneId") (HIVE-12767) – Property set on Parquet tables only, storing the Java time zone ID used

STORED BY

指定一个non-native table format，比如指定hbase、druid

外部表、内部表(Managed and External Tables)的区别

内部表：
1、元数据、文件都由hive管理
2、数据存储位置由参数hive.metastore.warehouse.dir指定
3、删除表会删除对应的数据
4、Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.
外部表：
1、 对接外部数据（HDFS），为外部数据创建元信息
2、当外部表数据结构或分区发生改变时，使用`MSCK REPAIR TABLE table_name;`或则`ALTER TABLE table_name ADD PARTITION`,类似EMR版本中的如下命令
`ALTER TABLE table_name RECOVER PARTITIONS`;
当分区目录下还有子目录，此时使用msck会报错，可以使用set hive.msck.path.validation=ignore;
避免这个错误，或者删除子目录
3、删除表不会删除外部数据

修改内部表为外部表（反之亦然）：
hive -e "ALTER TABLE table_name SET TBLPROPERTIES ('EXTERNAL'='TRUE');"

分区

1、可以指定多个分区
2、分区字段名不能与表字段有重复

内部表分区

1、示例：
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;

外部表分区

1、创建外部表，指定路径
CREATE EXTERNAL TABLE my_daily_report(  
    last_update string,  
    col_a string,  
    col_b string,  
    col_c string)  
    PARTITIONED BY (  
    par_dt string) location '/user/chenshu/data/daily'; 
2、添加分区，挂载数据
alter table my_daily_report add partition (par_dt='20140530') location '/user/chenshu/data/daily/my_daily_report/20140530';

Create Table As Select (CTAS)
* 示例

CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

1、可以随意指定目标表的serde和存储文件的格式
2、目标表的字段名与select字段名相同，如果 (key % 1024) new_key指定别名new_key，那么目标表就有一个new_key字段，否则自动指定成_col0, _col1, and _col2 等等。

属性：

1、操作原子性，当该操作进行时该表(目标表)对其他用户不可见
2、目标表不能是分区表
3、目标表不能是外部表.
4、The target table cannot be a list bucketing table.

Temporary Tables
* 示例

CREATE TEMPORARY TABLE TABLE_NAME_HERE (key string, value string)

The steps would be:
1、Create Table
2、Fill Table:  INSERT OVERWRITE TABLE temptbl <select_query>
3、Run query

数据默认存储在 user's scratch directory目录下，有权限配置的时候就用不了

属性

1、当前session可见，数据存在用户指定的scratch目录下，当session结束将会删除
2、不支持创建分区、不支持创建索引
3、设置临时表存储位置hive.exec.temporary.table.storage可选memory, ssd, or default

参考

http://stackoverflow.com/questions/5385163/create-temporary-table-in-hive

Create Table Like
* 示例

CREATE TABLE empty_key_value_store
LIKE key_value_store;

属性

1、不会拷贝数据
2、拷贝表的定义(如：SerDe and file formats)

Bucketed Sorted Tables
* 示例

1、创建表
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS;

2、往表中写入数据
set hive.enforce.bucketing = true;  -- (Note: Not needed in Hive 2.x onward)
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';

Note:
  set hive.enforce.bucketing = true只针对于v0.x和v1.x版本
  若没有使用hive.enforce.bucketing属性, 则需要设置和分桶个数相匹配的reducer个数, 同时SELECT后添加CLUSTER BY select.
    set mapred.reduce.tasks=256;
    FROM user_id
    INSERT OVERWRITE TABLE user_info_bucketed
    PARTITION (ds='2009-02-25')
    SELECT userid, firstname, lastname WHERE ds='2009-02-25' CLUSTER BY user_id;

属性

1、获得更高的查询处理效率，将保存相同列值的桶进行JOIN操作，可以大大较少JOIN的数据量。
2、使取样（sampling）更高效
3、对写性能不起作用
4、计算方式：hash_function(bucketing_column) mod num_buckets。hash_function与bucketing_column的类型相关。
   我们应该设置 set hive.enforce.bucketing = true (for Hive 0.x and 1.x)使得数据能正确分桶。
5、当插入读取的数据（如示例:insert ... select ...）与表的数据类型不同、或者手动指定的cluster字段与表定义的不同会出现数据错误。

UNION

概念

1、
1.2.0版本之前，只支持UNION ALL，不会去重
1.2.0版本之后，UNION默认去重
2、DISTINCT也是用于去重额，但对1.2.0之后UNION默认支持去重的无效
3、UNION ALL不去重
4、The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.

语法

select_statement UNION [ALL | DISTINCT] select_statement UNION [ALL | DISTINCT] select_statement ...

示例

SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'
    UNION ALL
    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
 ) actions JOIN users u ON (u.id = actions.uid)

对UNION结果使用ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY or LIMIT，需要把这些放到最后

SELECT key FROM src
UNION
SELECT key FROM src1 
ORDER BY key LIMIT 10

参考

官方文档

kanbuqinghuanyizhang

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录