DDL-table
CREATE
语句
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
说明
Table的类型
CREATE TEMPORARY TABLE:临时表,只存在于当前会话的表,一旦会话关闭,则表被删除
CREATE EXTERNAL TABLE:外部表,需要单独指定文件路径,删除表的时候,数据不会被删除
CREATE TABLE:内部表,删除表的时候,hdfs中数据文件也删除了
ROW FORMAT
row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] – (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES
(property_name=property_value, property_name=property_value, …)]
表示表中每一行的格式,可以取DELIMITED和SERDE两个值
ROW FORMAT DELIMITED :表示每一行是按照一定的分隔符分隔开的格式
例如: ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ —> 表示每一行的每一个字段是按照逗号分隔开的
ROW FORMAT SERDE: 表示自定义每一行的格式,支持内置Avro、ORC、RegEx、Thrift、Parquet、CSV、JsonSerDe及继承自定义类型(继承抽象类AbstractSerDe, 实现initialize, deserialize)
SerDe 是 “Serializer and Deserializer.” 的缩写
Hive uses SerDe (andFileFormat) to read and write table rows.
读的过程步骤:HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
写的过程步骤:Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
Storage Format
默认的存储格式是text format
其他存储格式见下方官方文档引用
file_format: : SEQUENCEFILE
| TEXTFILE – (Default, dependingon hive.default.fileformat configuration)
| RCFILE – (Note:Available in Hive 0.6.0 and later)
| ORC – (Note: Available in Hive 0.11.0 and later)
| PARQUET – (Note: Available in Hive 0.13.0 and later)
| AVRO – (Note: Available in Hive 0.14.0 and later)
| JSONFILE – (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
Partitioned By
创建分区表
分区字段对应Hdfs路径的目录层级
例如:PARTITIONED BY(year string, month string, day string)
对应hdfs路径为hdfs:///year=2019/month=06/day=01 (三级目录)
PARTITIONED BY( day string)
对应hdfs路径为hdfs:///day=2019-06-01 (一级目录)
分区操作命令总结
1、查看分区
命令:SHOW PARTITIONS 库名.表名;
可以查看表对应的所有分区
2、分区重命名
命令:ALTER TABLE 库名.表名 PARTITION (day=2019-06-01) RENAME TO PARTITION (day=2018-06-01);
外部表分区重命名,不会改变数据文件的路径
内部表的话,会改变数据文件的路径
3、修复分区
命令:msck repair table 库名.表名
修复在对应HDFS路径上有文件而在metastore中没有对应分区的分区
修复参数:
修复中还可以设置批量修复分区数的参数(set hive.msck.repair.batch.size=10)来提高修复效率,参数默认值是0,表示一次修复所有分区,当分区很多时会比较慢。
4、删除分区
命令:ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, …] [PURGE];