HIVE中的表分为:内部表(也称管理表)、外部表、临时表,其中内部表和外部表最根本的区别是:数据是否由hive管理。
- create table methods
1、直接创建表结构
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
--根据某个或某些列分区
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
--根据某些列分桶
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
--某些列的具体属性值上发生数据倾斜
[SKEWED BY (col_name, col_name, ...) ]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
2、根据select语句创建表结构
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
AS select_statement;
3、根据某个表或视图创建表结构
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
HIVE支持的
data_type:
primitive_type、array_type、array_type、struct_type、union_type
primitive_type: TINYINT|SMALLINT|INT|BIGINT|BOOLEAN|FLOAT|DOUBLE|DOUBLE PRECISION|STRING|BINARY|TIMESTAMP|DECIMAL
|DECIMAL(precision, scale)|DATE|VARCHAR|CHAR
array_type: ARRAY < data_type >
map_type: MAP < primitive_type, data_type >
struct_type: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type: UNIONTYPE < data_type, data_type, ... >
row_format: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format: SEQUENCEFILE| TEXTFILE| RCFILE| ORC| PARQUET| AVRO
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
constraint_specification: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...)
REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE]
注意点:
①表、列名称是大小写不敏感,但是SerDe和属性名称是大小写敏感的
②使用了EXTERNAL关键字创建的表成为外部表,managed 表数据有hive管理,通过 命令DESCRIBE EXTENDED table_name 查看表是属于内部表还是外部表
③可以通过设置TBLPROPERTIES属性来自定义表的元数据信息,其中last_modified_user和last_modified_time是有hive自动添加和管理的
②使用了EXTERNAL关键字创建的表成为外部表,managed 表数据有hive管理,通过 命令DESCRIBE EXTENDED table_name 查看表是属于内部表还是外部表
③可以通过设置TBLPROPERTIES属性来自定义表的元数据信息,其中last_modified_user和last_modified_time是有hive自动添加和管理的
TBLPROPERTIES属性值:
TBLPROPERTIES ("comment"="table_comment")
TBLPROPERTIES ("hbase.table.name"="table_name")
TBLPROPERTIES ("immutable"=["true"|"false"]) or ("immutable"=)
TBLPROPERTIES ("orc.compress"=["ZLIB"|"SNAPPY"|"NONE"|"ORC"])
TBLPROPERTIES ("transactional"=["true"|"false"])
TBLPROPERTIES ("NO_AUTO_COMPACTION"=["true"|"false"])
TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory")
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num")
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct")
TBLPROPERTIES ("auto.purge"=["true"|"false"])
TBLPROPERTIES ("EXTERNAL"="TRUE")
TBLPROPERTIES("parquet.mr.int96.write.zone"="timeZoneId")
④通过命令use database切换数据库或database_name.table_name来指定所创建的表落在哪个数据库中
- 管理表和外部表
默认情况下,hive创建管理表,其文件、元数据和统计信息都是由hive内部进程管理,管理表存储在hive.metastore.warehouse.dir 属性设置的目录下,当然在创建表时,可以通过设置location属性覆盖默认路径。管理表的一个特点是:表或分区被删除时,和其相关的数据和元数据都会被删除,如果没有指定purge选项,数据将移到一个trash(垃圾箱)目录下。
hive管理使用的管理表或生成临时表的生命周期。
外部表描述的metadata/schema在外部文件,外部表的文件可以被非hive的第三方工具管理和访问,外部表访问的数据可以存储在ASV或HDFS资源上。如果一个外部表的表结构或分区变了,通常使用MSCK REPAIR TABLE table_name命令去刷新元数据信息。
外部表的一个特点:当表被删除后,其对应的文件依然会保留。
通过DESCRIBE FORMATTED table_name命令查看table_name是属于管理表还是外部表。
- hive内置的存储格式
①STORED AS TEXTFILE:默认的是文本格式,当然也可以通过设置hive.default.fileformat修改默认设置;使用
DELIMITED语句解析分割文件,通过
ESCAPED BY语句设置分隔符的转义符号,当数据中含有分隔符时,就必须转义;通过NULL DEFINED AS语句可以设置一个自定义的空处理格式,默认情况下NULL='\N'
②STORED AS SEQUENCEFILE:压缩序列文件格式
③STORED AS ORC:ORC文件格式
④STORED AS PARQUET
⑤STORED AS AVRO
⑥STORED AS RCFILE:Record Columnar File
⑦STORED BY:non-native table format
⑧INPUTFORMAT and OUTPUTFORMAT
- Row Formats & SerDe(序列化与反序列化)
创建表时可以指定使用的自定义SerDe 或本地SerDe ,如果没有指定ROW FORMAT或 ROW FORMAT DELIMITED,则默认使用本地SerDe,使用SERDE语句
Row Formats:
①RegEx格式
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "<regex>"
)
STORED AS TEXTFILE;
②
JSON格式
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
③
CSV/TSV格式
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE;
- 分区表
在创建表时,使用
PARTITIONED BY语句,则所创建表就是分区表;分区表可以有一个或多个分区列,给唯一的分区列值创建一个目录。表或分区可以继续通过CLUSTERED BY细分桶,使用SORT BY语句使桶内的数据按升序或降序排序,在某些情况下,可以提升某种查询的性能。
注意点:①分区列式虚拟的,不属于数据本身 ②分区列不能和表结构字段重名,如果需要直接用表结构字段中的名字作为分区列,那必须将表结构中字段重新命名,否则会报错
hive> create table page_view(viewTime int ,
> page_url string,
> referrer_url string,
> ip string comment 'ip address of the user')
> comment 'this is the page view table'
> partitioned by (dt string ,country string)
> stored as sequencefile;
hive> show create table page_view;
OK
CREATE TABLE `page_view`(
`viewtime` int,
`page_url` string,
`referrer_url` string,
`ip` string COMMENT 'ip address of the user')
COMMENT 'this is the page view table'
PARTITIONED BY (
`dt` string,
`country` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
'hdfs://Master:9000/user/hive/warehouse/study.db/page_view'
TBLPROPERTIES (
'transient_lastDdlTime'='1502523496')
Time taken: 0.183 seconds, Fetched: 19 row(s)
- 外部表
创建外部表时,通过指定location语句的路径来覆盖配置文件中指定的文件夹,外部表的特点:当外部表被删除时,其对应的数据并非也从系统中删除
hive> create external table page_view(viwTime int,
> page_url string,
> referrer_url string,
> ip string comment 'ip address of the user',
> country string comment 'country of origination'
> )
> comment 'this is the staging page view table'
> row format delimited fields terminated by ','
> stored as textfile
> location '/user/hadoop';
OK
Time taken: 0.139 seconds
hive> desc formatted page_view;
OK
# col_name data_type comment
viwtime int
page_url string
referrer_url string
ip string ip address of the user
country string country of origination
# Detailed Table Information
Database: study
Owner: root
CreateTime: Sat Aug 12 15:58:53 CST 2017
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://Master:9000/user/hadoop
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
comment this is the staging page view table
numFiles 0
totalSize 0
transient_lastDdlTime 1502524733
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.218 seconds, Fetched: 34 row(s)
- 利用CTAS(Create Table As Select)创建表
利用CTAS方法创建的表,其他用户要么能看到该表完整查询结果,要么看不到该表
注意点:所创建的表不能是分区表、外部表、分桶表
hive> create external table emp_test
> location '/user/hadoop/emp_test'
> as select * from emp ;
FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot create external table
hive> create table emp_test
> partitioned by (dt date)
> location '/user/hadoop/emp_test'
> as select * from emp;
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table
hive> create table emp_test
> clustered by (job) sorted by (empno desc ) into 5 buckets
> location '/user/hadoop/emp_test'
> as select * from emp;
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table
通过explian命令可以查看CTAS创建表的具体过程
hive> explain create table emp_test
> location '/user/hadoop/emp_test'
> as select * from emp;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
Stage-4
Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
Stage-8 depends on stages: Stage-0
Stage-2 depends on stages: Stage-8
Stage-3
Stage-5
Stage-6 depends on stages: Stage-5
...
- 利用Create Table Like语句拷贝一个已经存在表的结构,和CTAS刚好相反,仅仅是复制表结构,不会copy数据
hive> create table emp_test
> like emp
> location '/user/hadoop/emp_test';
-chgrp: '*' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
OK
Time taken: 0.134 seconds
hive> select * from emp_test;
-chgrp: '*' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
OK
Time taken: 0.177 seconds
注意如下写法会报错
hive> create table emp_test
> location '/user/hadoop/emp_test'
> like emp;
FAILED: ParseException line 3:0 missing EOF at 'like' near ''/user/hadoop/emp_test''