HIVE入门二之表DDL(一)

最新推荐文章于 2024-01-09 10:50:13 发布

每天进步一点点2017

最新推荐文章于 2024-01-09 10:50:13 发布

阅读量1.3k

点赞数

分类专栏： hive 文章标签： hive DDL语句

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

HIVE中的表分为：内部表(也称管理表)、外部表、临时表，其中内部表和外部表最根本的区别是：数据是否由hive管理。

create table methods

1、直接创建表结构

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  --根据某个或某些列分区
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  --根据某些列分桶
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  --某些列的具体属性值上发生数据倾斜
  [SKEWED BY (col_name, col_name, ...) ]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]

2、根据select语句创建表结构

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name   
  AS select_statement;

3、根据某个表或视图创建表结构

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

HIVE支持的 data_type： primitive_type、array_type、array_type、struct_type、union_type

primitive_type: TINYINT|SMALLINT|INT|BIGINT|BOOLEAN|FLOAT|DOUBLE|DOUBLE PRECISION|STRING|BINARY|TIMESTAMP|DECIMAL
                |DECIMAL(precision, scale)|DATE|VARCHAR|CHAR   
array_type: ARRAY < data_type > 
map_type: MAP < primitive_type, data_type > 
struct_type: STRUCT < col_name : data_type [COMMENT col_comment], ...> 
union_type: UNIONTYPE < data_type, data_type, ... >     


row_format: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
               [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
               [NULL DEFINED AS char] 
           | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format: SEQUENCEFILE| TEXTFILE| RCFILE| ORC| PARQUET| AVRO
             | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
constraint_specification: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
                          [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...)
                           REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE]

注意点：

①表、列名称是大小写不敏感，但是SerDe和属性名称是大小写敏感的
②使用了EXTERNAL关键字创建的表成为外部表，managed 表数据有hive管理，通过命令DESCRIBE EXTENDED table_name 查看表是属于内部表还是外部表
③可以通过设置TBLPROPERTIES属性来自定义表的元数据信息，其中last_modified_user和last_modified_time是有hive自动添加和管理的

TBLPROPERTIES属性值：

TBLPROPERTIES ("comment"="table_comment")
TBLPROPERTIES ("hbase.table.name"="table_name")
TBLPROPERTIES ("immutable"=["true"|"false"]) or ("immutable"=)
TBLPROPERTIES ("orc.compress"=["ZLIB"|"SNAPPY"|"NONE"|"ORC"]) 
TBLPROPERTIES ("transactional"=["true"|"false"]) 
TBLPROPERTIES ("NO_AUTO_COMPACTION"=["true"|"false"]) 
TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory") 
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num") 
TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct")
TBLPROPERTIES ("auto.purge"=["true"|"false"]) 
TBLPROPERTIES ("EXTERNAL"="TRUE")
TBLPROPERTIES("parquet.mr.int96.write.zone"="timeZoneId")

④通过命令use database切换数据库或database_name.table_name来指定所创建的表落在哪个数据库中

管理表和外部表

默认情况下，hive创建管理表，其文件、元数据和统计信息都是由hive内部进程管理，管理表存储在hive.metastore.warehouse.dir 属性设置的目录下，当然在创建表时，可以通过设置location属性覆盖默认路径。管理表的一个特点是：表或分区被删除时，和其相关的数据和元数据都会被删除，如果没有指定purge选项，数据将移到一个trash(垃圾箱)目录下。

hive管理使用的管理表或生成临时表的生命周期。

外部表描述的metadata/schema在外部文件，外部表的文件可以被非hive的第三方工具管理和访问，外部表访问的数据可以存储在ASV或HDFS资源上。如果一个外部表的表结构或分区变了，通常使用MSCK REPAIR TABLE table_name命令去刷新元数据信息。

外部表的一个特点：当表被删除后，其对应的文件依然会保留。

通过DESCRIBE FORMATTED table_name命令查看table_name是属于管理表还是外部表。

hive内置的存储格式

①STORED AS TEXTFILE：默认的是文本格式，当然也可以通过设置hive.default.fileformat修改默认设置；使用 DELIMITED语句解析分割文件，通过 ESCAPED BY语句设置分隔符的转义符号，当数据中含有分隔符时，就必须转义；通过NULL DEFINED AS语句可以设置一个自定义的空处理格式，默认情况下NULL='\N'

②STORED AS SEQUENCEFILE：压缩序列文件格式

③STORED AS ORC：ORC文件格式

④STORED AS PARQUET

⑤STORED AS AVRO

⑥STORED AS RCFILE：Record Columnar File

⑦STORED BY：non-native table format

⑧INPUTFORMAT and OUTPUTFORMAT

Row Formats & SerDe(序列化与反序列化)

创建表时可以指定使用的自定义SerDe 或本地SerDe ，如果没有指定ROW FORMAT或 ROW FORMAT DELIMITED，则默认使用本地SerDe，使用SERDE语句

Row Formats：

①RegEx格式

ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES 
(
"input.regex" = "<regex>"
)
STORED AS TEXTFILE;

② JSON格式

ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe' 
STORED AS TEXTFILE;

③ CSV/TSV格式

ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE;

分区表

在创建表时，使用 PARTITIONED BY语句，则所创建表就是分区表；分区表可以有一个或多个分区列，给唯一的分区列值创建一个目录。表或分区可以继续通过CLUSTERED BY细分桶，使用SORT BY语句使桶内的数据按升序或降序排序，在某些情况下，可以提升某种查询的性能。

注意点：①分区列式虚拟的，不属于数据本身 ②分区列不能和表结构字段重名，如果需要直接用表结构字段中的名字作为分区列，那必须将表结构中字段重新命名，否则会报错

hive> create table page_view(viewTime int ,
    > page_url string,
    > referrer_url string,
    > ip string comment 'ip address of the user')
    > comment 'this is the page view table'
    > partitioned by (dt string ,country string)
    > stored as sequencefile;
hive> show create table page_view;
OK
CREATE TABLE `page_view`(
  `viewtime` int, 
  `page_url` string, 
  `referrer_url` string, 
  `ip` string COMMENT 'ip address of the user')
COMMENT 'this is the page view table'
PARTITIONED BY ( 
  `dt` string, 
  `country` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.SequenceFileInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
  'hdfs://Master:9000/user/hive/warehouse/study.db/page_view'
TBLPROPERTIES (
  'transient_lastDdlTime'='1502523496')
Time taken: 0.183 seconds, Fetched: 19 row(s)

外部表

创建外部表时，通过指定location语句的路径来覆盖配置文件中指定的文件夹，外部表的特点：当外部表被删除时，其对应的数据并非也从系统中删除

hive> create  external table page_view(viwTime int,
    > page_url string,
    > referrer_url string,
    > ip string comment 'ip address of the user',
    > country string comment 'country of origination'
    > )
    > comment 'this is the staging page view table'
    > row format delimited fields terminated by ','
    > stored as textfile 
    > location '/user/hadoop';
OK
Time taken: 0.139 seconds
hive> desc formatted page_view;
OK
# col_name            	data_type           	comment             
	 	 
viwtime             	int                 	                    
page_url            	string              	                    
referrer_url        	string              	                    
ip                  	string              	ip address of the user
country             	string              	country of origination
	 	 
# Detailed Table Information	 	 
Database:           	study               	 
Owner:              	root                	 
CreateTime:         	Sat Aug 12 15:58:53 CST 2017	 
LastAccessTime:     	UNKNOWN             	 
Retention:          	0                   	 
Location:           	hdfs://Master:9000/user/hadoop	 
Table Type:         	EXTERNAL_TABLE      	 
Table Parameters:	 	 
	EXTERNAL            	TRUE                
	comment             	this is the staging page view table
	numFiles            	0                   
	totalSize           	0                   
	transient_lastDdlTime	1502524733          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	,                   
	serialization.format	,                   
Time taken: 0.218 seconds, Fetched: 34 row(s)

利用CTAS(Create Table As Select)创建表

利用CTAS方法创建的表，其他用户要么能看到该表完整查询结果，要么看不到该表

注意点：所创建的表不能是分区表、外部表、分桶表

hive> create external table emp_test
    > location '/user/hadoop/emp_test'
    > as select * from  emp ;
FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot create external table
hive> create table emp_test
    >  partitioned by (dt date)
    > location '/user/hadoop/emp_test'
    > as select * from emp;
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table 
hive> create table emp_test  
    >  clustered by (job) sorted by (empno desc ) into 5 buckets
    > location '/user/hadoop/emp_test'
    > as select * from emp;
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table

通过explian命令可以查看CTAS创建表的具体过程

hive> explain create table emp_test
    > location '/user/hadoop/emp_test'
    > as select * from emp;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
  Stage-8 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-8
  Stage-3
  Stage-5
  Stage-6 depends on stages: Stage-5
...

利用Create Table Like语句拷贝一个已经存在表的结构，和CTAS刚好相反，仅仅是复制表结构，不会copy数据

hive> create table emp_test
    > like emp
    > location '/user/hadoop/emp_test';
-chgrp: '*' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
OK
Time taken: 0.134 seconds
hive> select * from emp_test;
-chgrp: '*' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
OK
Time taken: 0.177 seconds

注意如下写法会报错

hive> create table emp_test
    > location '/user/hadoop/emp_test'
    > like emp;
FAILED: ParseException line 3:0 missing EOF at 'like' near ''/user/hadoop/emp_test''