Hive sql操作笔记

最近复习整理一下hive的使用,如有错误欢迎指出,望共同进步。

建表

参考官网
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

Create Table

----------------------------- start-------------------------
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, …) – (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, …), (col_value, col_value, …), …)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)] – (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, …)] – (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; – (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];

data_type
primitive_type
| array_type
| map_type
| struct_type
| union_type – (Note: Available in Hive 0.7.0 and later)
primitive_type
TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION – (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY – (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP – (Note: Available in Hive 0.8.0 and later)
| DECIMAL – (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) – (Note: Available in Hive 0.13.0 and later)
| DATE – (Note: Available in Hive 0.12.0 and later)
| VARCHAR – (Note: Available in Hive 0.12.0 and later)
| CHAR – (Note: Available in Hive 0.13.0 and later)
array_type
ARRAY < data_type >
map_type
MAP < primitive_type, data_type >
struct_type
STRUCT < col_name : data_type [COMMENT col_comment], …>
union_type
UNIONTYPE < data_type, data_type, … > – (Note: Available in Hive 0.7.0 and later)
row_format
DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] – (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, …)]
file_format:
SEQUENCEFILE
| TEXTFILE – (Default, depending on hive.default.fileformat configuration)
| RCFILE – (Note: Available in Hive 0.6.0 and later)
| ORC – (Note: Available in Hive 0.11.0 and later)
| PARQUET – (Note: Available in Hive 0.13.0 and later)
| AVRO – (Note: Available in Hive 0.14.0 and later)
| JSONFILE – (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
constraint_specification:
[, PRIMARY KEY (col_name, …) DISABLE NOVALIDATE ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, …) REFERENCES table_name(col_name, …) DISABLE NOVALIDATE

----------------------------- end-------------------------

创建内部表

hive> create table psn
    > (
    > id int,
    > name string,
    > likes array<string>,
    > address map<string,string>
    > )
    > row format DELIMITED 
    > FIELDS TERMINATED BY ',' 
    > COLLECTION ITEMS TERMINATED BY '-'
    > MAP KEYS TERMINATED BY ':'
    > ;
hive> desc formatted psn;
OK
插入数据

准备如下数据
[root@node03 ~]# cat data_psn.txt
1,小明1,lol-book-movie,beijing:shangxuetang-shanghai:pudong
2,小明2,lol-book-movie,beijing:shangxuetang-shanghai:pudong
3,小明3,lol-book-movie,beijing:shangxuetang-shanghai:pudong

将本地数据导入hdfs
注:这里的’/root/data_psn.txt’是linux下的路径,不是hdfs的。

hive> load data local inpath '/root/data_psn.txt' into table psn;
Loading data to table default.psn
Table default.psn stats: [numFiles=1, totalSize=541]
OK
Time taken: 1.39 seconds
hive> select * from psn;
OK
1	小明1	["lol","book","movie"]	{"beijing":"shangxuetang","shanghai":"pudong"}
2	小明2	["lol","book","movie"]	{"beijing":"shangxuetang","shanghai":"pudong"}
3	小明3	["lol","book","movie"]	{"beijing":"shangxuetang","shanghai":"pudong"}

查看psn的表的详细属性

hive> desc formatted psn;
OK
# col_name            	data_type           	comment             
	 	 
id                  	int                 	                    
name                	string              	                    
likes               	array<string>       	                    
address             	map<string,string>  	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	root                	 
CreateTime:         	Sat Dec 29 17:37:32 CST 2018	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://mycluster/user/hive/warehouse/psn	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	true                
	numFiles            	1                   
	totalSize           	541                 
	transient_lastDdlTime	1546078121          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	colelction.delim    	-                   
	field.delim         	,                   
	mapkey.delim        	:                   
	serialization.format	,                   
Time taken: 0.385 seconds, Fetched: 35 row(s)
hive> 

其中 Table Type: MANAGED_TABLE 表示此为内部表。内部表是什么?有内部表就有外部表。具体有什么区别吗?
答:
外部表和内部表的区别
删除的时候,内部表是一并删除hive中的元数据和hdfs的数据
外部表只删除hive的元数据,hdfs中的不会删除。

外部表

建表sql

先导入/root/data_psn.txt数据到hdfs中
#hdfs dfs -put /root/data_psn.txt /usr/hive/data_psn
注意 : 这里的location定位的是hdfs的目录路径,不是linux的文件路径

create EXTERNAL table test_data_psn1
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format DELIMITED 
FIELDS TERMINATED BY ',' 
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
location '/usr/hive/data_psn'
;
查看外部表
hive> desc formatted test_data_psn1;
OK
# col_name            	data_type           	comment             
	 	 
id                  	int                 	                    
name                	string              	                    
likes               	array<string>       	                    
address             	map<string,string>  	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	root                	 
CreateTime:         	Fri Apr 05 03:12:31 CST 2019	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://mycluster/usr/hive/data_psn	 
Table Type:         	EXTERNAL_TABLE      	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	false               
	EXTERNAL            	TRUE                
	numFiles            	0                   
	numRows             	-1                  
	rawDataSize         	-1                  
	totalSize           	0                   
	transient_lastDdlTime	1554405151          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	colelction.delim    	-                   
	field.delim         	,                   
	mapkey.delim        	:                   
	serialization.format	,                   
Time taken: 0.31 seconds, Fetched: 38 row(s)
hive> 

其中 Table Type: EXTERNAL_TABLE 表示外部表

建表的其他方式

Hive 建表
方式一 :create table like 如
create table psn_createTableTest1 like psn;
方式二:create table as select 如
create table psn_createTableTest2 as select id,name from psn;

Hive分区

Hive 分区partition
必须在表定义时指定对应的partition字段
a、单分区建表语句:
create table day_table (id int, content string) partitioned by (dt string);
单分区表,按天分区,在表结构中存在id,content,dt三列。
以dt为文件夹区分
b、 双分区建表语句:
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。
先以dt为文件夹,再以hour子文件夹区分

分区 单个字段
静态分区

要求:按照年龄来分区
准备数据(其中没有表示年龄的列)

[root@node03 usr]# cat /root/data
1,小明1,lol-book-movie,beijing:shangxuetang-shanghai:pudong
2,小明2,lol-book-movie,beijing:shangxuetang-shanghai:pudong
3,小明3,lol-book-movie,beijing:shangxuetang-shanghai:pud

注意:此时建表的属性中将年龄那个属性去掉

hive>create table psn2
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int)
row format DELIMITED 
FIELDS TERMINATED BY ',' 
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
;
hive>load data  local inpath '/root/data' into table psn2 partition (age=10) ;
hive>load data  local inpath '/root/data' into table psn2 partition (age=20) ;

注意:这里面的列age是通过partition (age=10) ;指定age=10导入的,不是通过data文件导入的。

分区 多个字段

要求:以年龄和性别来分区
create table psn3
(
id int,
name string,
likes array,
address map<string,string>
)
partitioned by(age int,sex string)
row format DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘-’
MAP KEYS TERMINATED BY ‘:’
;
注意:此时建表的属性中将分区的属性去掉

导入数据的指令
load data local inpath ‘/root/data’ into table psn3 partition (age=10,sex=‘boy’) ;
load data local inpath ‘/root/data’ into table psn3 partition (age=20,sex=‘boy’) ;
load data local inpath ‘/root/data’ into table psn3 partition (age=10,sex=‘girl’) ;
load data local inpath ‘/root/data’ into table psn3 partition (age=20,sex=‘girl’) ;

删除,两个分区都可以删除
alter table psn3 drop partition (sex=‘boy’);
追加
alter table psn3 add partition(age=10,sex=‘boy’);

Hive查询执行分区语法

SELECT day_table.* FROM day_table WHERE day_table.dt>= ‘2008-08-08’;
分区表的意义在于优化查询。查询时尽量利用分区字段。如果不使用分区字段,就会全部扫描。
Hive查询表的分区信息语法:
SHOW PARTITIONS day_hour_table;
预先导入分区数据,但是无法识别怎么办
Msck repair table tablename
直接添加分区

Hive DML

create table psn4
(
id int,
name string,
likes array
)
row format DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘-’
;
从一张表导入另一张表(经常用)
from psn3
insert overwrite table psn4
select id,name,likes
insert overwrite table psn5
select id,name
这样写的好处是一次可以插入多个表,而只执行一次MapReduce任务。

从目录中导入(不常用)
总结,导入hive表的四种方式
load
insert values
from insert overwrite
insert from directory目录

hive 参数设置方式

1、修改配置文件 ${HIVE_HOME}/conf/hive-site.xml
2、启动hive cli时,通过–hiveconf key=value的方式进行设置
例:hive --hiveconf hive.cli.print.header=true
3、进入cli之后,通过使用set命令设置

hive set命令

在hive CLI控制台可以通过set对hive中的参数进行查询、设置
set设置:
set hive.cli.print.header=true;
set查看
set hive.cli.print.header
hive参数初始化配置
当前用户家目录下的.hiverc文件
如: ~/.hiverc
如果没有,可直接创建该文件,将需要设置的参数写到该文件中,hive启动运行时,会加载改文件中的配置。
如果不知道如何设置key=value,可以通过进入命令行 输入set;
即可查询到关于hive的参数信息,如

hive> set;
_hive.hdfs.session.path=/tmp/hive/root/1148b521-b659-45a3-b63b-4bdcd85a834f
_hive.local.session.path=/tmp/root/1148b521-b659-45a3-b63b-4bdcd85a834f
_hive.tmp_table_space=/tmp/hive/root/1148b521-b659-45a3-b63b-4bdcd85a834f/_tmp_space.db
datanucleus.autoCreateSchema=true
datanucleus.autoStartMechanismMode=checked
datanucleus.cache.level2=false
datanucleus.cache.level2.type=none
datanucleus.connectionPoolingType=BONECP
datanucleus.fixedDatastore=false
datanucleus.identifierFactory=datanucleus1
datanucleus.plugin.pluginRegistryBundleCheck=LOG

hive 动态分区

目标:将某个表psn21的部分数据用动态分区的方式插入psn22中。

stpe1

准备原始数据插入原始表psn21,其中 第二列是年龄,第三列为性别,以这两个作为分区字段。

[root@node03 ~]# cat /root/data21
1,小明1,10,boy,lol-book-movie,beijing:shangxuetang-shanghai:pudong
2,小明2,10,girl,lol-book-movie,beijing:shangxuetang-shanghai:pudong
3,小明3,10,girl,lol-book-movie,beijing:shangxuetang-shanghai:pudong
4,小明4,20,girl,lol-book-movie,beijing:shangxuetang-shanghai:pudong
5,小明5,20,boy,lol-movie,beijing:shangxuetang-shanghai:pudong
6,小明6,20,boy,lol-book-movie,beijing:shangxuetang-shanghai:pudong
7,小明7,20,boy,lol-book,beijing:shangxuetang-shanghai:pudong
8,小明8,10,boy,lol-book,beijing:shangxuetang-shanghai:pudong
9,小明9,10,boy,lol-book-movie,beijing:shangxuetang-shanghai:pudong
stpe2

建立原始表
create table psn21
(
id int,
name string,
age int,
sex string,
likes array,
address map<string,string>
)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’
;
load data local inpath ‘root/data21’ into table psn21;

stpe3

创建动态分区表

create table psn23
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by (age int,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;
stpe4 导入分区表

把原始表psn21数据导入动态分区表psn22
from psn21
insert overwrite table psn22 partition(age, sex)
select id, name, likes, address,age,sex distribute by age,sex;
此时会通过提交MapReduce任务实现具体的分区导入。

先到这边吧,后期有空在拓展。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值