hive练习案例

最新推荐文章于 2024-05-31 09:28:08 发布

浮生若梦1379

最新推荐文章于 2024-05-31 09:28:08 发布

阅读量669

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/weixin_37796929/article/details/90545638

版权

hive 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

			本地模式

set hive.exec.mode.local.auto=true;

创建的表：

create table if not exists employee(
   name string comment 'employee name',
   salary float comment 'employee salary',
   subordinates array<string> comment 'names of subordinates',
   deductions map<string,float> comment 'keys are deductions values are percentages',
   address struct<street:string, city:string, state:string, zip:int> comment 'home address'
)
comment 'description of the table'
location '/user/hive/warehouse/mydb.db/employee'
tblproperties ('creator'='yoona','date'='20160719');

//注意：此处location放在tblproperties之前
//comment 为注释代码
//TBLPROPERTIES允许开发者定义一些自己的键值对信息。可以对TBLPROPERTIES进行查看和修改（部分可修改）。
在TBLPROPERTIES中有一些预定义信息，比如last_modified_user和last_modified_time，其他的一些预定义信息包括：

 单分区
 create table day_table(id int,content string) partitioned by(dt string)
查询分区：
select day_table  * from day  where day_table.dt='2018-07-087'
分区就是为了缩小查询范围

多分区
create table day_table(id int,content string) partitioned by(dt string,hour string);
分区的修改：
增加分区
alter table 表名 add partition(dt=‘2018-0808’)
删除分区：
alter table 表名 drop partition(dt=‘2018-0808’)
分桶：
字段： cluster by 、sort by、distribute by
指定开启分桶
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
创建分桶表：
create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno)
sorted by(Sno DESC)
into 4 buckets
row format delimited
fields terminated by ‘,’;

		准备数据：
								
					95001,李勇,男,20,CS
					95002,刘晨,女,19,IS
					95003,王敏,女,22,MA
					95004,张立,男,19,IS
					95005,刘刚,男,18,MA
					95006,孙庆,男,23,CS
					95007,易思玲,女,19,MA
					95008,李娜,女,18,CS
					95009,梦圆圆,女,18,MA
					95010,孔小涛,男,19,CS
					95011,包小柏,男,18,MA
					95012,孙花,女,20,CS
					95013,冯伟,男,21,CS
					95014,王小丽,女,19,CS
					95015,王君,男,18,MA
					95016,钱国,男,21,MA
					95017,王风娟,女,18,IS
					95018,王一,女,19,IS
					95019,邢小丽,女,19,IS
					95020,赵钱,男,21,IS
					95021,周二,男,17,MA
					95022,郑明,男,20,MA
		
		注：
		    1分桶表的个数：用户定义HSQL语句设置的reduceTask的个数决定，分桶表已经决定就不能修改，如果要改变分桶，需要重新插入分桶数据
			      
			2表的分区的个数：用户自定义的，可以由程序自定义生成，也可以动态增长
			
		     数据表现：
			         1分桶表中每个分桶数据可以有多个key值
					 2分区表中每个分区只有一个key


        插入数据：
		select * from student cluster by(Sno) sort by(Sage);  报错,cluster 和 sort 不能共存
        所以，用下面的：
           insert overwrite table t07 select * from t07_tmp cluster by(sno);
  
         select * from stu_buck tablesample (bucket 1 out of 2 on sno);

             解析：查询第1桶、第（1+2）桶的数据，其中sno是之前按照sno进行分桶。
                    这里的“2”的位置，必须是分桶个数的整数倍或者因子


 内外部表
        创建内部表：
		创建外部表：
        CREATE external TABLE test1(id INT, content STRING) ROW FORMAT delimited FIELDS TERMINATED BY ',' location '/stu';
       
	 区别：
	      删除内部表时，会删除表和相关的元数据；
          删除外部表时，会删除相关的元数据，但是那个目录不会删除

修改表：

1.2．修改表

增加分区： 

ALTER TABLE table_name ADD PARTITION (dt='20170101') location 

'/user/hadoop/warehouse/table_name/dt=20170101'; //一次添加一个分区 

ALTER TABLE table_name ADD PARTITION (dt='2008-08-08', country='us') location 

 '/path/to/us/part080808' PARTITION (dt='2008-08-09', country='us') location 

 '/path/to/us/part080809';  //一次添加多个分区 

删除分区 

ALTER TABLE table_name DROP IF EXISTS PARTITION (dt='2008-08-08'); 

ALTER TABLE table_name DROP IF EXISTS PARTITION (dt='2008-08-08', country='us'); 

修改分区 

ALTER TABLE table_name PARTITION (dt='2008-08-08') RENAME TO PARTITION (dt='20080808'); 

添加列 

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name STRING);  

注：ADD 是代表新增一个字段，新增字段位置在所有列后面(partition 列前) 

REPLACE 则是表示替换表中所有字段。 

修改列 

       test_change (a int, b int, c int); 

        ALTER TABLE test_change CHANGE a a1 INT; 
    	修改 a 字段名 
    			表重命名 
		ALTER TABLE table_name RENAME TO new_table_name

显示命令

		show tables; 
		显示当前数据库所有表 
		show databases |schemas; 
		显示所有数据库 
		show partitions table_name; 
		显示表分区信息，不是分区表执行报错 
		show functions; 
		显示当前版本 hive 支持的所有方法 
		desc extended table_name; 
		查看表信息 
		desc formatted table_name; 
		查看表信息（格式化美观） 
		describe database database_name; 
		查看数据库相关信息

like字段：
create table t_t9 like t_t8;
load字段
Load 在将数据加载到表中时， Hive 不会进行任何转换。加载操作是将数据文件移动到与 Hive表对应的位置的纯复制/移动操作。
语法结构
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]

				说明： 
				1、 filepath  
				相对路径，例如：project/data1  
				绝对路径，例如：/user/hive/project/data1  
				完整 URI，例如：hdfs://namenode:9000/user/hive/project/data1 
				filepath 可以引用一个文件（在这种情况下，Hive 将文件移动到表中） ，或
				者它可以是一个目录 （在这种情况下， Hive 将把该目录中的所有文件移动到表中） 。 

				2、 LOCAL 
				如果指定了 LOCAL， load 命令将在本地文件系统中查找文件路径。 
				load 命令会将 filepath 中的文件复制到目标文件系统中。目标文件系统由表
				的位置属性决定。被复制的数据文件移动到表的数据对应的位置。 
				如果没有指定 LOCAL 关键字，如果 filepath 指向的是一个完整的 URI，hive 
				会直接使用这个 URI。 否则：如果没有指定 schema 或者 authority，Hive 会使
				用在 hadoop 配置文件中定义的 schema 和 authority，fs.default.name 指定了 
				Namenode 的 URI。  

				3、 OVERWRITE  
				如果使用了 OVERWRITE 关键字，则目标表（或者分区）中的内容会被删除，
				然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。  
				如果目标表 （分区） 已经有一个文件， 并且文件名和 filepath 中的文件名冲突，
				那么现有的文件会被新文件所替代

动态分区插入

set hive.exec.dynamic.partition=true;    #是否开启动态分区功能，默认false关闭。
set hive.exec.dynamic.partition.mode=nonstrict;   #动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。

需求：
将dynamic_partition_table中的数据按照时间(day)，插入到目标表d_p_t的相应分区中。

原始表：

create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ","; 

load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table;

dynamic_partition_table.txt数据为：

2015-05-10,ip1

2015-05-10,ip2

2015-06-14,ip3

2015-06-14,ip4

2015-06-15,ip1

2015-06-15,ip2

目标表：

create table d_p_t(ip string) partitioned by (month string,day string);

动态插入：

insert overwrite table d_p_t partition (month,day) 

select ip,substr(day,1,7) as month,day 

from dynamic_partition_table;

1: jdbc:hive2://node1:10000> select * from d_p_t;
±----------±-------------±------------±-+
| d_p_t.ip | d_p_t.month | d_p_t.day |
±----------±-------------±------------±-+
| ip1 | 2015-05 | 2015-05-10 |
| ip2 | 2015-05 | 2015-05-10 |
| ip3 | 2015-06 | 2015-06-14 |
| ip4 | 2015-06 | 2015-06-14 |
| ip1 | 2015-06 | 2015-06-15 |
| ip2 | 2015-06 | 2015-06-15 |
±----------±-------------±------------±-+

查询结果导出到文件系统

3、将查询结果保存到指定的文件目录（可以是本地，也可以是hdfs）
insert overwrite local directory ‘/home/hadoop/test’
select * from t_p;

insert overwrite directory ‘/aaa/test’
select * from t_p;

Select
基本的 Select 操作
语法结构
SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
JOIN table_other ON expr
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY | ORDER BY col_list] ]
[LIMIT number]

说明：
1、order by 会对输入做全局排序，因此只有一个 reducer，会导致当输入规模较大时，
需要较长的计算时间。

2、sort by 不是全局排序，其在数据进入 reducer 前完成排序。因此，如果用 sort by 进
行排序，并且设置 mapred.reduce.tasks>1，则 sort by 只保证每个 reducer 的输出有序，不保
证全局有序。

3、 distribute by(字段)根据指定字段将数据分到不同的 reducer，分发算法是 hash 散列。

4、Cluster by(字段) 除了具有 Distribute by 的功能外，还会对该字段进行排序。
如果 distribute 和 sort 的字段是同一个时，此时，cluster by = distribute by + sort by

   1分桶下载：

insert overwrite local directory ‘/root/hivedata/aaa’
select * from t07_tmp cluster by(Sno);

select * from student order by sage asc
2分桶排序
insert overwrite local directory ‘/root/aaa888’
select * from student DISTRIBUTE by Sno sort by sage

3全局排序：
select * from student order by sage asc

多表关联：

create table a(id int,name string) row format delimited fields terminated by ‘,’;

create table b(id int,name string) row format delimited fields terminated by ‘,’;

关于hive中的各种join

准备数据
1,a
2,b
3,c
4,d
7,y
8,u

2,bb
3,cc
7,yy
9,pp

导入数据
load data local inpath ‘/root/hivedata/a.txt’ into table a;
load data local inpath ‘/root/hivedata/b.txt’ into table b;

select * from a cross join b;

cross join（##慎用）
返回两个表的笛卡尔积结果，不需要指定关联键。
±------±--------±------±--------±-+
| a.id | a.name | b.id | b.name |
±------±--------±------±--------±-+
| 1 | a | 2 | bb |
| 1 | a | 3 | cc |
| 1 | a | 7 | yy |
| 1 | a | 9 | pp |
| 2 | b | 2 | bb |
| 2 | b | 3 | cc |
| 2 | b | 7 | yy |
| 2 | b | 9 | pp |
| 3 | c | 2 | bb |
| 3 | c | 3 | cc |
| 3 | c | 7 | yy |
| 3 | c | 9 | pp |
| 4 | d | 2 | bb |
| 4 | d | 3 | cc |
| 4 | d | 7 | yy |
| 4 | d | 9 | pp |
| 7 | y | 2 | bb |
| 7 | y | 3 | cc |
| 7 | y | 7 | yy |
| 7 | y | 9 | pp |
| 8 | u | 2 | bb |
| 8 | u | 3 | cc |
| 8 | u | 7 | yy |
| 8 | u | 9 | pp |
±------±--------±------±--------±-+

内连接：
select * from a inner join b on a.id=b.id;
±------±--------±------±--------±-+
| a.id | a.name | b.id | b.name |
±------±--------±------±--------±-+
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 7 | y | 7 | yy |
±------±--------±------±--------±-+

左连接：
select * from a left join b on a.id=b.id;
±------±--------±------±--------±-+
| a.id | a.name | b.id | b.name |
±------±--------±------±--------±-+
| 1 | a | NULL | NULL |
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 4 | d | NULL | NULL |
| 7 | y | 7 | yy |
| 8 | u | NULL | NULL |
±------±--------±------±--------±-+

右连接：
select * from a right join b on a.id=b.id;
±------±--------±------±--------±-+
| a.id | a.name | b.id | b.name |
±------±--------±------±--------±-+
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 7 | y | 7 | yy |
| NULL | NULL | 9 | pp |
±------±--------±------±--------±-+

全连接：
select * from a full outer join b on a.id=b.id;
±------±--------±------±--------±-+
| a.id | a.name | b.id | b.name |
±------±--------±------±--------±-+
| 1 | a | NULL | NULL |
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 4 | d | NULL | NULL |
| 7 | y | 7 | yy |
| 8 | u | NULL | NULL |
| NULL | NULL | 9 | pp |
±------±--------±------±--------±-+

**hive中的特别join
select * from a left semi join b on a.id = b.id;
±------±--------±-+
| a.id | a.name |
±------±--------±-+
| 2 | b |
| 3 | c |
| 7 | y |
±------±--------±-+
相当于
select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低

select a.id,a.name from a join b on (a.id = b.id);

其他：
1参数的配置
2运算符和dual表
create table dual(id string);
load data local inpath ‘/root/hivedata/dual.log’ into table dual;
select 1 from dual where 1=1; //相等显示
| _c0 |
±-----±-+
| 1 |
±-----±-+
select 1 from dual where 1=2; //后面不等于，就不显示
±-----±-+
| _c0 |
±-----±-+
±-----±-+
select 1 from dual where ‘key’ like ‘foot%’;
±-----±-+
| _c0 |
±-----±-+
±-----±-+
加法：
select 1+9 from dual where 1=1;
±-----±-+
| _c0 |
±-----±-+
| 10 |
±-----±-+
除法：
select 40/5 from dual where 1=1;
±-----±-+
| _c0 |
±-----±-+
| 8.0 |
±-----±-+
截取函数：
select substring(‘1234’,2,3) from dual;
±-----±-+
| _c0 |
±-----±-+
| 234 |
±-----±-+

自定义分割符
drop table t_bi_reg;
create table t_bi_reg(id string,name string) row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties(
‘input.regex’=’(.)\|\|(.)’,
‘output.format.string’=’%1 $s$ s’
)
stored as textfile;
hive>load data local inpath ‘/root/hivedata/bi.dat’ into table t_bi_reg;
hive>select * from t_bi_reg;

±-------------±---------------±-+
| t_bi_reg.id | t_bi_reg.name |
±-------------±---------------±-+
| zhangsan | 23 |
| lisi | 45 |
±-------------±---------------±-+

联级求和：
create table t_access_times(username string,month string,salary int)
row format delimited fields terminated by ‘,’;

load data local inpath ‘/root/hivedata/t_access_times.dat’ into table t_access_times;

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5

select * from t_access_times;
±-------------------------±----------------------±-----------------------±-+
| t_access_times.username | t_access_times.month | t_access_times.salary |
±-------------------------±----------------------±-----------------------±-+
| A | 2015-01 | 5 |
| A | 2015-01 | 15 |
| B | 2015-01 | 5 |
| A | 2015-01 | 8 |
| B | 2015-01 | 25 |
| A | 2015-01 | 5 |
| A | 2015-02 | 4 |
| A | 2015-02 | 6 |
| B | 2015-02 | 10 |
| B | 2015-02 | 5 |
±-------------------------±----------------------±-----------------------±-+

1、第一步，先求个用户的月总金额
select username,month,sum(salary) as salary from t_access_times group by username,month
tmp
±----------±---------±--------±-+
| username | month | salary |
±----------±---------±--------±-+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
±----------±---------±--------±-+

2、第二步，将月总金额表自己连接自己连接
select A.,B. FROM
(select username,month,sum(salary) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(salary) as salary from t_access_times group by username,month) B on A.username=B.username
where B.month <= A.month;
±------------±---------±----------±------------±---------±----------±-+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
±------------±---------±----------±------------±---------±----------±-+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
±------------±---------±----------±------------±---------±----------±-+

注：此处有坑：注意空格
bug：Error: Error while compiling statement: FAILED: ParseException line 2:0 cannot recognize input near ‘select’ ‘username’ ‘,’ in join source (state=42000,code=40000)

3、第三步，从上一步的结果中
进行分组查询，分组的字段是a.username a.month
求月累计值：将b.month <= a.month的所有b.salary求和即可
select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from
(select username,month,sum(salary) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;

±------------±---------±--------±------------±-+
| a.username | a.month | salary | accumulate |
±------------±---------±--------±------------±-+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
±------------±---------±--------±------------±-+

浮生若梦1379

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hive练习案例

本地模式set hive.exec.mode.local.auto=true; 创建的表：create table if not exists employee( name string comment 'employee name', salary float comment 'employee salary', subordinates array<st...
复制链接

扫一扫

专栏目录