3、Hive的表增删改查

最新推荐文章于 2024-08-04 13:55:04 发布

weixin_33974433

最新推荐文章于 2024-08-04 13:55:04 发布

阅读量258

点赞数

文章标签： python 大数据

原文链接：https://my.oschina.net/liufukin/blog/798531

版权

2019独角兽企业重金招聘Python工程师标准>>>

1、load 数据

1.1、基本语法：

load data [local] inpath 'path' [overwrite] into table 'table_name' partition(partitionfield='xx');

1.2、本质：就是将数据从INPATH所指定的路径拷贝或者移动到表或者区文件夹中

  如果数据是在本地   LOCAL INPATH  ,数据是拷贝
  如果数据是在HDFS上，      INPATH ,数据是移动

2、动态分区

2.1、含义：动态分区就是可以根据select查询出来的结果数据中的一个字段的值不同，而插入另一个表中不同的分区

2.2、例子：

  先创建一个t_stu普通的学生详情表 t_a
  	create table t_a(sno int,sname string,sex string) partitioned by(sage string);

  创一个学生基本信息表  t_stu_baseinfo
  	create table t_stu_baseinfo(sno int,sname string,sex string) partitioned by(sage string);

  从详情表中查询若干字段数据插入t_stu_baseinfo表中，并且根据age不同而放入不同的分区
  注意：要使用动态分区，必须先开启动态分区参数

  	hive> set hive.exec.dynamic.partition.mode=nonstrict;
  	hive> insert into table t_a partition(sage) select sno,sname,sex,sage from t_b;

  结果观察

  	hive>  show partitions t_stu_baseinfo;
  	OK
  	sage=17
  	sage=18
  	sage=19
  	sage=20
  	sage=21
  	sage=22
  	sage=23

3、关于JOIN

3.1、join类型：

通用：INNER JOIN(JOIN) , LEFT JOIN(LEFT OUTER JOIN) ,RIGHT JOIN(RIGHT OUTER JOIN),full outer join HIVE专用：left semi join 左半连接

3.2、例子：

  准备数据
  1,a
  2,b
  3,c
  4,d
  7,y
  8,u

  2,bb
  3,cc
  7,yy
  9,pp

3.3、建表：

  create table a(id int,name string)
  row format delimited fields terminated by ',';

  create table b(id int,name string)
  row format delimited fields terminated by ',';

3.4、导入数据：

  load data local inpath '/root/hivedata/a.txt' into table a;
  load data local inpath '/root/hivedata/b.txt' into table b;

  实验：
  ** inner join ==> 只展示两边对的上的
  select a.*,b.* from a inner join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 7     | y       | 7     | yy      |
  +-------+---------+-------+---------+--+

  **left join  ==> a表全部展示，右边的如果没有置空
  select * from a left join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 1     | a       | NULL  | NULL    |
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 4     | d       | NULL  | NULL    |
  | 7     | y       | 7     | yy      |
  | 8     | u       | NULL  | NULL    |
  +-------+---------+-------+---------+--+

  **right join ==> b表全部展示，左边没有置空
  select * from a right join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 7     | y       | 7     | yy      |
  | NULL  | NULL    | 9     | pp      |
  +-------+---------+-------+---------+--+

  **左右连接  ==> 全部都展示出来
  select * from a full outer join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 1     | a       | NULL  | NULL    |
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 4     | d       | NULL  | NULL    |
  | 7     | y       | 7     | yy      |
  | 8     | u       | NULL  | NULL    |
  | NULL  | NULL    | 9     | pp      |
  +-------+---------+-------+---------+--+

  **hive中的特别左半链接 semi join  ==> 相当于左链接中，全部对上的
  select * from a left semi join b on a.id = b.id;
  效果相当于左连接结果中的左表连接成功的部分
  +-------+---------+--+
  | a.id  | a.name  |
  +-------+---------+--+
  | 2     | b       |
  | 3     | c       |
  | 7     | y       |
  +-------+---------+--+
  相当于
  select * from a where a.id exists(select b.id from b); 在hive中效率极低
  --------------------------------------------------------------------------------

4、关于分组查询

区别于分桶查询，分组查询的结果，一组只有一条记录返回。分桶查询只是将数据按照hash % reducer分开到不同的桶，数据总数前后不变。每个桶里面有个人的记录，而且每个人还可能有多条数据。例如：每个人每个月的上网流量。分桶后，小明的所有上网信息肯定在同一个桶里面，这个桶里面可能还包含小冬、老张的数据。

+-----------------------+--------------------+---------------------+--+
| usermag_tab.username  | usermag_tab.month  | usermag_tab.salary  |
+-----------------------+--------------------+---------------------+--+
| A                     | 2015-01            | 5                   |
| A                     | 2015-01            | 15                  |
| B                     | 2015-01            | 5                   |
| A                     | 2015-01            | 8                   |
| B                     | 2015-01            | 25                  |
| A                     | 2015-01            | 5                   |
| A                     | 2015-02            | 4                   |
| A                     | 2015-02            | 6                   |
| B                     | 2015-02            | 10                  |
| B                     | 2015-02            | 5                   |
+-----------------------+--------------------+---------------------+--+

首先需要设置一些reduce的数量，默认是1

set mapreduce.job.reduces=5;

需要对这组数据按照每个用户、每个月的访问量进行汇总。

select username,month,sum(salary) as salary from usermag_tab group by username,month;

如果需要累计每个用户的访问次数，不按月分，那么可以如下：

select username,max(month) as month,sum(salary) as salary from usermag_tab group by username;

注意：如果写成如下会报错，因为month会有多个值，必须选择其中一个

select username,month,sum(salary) as salary from usermag_tab group by username;

分组查询，没法做到去重，因为返回来的其他值有多个，没法一一对应，这时候一般使用row_number、rank、dense_rank函数为每条数据加上一个值后，在根据值的大小求topN。如果是top1就实现了排重功能

5、关于多重插入

from student
insert into table student_p partition(part='a')
select * where Sno<95011;
insert into table student_p partition(part='b')
select * where Sno>95011;

6、关于导出数据到本地

insert overwrite local directory '/home/hadoop/student.txt'   select * from student;

7、本地模式（本地跑demo的时候可以用）

set hive.exec.mode.local.auto=true;

转载于:https://my.oschina.net/liufukin/blog/798531

weixin_33974433

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
3、Hive的表增删改查

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫