hive基础操作

最新推荐文章于 2024-04-11 08:00:00 发布

bingo_liu

最新推荐文章于 2024-04-11 08:00:00 发布

阅读量393

点赞数 1

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/bingo_liu/article/details/77867255

版权

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

启动dfs和yarn

start-dfs.sh
start-yarn.sh

hive操作

hive
hive>show databases;
default

hive>create database test;
hive>show databases;
default
test

hive>use test;

内部表

drop表时存储在hdfs上的文件也会被删除

示例：学生表student，选修关系表sc，课程表course

hive>create table student(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
hive>create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
hive>create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;

创建的数据库保存在表DBS中
hdfs上的路径：
/user/hive/warehouse/test.db
其中/user/hive/warehouse在hive-site.xml文件中指定的

创建的表在TBLS中
hdfs上的路径：
/user/hive/warehouse/test.db/student

导入数据

创建文件
student.txt

95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
95007,易思玲,女,19,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95010,孔小涛,男,19,CS
95011,包小柏,男,18,MA
95012,孙花,女,20,CS
95013,冯伟,男,21,CS
95014,王小丽,女,19,CS
95015,王君,男,18,MA
95016,钱国,男,21,MA
95017,王风娟,女,18,IS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS
95020,赵钱,男,21,IS
95021,周二,男,17,MA
95022,郑明,男,20,MA

course.txt

1,数据库
2,数学
3,信息系统
4,操作系统
5,数据结构
6,数据处理

sc.txt

95001,1,81
95001,2,85
95001,3,88
95001,4,70
95002,2,90
95002,3,80
95002,4,71
95002,5,60
95003,1,82
95003,3,90
95003,5,100
95004,1,80
95004,2,92
95004,4,91
95004,5,70
95005,1,70
95005,2,92
95005,3,99
95005,6,87
95006,1,72
95006,2,62
95006,3,100
95006,4,59
95006,5,60
95006,6,98
95007,3,68
95007,4,91
95007,5,94
95007,6,78
95008,1,98
95008,3,89
95008,6,91
95009,2,81
95009,4,89
95009,6,100
95010,2,98
95010,5,90
95010,6,80
95011,1,81
95011,2,91
95011,3,81
95011,4,86
95012,1,81
95012,3,78
95012,4,85
95012,6,98
95013,1,98
95013,2,58
95013,4,88
95013,5,93
95014,1,91
95014,2,100
95014,4,98
95015,1,91
95015,3,59
95015,4,100
95015,6,95
95016,1,92
95016,2,99
95016,4,82
95017,4,82
95017,5,100
95017,6,58
95018,1,95
95018,2,100
95018,3,67
95018,4,78
95019,1,77
95019,2,90
95019,3,91
95019,4,67
95019,5,87
95020,1,66
95020,2,99
95020,5,93
95021,2,93
95021,5,91
95021,6,99
95022,3,69
95022,4,93
95022,5,82
95022,6,100

从本地导入数据
注意创建表结构是指定的分割符’,’与student.txt的行分割符’,’对应

hive>load data local inpath '/home/bingo/data/hive/students.txt' into table student;
hive>load data local inpath '/home/bingo/data/hive/course.txt' overwrite into table course;
hive>load data local inpath '/home/bingo/data/hive/sc.txt' overwrite into table sc;

基本操作：

查询全体学生的学号与姓名

select Sno,Sname from student;

查询选修了课程的学生姓名

select distinct Sname from student inner join sc on student.Sno=Sc.Sno;

—-hive的group by 和集合函数

查询学生的总人数

select count(distinct Sno) count from student;

计算1号课程的学生平均成绩

select avg(distinct Grade) from sc where Cno=1;

查询各科成绩平均分

select Cno,avg(Grade) from sc group by Cno;

查询选修1号课程的学生最高分数

select Grade from sc where Cno=1 sort by Grade desc limit 1;

比较:
select * from sc where Cno=1 sort by Grade
select Grade from sc where Cno=1 order by Grade

求各个课程号及相应的选课人数

select Cno,count(1) from sc group by Cno;

查询选修了3门以上的课程的学生学号

select Sno from (select Sno,count(Cno) CountCno from sc group by Sno)a where a.CountCno>3;
或
select Sno from sc group by Sno having count(Cno)>3;

—-hive的Order By/Sort By/Distribute By
　　Order By ，在strict 模式下（hive.mapred.mode=strict),order by 语句必须跟着limit语句，但是在nonstrict下就不是必须的，这样做的理由是必须有一个reduce对最终的结果进行排序，如果最后输出的行数过多，一个reduce需要花费很长的时间。只有一个reduce，如果数据量大，效率很低，甚至程序会崩掉

查询学生信息，结果按学号全局有序

set hive.mapred.mode=strict;   <默认nonstrict>
select Sno from student order by Sno;
FAILED: Error in semantic analysis: 1:33 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'Sno'

　　Sort By，它通常发生在每一个redcue里，“order by” 和“sort by”的区别在于，前者能给保证输出都是有顺序的，而后者如果有多个reduce的时候只是保证了输出的部分有序。set mapred.reduce.tasks=在sort by可以指定，在用sort by的时候，如果没有指定列，它会随机的分配到不同的reduce里去。distribute by 按照指定的字段对数据进行划分到不同的输出reduce中
　　此方法会根据性别划分到不同的reduce中，然后按年龄排序并输出到不同的文件中。

查询学生信息，按性别分区，在分区内按年龄有序

set mapred.reduce.tasks=2;
insert overwrite local directory '/home/bingo/out' 
select * from student distribute by Sex sort by Sage;

—-Join查询,join只支持等值连接
查询每个学生及其选修课程的情况

select student.*,sc.* from student join sc on (student.Sno =sc.Sno);

查询学生的得分情况。

select student.Sname,course.Cname,sc.Grade from student join sc on student.Sno=sc.Sno join course on sc.cno=course.cno;

查询选修2号课程且成绩在90分以上的所有学生。

select student.Sname,sc.Grade from student join sc on student.Sno=sc.Sno 
where  sc.Cno=2 and sc.Grade>90;

—-LEFT，RIGHT 和 FULL OUTER JOIN ,inner join, left semi join
查询所有学生的信息，如果在成绩表中有成绩，则输出成绩表中的课程号

select student.Sname,sc.Cno from student left outer join sc on student.Sno=sc.Sno;

　　如果student的sno值对应的sc在中没有值，则会输出student.Sname null.如果用right out join会保留右边的值，左边的为null。
　　Join 发生在WHERE 子句之前。如果你想限制 join 的输出，应该在 WHERE 子句中写过滤条件——或是在join 子句中写。
　　
—-LEFT SEMI JOIN Hive 当前没有实现 IN/EXISTS 子查询，可以用 LEFT SEMI JOIN 重写子查询语句

重写以下子查询为LEFT SEMI JOIN

SELECT a.key, a.value
  FROM a
  WHERE a.key exist in
   (SELECT b.key
    FROM B);
#可以被重写为：
   SELECT a.key, a.val
   FROM a LEFT SEMI JOIN b on (a.key = b.key)

查询与“刘晨”在同一个系学习的学生

select s1.Sname from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';

注意比较：

select * from student s1 left join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select * from student s1 right join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select * from student s1 inner join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select * from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select s1.Sname from student s1 right semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';

外部表

drop表时hdfs上的文件不会被删除

hive>create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string)
>row format delimited fields terminated by ','
>location '/input/hive/student/';

location是hdfs上的路径
导入数据：
与内部表的导入一样

分区表

可以是内部表，也可以是外部表

hive> create external table student_p(Sno int,Sname string,Sex string,Sage int,Sdept string)
>partitioned by(part string) 
>row format delimited fields terminated by ','
>location '/input/hive/partition';

#创建分区：
alter table student_p add if not exists partition (part ='usa');
#删除分区：
外部表时hdfs上的文件还是存在的
alter table student_p drop if exists partition (part ='japan');

#导入数据时决定分区
hive> load data local inpath '/home/bingo/data/hive/students.txt' overwrite into table student_p
    > partition (part='china');
hive> load data local inpath '/home/bingo/data/hive/students.txt' overwrite into table student_p
    > partition (part='japan');

#查询全部
select * from student_p;
#查询分区
select * from student_p where part='china';