1.上传数据给数据表两种式
从linux创建的表导入到hive(hadoop)中
-----在linux界面
hadoop fs -put 'hive数据表' hive中的数据表路径
从hadoop中的数据表导入到hive中
------在hive界面
语法:
load data [local] inpath '数据表的路径' [overwrite] into table table_name;
local是使用Hadoop上本地的数据表路径;反则是hdfs上的数据表路径
overwrite:写数据时覆盖原来表中的数据
案例:
load data inpath '/user/hive/warehouse/student001' overwrite into table student;
2.插入数据
1)将查询的结果插入数据表
语法:
insert (into | overwrite) table table_name select_statement;
案例:
insert into table student1 select * from student;
2)values将数据插入
语法:
insert (into | overwrite) table table_name values(colu_name,colu_type),..;
案例:
insert into table student values("yy",20,"China",123)
3)将查询结果插入到目标路径
语法:
insert overwrite [local] directory '目标路径' row format serde... select_statement;
案例:
insert overwrite local directory '/opt/module/hive/datas/student001' select * from student;
3.export | import
export是将hive中数据表导出到hdfs上的路径
import是将数据导入hive数据表中
适用于两个不同hive集群之间数据传输
1)export
语法:
export table table_name to '导出目的路径'
案例:
export table default.student to '/user/hive/export/student_001'
2)import
语法:
import table table_exit_name from '需要导入数据的源路径'
案例:
import table student2 from '/user/hive/export/student_001'
4.查询数据
语法:
select [all|distinct] * from table_name [where 过滤条件] [group by colu_name] **分组** [having 分组后的过滤条件][order by colu_name] **排序** [limit number] **限制行数
案例:
select * from student where id < 20; 或者 select name,id from student where id <20;
案例:聚合和group一起使用
1)关系运算符
<, >, =, <=, >=, !=,[not]like, between B and C , in , is [not] null ...
案例:
select * from student where id in(19,20);
2) 逻辑运算符
and , or , not
select * from student where id = 19 or id = 20;
3)聚合函数
count(*) : 统计数据表有多少行
count(colu_name) : 统计数据表有多少行 null不算在其中
sum(colu_name) : 求总和
max():求最大值
min():求最小值
avg():求最平均值 有null不算在其中
案例:分组
select count(*) as a from student group by id; ##统计每各组的人数 或者 select avg(id) from student;
4)having(通常和group by 一起使用)
语法: group by colu_name having '过滤条件'
案例:
select * from ods_weblog_origin group by request having _c1 > 50;
4.join
1)join连接(内连接)
语法: from table table_name1 as a join table_name2 as b ##as 重新命名表名
on a.colu_name = b.colu_name
案例:两个表之间的连接
select dept.*,emp.* from dept join emp on dept.id = emp.id
2)左连接
语法: from table_name1 [as 别名]
left [outer] join table_name2 [as b]
on a.colu_name = b.colu_name
##左边的行都能显示出来,左边的数据,右边没有数据 就显示null
案例:
select dept.deptno,dept.ename,emp.* from dept left join emp on dept.id = emp.id
3)右连接
语法:from table_name1 [as 别名]
right [outer] join table_name2 [as b]
on a.colu_name = b.colu_name
案例:
select dept.deptno,dept.ename,emp.* from dept right join emp on dept.id = emp.id
4)满连接
语法:from table_name1 [as 别名a]
full join table_name2 [as b]
on a.colu_name = b.colu_name
案例:
select dept.deptno,dept.ename,emp.* from dept full join emp on dept.id = emp.id
5)多表连接
语法:from table_name1 [as 别名a] join table_name2 [as b]
on a.colu_name = b.colu_name
join table_name3 [as c]
on b.no = c.no
案例:
select * from ...
5.排序
1)order by(全局排序)
asc:升序
desc :降序
语法:select * from table_name order by colu_name;
##order by 通常和limit使用;作用是减少mapreduce的执行压力
案例1:
select * from emp order by sal desc limit 3;
2)sort by(对每个reduce内部进行排序)
设置reduce个数: set mapreduce.job.reduces = 3
语法: sort by colu_name desc;
案例2:
set mapreduce.job.reduces = 3 selecet * from emp sort by deptno desc; 或者可以查看每个reduce内部的排序 insert overwrite local directory '/opt/module/datas/hive_demo/sort_result' select * from emp sort by deptno desc;
3)distribute by
需要设置reduce个数(分区排序)
语法: distribute by deptno sort by deptno desc;
案例3:
insert overwrite local directory '/opt/module/datas/hive_demo/distribute_result' select * from emp distribute by deptno sort by sal desc;
4)cluster by
主要是对distribute by colu_name 和 sort by colu_name 进行简写 ;前提是colu_name需要相同