20190322-Hive高级操作

最新推荐文章于 2024-05-31 08:45:35 发布

我是鲨鱼胃

最新推荐文章于 2024-05-31 08:45:35 发布

阅读量250

点赞数 2

分类专栏： Hadoop Hive 笔记文章标签： hadoop hive

本文链接：https://blog.csdn.net/sinat_38065180/article/details/88739987

版权

Hadoop 同时被 3 个专栏收录

11 篇文章 0 订阅

订阅专栏

笔记

9 篇文章 0 订阅

订阅专栏

Hive

5 篇文章 0 订阅

订阅专栏

文章目录

Join操作

连接数据准备

数据准备

[yao@master data]$ vim studenta.txt
[yao@master data]$ vim studentb.txt

数据内容

//studenta.txt
10001	shiny
10002	mark
10003	angel
10005	ella
10009	jack
10014	eva
10018	judy
10020	cendy


//studentb.txt
10001	23
10004	22
10007	24
10008	21
10009	25
10012	25
10015	20
10018	19
10020	26

创建表studenta和表studentb

hive> use yr_test;
OK
Time taken: 1.022 seconds
hive> create table studenta(
    > id int,name string)
    > row format delimited fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 2.258 seconds
hive> create table studentb(
    > id int,name string)
    > row format delimited fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 0.092 seconds

加载数据

hive> load data local inpath '/home/yao/data/studenta.txt' overwrite into table studenta;
Loading data to table yr_test.studenta
Table yr_test.studenta stats: [numFiles=1, numRows=0, totalSize=90, rawDataSize=0]
OK
Time taken: 4.047 seconds
hive> load data local inpath '/home/yao/data/studentb.txt' overwrite into table studentb;
Loading data to table yr_test.studentb
Table yr_test.studentb stats: [numFiles=1, numRows=0, totalSize=81, rawDataSize=0]
OK
Time taken: 0.612 seconds

1 内连接 JOIN

把符合两边条件的数据查找出来

hive> select * from studenta a join studentb b on a.id=b.id;

//结果
Total MapReduce CPU Time Spent: 3 seconds 690 msec
OK
10001	shiny	10001	23
10009	jack	10009	25
10018	judy	10018	19
10020	cendy	10020	26

2 外连接 OUTER JOIN

左外连接 LEFT JOIN
1.以左表数据为匹配标准，左大右小
2.匹配不上的就是null
3.返回的数据条数与左表相同

hive> select * from studenta a left join studentb b on a.id=b.id;

//结果
Total MapReduce CPU Time Spent: 2 seconds 570 msec
OK
10001	shiny	10001	23
10002	mark	NULL	NULL
10003	angel	NULL	NULL
10005	ella	NULL	NULL
10009	jack	10009	25
10014	eva	NULL	NULL
10018	judy	10018	19
10020	cendy	10020	26

右外连接 RIGHT JOIN
1.以右表数据为匹配标准，右大左小
2.匹配不上的就是null
3.返回的数据条数与右表相同

hive> select * from studenta a right join studentb b on a.id=b.id;

//结果
Total MapReduce CPU Time Spent: 2 seconds 670 msec
OK
10001	shiny	10001	23
NULL	NULL	10004	22
NULL	NULL	10007	24
NULL	NULL	10008	21
10009	jack	10009	25
NULL	NULL	10012	25
NULL	NULL	10015	20
10018	judy	10018	19
10020	cendy	10020	26

全外连接 FULL JOIN
1.以两个表数据为匹配标准
2.匹配不上的就是null
3.返回的数据条数等于两表数据去重之和

hive> select * from studenta a full join studentb b on a.id=b.id;

//结果
Total MapReduce CPU Time Spent: 8 seconds 50 msec
OK
10001	shiny	10001	23
10002	mark	NULL	NULL
10003	angel	NULL	NULL
NULL	NULL	10004	22
10005	ella	NULL	NULL
NULL	NULL	10007	24
NULL	NULL	10008	21
10009	jack	10009	25
NULL	NULL	10012	25
10014	eva	NULL	NULL
NULL	NULL	10015	20
10018	judy	10018	19
10020	cendy	10020	26

3 左半连接 LEFT SEMI

1.把符合两边连接条件的左表数据显示出来
2.右表只能在ON子句中设置过滤条件，在where子句、select子句或者其他地方过滤都不行
因为如果连接语句中有WHERE子句，会先执行JOIN子句，再执行WHERE子句。

hive> select * from studenta a left semi join studentb b on a.id=b.id;

//结果
Total MapReduce CPU Time Spent: 3 seconds 250 msec
OK
10001	shiny
10009	jack
10018	judy
10020	cendy

数据类型

原子数据类型

数值、布尔、字符串、日期
在这里插入图片描述

复杂数据类型

数组、映射、结构体
在这里插入图片描述

复杂数据类型实例

1.array

数据准备

[yao@master data]$ vim array.txt

shiny	23	beijing,tianjin,qingdao
jack	34	shanghai,guangzhou
mark	26	beijing,xian
ella	21	beijing
judy	30	shanghai,hangzhou,chongqing
cendy	28	beijing,shanghai,dalian,chengdu

创建表并导入数据

hive> create table employee(
    > name string,
    > age int,
    > work_locations array<string>)
    > row format delimited fields terminated by '\t'
    > collection items terminated by ','
    > stored as textfile;
OK
Time taken: 0.639 seconds

hive> load data local inpath '/home/yao/data/array.txt' into table employee;

查询数据

hive> select name,age,work_locations[0] from employee;
OK
shiny	23	beijing
jack	34	shanghai
mark	26	beijing
ella	21	beijing
judy	30	shanghai
cendy	28	beijing
Time taken: 0.131 seconds, Fetched: 6 row(s)

2.MAP

数据准备

[yao@master data]$ vim scores.txt

shiny	chinese:90,math:100,english:99
mark	chinese:89,math:56,english:87
judy	chinese:94,math:78,english:81
ella	chinese:54,math:23,english:48
jack	chinese:100,math:95,english:69
cendy	chinese:67,math:83,english:45

创建表并导入数据

hive> create table scores(
    > name string,
    > scores map<string,int>)
    > row format delimited fields terminated by '\t'
    > collection items terminated by ','
    > map keys terminated by ':'
    > stored as textfile;
OK
Time taken: 0.087 seconds

hive> load data local inpath '/home/yao/data/scores.txt' into table scores;

查询数据

hive> select name,scores['math'] from scores;
OK
shiny	100
mark	56
judy	78
ella	23
jack	95
cendy	83
Time taken: 0.107 seconds, Fetched: 6 row(s)

hive> select name,'math',scores['math'] from scores;
OK
shiny	math	100
mark	math	56
judy	math	78
ella	math	23
jack	math	95
cendy	math	83
Time taken: 0.075 seconds, Fetched: 6 row(s)
hive> select name,'.学',scores['math'] from scores;
OK
shiny	数学	100
mark	数学	56
judy	数学	78
ella	数学	23
jack	数学	95
cendy	数学	83
Time taken: 0.092 seconds, Fetched: 6 row(s)

3.struct

数据准备

[yao@master data]$ vim coursescore.txt

1	chinese,100
2	math,98
3	english,99
4	computer,78

创建表并导入数据

hive> create table coursescore(
    > id int,
    > course struct<name:string,score:int>)
    > row format delimited fields terminated by '\t'
    > collection items terminated by ','
    > stored as textfile;
OK
Time taken: 0.138 seconds

hive> load data local inpath '/home/yao/data/coursescore.txt' into table coursescore;

查询数据

hive> select * from coursescore;
OK
1	{"name":"chinese","score":100}
2	{"name":"math","score":98}
3	{"name":"english","score":99}
4	{"name":"computer","score":78}
Time taken: 0.084 seconds, Fetched: 4 row(s)
hive> select id,course.name from coursescore;
OK
1	chinese
2	math
3	english
4	computer
Time taken: 0.081 seconds, Fetched: 4 row(s)
hive> select * from coursescore;
OK
1	{"name":"chinese","score":100}
2	{"name":"math","score":98}
3	{"name":"english","score":99}
4	{"name":"computer","score":78}
Time taken: 0.084 seconds, Fetched: 4 row(s)
hive> select id,course.name,course.score from coursescore;
OK
1	chinese	100
2	math	98
3	english	99
4	computer	78
Time taken: 0.081 seconds, Fetched: 4 row(s)

函数

内置函数

查看内置函数

show functions;

查看函数的详细信息

desc function 【函数名】;

显示函数的扩展信息

desc function extended 【函数名】;

自定义函数

应用场景
当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数
类型
（1）UDF（user-defined function）：用户自定义函数作用于单个数据行，
产生一个数据行作为输出。（数学函数，字符串函数）
（2）UDAF（User- Defined Aggregation Funcation）：用户自定义聚合函数**接收多个输入数据行，并产生一个输出数据行。（COUNT、MAX）

Json数据解析

1.内置函数

准备数据

rating.json
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
...

创建表并加载数据

hive> create table rat_json(line string) row format delimited;
OK
Time taken: 0.06 seconds
hive> load data local inpath '/home/yao/data/rating.json' into table rat_json;
Loading data to table yr_test.rat_json
Table yr_test.rat_json stats: [numFiles=1, totalSize=65602705]
OK
Time taken: 0.961 seconds
hive> select * from rat_json limit 5;
OK
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}

创建存放数据的表
需要解析Json数据成四个字段，插入一张新的表 rate(用于存放处理的数据，需要有四个字段)

hive> create table rate(
    > movie int,
    > rate int,
    > ts int,
    > userid int)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.09 seconds

插入数据解析Json
解析Json，得到结果之后存入rate表

insert into table rate select
get_json_object(line,'$.movie') as movie,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as ts,
get_json_object(line,'$.uid') as userid
from rat_json;

查看解析出来的Json数据

hive> select * from rate limit 10;
OK
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1
2804	5	978300719	1
594	4	978302268	1
919	4	978301368	1
Time taken: 0.078 seconds, Fetched: 10 row(s)

2.自定义函数

1.新建一个项目HiveDemo，导入apache-hive-1.2.1-bin\lib下的所有包

在windows下解压hive的安装包文件apache-hive-1.2.1-bin.tar
新建java project
添加包

2.开发一个简单的 Java 类，继承org.apache.hadoop.hive.ql.exec.UDF，重载 evaluate 方法

package org.yao.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;

//解析Json格式数据
public class JsonUDF extends UDF {
	//必须是public
public String evaluate(String jsonStr,String fields) throws JSONException {
	//创建一个json对象
	JSONObject json=new JSONObject(jsonStr);
	//把这个json对象转换为string类型的对象
	String result=(String)json.get(fields);
	//返回这个string类型的对象
	return result;	
}
}

3.打成json.jar包上传到Linux服务器

在这里插入图片描述

4.将jar包添加到Hive的classpath下

hive> add jar /home/yao/json.jar;
Added [/home/yao/json.jar] to class path
Added resources: [/home/yao/json.jar]
hive> list jar;
/home/yao/json.jar		//查看jar包是否加入成功

5.创建临时函数与开发好的 class 关联起来

hive> hive> create temporary function jsontostring as 'org.yao.hive.udf.JsonUDF';
OK
Time taken: 0.011 seconds

//jsontostring ：临时函数的名字
//'org.yao.hive.udf.JsonUDF'是'包名.类名'

6.解析Json，得到结果之后存入rates表（rates表用于存放处理的数据，所以需要有四个字段)

create  table rates as select
jsontostring(line,'movie') as movie,
jsontostring(line,'rate') as rate,
jsontostring(line,'timeStamp') as ts,
jsontostring(line,'uid') as userid
from rat_json;

7.至此，便可以使用自定义的函数

hive> select * from rates limit 10;
OK
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1
2804	5	978300719	1
594	4	978302268	1
919	4	978301368	1
Time taken: 0.063 seconds, Fetched: 10 row(s)

Hive Shell

参数
（1）-i：从文件初始化 HQL
（2）-e：从命令行执行指定的 HQL
（3）-f：执行 HQL 脚本
（4）-v：输出执行的 HQL 语句到控制台
（5）-p： connect to Hive Server on port number
（6）-hiveconf x=y（Use this to set hive/hadoop configuration variables）
（7）-S：表示以不打印日志的形式执行命名操作
从命令行执行指定的HQL

[yao@master ~]$ cd data/
[yao@master data]$ touch studenta
//将数据库yr_test中的表studenta写入这个新建的空文件中
[yao@master data]$ hive -e 'select * from yr_test.studenta' > studenta

//如果当前目录不存在studentaaa，那么会自动创建
[yao@master ~]$ hive -e 'select * from yr_test.studenta' > studentaaa
[yao@master ~]$ cat studentaaa
10001	shiny
10002	mark
10003	angel
10005	ella
10009	jack
10014	eva
10018	judy
10020	cendy

运行一个文件

1.创建一个HQL文件

[yao@master ~]$ vim hive.hql
[yao@master ~]$ cat hive.hql 
create database if not exists yr_0322;
use yr_0322;
create table if not exists stu(id int,name string)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/home/yao/data/stu.txt' into table stu;
insert overwrite directory '/yr_0322/stu' row format delimited fields terminated by '\t' select * from stu;

overwrite是复写，如果不写就是在后面追加

stu.txt 
1,张三
2,李四
3,rose
4,jack
5,haha
6,hehe
7,aaaaaaaaaaaaa

2.执行HQL脚本

[yao@master ~]$ hive -f hive.hql 
//在hdfs上查看
[yao@master ~]$ hadoop fs -ls /yr_0322/stu
Found 1 items
-rwxr-xr-x   1 yao supergroup         52 2019-03-22 17:25 /yr_0322/stu/000000_0
[yao@master ~]$ hadoop fs -cat /yr_0322/stu/000000_0
1	张三
2	李四
3	rose
4	jack
5	haha
6	hehe
7	aaaaaaaaaaaaa

//在hive上查看
hive> use yr_0322;
OK
Time taken: 0.026 seconds
hive> select * from stu;
OK
1	张三
2	李四
3	rose
4	jack
5	haha
6	hehe
7	aaaaaaaaaaaaa
Time taken: 0.188 seconds, Fetched: 7 row(s)

我是鲨鱼胃

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
20190322-Hive高级操作

Join操作连接数据准备数据准备[yao@master data]$ vim studenta.txt[yao@master data]$ vim studentb.txt数据内容//studenta.txt10001 shiny10002 mark10003 angel10005 ella10009 jack10014 eva10018 judy10020 ce...
复制链接

扫一扫