Hive

最新推荐文章于 2021-11-24 09:53:23 发布

initializeliu

最新推荐文章于 2021-11-24 09:53:23 发布

阅读量419

点赞数

分类专栏：大数据分布式开发文章标签： Hive

本文链接：https://blog.csdn.net/weixin_42581821/article/details/96372726

版权

大数据分布式开发专栏收录该内容

32 篇文章 1 订阅

订阅专栏

文章目录

Hive

Hive

HIVE是什么？

1.HIVE是一个可以将sql翻译为MR程序的工具。
2.HIVE支持用户将HDFS上的文件映射为表结构，然后用户就可以输入SQL对这些表（HDFS上的文件）进行查询分析。
3.HIVE将用户定义的库、表结构等信息存储hive的元数据库（可以是本地derby，也可以是远程mysql）中。

HIVE的用途？

解放大数据分析程序员，不用自己写大量的mr程序来分析数据，只需要写sql脚本即可。
HIVE可用于构建大数据体系下的数据仓库

安装mysql数据库

Hive的库，表结构等信息要存储到mysql中。

卸载mysql

防止重复安装所以要先卸载。
yum查看是否安装过mysql
在这里插入图片描述
上图属于没有安装过mysql.
yum卸载mysql根据列表上的名字
#yum remove mysql-community-client mysql-community-common mysql-community-libs mysql-community-libs-compat mysql-community-server mysql57-community-release
#rm -rf /var/lib/mysql
#rm /etc/my.cnf
rpm查看安装
rpm -qa | grep -i mysql
rpm 卸载
#rpm -e mysql57-community-release-el7-9.noarch
#rpm -e mysql-community-server-5.7.17-1.el7.x86_64
#rpm -e mysql-community-libs-5.7.17-1.el7.x86_64
#rpm -e mysql-community-libs-compat-5.7.17-1.el7.x86_64
#rpm -e mysql-community-common-5.7.17-1.el7.x86_64
#rpm -e mysql-community-client-5.7.17-1.el7.x86_64
#cd /var/lib/
#rm -rf mysql/
清除余项
#whereis mysql
mysql: /usr/bin/mysql /usr/lib64/mysql /usr/local/mysql /usr/share/mysql /usr/share/man/man1/mysql.1.gz
删除上面的文件夹
rm -rf /usr/bin/mysql
删除配置
#rm –rf /usr/my.cnf
#rm -rf /root/.mysql_sercret
剩余配置检查
#chkconfig --list | grep -i mysql
#chkconfig --del mysqld

安装mysql

详情键连接：https://www.jianshu.com/p/9043223c4168
十分详细。

HIVE安装

1.上传安装包至集群
在这里插入图片描述
2.解压包到指定目录
tar -zxvf apache-hive-2.3.5-bin.tar.gz -C /appdata/
3.添加环境变量
vi /etc/profile

更新环境标量：source /etc/profile
4.检测是否安装成功
hive --version

5.修改配置文件hive-env.sh
cd /appdata/hive/conf
cp hive-env.sh.template hive-env.sh
vi hive-env.sh
在这里插入图片描述
6.修改配置文件hive-site.xml
cp hive-default.xml.template hive-site.xml
vi hive-site.xml

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://n1:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>

7.创建hive数据库
8.格式化hive数据库
cd /appdata/hive/bin
schematool -dbType mysql -initSchema
在这里插入图片描述
9.查看mysql数据库中的文件

10.启动hive客户端
./hive

11.查看以有数据库

12.验证

创建成功！

HIVE的使用方式？

本地交互式查询（一）

启动命令：bin/hive
进入交互式客户端：hive>select * from t_test;

将hive启动为一个服务（二）

启动命令：bin/hiveserver2(后台启动：nohup hiveserver2 1>/dev/null 2>&1 &)
然后可以在任意一台机器上使用beeline客户端连接hive服务，进行交互式查询。
在这里插入图片描述
客户端启动命令：bin/beeline -u jdbc:hive2://n1:10000 -n lys

注意如果在启动客户端是抛出一下异常：
主要报错内容是：java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException:User: lys is not allowed to impersonate anonymous
解决办法：通过httpfs协议访问rest接口，以root用户包装自己用户的方式操作HDFS
首先需要开启rest接口，在hdfs-site.xml文件中加入：

<property>  
<name>dfs.webhdfs.enabled</name>  
<value>true</value>  
</property>  
<property>

然后在core-site.xml文件中加入：

<property>
<name>hadoop.proxyuser.lys.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.lys.groups</name>
<value>*</value>
</property>

beeline配置远程连接

配置hive使用beeline配置远程连接（来源于：https://www.cnblogs.com/zuizui1204/p/9999652.html）
0，安装hive,并将mysql驱动jar导入到lib目录中。
1，在hive服务的安装节点的hive-site.xml配置文件中添加以下配置，配置beeline的远程访问用户名和密码(安装时，已做过)
2，保证hadoop集群环境正常运行，另外重要的一点是要在hive的服务安装节点开启hiveserver2
　　直接运行hiveserver2会一直停留在打印hive的操作日志的状态，可以改为后台运行
　　后台运行命令为：nohup hiveserver2 1>/dev/null 2>&1 &
　　nohup后台运行的命令怎么关掉呢？一个方法为：jobs -l 可以看到运行的进程号，kill -9 进程号
3，把hadoop的安装文件夹和hive的安装文件夹copy到要配置hive客户端的这个linux上面，并配置环境变量/etc/profile
　　注意：不配置HADOOP_HOME，运行beeline会提示缺少HADOOP_HOME参数，你可以理解为需要连接集群的地址，如master:9000的相关配置是在hdfs-site.xml中，source /etc/profile重新加载配置文件。
　　hive是需要这些信息连接集群环境的，毕竟hive是在操作hdfs上的文件。
4，配置命令；经过上面那三个步骤，其实你已经可以使用beeline -u jdbc:hive2://n1:10000 -n lys --color=true --silent=false 来远程登录hive了

但是看着这一长串的命令，懒癌犯了。。。配置一个简单点的命令吧。想到的一个办法就是alias自定义命令，并把它配置到环境变量中，在~/.bashrc中添加
　　alias beeline="/opt/softWare/hive/apache-hive-1.2.2-bin/bin/beeline -u jdbc:hive2://ip:10000 -nroot --color=true --silent=false"
　　source ~/.bashrc即可，需要注意的是beeline使用了绝对路径，因为我新生成的命令就叫beeline，为避免冲突，写了绝对路径，当然你可以改成其他命令
5，最后测试一下，敲出beeline命令即可进入到远程hive的命令行

将hive作为命令运行（三）

可以将hive作为命令一次性运行：

a.较短语句运行方式
bin/hive -e “sql1;sql2;sql3;sql4”
b.较长语句运行方式
遇到较长的语句，事先将sql语句写入一个文件比如 q.hql ，然后用hive命令执行：　　
bin/hive -f q.hql

可以将较长语句写入一个xxx.sh脚本中

运行脚本即可

HIVE的DDL语法

建库

create database db1; —> hive就会在/user/hive/warehouse/下建一个文件夹： db1.db

建内部表

内部表中的数据放到仓库目录中。
use db1;
create table t_test1(id int,name string,age int,create_time bigint)
row format delimited
fields terminated by ‘\001’;
建表后，hive会在仓库目录中建一个表目录： /user/hive/warehouse/db1.db/t_test1
在这里插入图片描述

建外部表

create external table t_test1(id int,name string,age int,create_time bigint)
row format delimited
fields terminated by ‘\001’
location ‘/external/t_test’;

注意:fields terminated by ‘\001’,会将hdfs中保存的数据，以’\001’切分。
location ‘/external/t_test’,默认与hdfs中数据进行关联，并不会将数据保存到数据仓库。
在这里插入图片描述
区别：内部表的目录由hive创建在默认的仓库目录下：/user/hive/warehouse/…
外部表的目录由用户建表时自己指定： location ‘/位置/’

   drop一个内部表时，表的元信息和表数据目录都会被删除；
   drop一个外部表时，只删除表的元信息，表的数据目录不会删除；

意义：通常，一个数据仓库系统，数据总有一个源头，而源头一般是别的应用系统产生的，
其目录无定法，为了方便映射，就可以在hive中用外部表进行映射；并且，就算在hive中把
这个表给drop掉，也不会删除数据目录，也就不会影响到别的应用系统；

删除表

drop table 表名;如果要永久性删除，不准备再恢复：
在这里插入图片描述

导入数据

本质上就是把数据文件放入表目录；
可以用hive命令来做：
hive> load data [local] inpath ‘/data/path’ [overwrite] into table t_test;
默认将hdfs中的数据导入到数据仓库。
local:将linux本地目录中的数据导入到数据仓库。
在这里插入图片描述

建分区表

分区的意义在于可以将数据分子目录存储，以便于查询时让数据读取范围更精准；
create table t_test1(id int,name string,age int,create_time bigint)
partitioned by (day string,country string)
row format delimited
fields terminated by ‘\001’;
在这里插入图片描述

插入数据到指定分区

hive> load data [local] inpath ‘/data/path1’ [overwrite] into table t_test partition(day=‘2017-06-04’,country=‘China’);
hive> load data [local] inpath ‘/data/path2’ [overwrite] into table t_test partition(day=‘2017-06-05’,country=‘China’);
hive> load data [local] inpath ‘/data/path3’ [overwrite] into table t_test partition(day=‘2017-06-04’,country=‘England’);

导入完成后，形成的目录结构如下：
/user/hive/warehouse/db1.db/t_test1/day=2017-06-04/country=China/…
/user/hive/warehouse/db1.db/t_test1/day=2017-06-04/country=England/…
/user/hive/warehouse/db1.db/t_test1/day=2017-06-05/country=China/…
在这里插入图片描述

添加分区
alter table t_4 add partition(day=‘2017-04-10’) partition(day=‘2017-04-11’);
添加完成后，可以检查t_4的分区情况：
show partitions t_4;
±----------------±-+
| partition |
±----------------±-+
| day=2017-04-08 |
| day=2017-04-09 |
| day=2017-04-10 |
| day=2017-04-11 |
±----------------±-+
导入数据：使用load
load data local inpath ‘/root/weblog.3’ into table t_4 partition(day=‘2017-04-10’);
0: jdbc:hive2://hdp-nn-01:10000> select * from t_4 where day=‘2017-04-10’;
±-------------±----------------------±--------------±------------±-+
| t_4.ip | t_4.url | t_4.staylong | t_4.day |
±-------------±----------------------±--------------±------------±-+
| 10.300.33.6 | niubi.com/b/b.html | 300 | 2017-04-10 |
| 30.20.33.7 | niubi.com/a/c.html | 400 | 2017-04-10 |
| 30.80.33.6 | niubi.com/c/bb.html | 300 | 2017-04-10 |
| 30.60.33.8 | niubi.com/aa/bc.html | 60 | 2017-04-10 |
| 30.100.33.6 | niubi.com/a/b.html | 30 | 2017-04-10 |
| 30.100.33.9 | niubi.com/a/ab.html | 10 | 2017-04-10 |
±-------------±----------------------±--------------±------------±-+

还可以使用insert
insert into table t_4 partition(day=‘2017-04-11’)
select ip,url,staylong from t_4 where day=‘2017-04-08’ and staylong>30;

删除分区
alter table t_4 drop partition(day=‘2017-04-11’);

修改表的列定义
添加列：
alter table t_seq add columns(address string,age int);
全部替换：
alter table t_seq replace columns(id int,name string,address string,age int);

修改已存在的列定义：
alter table t_seq change userid uid string;

显示命令

show tables;
show databases;
show partitions
例子： show partitions t_4;
show functions; – 显示hive中所有的内置函数
desc t_name; – 显示表定义
desc extended t_name; – 显示表定义的详细信息
desc formatted table_name; – 显示表定义的详细信息，并且用比较规范的格式显示
show create table table_name; – 显示建表语句

HIVE的DML

基本查询语法跟标准sql基本一致
SELECT FIELDS,FUNCTION(FIELDS)
FROM T1
JOIN T2
WHERE CONDITION
GROUP BY FILEDS
HAVING CONDTION
ORDER BY FIELDS DESC|ASC

各类join

测试数据：
create table t_a(name string,numb int)
row format delimited
fields terminated by ‘,’;

create table t_b(name string,nick string)
row format delimited
fields terminated by ‘,’;

load data local inpath ‘/root/test/a.txt’ into table t_a;
load data local inpath ‘/root/test/b.txt’ into table t_b;
a.txt

a,123
b,232
c,982
d,444
e,89

b.txt

a,beijing
b,shanghai
c,nanjing
g,chongqing

内连接/笛卡尔积

select a.,b.
from t_a a inner join t_b b;
在这里插入图片描述
指定join条件
select a.,b.
from
t_a a join t_b b on a.name=b.name;

左外连接（左连接）

select a.,b.
from
t_a a left outer join t_b b on a.name=b.name;
在这里插入图片描述

右外连接（右连接）

select a.,b.
from
t_a a right outer join t_b b on a.name=b.name;
在这里插入图片描述

全外连接

select a.,b.
from
t_a a full outer join t_b b on a.name=b.name;
在这里插入图片描述

左半连接

select a.*
from
t_a a left semi join t_b b on a.name=b.name;
在这里插入图片描述

分组聚合查询

分组聚合综合示例有如下数据

192.168.33.3,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.3,http://www.edu360.cn/teach,2017-08-04 15:35:20
192.168.33.4,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.4,http://www.edu360.cn/job,2017-08-04 16:30:20
192.168.33.5,http://www.edu360.cn/job,2017-08-04 15:40:20


192.168.33.3,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.44.3,http://www.edu360.cn/teach,2017-08-05 15:35:20
192.168.33.44,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.33.46,http://www.edu360.cn/job,2017-08-05 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-05 15:40:20


192.168.133.3,http://www.edu360.cn/register,2017-08-06 15:30:20
192.168.111.3,http://www.edu360.cn/register,2017-08-06 15:35:20
192.168.34.44,http://www.edu360.cn/pay,2017-08-06 15:30:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.25,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.36,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20

建表映射上述数据
create table t_access(ip string,url string,access_time string)
partitioned by (dt string)
row format delimited fields terminated by ‘,’;

导入数据
load data local inpath ‘/root/hivetest/access.log.0804’ into table t_access partition(dt=‘2017-08-04’);
load data local inpath ‘/root/hivetest/access.log.0805’ into table t_access partition(dt=‘2017-08-05’);
load data local inpath ‘/root/hivetest/access.log.0806’ into table t_access partition(dt=‘2017-08-06’);

查看表的分区
show partitions t_access;
在这里插入图片描述

针对每一行进行运算

select ip,upper(url),access_time – 该表达式是对数据中的每一行进行逐行运算
from t_access;
在这里插入图片描述

求每条URL的访问总次数

select url,count(1) as cnts – 该表达式是对分好组的数据进行逐组运算
from t_access
group by url;
在这里插入图片描述

求每个URL的访问者中ip地址最大的

select url,max(ip)
from t_access
group by url;
在这里插入图片描述

求每个用户访问同一个页面的所有记录中，时间最晚的一条

select ip,url,max(access_time)
from t_access
group by ip,url;
在这里插入图片描述

求8月4号以后，每天http://www.edu360.cn/job的总访问次数，及访问者中ip地址中最大的

select dt,‘http://www.edu360.cn/job’,count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’
group by dt having dt>‘2017-08-04’;

select dt,max(url),count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’
group by dt having dt>‘2017-08-04’;

select dt,url,count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’
group by dt,url having dt>‘2017-08-04’;

select dt,url,count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’ and dt>‘2017-08-04’
group by dt,url;
在这里插入图片描述

求8月4号以后，每天每个页面的总访问次数，及访问者中ip地址中最大的

select dt,url,count(1),max(ip)
from t_access
where dt>‘2017-08-04’
group by dt,url;
在这里插入图片描述

求8月4号以后，每天每个页面的总访问次数，及访问者中ip地址中最大的，且，只查询出总访问次数>2 的记录

方式1：
select dt,url,count(1) as cnts,max(ip)
from t_access
where dt>‘2017-08-04’
group by dt,url having cnts>2;

– 方式2：用子查询
select dt,url,cnts,max_ip
from
(select dt,url,count(1) as cnts,max(ip) as max_ip
from t_access
where dt>‘2017-08-04’
group by dt,url) tmp
where cnts>2;
在这里插入图片描述

复合数据类型

数组

有如下数据：

战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20
普罗米修斯,苍老师:小泽老师:波多老师,2017-09-17
美女与野兽,吴刚:加藤鹰,2017-09-17

建表映射：
create table t_movie(movie_name string,actors array,first_show date)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;
导入数据
load data local inpath ‘/root/hivetest/actor.dat’ into table t_movie;
load data local inpath ‘/root/hivetest/actor.dat.2’ into table t_movie;

查询
select movie_name,actors[0],first_show from t_movie;
在这里插入图片描述
select movie_name,actors,first_show
from t_movie where array_contains(actors,‘吴刚’);

select movie_name,size(actors) as actor_number,first_show
from t_movie;

map

有如下数据：

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

建表映射上述数据
create table t_family(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ‘,’
collection items terminated by ‘#’
map keys terminated by ‘:’;
导入数据
load data local inpath ‘/root/hivetest/fm.dat’ into table t_family;

±-------------±---------------±---------------------------------------------------------------±--------------±-+
| t_family.id | t_family.name | t_family.family_members | t_family.age |
±-------------±---------------±---------------------------------------------------------------±--------------±-+
| 1 | zhangsan | {“father”:“xiaoming”,“mother”:“xiaohuang”,“brother”:“xiaoxu”} | 28 |
| 2 | lisi | {“father”:“mayun”,“mother”:“huangyi”,“brother”:“guanyu”} | 22 |
| 3 | wangwu | {“father”:“wangjianlin”,“mother”:“ruhua”,“sister”:“jingtian”} | 29 |
| 4 | mayun | {“father”:“mayongzhen”,“mother”:“angelababy”} | 26 |
±-------------±---------------±---------------------------------------------------------------±--------------±-+

查出每个人的爸爸、姐妹
select id,name,family_members[“father”] as father,family_members[“sister”] as sister,age
from t_family;
在这里插入图片描述
查出每个人有哪些亲属关系
select id,name,map_keys(family_members) as relations,age
from t_family;

查出每个人的亲人名字
select id,name,map_values(family_members) as relations,age
from t_family;

查出每个人的亲人数量
select id,name,size(family_members) as relations,age
from t_family;
在这里插入图片描述
查出所有拥有兄弟的人及他的兄弟是谁
方案1：一句话写完
select id,name,age,family_members[‘brother’]
from t_family where array_contains(map_keys(family_members),‘brother’);
方案2：子查询
select id,name,age,family_members[‘brother’]
from
(select id,name,age,map_keys(family_members) as relations,family_members
from t_family) tmp
where array_contains(relations,‘brother’);
在这里插入图片描述

hive数据类型struct

假如有以下数据：

1,zhangsan,18:male:深圳
2,lisi,28:female:北京
3,wangwu,38:male:广州
4,赵六,26:female:上海
5,钱琪,35:male:杭州
6,王八,48:female:南京

建表映射上述数据
drop table if exists t_user;
create table t_user(id int,name string,info structage:int,sex:string,addr:string)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;
导入数据
load data local inpath ‘/root/hivetest/user.dat’ into table t_user;

查询每个人的id name和地址
select id,name,info.addr
from t_user;
在这里插入图片描述

HIVE的内置函数

时间处理函数

string from_unixtime(bigint unixtime[, string format]);
将时间的秒值转换成format格式（format可为“yyyy-MM-dd hh:mm:ss”,“yyyy-MM-dd hh”,“yyyy-MM-dd hh:mm”等等）如from_unixtime(1250111000,“yyyy-MM-dd”) —>2009-03-12
在这里插入图片描述
bigint unix_timestamp();
获取本地时区下的时间戳

string to_date(string timestamp)；

类型转换函数

from_unixtime(cast(‘21938792183’ as bigint),‘yyyy-MM-dd HH:mm:ss’)；
在这里插入图片描述

字符串截取和拼接

substr(“abcd”,1,3) --> ‘abc’
concat(‘abc’,‘def’) --> ‘abcdef’
在这里插入图片描述

Json数据解析函数

get_json_object(’{“key1”:3333，“key2”:4444}’ , ‘$.key1’) --> 3333
json_tuple(’{“key1”:3333，“key2”:4444}’,‘key1’,‘key2’) as(key1,key2) --> 3333, 4444
在这里插入图片描述

url解析函数

parse_url_tuple(‘http://www.edu360.cn/bigdata/baoming?userid=8888’,‘HOST’,‘PATH’,‘QUERY’,'QUERY:userid’)
—> www.edu360.cn /bigdata/baoming userid=8888 8888
在这里插入图片描述

函数：explode

explode:可以将一个数组变成列

1,zhangsan,数学:语文:英语:生物
2,lisi,数学:语文
3,wangwu,化学:计算机:java编程

建表：
create table t_xuanxiu(uid string,name string,kc array)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘:’;
explode效果示例：
select explode(kc) from t_xuanxiu where uid=1;
在这里插入图片描述

lateral view 表生成函数

hive> select uid,name,tmp.* from t_xuanxiu
> lateral view explode(kc) tmp as course;
在这里插入图片描述

利用explode和lateral view 实现hive版的wordcount

a b c d e f g
a b c
e f g a
b c d b

对数据建表：
create table t_juzi(line string) row format delimited;
导入数据：
load data local inpath ‘/root/words.txt’ into table t_juzi;

wordcount查询语句
select a.word,count(1) cnt
from
(select tmp.* from t_juzi lateral view explode(split(line,’ ')) tmp as word) a
group by a.word
order by cnt desc;
在这里插入图片描述

窗口分析函数row_number()和over() 函数

常用于求分组TOPN

zhangsan,kc1,90
zhangsan,kc2,95
zhangsan,kc3,68 
lisi,kc1,88
lisi,kc2,95
lisi,kc3,98

建表：
create table t_rowtest(name string,kcId string,score int)
row format delimited
fields terminated by ‘,’;

利用row_number() over() 函数看下效果：
select *,row_number() over(partition by name order by score desc) as rank from t_rowtest;
在这里插入图片描述
从而，求分组topn就变得很简单了：
select name,kcid,score
from
(select *,row_number() over(partition by name order by score desc) as rank from t_rowtest) tmp
where rank<3;

1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female

create table t_rn(id int,age int,name string,sex string)
row format delimited fields terminated by ‘,’;

load data local inpath ‘/root/hivetest/rn.dat’ into table t_rn;

分组标记序号
select *
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn<3;
在这里插入图片描述

窗口分析函数sum() over() ：可以实现在窗口中进行逐行累加

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20

create table t_access_times(username string,month string,counts int)
row format delimited fields terminated by ‘,’;

load data local inpath ‘/root/accumulate.dat’ into table t_access_times;

普通实现方法：
1、第一步，先求个用户的月总金额
select username,month,sum(counts) as salary from t_access_times group by username,month;

2、第二步，将月总金额表自己连接自己连接
select A.,B. FROM
(select username,month,sum(counts) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(counts) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month;
在这里插入图片描述
3,第3步：
select auname,amonth,acnts,sum(bcnts)
from t_tmp2
group by auname,amonth,acnts;

把整个逻辑过程写成一个SQL语句：

select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from
(select username,month,sum(counts) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(counts) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;
在这里插入图片描述
sum() over()实现方法

求出每个人截止到每个月的总额
select username,month,counts,
sum(counts) over(partition by username order by month rows between unbounded preceding and current row) as accumulate
from t_access_times;
在这里插入图片描述

条件控制函数case when

在这里插入图片描述
需求：查询出用户的id、name、年龄（如果年龄在30岁以下，显示年轻人，30-40之间，显示中年人，40以上老年人）
select id,name,
case
when info.age<30 then ‘青年’
when info.age>=30 and info.age<40 then ‘中年’
else ‘老年’
end
from t_user;
在这里插入图片描述

条件控制函数IF

在这里插入图片描述
需求：查询电影信息，并且如果主演中有吴刚的，显示好电影，否则烂片
select movie_name,actors,first_show,
if(array_contains(actors,‘吴刚’),‘好片儿’,‘烂片儿’)
from t_movie;

作业

2017-09-15号的数据：
192.168.33.6,hunter,2017-09-15 10:30:20,/a
192.168.33.7,hunter,2017-09-15 10:30:26,/b
192.168.33.6,jack,2017-09-15 10:30:27,/a
192.168.33.8,tom,2017-09-15 10:30:28,/b
192.168.33.9,rose,2017-09-15 10:30:30,/b
192.168.33.10,julia,2017-09-15 10:30:40,/c
2017-09-16号的数据：
192.168.33.16,hunter,2017-09-16 10:30:20,/a
192.168.33.18,jerry,2017-09-16 10:30:30,/b
192.168.33.26,jack,2017-09-16 10:30:40,/a
192.168.33.18,polo,2017-09-16 10:30:50,/b
192.168.33.39,nissan,2017-09-16 10:30:53,/b
192.168.33.39,nissan,2017-09-16 10:30:55,/a
192.168.33.39,nissan,2017-09-16 10:30:58,/c
192.168.33.20,ford,2017-09-16 10:30:54,/c
2017-09-17号的数据：
192.168.33.46,hunter,2017-09-17 10:30:21,/a
192.168.43.18,jerry,2017-09-17 10:30:22,/b
192.168.43.26,tom,2017-09-17 10:30:23,/a
192.168.53.18,bmw,2017-09-17 10:30:24,/b
192.168.63.39,benz,2017-09-17 10:30:25,/b
192.168.33.25,haval,2017-09-17 10:30:30,/c
192.168.33.10,julia,2017-09-17 10:30:40,/c

需求：
建立一个表，来存储每天新增的数据(分区表)
统计每天的活跃用户(日活)（需要用户的ip，用户的账号，用户访问时间最早的一条url和时间）
统计每天的新增用户(日新)
实现
1，建表映射日志数据
create table t_web_log(ip string,uid string,access_time string,url string)
partitioned by (day string)
row format delimited fields terminated by ‘,’;
2.建表保存日活用户
create table t_user_active_day(ip string,uid string,first_access string,url string) partitioned by (day string);
历史用户表
create table t_user_history(uid string);
新用户表
create table t_user_new_day like t_user_active_day;
3.导入数据：
load data local inpath ‘/home/lys/test/log.15’ into table t_web_log partition(day=‘2017-09-15’);
4.指标统计
日活：
insert into table t_user_active_day partition(day=‘2017-09-15’)
select ip,uid,access_time,url
from
(select ip,uid,access_time,url,row_number() over(partition by uid order by access_time) as rn
from t_web_log
where day=‘2017-09-15’) tmp
where rn=1;

日新：
将当日活跃用户跟历史用户表关联，找出那些在历史用户表中尚不存在的，保存到新用户表中。
求出15号的新用户：
insert into table t_user_new_day partition(day=‘2017-09-15’)
select ip,uid,first_access,url
from
(select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from t_user_active_day a
left join t_user_history b on a.uid=b.uid
where a.day=‘2017-09-15’) tmp
where tmp.b_uid is null;

将新用户插入历史表：
insert into table t_user_history
select uid
from t_user_new_day where day=‘2017-09-15’;

编写shell脚本运行

truncate table 表名：清除数据，不删除表结构。

#!/bin/bash
day_str=`date -d '-1 day' + '%Y-%m-%d'`
#day_str=2017-09-15

hive_exec=/appdata/hive/bin/hive

HQL_user_active_day="
insert into table test1.t_user_active_day partition(day=\"$day_str\")
select ip,uid,access_time,url
from
(select ip,uid,access_time,url,row_number() over(partition by uid order by access_time) as rn
from test1.t_web_log
where day=\"$day_str\") tmp
where rn=1
"

$hive_exec -e "$HQL_user_active_day"

HQL_user_new_day="
insert into table test1.t_user_new_day partition(day=\"$day_str\")
select ip,uid,first_access,url
from
(select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from test1.t_user_active_day a
left join test1.t_user_history b on a.uid=b.uid
where a.day=\"$day_str\") tmp
where tmp.b_uid is null
"

$hive_exec -e "$HQL_user_new_day"

HQL_new_to_history="
insert into table test1.t_user_history
select uid
from test1.t_user_new_day where day=\"$day_str\"
"

$hive_exec -e "$HQL_new_to_history"

hive自定义parse_user_info() 函数

实现步骤：
1、写一个java类实现函数所需要的功能
public class UserInfoParser extends UDF{
// 1,zhangsan:18-1999063117:30:00-beijing
public String evaluate(String line,int index) {
String newLine = line.replaceAll(",", “\001”).replaceAll(":", “\001”).replaceAll("-", “\001”);
StringBuilder sb = new StringBuilder();
String[] split = newLine.split("\001");
StringBuilder append = sb.append(split[0])
.append("\t")
.append(split[1])
.append("\t")
.append(split[2])
.append("\t")
.append(split[3].substring(0, 8))
.append("\t")
.append(split[3].substring(8, 10)).append(split[4]).append(split[5])
.append("\t")
.append(split[6]);

    String res = append.toString();

    return res.split("\t")[index];
}

}

2、将java类打成jar包: d:/up.jar
3、上传jar包到hive所在的机器上 /root/up.jar
4、在hive的提示符中添加jar包
hive> add jar /root/up.jar;
5、创建一个hive的自定义函数名跟写好的jar包中的java类对应
hive> create temporary function parse_user_info as ‘com.doit.hive.udf.UserInfoParser’;

initializeliu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive

文章目录HiveHIVE是什么？HIVE的用途？安装mysql数据库卸载mysql安装mysqlHIVE的使用方式？本地交互式查询（一）将hive启动为一个服务（二）beeline配置远程连接将hive作为命令运行（三）可以将hive作为命令一次性运行：可以将较长语句写入一个xxx.sh脚本中HIVE的DDL语法建库建内部表建外部表删除表导入数据建分区表插入数据到指定分区HIVE的DMLHIVE的...
复制链接

扫一扫