文章目录
Hive
HIVE是什么?
1.HIVE是一个可以将sql翻译为MR程序的工具。
2.HIVE支持用户将HDFS上的文件映射为表结构,然后用户就可以输入SQL对这些表(HDFS上的文件)进行查询分析。
3.HIVE将用户定义的库、表结构等信息存储hive的元数据库(可以是本地derby,也可以是远程mysql)中。
HIVE的用途?
解放大数据分析程序员,不用自己写大量的mr程序来分析数据,只需要写sql脚本即可。
HIVE可用于构建大数据体系下的数据仓库
安装mysql数据库
Hive的库,表结构等信息要存储到mysql中。
卸载mysql
防止重复安装所以要先卸载。
yum查看是否安装过mysql
上图属于没有安装过mysql.
yum卸载mysql根据列表上的名字
#yum remove mysql-community-client mysql-community-common mysql-community-libs mysql-community-libs-compat mysql-community-server mysql57-community-release
#rm -rf /var/lib/mysql
#rm /etc/my.cnf
rpm查看安装
rpm -qa | grep -i mysql
rpm 卸载
#rpm -e mysql57-community-release-el7-9.noarch
#rpm -e mysql-community-server-5.7.17-1.el7.x86_64
#rpm -e mysql-community-libs-5.7.17-1.el7.x86_64
#rpm -e mysql-community-libs-compat-5.7.17-1.el7.x86_64
#rpm -e mysql-community-common-5.7.17-1.el7.x86_64
#rpm -e mysql-community-client-5.7.17-1.el7.x86_64
#cd /var/lib/
#rm -rf mysql/
清除余项
#whereis mysql
mysql: /usr/bin/mysql /usr/lib64/mysql /usr/local/mysql /usr/share/mysql /usr/share/man/man1/mysql.1.gz
删除上面的文件夹
rm -rf /usr/bin/mysql
删除配置
#rm –rf /usr/my.cnf
#rm -rf /root/.mysql_sercret
剩余配置检查
#chkconfig --list | grep -i mysql
#chkconfig --del mysqld
安装mysql
详情键连接:https://www.jianshu.com/p/9043223c4168
十分详细。
HIVE安装
1.上传安装包至集群
2.解压包到指定目录
tar -zxvf apache-hive-2.3.5-bin.tar.gz -C /appdata/
3.添加环境变量
vi /etc/profile
更新环境标量:source /etc/profile
4.检测是否安装成功
hive --version
5.修改配置文件hive-env.sh
cd /appdata/hive/conf
cp hive-env.sh.template hive-env.sh
vi hive-env.sh
6.修改配置文件hive-site.xml
cp hive-default.xml.template hive-site.xml
vi hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://n1:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
7.创建hive数据库
8.格式化hive数据库
cd /appdata/hive/bin
schematool -dbType mysql -initSchema
9.查看mysql数据库中的文件
10.启动hive客户端
./hive
11.查看以有数据库
12.验证
创建成功!
HIVE的使用方式?
本地交互式查询(一)
启动命令:bin/hive
进入交互式客户端:hive>select * from t_test;
将hive启动为一个服务(二)
启动命令:bin/hiveserver2(后台启动:nohup hiveserver2 1>/dev/null 2>&1 &)
然后可以在任意一台机器上使用beeline客户端连接hive服务,进行交互式查询。
客户端启动命令:bin/beeline -u jdbc:hive2://n1:10000 -n lys
注意如果在启动客户端是抛出一下异常:
主要报错内容是:java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException:User: lys is not allowed to impersonate anonymous
解决办法:通过httpfs协议访问rest接口,以root用户包装自己用户的方式操作HDFS
首先需要开启rest接口,在hdfs-site.xml文件中加入:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
然后在core-site.xml文件中加入:
<property>
<name>hadoop.proxyuser.lys.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.lys.groups</name>
<value>*</value>
</property>
beeline配置远程连接
配置hive使用beeline配置远程连接(来源于:https://www.cnblogs.com/zuizui1204/p/9999652.html)
0,安装hive,并将mysql驱动jar导入到lib目录中。
1,在hive服务的安装节点的hive-site.xml配置文件中添加以下配置,配置beeline的远程访问用户名和密码(安装时,已做过)
2,保证hadoop集群环境正常运行,另外重要的一点是要在hive的服务安装节点开启hiveserver2
直接运行hiveserver2会一直停留在打印hive的操作日志的状态,可以改为后台运行
后台运行命令为:nohup hiveserver2 1>/dev/null 2>&1 &
nohup后台运行的命令怎么关掉呢?一个方法为:jobs -l 可以看到运行的进程号,kill -9 进程号
3,把hadoop的安装文件夹和hive的安装文件夹copy到要配置hive客户端的这个linux上面,并配置环境变量/etc/profile
注意:不配置HADOOP_HOME,运行beeline会提示缺少HADOOP_HOME参数,你可以理解为需要连接集群的地址,如master:9000的相关配置是在hdfs-site.xml中,source /etc/profile重新加载配置文件。
hive是需要这些信息连接集群环境的,毕竟hive是在操作hdfs上的文件。
4,配置命令;经过上面那三个步骤,其实你已经可以使用beeline -u jdbc:hive2://n1:10000 -n lys --color=true --silent=false 来远程登录hive了
但是看着这一长串的命令,懒癌犯了。。。配置一个简单点的命令吧。想到的一个办法就是alias自定义命令,并把它配置到环境变量中,在~/.bashrc中添加
alias beeline="/opt/softWare/hive/apache-hive-1.2.2-bin/bin/beeline -u jdbc:hive2://ip:10000 -nroot --color=true --silent=false"
source ~/.bashrc即可,需要注意的是beeline使用了绝对路径,因为我新生成的命令就叫beeline,为避免冲突,写了绝对路径,当然你可以改成其他命令
5,最后测试一下,敲出beeline命令即可进入到远程hive的命令行
将hive作为命令运行(三)
可以将hive作为命令一次性运行:
a.较短语句运行方式
bin/hive -e “sql1;sql2;sql3;sql4”
b.较长语句运行方式
遇到较长的语句,事先将sql语句写入一个文件比如 q.hql ,然后用hive命令执行:
bin/hive -f q.hql
可以将较长语句写入一个xxx.sh脚本中
运行脚本即可
HIVE的DDL语法
建库
create database db1; —> hive就会在/user/hive/warehouse/下建一个文件夹: db1.db
建内部表
内部表中的数据放到仓库目录中。
use db1;
create table t_test1(id int,name string,age int,create_time bigint)
row format delimited
fields terminated by ‘\001’;
建表后,hive会在仓库目录中建一个表目录: /user/hive/warehouse/db1.db/t_test1
建外部表
create external table t_test1(id int,name string,age int,create_time bigint)
row format delimited
fields terminated by ‘\001’
location ‘/external/t_test’;
注意:fields terminated by ‘\001’,会将hdfs中保存的数据,以’\001’切分。
location ‘/external/t_test’,默认与hdfs中数据进行关联,并不会将数据保存到数据仓库。
区别: 内部表的目录由hive创建在默认的仓库目录下:/user/hive/warehouse/…
外部表的目录由用户建表时自己指定: location ‘/位置/’
drop一个内部表时,表的元信息和表数据目录都会被删除;
drop一个外部表时,只删除表的元信息,表的数据目录不会删除;
意义: 通常,一个数据仓库系统,数据总有一个源头,而源头一般是别的应用系统产生的,
其目录无定法,为了方便映射,就可以在hive中用外部表进行映射;并且,就算在hive中把
这个表给drop掉,也不会删除数据目录,也就不会影响到别的应用系统;
删除表
drop table 表名;如果要永久性删除,不准备再恢复:
导入数据
本质上就是把数据文件放入表目录;
可以用hive命令来做:
hive> load data [local] inpath ‘/data/path’ [overwrite] into table t_test;
默认将hdfs中的数据导入到数据仓库。
local:将linux本地目录中的数据导入到数据仓库。
建分区表
分区的意义在于可以将数据分子目录存储,以便于查询时让数据读取范围更精准;
create table t_test1(id int,name string,age int,create_time bigint)
partitioned by (day string,country string)
row format delimited
fields terminated by ‘\001’;
插入数据到指定分区
hive> load data [local] inpath ‘/data/path1’ [overwrite] into table t_test partition(day=‘2017-06-04’,country=‘China’);
hive> load data [local] inpath ‘/data/path2’ [overwrite] into table t_test partition(day=‘2017-06-05’,country=‘China’);
hive> load data [local] inpath ‘/data/path3’ [overwrite] into table t_test partition(day=‘2017-06-04’,country=‘England’);
导入完成后,形成的目录结构如下:
/user/hive/warehouse/db1.db/t_test1/day=2017-06-04/country=China/…
/user/hive/warehouse/db1.db/t_test1/day=2017-06-04/country=England/…
/user/hive/warehouse/db1.db/t_test1/day=2017-06-05/country=China/…
添加分区
alter table t_4 add partition(day=‘2017-04-10’) partition(day=‘2017-04-11’);
添加完成后,可以检查t_4的分区情况:
show partitions t_4;
±----------------±-+
| partition |
±----------------±-+
| day=2017-04-08 |
| day=2017-04-09 |
| day=2017-04-10 |
| day=2017-04-11 |
±----------------±-+
导入数据:使用load
load data local inpath ‘/root/weblog.3’ into table t_4 partition(day=‘2017-04-10’);
0: jdbc:hive2://hdp-nn-01:10000> select * from t_4 where day=‘2017-04-10’;
±-------------±----------------------±--------------±------------±-+
| t_4.ip | t_4.url | t_4.staylong | t_4.day |
±-------------±----------------------±--------------±------------±-+
| 10.300.33.6 | niubi.com/b/b.html | 300 | 2017-04-10 |
| 30.20.33.7 | niubi.com/a/c.html | 400 | 2017-04-10 |
| 30.80.33.6 | niubi.com/c/bb.html | 300 | 2017-04-10 |
| 30.60.33.8 | niubi.com/aa/bc.html | 60 | 2017-04-10 |
| 30.100.33.6 | niubi.com/a/b.html | 30 | 2017-04-10 |
| 30.100.33.9 | niubi.com/a/ab.html | 10 | 2017-04-10 |
±-------------±----------------------±--------------±------------±-+
还可以使用insert
insert into table t_4 partition(day=‘2017-04-11’)
select ip,url,staylong from t_4 where day=‘2017-04-08’ and staylong>30;
删除分区
alter table t_4 drop partition(day=‘2017-04-11’);
修改表的列定义
添加列:
alter table t_seq add columns(address string,age int);
全部替换:
alter table t_seq replace columns(id int,name string,address string,age int);
修改已存在的列定义:
alter table t_seq change userid uid string;
显示命令
show tables;
show databases;
show partitions
例子: show partitions t_4;
show functions; – 显示hive中所有的内置函数
desc t_name; – 显示表定义
desc extended t_name; – 显示表定义的详细信息
desc formatted table_name; – 显示表定义的详细信息,并且用比较规范的格式显示
show create table table_name; – 显示建表语句
HIVE的DML
基本查询语法跟标准sql基本一致
SELECT FIELDS,FUNCTION(FIELDS)
FROM T1
JOIN T2
WHERE CONDITION
GROUP BY FILEDS
HAVING CONDTION
ORDER BY FIELDS DESC|ASC
各类join
测试数据:
create table t_a(name string,numb int)
row format delimited
fields terminated by ‘,’;
create table t_b(name string,nick string)
row format delimited
fields terminated by ‘,’;
load data local inpath ‘/root/test/a.txt’ into table t_a;
load data local inpath ‘/root/test/b.txt’ into table t_b;
a.txt
a,123
b,232
c,982
d,444
e,89
b.txt
a,beijing
b,shanghai
c,nanjing
g,chongqing
内连接/笛卡尔积
select a.,b.
from t_a a inner join t_b b;
指定join条件
select a.,b.
from
t_a a join t_b b on a.name=b.name;
左外连接(左连接)
select a.,b.
from
t_a a left outer join t_b b on a.name=b.name;
右外连接(右连接)
select a.,b.
from
t_a a right outer join t_b b on a.name=b.name;
全外连接
select a.,b.
from
t_a a full outer join t_b b on a.name=b.name;
左半连接
select a.*
from
t_a a left semi join t_b b on a.name=b.name;
分组聚合查询
分组聚合综合示例有如下数据
192.168.33.3,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.3,http://www.edu360.cn/teach,2017-08-04 15:35:20
192.168.33.4,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.4,http://www.edu360.cn/job,2017-08-04 16:30:20
192.168.33.5,http://www.edu360.cn/job,2017-08-04 15:40:20
192.168.33.3,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.44.3,http://www.edu360.cn/teach,2017-08-05 15:35:20
192.168.33.44,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.33.46,http://www.edu360.cn/job,2017-08-05 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-05 15:40:20
192.168.133.3,http://www.edu360.cn/register,2017-08-06 15:30:20
192.168.111.3,http://www.edu360.cn/register,2017-08-06 15:35:20
192.168.34.44,http://www.edu360.cn/pay,2017-08-06 15:30:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.25,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.36,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20
建表映射上述数据
create table t_access(ip string,url string,access_time string)
partitioned by (dt string)
row format delimited fields terminated by ‘,’;
导入数据
load data local inpath ‘/root/hivetest/access.log.0804’ into table t_access partition(dt=‘2017-08-04’);
load data local inpath ‘/root/hivetest/access.log.0805’ into table t_access partition(dt=‘2017-08-05’);
load data local inpath ‘/root/hivetest/access.log.0806’ into table t_access partition(dt=‘2017-08-06’);
查看表的分区
show partitions t_access;
针对每一行进行运算
select ip,upper(url),access_time – 该表达式是对数据中的每一行进行逐行运算
from t_access;
求每条URL的访问总次数
select url,count(1) as cnts – 该表达式是对分好组的数据进行逐组运算
from t_access
group by url;
求每个URL的访问者中ip地址最大的
select url,max(ip)
from t_access
group by url;
求每个用户访问同一个页面的所有记录中,时间最晚的一条
select ip,url,max(access_time)
from t_access
group by ip,url;
求8月4号以后,每天http://www.edu360.cn/job的总访问次数,及访问者中ip地址中最大的
select dt,‘http://www.edu360.cn/job’,count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’
group by dt having dt>‘2017-08-04’;
select dt,max(url),count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’
group by dt having dt>‘2017-08-04’;
select dt,url,count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’
group by dt,url having dt>‘2017-08-04’;
select dt,url,count(1),max(ip)
from t_access
where url=‘http://www.edu360.cn/job’ and dt>‘2017-08-04’
group by dt,url;
求8月4号以后,每天每个页面的总访问次数,及访问者中ip地址中最大的
select dt,url,count(1),max(ip)
from t_access
where dt>‘2017-08-04’
group by dt,url;
求8月4号以后,每天每个页面的总访问次数,及访问者中ip地址中最大的,且,只查询出总访问次数>2 的记录
方式1:
select dt,url,count(1) as cnts,max(ip)
from t_access
where dt>‘2017-08-04’
group by dt,url having cnts>2;
– 方式2:用子查询
select dt,url,cnts,max_ip
from
(select dt,url,count(1) as cnts,max(ip) as max_ip
from t_access
where dt>‘2017-08-04’
group by dt,url) tmp
where cnts>2;
复合数据类型
数组
有如下数据:
战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20
普罗米修斯,苍老师:小泽老师:波多老师,2017-09-17
美女与野兽,吴刚:加藤鹰,2017-09-17
建表映射:
create table t_movie(movie_name string,actors array,first_show date)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;
导入数据
load data local inpath ‘/root/hivetest/actor.dat’ into table t_movie;
load data local inpath ‘/root/hivetest/actor.dat.2’ into table t_movie;
查询
select movie_name,actors[0],first_show from t_movie;
select movie_name,actors,first_show
from t_movie where array_contains(actors,‘吴刚’);
select movie_name,size(actors) as actor_number,first_show
from t_movie;
map
有如下数据:
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26
建表映射上述数据
create table t_family(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ‘,’
collection items terminated by ‘#’
map keys terminated by ‘:’;
导入数据
load data local inpath ‘/root/hivetest/fm.dat’ into table t_family;
±-------------±---------------±---------------------------------------------------------------±--------------±-+
| t_family.id | t_family.name | t_family.family_members | t_family.age |
±-------------±---------------±---------------------------------------------------------------±--------------±-+
| 1 | zhangsan | {“father”:“xiaoming”,“mother”:“xiaohuang”,“brother”:“xiaoxu”} | 28 |
| 2 | lisi | {“father”:“mayun”,“mother”:“huangyi”,“brother”:“guanyu”} | 22 |
| 3 | wangwu | {“father”:“wangjianlin”,“mother”:“ruhua”,“sister”:“jingtian”} | 29 |
| 4 | mayun | {“father”:“mayongzhen”,“mother”:“angelababy”} | 26 |
±-------------±---------------±---------------------------------------------------------------±--------------±-+
查出每个人的 爸爸、姐妹
select id,name,family_members[“father”] as father,family_members[“sister”] as sister,age
from t_family;
查出每个人有哪些亲属关系
select id,name,map_keys(family_members) as relations,age
from t_family;
查出每个人的亲人名字
select id,name,map_values(family_members) as relations,age
from t_family;
查出每个人的亲人数量
select id,name,size(family_members) as relations,age
from t_family;
查出所有拥有兄弟的人及他的兄弟是谁
方案1:一句话写完
select id,name,age,family_members[‘brother’]
from t_family where array_contains(map_keys(family_members),‘brother’);
方案2:子查询
select id,name,age,family_members[‘brother’]
from
(select id,name,age,map_keys(family_members) as relations,family_members
from t_family) tmp
where array_contains(relations,‘brother’);
hive数据类型struct
假如有以下数据:
1,zhangsan,18:male:深圳
2,lisi,28:female:北京
3,wangwu,38:male:广州
4,赵六,26:female:上海
5,钱琪,35:male:杭州
6,王八,48:female:南京
建表映射上述数据
drop table if exists t_user;
create table t_user(id int,name string,info structage:int,sex:string,addr:string)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;
导入数据
load data local inpath ‘/root/hivetest/user.dat’ into table t_user;
查询每个人的id name和地址
select id,name,info.addr
from t_user;
HIVE的内置函数
时间处理函数
string from_unixtime(bigint unixtime[, string format]);
将时间的秒值转换成format格式(format可为“yyyy-MM-dd hh:mm:ss”,“yyyy-MM-dd hh”,“yyyy-MM-dd hh:mm”等等)如from_unixtime(1250111000,“yyyy-MM-dd”) —>2009-03-12
bigint unix_timestamp();
获取本地时区下的时间戳
string to_date(string timestamp);
类型转换函数
from_unixtime(cast(‘21938792183’ as bigint),‘yyyy-MM-dd HH:mm:ss’);
字符串截取和拼接
substr(“abcd”,1,3) --> ‘abc’
concat(‘abc’,‘def’) --> ‘abcdef’
Json数据解析函数
get_json_object(’{“key1”:3333,“key2”:4444}’ , ‘$.key1’) --> 3333
json_tuple(’{“key1”:3333,“key2”:4444}’,‘key1’,‘key2’) as(key1,key2) --> 3333, 4444
url解析函数
parse_url_tuple(‘http://www.edu360.cn/bigdata/baoming?userid=8888’,‘HOST’,‘PATH’,‘QUERY’,'QUERY:userid’)
—> www.edu360.cn /bigdata/baoming userid=8888 8888
函数:explode
explode:可以将一个数组变成列
1,zhangsan,数学:语文:英语:生物
2,lisi,数学:语文
3,wangwu,化学:计算机:java编程
建表:
create table t_xuanxiu(uid string,name string,kc array)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘:’;
explode效果示例:
select explode(kc) from t_xuanxiu where uid=1;
lateral view 表生成函数
hive> select uid,name,tmp.* from t_xuanxiu
> lateral view explode(kc) tmp as course;
利用explode和lateral view 实现hive版的wordcount
a b c d e f g
a b c
e f g a
b c d b
对数据建表:
create table t_juzi(line string) row format delimited;
导入数据:
load data local inpath ‘/root/words.txt’ into table t_juzi;
wordcount查询语句
select a.word,count(1) cnt
from
(select tmp.* from t_juzi lateral view explode(split(line,’ ')) tmp as word) a
group by a.word
order by cnt desc;
窗口分析函数row_number()和over() 函数
常用于求分组TOPN
zhangsan,kc1,90
zhangsan,kc2,95
zhangsan,kc3,68
lisi,kc1,88
lisi,kc2,95
lisi,kc3,98
建表:
create table t_rowtest(name string,kcId string,score int)
row format delimited
fields terminated by ‘,’;
利用row_number() over() 函数看下效果:
select *,row_number() over(partition by name order by score desc) as rank from t_rowtest;
从而,求分组topn就变得很简单了:
select name,kcid,score
from
(select *,row_number() over(partition by name order by score desc) as rank from t_rowtest) tmp
where rank<3;
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
create table t_rn(id int,age int,name string,sex string)
row format delimited fields terminated by ‘,’;
load data local inpath ‘/root/hivetest/rn.dat’ into table t_rn;
分组标记序号
select *
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn<3;
窗口分析函数sum() over() :可以实现在窗口中进行逐行累加
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20
create table t_access_times(username string,month string,counts int)
row format delimited fields terminated by ‘,’;
load data local inpath ‘/root/accumulate.dat’ into table t_access_times;
普通实现方法:
1、第一步,先求个用户的月总金额
select username,month,sum(counts) as salary from t_access_times group by username,month;
2、第二步,将月总金额表自己连接 自己连接
select A.,B. FROM
(select username,month,sum(counts) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(counts) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month;
3,第3步:
select auname,amonth,acnts,sum(bcnts)
from t_tmp2
group by auname,amonth,acnts;
把整个逻辑过程写成一个SQL语句:
select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from
(select username,month,sum(counts) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(counts) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;
sum() over()实现方法
求出每个人截止到每个月的总额
select username,month,counts,
sum(counts) over(partition by username order by month rows between unbounded preceding and current row) as accumulate
from t_access_times;
条件控制函数case when
需求:查询出用户的id、name、年龄(如果年龄在30岁以下,显示年轻人,30-40之间,显示中年人,40以上老年人)
select id,name,
case
when info.age<30 then ‘青年’
when info.age>=30 and info.age<40 then ‘中年’
else ‘老年’
end
from t_user;
条件控制函数IF
需求: 查询电影信息,并且如果主演中有吴刚的,显示好电影,否则烂片
select movie_name,actors,first_show,
if(array_contains(actors,‘吴刚’),‘好片儿’,‘烂片儿’)
from t_movie;
作业
2017-09-15号的数据:
192.168.33.6,hunter,2017-09-15 10:30:20,/a
192.168.33.7,hunter,2017-09-15 10:30:26,/b
192.168.33.6,jack,2017-09-15 10:30:27,/a
192.168.33.8,tom,2017-09-15 10:30:28,/b
192.168.33.9,rose,2017-09-15 10:30:30,/b
192.168.33.10,julia,2017-09-15 10:30:40,/c
2017-09-16号的数据:
192.168.33.16,hunter,2017-09-16 10:30:20,/a
192.168.33.18,jerry,2017-09-16 10:30:30,/b
192.168.33.26,jack,2017-09-16 10:30:40,/a
192.168.33.18,polo,2017-09-16 10:30:50,/b
192.168.33.39,nissan,2017-09-16 10:30:53,/b
192.168.33.39,nissan,2017-09-16 10:30:55,/a
192.168.33.39,nissan,2017-09-16 10:30:58,/c
192.168.33.20,ford,2017-09-16 10:30:54,/c
2017-09-17号的数据:
192.168.33.46,hunter,2017-09-17 10:30:21,/a
192.168.43.18,jerry,2017-09-17 10:30:22,/b
192.168.43.26,tom,2017-09-17 10:30:23,/a
192.168.53.18,bmw,2017-09-17 10:30:24,/b
192.168.63.39,benz,2017-09-17 10:30:25,/b
192.168.33.25,haval,2017-09-17 10:30:30,/c
192.168.33.10,julia,2017-09-17 10:30:40,/c
需求:
建立一个表,来存储每天新增的数据(分区表)
统计每天的活跃用户(日活)(需要用户的ip,用户的账号,用户访问时间最早的一条url和时间)
统计每天的新增用户(日新)
实现
1,建表映射日志数据
create table t_web_log(ip string,uid string,access_time string,url string)
partitioned by (day string)
row format delimited fields terminated by ‘,’;
2.建表保存日活用户
create table t_user_active_day(ip string,uid string,first_access string,url string) partitioned by (day string);
历史用户表
create table t_user_history(uid string);
新用户表
create table t_user_new_day like t_user_active_day;
3.导入数据:
load data local inpath ‘/home/lys/test/log.15’ into table t_web_log partition(day=‘2017-09-15’);
4.指标统计
日活:
insert into table t_user_active_day partition(day=‘2017-09-15’)
select ip,uid,access_time,url
from
(select ip,uid,access_time,url,row_number() over(partition by uid order by access_time) as rn
from t_web_log
where day=‘2017-09-15’) tmp
where rn=1;
日新:
将当日活跃用户跟历史用户表关联,找出那些在历史用户表中尚不存在的,保存到新用户表中。
求出15号的新用户:
insert into table t_user_new_day partition(day=‘2017-09-15’)
select ip,uid,first_access,url
from
(select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from t_user_active_day a
left join t_user_history b on a.uid=b.uid
where a.day=‘2017-09-15’) tmp
where tmp.b_uid is null;
将新用户插入历史表:
insert into table t_user_history
select uid
from t_user_new_day where day=‘2017-09-15’;
编写shell脚本运行
truncate table 表名:清除数据,不删除表结构。
#!/bin/bash
day_str=`date -d '-1 day' + '%Y-%m-%d'`
#day_str=2017-09-15
hive_exec=/appdata/hive/bin/hive
HQL_user_active_day="
insert into table test1.t_user_active_day partition(day=\"$day_str\")
select ip,uid,access_time,url
from
(select ip,uid,access_time,url,row_number() over(partition by uid order by access_time) as rn
from test1.t_web_log
where day=\"$day_str\") tmp
where rn=1
"
$hive_exec -e "$HQL_user_active_day"
HQL_user_new_day="
insert into table test1.t_user_new_day partition(day=\"$day_str\")
select ip,uid,first_access,url
from
(select a.ip,a.uid,a.first_access,a.url,b.uid as b_uid
from test1.t_user_active_day a
left join test1.t_user_history b on a.uid=b.uid
where a.day=\"$day_str\") tmp
where tmp.b_uid is null
"
$hive_exec -e "$HQL_user_new_day"
HQL_new_to_history="
insert into table test1.t_user_history
select uid
from test1.t_user_new_day where day=\"$day_str\"
"
$hive_exec -e "$HQL_new_to_history"
hive自定义parse_user_info() 函数
实现步骤:
1、写一个java类实现函数所需要的功能
public class UserInfoParser extends UDF{
// 1,zhangsan:18-1999063117:30:00-beijing
public String evaluate(String line,int index) {
String newLine = line.replaceAll(",", “\001”).replaceAll(":", “\001”).replaceAll("-", “\001”);
StringBuilder sb = new StringBuilder();
String[] split = newLine.split("\001");
StringBuilder append = sb.append(split[0])
.append("\t")
.append(split[1])
.append("\t")
.append(split[2])
.append("\t")
.append(split[3].substring(0, 8))
.append("\t")
.append(split[3].substring(8, 10)).append(split[4]).append(split[5])
.append("\t")
.append(split[6]);
String res = append.toString();
return res.split("\t")[index];
}
}
2、将java类打成jar包: d:/up.jar
3、上传jar包到hive所在的机器上 /root/up.jar
4、在hive的提示符中添加jar包
hive> add jar /root/up.jar;
5、创建一个hive的自定义函数名 跟 写好的jar包中的java类对应
hive> create temporary function parse_user_info as ‘com.doit.hive.udf.UserInfoParser’;