Hive SQL 练习

第1题 连续问题

如下数据为蚂蚁森林中用户领取的减少碳排放量

id dt lowcarbon

1001 2021-12-12 123

1002 2021-12-12 45

1001 2021-12-13 43

1001 2021-12-13 45

1001 2021-12-13 23

1002 2021-12-14 45

1001 2021-12-14 230

1002 2021-12-15 45

1001 2021-12-15 23

… …

1.找出连续3天及以上累计减少碳排放量在100以上的用户

2.找出连续连续3天以上有减少排放量的用户

LAG

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

  1. hive 建表及数据加载
    CREATE EXTERNAL TABLE IF NOT EXISTS demo1(
    id string comment 'id',
    dt string comment '日期',
    lowcarbon int comment '姓名'
    )
    comment '蚂蚁森林减少排放测试表'
    PARTITIONED BY (pdt STRING)
    row format delimited fields terminated by ' '--逗号分隔
    STORED AS TEXTFILE -- 存储格式为TEXTFILE
    tblproperties("skip.header.line.count"="1")  --跳过文件行第一1行;
  2. 加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)\

    load data inpath '/libangqin/data1.txt' overwrite into table demo1 partition(pdt='2022-05-14');

     3.进行分析 

1. 首先按照日期进行排序 使用lag() 开窗函数

select e.id,e.dt,sum(e.lowcarbon) as lowcarbon from demo1 e group by id,dt having sum(e.lowcarbon) >=100 



select d.id,d.dt,d.lowcarbon,lag(d.dt,1,date_sub(d.dt,1)) over( partition by id order by dt ) as ldt from (select e.id,e.dt,sum(e.lowcarbon) as lowcarbon from demo1 e group by id,dt having sum(e.lowcarbon) >=100 ) d 

d.id    d.dt    d.lowcarbon     ldt
1001    2021-12-12      123     NULL
1001    2021-12-13      43      2021-12-12
1001    2021-12-13      45      2021-12-13
1001    2021-12-13      23      2021-12-13
1001    2021-12-14      230     2021-12-13
1001    2021-12-15      23      2021-12-14
1001    2021-12-15      33      2021-12-15
1001    2021-12-16      33      2021-12-15
1001    2021-12-17      53      2021-12-16
1002    2021-12-12      45      NULL
1002    2021-12-14      45      2021-12-12
1002    2021-12-15      45      2021-12-14
1002    2021-12-18      63      2021-12-15
1003    2021-12-19      73      NULL
1004    2021-12-16      23      NULL
dt - ldt 按照diff 进行分组并且count >=3 表示连续天数超过3次符合
最终结果 

t2.id   lowcarbon       mindate maxdate
1001    464     2021-12-12      2021-12-14
Time taken: 2.252 seconds, Fetched: 1 row(s)

select t2.id,sum(t2.lowcarbon) as lowcarbon,min(t2.dt) as minDate,max(t2.dt) as maxDate from (
    select d.id,d.dt,d.lowcarbon,datediff(d.dt,lag(d.dt,1,date_sub(d.dt,1)) over( partition by d.id order by d.dt )) as df from (select e.id,e.dt,sum(e.lowcarbon) as lowcarbon from demo1 e group by id,dt having sum(e.lowcarbon) >=100 ) d 
) t2 group by t2.id,t2.df having count(t2.id) >= 3;

第2题 分组问题

如下为电商公司用户访问时间数据

id ts(秒)

1001 17523641234

1001 17523641256

1002 17523641278

1001 17523641334

1002 17523641434

1001 17523641534

1001 17523641544

1002 17523641634

1001 17523641638

1001 17523641654

某个用户连续的访问记录如果时间间隔小于60秒,则分为同一个组,结果为:

id ts(秒) group

1001 17523641234 1

1001 17523641256 1

1001 17523641334 2

1001 17523641534 3

1001 17523641544 3

1001 17523641638 4

1001 17523641654 4

1002 17523641278 1

1002 17523641434 2

1002 17523641634 3

第3题 间隔连续问题

某游戏公司记录的用户每日登录数据

id dt

1001 2021-12-12

1002 2021-12-12

1001 2021-12-13

1001 2021-12-14

1001 2021-12-16

1002 2021-12-16

1001 2021-12-19

1002 2021-12-17

1001 2021-12-20

计算每个用户最大的连续登录天数,可以间隔一天。解释:如果一个用户在1,3,5,6登录游戏,则视为连续6天登录。

1. hive 建表及数据加载

CREATE EXTERNAL TABLE IF NOT EXISTS demo2(
id string comment 'id',
dt string comment '日期'
)
comment '连续登录日期间隔统计测试表'
PARTITIONED BY (pdt STRING)
row format delimited fields terminated by ' '--逗号分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1")  --跳过文件行第一1行;

2.加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)

load data inpath '/libangqin/data2.txt' overwrite into table demo2 partition(pdt='2022-05-14');

3.进行分析

1 lag 开窗 作差

1001 2021-12-12 2021-12-11 1

1001 2021-12-13 2021-12-12 1

1001 2021-12-14 2021-12-13 1

1001 2021-12-16 2021-12-14 2

1001 2021-12-19 2021-12-16 3

1001 2021-12-20 2021-12-19 1

1002 2021-12-16

1002 2021-12-17

1002 2021-12-12


1. lag 开窗作差
select id,dt,datediff(t1.dt,t1.ldt) df from (select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 ) t1
2.根据df进行分组
select id,dt,df,sum(if(df>2,1,0)) over (partition by id order by dt) sumCount from (select id,dt,datediff(t1.dt,t1.ldt) df from (select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 ) t1)t2;
3.group by 操作 求连续天数maxDay
select t3.id,(datediff(max(dt),min(dt)) + 1) as maxDay,max(t3.dt) as maxDate,min(t3.dt) as minDate,sumCount from (select id,dt,df,sum(if(df>2,1,0)) over (partition by id order by dt) sumCount from (select id,dt,datediff(t1.dt,t1.ldt) df from (select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 ) t1)t2)t3 group by id,sumCount;  
4.
select id,maxDay,minDate,maxDate,sumCount from (
    select t3.id,(datediff(max(dt),min(dt)) + 1) as maxDay,max(t3.dt) as maxDate,min(t3.dt) as minDate,sumCount from (
    select id,dt,df,sum(if(df>2,1,0)) over (partition by id order by dt) sumCount from (
    select id,dt,datediff(t1.dt,t1.ldt) df from (
    select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 )t1)t2)t3 group by id,sumCount
)t4 


运行报错
hive return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

设置执行引擎为mr 执行模式为本地执行
set hive.execution.engine=mr;
set hive.exec.mode.local.auto=true;

运行结果

1001    5       2021-12-12      2021-12-16
1001    9       2021-12-19      2021-12-27
1002    1       2021-12-12      2021-12-12
1002    2       2021-12-16      2021-12-17

按照 sum(if(diff>2,1,0)) 的结果进行分组

1001 2021-12-12 2021-12-11 1 0

1001 2021-12-13 2021-12-12 1 0

1001 2021-12-14 2021-12-13 1 0

1001 2021-12-16 2021-12-14 2 0

1001 2021-12-19 2021-12-16 3 1

1001 2021-12-20 2021-12-19 1 1

进行count 并且计算maxDate - minDate + 1 并且取差值较大的作为最终连续天数

第4题 打折日期交叉问题

如下为平台商品促销数据:字段为品牌,打折开始日期,打折结束日期

brand stt edt

oppo 2021-06-05 2021-06-09

oppo 2021-06-11 2021-06-21

vivo 2021-06-05 2021-06-15

vivo 2021-06-09 2021-06-21

redmi 2021-06-05 2021-06-21

redmi 2021-06-09 2021-06-15

redmi 2021-06-17 2021-06-26

huawei 2021-06-05 2021-06-26

huawei 2021-06-09 2021-06-15

huawei 2021-06-17 2021-06-21

计算每个品牌总的打折销售天数,注意其中的交叉日期,比如vivo品牌,第一次活动时间为2021-06-05到2021-06-15,第二次活动时间为2021-06-09到2021-06-21其中9号到15号为重复天数,只统计一次,即vivo总打折天数为2021-06-05到2021-06-21共计17天

1. hive 建表及数据加载

CREATE EXTERNAL TABLE IF NOT EXISTS demo3(
id string comment 'id',
stt string comment '开始打折日期',
edt string comment '结束打折日期'
)
comment '品牌打折日期统计测试表'
PARTITIONED BY (pdt STRING)
row format delimited fields terminated by ' '--空格分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1")  --跳过文件行第一1行;

2.加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)

load data inpath '/libangqin/data3.txt' overwrite into table demo3 partition(pdt='2022-05-14');

3.进行分析

1. 将当前行往前的数据中最大的edt放置到当前的位置
开窗范围 rows between unbounded preceding and 1 preceding
select id,stt,edt,max(edt) over (partition by id order by stt rows between unbounded preceding and 1 preceding ) as mtt from demo3;
2. edt - mtt 
select id,stt,edt,mt,datediff(edt,mt) as df from(select id,stt,edt,if(mtt is null,stt,if(stt>mtt,stt,mtt)) mt from(select id,stt,edt,max(edt) over (partition by id order by stt rows between unbounded preceding and 1 preceding ) as mtt from demo3) t1)t2
3. sum(if(df>0)) 计算总的打折天数 
select id,sum(if(df>=0,df+1,0)) as dayCount  from (select id,stt,edt,mt,datediff(edt,mt) as df from(select id,stt,edt,if(mtt is null,stt,if(stt>mtt,stt,date_add(mtt,1))) mt from(select id,stt,edt,max(edt) over (partition by id order by stt rows between unbounded preceding and 1 preceding ) as mtt from demo3) t1)t2)t3 group by t3.id;
结果
OK
huawe   22
oppo    16
redmi   22
vivo    17
xiaomi  23

第5题 同时在线问题

如下为某直播平台主播开播及关播时间,根据该数据计算出平台最高峰同时在线的主播人数。

id stt edt

1001 2021-06-14 12:12:12 2021-06-14 18:12:12

1003 2021-06-14 13:12:12 2021-06-14 16:12:12

1004 2021-06-14 13:15:12 2021-06-14 20:12:12

1002 2021-06-14 15:12:12 2021-06-14 16:12:12

1005 2021-06-14 15:18:12 2021-06-14 20:12:12

1001 2021-06-14 20:12:12 2021-06-14 23:12:12

1006 2021-06-14 21:12:12 2021-06-14 23:15:12

1007 2021-06-14 22:12:12 2021-06-14 23:10:12

1. hive 建表及数据加载

CREATE EXTERNAL TABLE IF NOT EXISTS demo4(
id string comment 'id',
stt string comment '开始直播日期',
edt string comment '结束直播日期'
)
comment '同时在线测试表'
PARTITIONED BY (pdt STRING)
row format delimited fields terminated by ','--空格分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1")  --跳过文件行第一1行;

2.加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)

load data inpath '/libangqin/data4.txt' overwrite into table demo4 partition(pdt='2022-05-14');

3.进行分析

1.将开播日期定义为1 表示主播上线  结束日期定义为-1 表示主播下线 然后将日期进行排行 进行sum 窗口分析

select id,tt,p from (select id,stt as tt,1 as p from demo4 
union all 
select id,edt as tt,-1 as p from demo4 )t1 order by tt


2. 窗口分析 
此时不需要进行partition by
select id,tt,sum(p) over(order by tt ) as sumP from (select id,tt,p from (select id,stt as tt,1 as p from demo4 union all select id,edt as tt,-1 as p from demo4 )t1 order by tt)t2;

结果:

1001    2021-06-14 12:12:12     1
1003    2021-06-14 13:12:12     2
1004    2021-06-14 13:15:12     3
1002    2021-06-14 15:12:12     4
1005    2021-06-14 15:18:12     5
1002    2021-06-14 16:12:12     3
1003    2021-06-14 16:12:12     3
1001    2021-06-14 18:12:12     2
1001    2021-06-14 20:12:12     1
1004    2021-06-14 20:12:12     1
1005    2021-06-14 20:12:12     1
1006    2021-06-14 21:12:12     2
1007    2021-06-14 22:12:12     3
1007    2021-06-14 23:10:12     2
1001    2021-06-14 23:12:12     1
1006    2021-06-14 23:15:12     0

3. 求最大
select max(sumP) from ()t3;

六、groupping_set 测试

建表 导入数据

CREATE EXTERNAL TABLE IF NOT EXISTS demo5(
school string comment '学校',
grade string comment '班级',
name string comment '用户'
)
comment '学生表'
row format delimited fields terminated by ','--空格分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1")  --跳过文件行第一1行;

load data inpath '/libangqin/data5.txt' overwrite into table demo5;

 导入数据时报错
 Hive FAILED: ParseException line 1:4 character ' ' not supported here
 
 解决方法:
     
重新编辑创建表sql
重新导入数据即可

grouping sets 测试

select nvl(school,'=school=') as school,nvl(grade,'==grade==') as grade,count(1) as cnt from demo5 
group by school,grade grouping sets((school,grade),school,grade);

结果
=school=        grade1  3
=school=        grade2  3
=school=        grade3  3
=school=        grade4  3
=school=        grade5  3
=school=        grade6  3
school  ==grade==       6
school  grade1  1
school  grade2  1
school  grade3  1
school  grade4  1
school  grade5  1
school  grade6  1
school2 ==grade==       6
school2 grade1  1
school2 grade2  1
school2 grade3  1
school2 grade4  1
school2 grade5  1
school2 grade6  1
school3 ==grade==       6
school3 grade1  1
school3 grade2  1
school3 grade3  1
school3 grade4  1
school3 grade5  1
school3 grade6  1


上卷测试
select nvl(school,'=school=') as school,nvl(grade,'==grade==') as grade,count(1) as cnt from demo5 group by school,grade with rollup;
结果
=school= ==grade==       18
school  ==grade==       6
school  grade1  1
school  grade2  1
school  grade3  1
school  grade4  1
school  grade5  1
school  grade6  1
school2 ==grade==       6
school2 grade1  1
school2 grade2  1
school2 grade3  1
school2 grade4  1
school2 grade5  1
school2 grade6  1
school3 ==grade==       6
school3 grade1  1
school3 grade2  1
school3 grade3  1
school3 grade4  1
school3 grade5  1
school3 grade6  1

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
1.上传tar包 2.解压 tar -zxvf hive-1.2.1.tar.gz 3.安装mysql数据库 推荐yum 在线安装 4.配置hive (a)配置HIVE_HOME环境变量 vi conf/hive-env.sh 配置其中的$hadoop_home (b)配置元数据库信息 vi hive-site.xml 添加如下内容: javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName root username to use against metastore database javax.jdo.option.ConnectionPassword hadoop password to use against metastore database 5.安装hive和mysq完成后,将mysql的连接jar包拷贝到$HIVE_HOME/lib目录下 如果出现没有权限的问题,在mysql授权(在安装mysql的机器上执行) mysql -uroot -p #(执行下面的语句 *.*:所有库下的所有表 %:任何IP地址或主机都可以连接) GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION; FLUSH PRIVILEGES; 6. Jline包版本不一致的问题,需要拷贝hive的lib目录中jline.2.12.jar的jar包替换掉hadoop中的 /home/hadoop/app/hadoop-2.6.4/share/hadoop/yarn/lib/jline-0.9.94.jar 启动hive bin/hive ---------------------------------------------------------------------------------------------------- Hive几种使用方式: 1.Hive交互shell bin/hive 2.Hive JDBC服务(参考java jdbc连接mysql) 3.hive启动为一个服务器,来对外提供服务 bin/hiveserver2 nohup bin/hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err & 启动成功后,可以在别的节点上用beeline去连接 bin/beeline -u jdbc:hive2://mini1:10000 -n root 或者 bin/beeline ! connect jdbc:hive2://mini1:10000 4.Hive命令 hive -e ‘sql’ bin/hive -e 'select * from t_test'

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值