SQL专项练习第五天-CSDN博客

本文链接：https://blog.csdn.net/weixin_64726356/article/details/142423773

Hive 在处理各种复杂数据问题时的灵活性和强大功能。计算同时在线人数、处理车辆数据、解决促销日期交叉问题、删除重复数据还是整理数据结构，Hive 都能提供有效的解决方案。本文将介绍五个 Hive 数据处理问题的解决方案，并通过实际案例进行演示。

先在home文件夹下建一个hivedata文件夹，把我们所需的数据写成txt文件导入到/home/hivedata/文件夹下面。

一、同时在线问题

问题描述：给定一个直播平台主播开播及关播时间的表，要求计算出平台最高峰同时在线的主播人数。

数据：

id          stt                 edt
1001,2021-06-14 12:12:12,2021-06-14 18:12:12
1003,2021-06-14 13:12:12,2021-06-14 16:12:12
1004,2021-06-14 13:15:12,2021-06-14 20:12:12
1002,2021-06-14 15:12:12,2021-06-14 16:12:12
1005,2021-06-14 15:18:12,2021-06-14 20:12:12
1001,2021-06-14 20:12:12,2021-06-14 23:12:12
1006,2021-06-14 21:12:12,2021-06-14 23:15:12
1007,2021-06-14 22:12:12,2021-06-14 23:10:12

建表：

-- 建表
create table t16(
    id string,
    stt string,
    edt string
)row format delimited
fields terminated by ','
tblproperties("skip.header.line.count"="1");
-- 导入数据
load data local inpath '/home/hivedata/t16.txt' into table t16;
、

解决方案：

-- 这个题目的思路是：每一个主播同时都跟谁同时在线。
-- 查询sql
with t as (
    select t1.id,t1.stt,t1.edt,count(distinct t2.id) onlineNum from t16 t1 left join t16 t2
    on t1.stt <= t2.edt
            and  t1.edt >= t2.stt group by t1.id,t1.stt,t1.edt
)
select max(onlineNum) from t;

二、车辆每日最后一条数据的获取方法及区别

问题描述：车辆每天上传 N 条数据，要求获得每日最后一条数据，用至少三种方式实现，并说明三种方式的区别。

数据

vin                pkgts             value
1111111111111111,2022-06-01 16:54:57,60.0
1111111111111111,2022-06-01 07:40:27,6.0
2222222222222222,2022-06-01 19:40:27,7.0
1111111111111111,2022-06-26 12:16:45,64.0
2222222222222222,2022-06-05 23:00:42,18.0
2222222222222222,2022-06-05 09:10:17,44.0
2222222222222222,2022-06-05 13:00:25,19.0
2222222222222222,2022-06-13 16:14:04,43.0
2222222222222222,2022-06-13 07:21:08,40.0

建表

-- 建表
create table t17(
    vin string,
    pkgts string,
    value string
)row format delimited
fields terminated by ','
tblproperties("skip.header.line.count"="1"); -- 跳过第一行
-- 导入数据
load data local inpath '/home/hivedata/t17.txt' into table t17;

方法一：使用窗口函数`row_number()`

代码如下：

select vin, pkgts as last_pkgts, value as last_value
from (
    select *, row_number() over (partition by vin, date (pkgts) order by pkgts desc ) as rn
    from t17
) ranked where rn = 1;

方法二：使用表连接

代码如下：

select a.vin, a.pkgts as last_pkgts, a.value as last_value from t17 a
left join t17 b on a.vin = b.vin and a.pkgts < b.pkgts and date (a.pkgts) = date (b.pkgts)
where b.pkgts is null;

方法三：使用聚合函数和子查询

代码如下：

select vin, max(pkgts) as last_pkgts, max(value) as last_value from t17
group by vin, date(pkgts);

三种方法的区别

窗口函数:
简洁，易于理解和维护。

自连接：
比较复杂，数据多的话不容易理解和操作，效率较低，因为需要对每条记录进行比较。需要确保连接条件正确，避免错误。

聚合函数：
简单且易于实现。数据量大时效率低，需要对每个组进行聚合。需要确保字段的唯一性

三、打折日期交叉问题

问题描述：给定平台商品促销数据，包含品牌、打折开始日期和结束日期，要求计算每个品牌总的打折销售天数，注意交叉日期只统计一次。

解决方案：

首先使用自连接，找出每个品牌不同促销活动之间的交叉日期范围。
对于每个品牌，将所有促销活动的日期范围合并起来，去除重复的日期。
计算每个品牌合并后的日期范围的天数，即为总的打折销售天数。

数据：

brand stt        edt
oppo,2021-06-05,2021-06-09
oppo,2021-06-11,2021-06-21
vivo,2021-06-05,2021-06-15
vivo,2021-06-09,2021-06-21
redmi,2021-06-05,2021-06-21
redmi,2021-06-09,2021-06-15
redmi,2021-06-17,2021-06-26
huawei,2021-06-05,2021-06-26
huawei,2021-06-09,2021-06-15
huawei,2021-06-17,2021-06-21

建表：

-- 建表
create table shop_discount(
    brand string,
    stt string,
    edt string
)row format delimited
fields terminated by ','
tblproperties("skip.header.line.count"="1");
-- 导入数据
load data local inpath '/home/hivedata/shop_discount.txt' into table shop_discount;

代码如下：

1）使用开窗进行连续区间划分及合并

-- 使用开窗进行连续区间划分及合并
select brand,
    sum(days) as promotion_day_count
from (
	select  brand, datediff(max(edt), min(stt)) + 1 as days from (
		select brand, stt, edt,
		    sum(is_new_start) over (partition by brand order by  stt) as interval_id
		from (
			select brand, stt,  edt,
			    if(stt > coalesce(lag_end_date,'1970-01-01'),1,0) as is_new_start
			from (
				select brand, stt, edt,
				    max(edt) over (partition by brand order by  stt
		                rows between unbounded preceding and 1 preceding) as lag_end_date
				from shop_discount
			) t1
		) t2
	) t3 group by  brand, interval_id
) t4 group by  brand;

2）使用开窗求出没有活动的天数

select  brand,
    datediff(max(edt), min(stt)) - sum(no_promo_days) + 1 as promotion_day_count
from (
    select brand, stt, edt,
        if(stt > lag_end_date, datediff(stt, lag_end_date) - 1, 0) as no_promo_days
	from (
		select brand, stt, edt,
		    max(edt) over (partition by brand order by  stt
		        rows between unbounded preceding and 1 preceding) as lag_end_date
		from shop_discount
	) t1
) t2 group by brand;

3）使用开窗去除区间之间重复的部分

select brand, sum(datediff(edt,start_date)+1) promotion_day_count
from (
	select brand, max_end_date,
	    if(max_end_date is null or stt > max_end_date, stt, date_add(max_end_date,1)) start_date, edt
	from (
		select brand, stt, edt,
		    max(edt) over (partition by brand order by  stt
		        rows between unbounded preceding and 1 preceding) as max_end_date
		from shop_discount
	) t1
)t2 where edt > start_date group by brand;

4）使用UDTF生成所有活动日期然后去重

select brand,
    count(distinct promo_date) as promotion_day_count
from (
	select brand, date_add(stt, pos) as promo_date from shop_discount
    lateral view posexplode (split(space(datediff(edt, stt)), '') ) tmp as pos, element
	-- 炸开的实现方式有很多种
	-- LATERAL VIEW POSEXPLODE (split(repaet(',',DATEDIFF(end_date, start_date)), ',') ) tmp AS pos, element
) t1 group by  brand;

四、删除重复数据并保留 id 最大的记录

问题描述：有一个表包含 id 和 name 字段，name 列相同的为重复数据，要求删除重复数据，只保留一条 id 最大的记录。

解决方案：

使用窗口函数row_number()按照 name 分组并为每个分组内的记录分配一个序号。
筛选出序号为 1 的记录，即每个 name 组内 id 最大的记录。

建表：

CREATE TABLE `tab` (
  `id`    int,
  `name` string
) ;
  

INSERT INTO `tab` VALUES ('1', 'A')
,('2', 'A')
,('3', 'A')
,('4', 'B')
,('5', 'B')
,('6', 'C')
,('7', 'B')
,('8', 'B')
,('9', 'B')
,('10', 'E')
,('11', 'E')
,('12', 'E');

代码如下：

-- 重复数据只保留一条id最大的的
select t1.id, t1.name from tab t1
where t1.id = (
    select max(t2.id) from tab t2
    where t2.name = t1.name
);

五、整理数据结构

问题描述：给定一个包含平台、用户 ID、点击时间和点击 URL 的数据，要求整理成特定的数据结构，包括平台、用户 ID、序号、上一个 URL 和当前 URL。

现需将数据整理成如下数据结构

PLATFORM  USER_ID   SEQ     FROM_URL        TO_URL
WEB            12332321    1            NULL                /home/
WEB            12332321    2            /home/               /selectcat/er/
WEB            12332321    3            /selectcat/er/           /er/viewad/12.html
WEB            12332321    4            /er/viewad/12.html         /er/viewad/13.html
WEB            12332321    5            /er/viewad/13.html         /er/viewad/24.html
WEB            12332321    6            /er/viewad/24.html         /selectcat/che/
WEB            12332321    7            /selectcat/che/          /che/viewad/93.html
WEB            12332321    8            /che/viewad/93.html        /che/viewad/10.html
WEB            12332321    9            /che/viewad/10.html        /home/
WAP            32483923    1            NULL                /m/home/
WAP            32483923    2            /m/home/              /m/selectcat/fang/
WAP            32483923    3            /m/selectcat/fang/         /m/fang/33.html
WAP            32483923    4            /m/fang/33.html          /m/fang/54.html
WAP            32483923    5            /m/fang/54.html          /m/selectcat/er/
WAP            32483923    6            /m/selectcat/er/          /m/er/49.html

解决方案：

使用窗口函数row_number()为每个用户的点击记录分配一个序号。
使用窗口函数lag()获取每个用户上一个点击的 URL。
选择平台、用户 ID、序号、上一个 URL 和当前 URL 作为结果。

数据：

PLATFORM	USER_ID		CLICK_TIME					CLICK_URL
WEB	12332321	2013-03-21 13:48:31.324	/home/
WEB	12332321	2013-03-21 13:48:32.954	/selectcat/er/
WEB	12332321	2013-03-21 13:48:46.365	/er/viewad/12.html
WEB	12332321	2013-03-21 13:48:53.651	/er/viewad/13.html
WEB	12332321	2013-03-21 13:49:13.435	/er/viewad/24.html
WEB	12332321	2013-03-21 13:49:35.876	/selectcat/che/
WEB	12332321	2013-03-21 13:49:56.398	/che/viewad/93.html
WEB	12332321	2013-03-21 13:50:03.143	/che/viewad/10.html
WEB	12332321	2013-03-21 13:50:34.265	/home/
WAP	32483923	2013-03-21 23:58:41.123	/m/home/
WAP	32483923	2013-03-21 23:59:16.123	/m/selectcat/fang/
WAP	32483923	2013-03-21 23:59:45.123	/m/fang/33.html
WAP	32483923	2013-03-22 00:00:23.984	/m/fang/54.html
WAP	32483923	2013-03-22 00:00:54.043	/m/selectcat/er/
WAP	32483923	2013-03-22 00:01:16.576	/m/er/49.html

建表：

-- 建表：
create table alllog
(
    platform string,
    user_id int,
    visite_time string,
    from_url string
)
row format delimited
fields terminated by '\t'
tblproperties("skip.header.line.count"="1");
-- 导入数据
load data local inpath '/home/hivedata/alllog.txt'overwrite into table alllog;

代码如下：

方法一

-- 方法一
with t as (
    select platform, user_id, visite_time, from_url to_url,
        row_number() over (partition by user_id order by visite_time) as seq,
        lag(from_url, 1) over (partition by user_id order by visite_time) as from_url
    from alllog
) select platform, user_id, seq, from_url, to_url from t;

方法二

--方法二
select platform, user_id,
    row_number() over (partition by user_id order by visite_time) as seq,
    lag(from_url, 1) over (partition by user_id order by visite_time) as from_url,
    from_url to_url
from alllog;

通过以上五个问题的解决，展示了 Hive 在处理各种复杂数据问题时的灵活性和强大功能。无论是计算同时在线人数、处理车辆数据、解决促销日期交叉问题、删除重复数据还是整理数据结构，Hive 都能提供有效的解决方案。