Hive 小练习

最新推荐文章于 2024-09-03 16:35:30 发布

程序员学习圈

最新推荐文章于 2024-09-03 16:35:30 发布

阅读量4.7k

点赞数

分类专栏： # Hive

本文链接：https://blog.csdn.net/Luomingkui1109/article/details/82250638

版权

Hive 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

1.日志的格式如下：

pin|-|request_tm|-url|-|sku_id|-|amount

分隔符为‘|-|’, 数据样例为:

张三|-|q2013-11-23 11:59:30|-|www.jd.com|-|100023|-|110.15
李四|-|q2014-11-23 11:59:30|-|www.bu.com|-|100023|-|210.15
王五|-|q2013-11-23 11:59:30|-|www.dd.com|-|100023|-|333.15
张三|-|q2013-11-23 11:59:30|-|www.jd.com|-|100023|-|220.15
张三|-|q2013-11-23 11:59:30|-|www.jd.com|-|100023|-|110.15
王五|-|q2013-11-23 11:59:30|-|www.dd.com|-|100023|-|322.15

假设本地数据文件为sample.txt,先将其导入到hive的test库的表t_sample中，并计算每个用户的总消费金额，写出详细过程包括表结构。

create external table t_sample
(pin string,
request_tm string,
url string,
sku_id string,
amount string)
row format delimited fields terminated by '|-|';

先清洗数据，改变其分割符变为‘\t’，存入本地文件jd.txt中(注意split()方法也是需要转义的)

create external table t_sample
(pin string,
request_tm string,
url string,
sku_id int,
amount double)
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/sample.txt' into table t_sample;

select * from t_sample;
select pin,sum(amount) from t_sample group by pin;

2.订单详情表ord_det(order_id订单号，sku_id商品编号，sale_qtty销售数量，dt日期分区)，计算2016年1月1日商品销量的Top100，并按销量降级排序

建表语句：

123456    111111    100
234567    222222    200
345678    333333    300
456789    444444    400
567890    555555    500

create table ord_det(order_id string,sku_id string,sale_qtty int)
partitioned by (dt string)
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/ord_det.txt' into table ord_det partition(dt='20160101');

select sku_id,sum(sale_qtty) sale_count from ord_det where dt="20160101" group by sku_id order by sale_count desc limit 100;

3.场景题：北京市(数据量很大)学生成绩分析.

成绩的数据格式:时间,学校,年级,姓名,科目,成绩样例数据如下:

2013,北大,1,裘容絮,语文,97
2013,北大,1,庆眠拔,语文,52
2013,北大,1,乌洒筹,语文,85
2012,清华,0,钦尧,英语,61
2015,北理工,3,冼殿,物理,81
2016,北科,4,况飘索,化学,92
2014,北航,2,孔须,数学,70
2012,清华,0,王脊,英语,59
2014,北航,2,方部盾,数学,49
2018,北航,2,东门雹,数学,77
2018,北大,1,裘容絮,语文,97
2018,北大,1,庆眠拔,语文,52
2013,北大,1,乌洒筹,语文,85
2017,清华,0,钦尧,英语,61
2015,北理工,3,冼殿,物理,81
2017,北科,4,况飘索,化学,92
2014,北航,2,孔须,数学,70
2018,清华,0,王脊,英语,59
2014,北航,2,方部盾,数学,49
2014,北航,2,东门雹,数学,77
... ...

问题:（1）如何设计存储这些数据的表，写出建表语句：

create table score_ori
(year int,
school string,
class string,
name string,
subject string,
score double)
row format delimited fields terminated by ",";

导入数据：

load data local inpath "/opt/module/datas/score.txt" into table score_ori;

开启动态分区：

set hive.exec.dynamic.partition=true;

创建分区语句：

create table score_partition
(school string,
class string,
name string,
subject string,
score double)
partitioned by (year string)
row format delimited fields terminated by "\t";

查询导入数据：

insert into table score_partition partition (year) select school,class,name,subject,score,year from score_ori;

（2）选出今年每个学校,每个年级,分数前三的科目.

select *
from(
select
school,
class,
subject,
score,
row_number() over(partition by school,class,subject order by score desc) rank_code
from score_partition
where year="2018"
) t
where t.rank_code <= 3;

（3）今年清华 1年级总成绩大于200分的学生以及学生数.

select 
school,class,name,
sum(score) as total_score,
count(1)  over (partition by school,class) nct
from 
score_partition
where 
year="2018" and school="清华" and class=1
group by 
school,class,name
having 
total_score>100;

4、有一张很大的表：TRLOG,数据如下：

PLATFORM        USER_ID            CLICK_TIME                    CLICK_URL
WEB                12332321        2013-03-21 13:48:31.324        /home/
WEB                12332321        2013-03-21 13:48:32.954        /selectcat/er/
WEB                12332321        2013-03-21 13:48:46.365        /er/viewad/12.html
WEB                12332321        2013-03-21 13:48:53.651        /er/viewad/13.html

创建原始表：

CREATE TABLE trlog
(platform string,
user_id int,
click_time string,
click_url string)
row format delimited fields terminated by "\t";

导入数据：

load data local inpath "/opt/module/datas/log.txt" into table trlog;

CREATE TABLE allog
(platform string,
user_id int,
seq int,
from_url string,
to_url string)
row format delimited fields terminated by "\t";

查询导入数据：

insert into table allog
select 
platform,
user_id,
row_number() over(partition by user_id order by click_time) seq,
lag(click_url,1) over(partition by user_id order by click_time) as from_url,
click_url as to_url
from
trlog;

结果展示：

select * from allog;
+-----------+-----------+------+---------------------+---------------------+--+
| platform  |  user_id  | seq  |      from_url       |       to_url        |
+-----------+-----------+------+---------------------+---------------------+--+
| WEB       | 12332321  | 1    | NULL                | /selectcat/er/      |
| WEB       | 12332321  | 2    | /home/              | /er/viewad/12.html  |
| WEB       | 12332321  | 3    | /selectcat/er/      | /er/viewad/13.html  |
| WEB       | 12332321  | 4    | /er/viewad/12.html  | Exit!               |
+-----------+-----------+------+---------------------+---------------------+--+