Hive 小练习

1.日志的格式如下:

    pin|-|request_tm|-url|-|sku_id|-|amount

    分隔符为‘|-|’,    数据样例为:

张三|-|q2013-11-23 11:59:30|-|www.jd.com|-|100023|-|110.15
李四|-|q2014-11-23 11:59:30|-|www.bu.com|-|100023|-|210.15
王五|-|q2013-11-23 11:59:30|-|www.dd.com|-|100023|-|333.15
张三|-|q2013-11-23 11:59:30|-|www.jd.com|-|100023|-|220.15
张三|-|q2013-11-23 11:59:30|-|www.jd.com|-|100023|-|110.15
王五|-|q2013-11-23 11:59:30|-|www.dd.com|-|100023|-|322.15

    假设本地数据文件为sample.txt,先将其导入到hive的test库的表t_sample中,并计算每个用户的总消费金额,写出详细过程包括表结构。

create external table t_sample
(pin string,
request_tm string,
url string,
sku_id string,
amount string)
row format delimited fields terminated by '|-|';

    先清洗数据,改变其分割符变为‘\t’,存入本地文件jd.txt中(注意split()方法也是需要转义的)

create external table t_sample
(pin string,
request_tm string,
url string,
sku_id int,
amount double)
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/sample.txt' into table t_sample;

select * from t_sample;
select pin,sum(amount) from t_sample group by pin;

 

2.订单详情表ord_det(order_id订单号,sku_id商品编号,sale_qtty销售数量,dt日期分区),计算2016年1月1日商品销量的Top100,并按销量降级排序

建表语句:

123456    111111    100
234567    222222    200
345678    333333    300
456789    444444    400
567890    555555    500
create table ord_det(order_id string,sku_id string,sale_qtty int)
partitioned by (dt string)
row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/ord_det.txt' into table ord_det partition(dt='20160101');

select sku_id,sum(sale_qtty) sale_count from ord_det where dt="20160101" group by sku_id order by sale_count desc limit 100;

3.场景题:北京市(数据量很大)学生成绩分析.

成绩的数据格式:时间,学校,年级,姓名,科目,成绩样例数据如下:

2013,北大,1,裘容絮,语文,97
2013,北大,1,庆眠拔,语文,52
2013,北大,1,乌洒筹,语文,85
2012,清华,0,钦尧,英语,61
2015,北理工,3,冼殿,物理,81
2016,北科,4,况飘索,化学,92
2014,北航,2,孔须,数学,70
2012,清华,0,王脊,英语,59
2014,北航,2,方部盾,数学,49
2018,北航,2,东门雹,数学,77
2018,北大,1,裘容絮,语文,97
2018,北大,1,庆眠拔,语文,52
2013,北大,1,乌洒筹,语文,85
2017,清华,0,钦尧,英语,61
2015,北理工,3,冼殿,物理,81
2017,北科,4,况飘索,化学,92
2014,北航,2,孔须,数学,70
2018,清华,0,王脊,英语,59
2014,北航,2,方部盾,数学,49
2014,北航,2,东门雹,数学,77
... ...

问题:(1)如何设计存储这些数据的表,写出建表语句:

create table score_ori
(year int,
school string,
class string,
name string,
subject string,
score double)
row format delimited fields terminated by ",";

    导入数据:

load data local inpath "/opt/module/datas/score.txt" into table score_ori;

    开启动态分区:

set hive.exec.dynamic.partition=true;

    创建分区语句:

create table score_partition
(school string,
class string,
name string,
subject string,
score double)
partitioned by (year string)
row format delimited fields terminated by "\t";

    查询导入数据:

insert into table score_partition partition (year) select school,class,name,subject,score,year from score_ori;

(2)选出今年每个学校,每个年级,分数前三的科目.

select *
from(
select
school,
class,
subject,
score,
row_number() over(partition by school,class,subject order by score desc) rank_code
from score_partition
where year="2018"
) t
where t.rank_code <= 3;


(3)今年 清华 1年级 总成绩大于200分的学生 以及学生数.

select 
school,class,name,
sum(score) as total_score,
count(1)  over (partition by school,class) nct
from 
score_partition
where 
year="2018" and school="清华" and class=1
group by 
school,class,name
having 
total_score>100;

 

4、有一张很大的表:TRLOG,数据如下:

PLATFORM        USER_ID            CLICK_TIME                    CLICK_URL
WEB                12332321        2013-03-21 13:48:31.324        /home/
WEB                12332321        2013-03-21 13:48:32.954        /selectcat/er/
WEB                12332321        2013-03-21 13:48:46.365        /er/viewad/12.html
WEB                12332321        2013-03-21 13:48:53.651        /er/viewad/13.html

    创建原始表:

CREATE TABLE trlog
(platform string,
user_id int,
click_time string,
click_url string)
row format delimited fields terminated by "\t";

    导入数据:

load data local inpath "/opt/module/datas/log.txt" into table trlog;

CREATE TABLE allog
(platform string,
user_id int,
seq int,
from_url string,
to_url string)
row format delimited fields terminated by "\t";

    查询导入数据:

insert into table allog
select 
platform,
user_id,
row_number() over(partition by user_id order by click_time) seq,
lag(click_url,1) over(partition by user_id order by click_time) as from_url,
click_url as to_url
from
trlog;

    结果展示:

select * from allog;
+-----------+-----------+------+---------------------+---------------------+--+
| platform  |  user_id  | seq  |      from_url       |       to_url        |
+-----------+-----------+------+---------------------+---------------------+--+
| WEB       | 12332321  | 1    | NULL                | /selectcat/er/      |
| WEB       | 12332321  | 2    | /home/              | /er/viewad/12.html  |
| WEB       | 12332321  | 3    | /selectcat/er/      | /er/viewad/13.html  |
| WEB       | 12332321  | 4    | /er/viewad/12.html  | Exit!               |
+-----------+-----------+------+---------------------+---------------------+--+

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

程序员学习圈

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值