hive instrt_Hive简易教程 - 数据分析

最新推荐文章于 2023-07-10 13:11:23 发布

weixin_30131105

最新推荐文章于 2023-07-10 13:11:23 发布

阅读量198

点赞数

文章标签： hive instrt

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_30131105/article/details/114995952

版权

Hive是一个HDFS上的sql执行引擎，它将sql语句转化为Hadoop上的map-reduce任务来执行。由于是写sql，所以使用Hive进行数据分析的好处是没有什么额外的学习成本，但是它是批量式处理的，可能会比较慢。本文将通过几个案例来简单介绍如何使用Hive。

样例数据

** 随机生成一批订单数据(order_id, price, tag, order_date) **

from random import randint

from datetime import date

from datetime import timedelta

for i in range(1000):

order_id = 'order_%s' % i

seller_id = 'seller_%s' % randint(0, 300)

price = randint(0, 100000) / 100.0

tag = randint(0, 1)

order_date = date.today() - timedelta(days=randint(0, 30))

print order_id, seller_id, price, tag, order_date

** 存储数据到Hive **

hive> create table test_order_sample

(order_id string, seller_id string, price double, tag int, order_date string)

row format delimited fields terminated by ' ';

hive> load data

local inpath '/data/order_sample'

into table test_order_sample;

案例一

** 统计出近一周每天成功支付的订单总数，gmv，客单价 **

hive> select order_date,count(*),round(sum(price),2),round(avg(price),2)

from test_order_sample

where tag=1

and order_date>=date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),7)

group by order_date

order by order_date desc;

案例二

** 统计出近一周每天成功支付及支付失败各自的订单总数，gmv，客单价 **

select order_date,

sum(if(tag=0,1,0)),sum(if(tag=0,price,0)),avg(if(tag=0,price,0)),

sum(if(tag=1,1,0)),sum(if(tag=1,price,0)),avg(if(tag=1,price,0))

from test_order_sample

where tag=1

and order_date>=date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),7)

group by order_date

order by order_date desc;

count函数和if条件组合，而不是两个sql join

案例三

** 挑选出近一周gmv>1000并且订单量>2单的卖家ID及其订单 **

hive> select seller_id,collect_set(order_id)

from test_order_sample

where tag=1

and order_date>=date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),7)

group by seller_id

having count(*)>2

and sum(price)>1000;

常用UDF

聚合相关函数

collect_set(c_1)

在使用group by之后只能select出group key以及相关的统计数字，但也可以以集合的形式select出任何其他的非group key，比如按卖家ID聚合之后又想查看在这个卖家下单的买家ID：sellect collect_set(buyer_id) from t group by seller_id。

collect_list(c_1)

与collect_set类似，元素可重复

explode(c_1)

explode函数可以把一个array类型的数据扁平化。比如，现在每行是一个seller_id集合，使用explode可以扁平化为每行一个seller_id。但explode不可以直接与group by一起使用，比如我想按某些条件筛选一些卖家然后在查看该店铺的买家的情况：select explode(b.buyer_ids) from (select collect_set(buyer_id) as buyer_ids from t group by seller_id) b;

时间函数

unix_timestamp()

当前时间

from_unixtime(timestamp, format)

将系统时间戳转化为人可读的数据格式如：select from_unixtime(unix_timestamp(), 'yyyy-MM-dd');

date_sub(string startdate, int days)

求几天前的日期

其它

nvl(v1, v2)

nvl函数用于处理null值，当一个字段是null时，这个字段和其它字段进行算术运算时的结果依然为null。这时可以使用这个函数为值可能为null的字段赋予一个默认值，即v2.

instr(str1, 'xxx')

判断字符串'xxx'是否出现在str1中，如果str1是null或者不存在xxx返回值都是0

size(a1)

返回数组a1的大小

union_all()

合并两个查询结果，但结果的列数需要一致！！！

weixin_30131105

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。