Hadoop环境中使用Hive实现对产品和用户的数据分析

3 篇文章 3 订阅
2 篇文章 0 订阅

一、背景
某电商平台为了合理的投入人力物力创造更大的销售利润,现对已有的销售数据进行用户分析,提出合理的促销计划。

围绕产品和用户两大方面展开为电商平台制定策略提供分析及建议。

二、需求
用户分析:从性别、年龄、 职业、城市、居住年限,婚姻状况等维度找到高质量用户,并查看高质量用户人群的占比,为其提供高价值消费品 (定位高价值消费品以销售金额评估)。针对其他的用户,主要引导用户进行购买,多推荐一些热销的商品(定位热销产品)

产品分析:从销量、销售额都高的产品并以二八法则找到高贡献的一级产品类目

三、数据介绍
假定每条记录为一单

在这里插入图片描述

--建表
create table model2_datas
(User_ID int,
Product_ID string,
Gender string,
Age string,
Occupation int,
City_Category string,
Stay_In_Current_City_Years string,
Marital_Status int,
Product_Category_1 int,
Product_Category_2 int,
Product_Category_3 int,
Purchase double
)
row format delimited fields terminated by ',' --指定分隔符csv 为逗号分割
tblproperties(
"skip.header.line.count"="1" --跳过⽂件⾏⾸1⾏
);
--装载数据
load data local inpath '/home/hadoop/datas/model2_datas.csv' overwrite into
table model2_datas

四、 使用方法

本案例是在Hadoop分布式集群环境中完成,主要用到hive基础函数和开窗函数。
Hadoop一般有三个重要部分组成,分别是分布式文件系统(HDFS)、分布式计算(MapReduce)、数据仓库工具(Hive)。

  • HDFS(Hadoop Distributed Fill System) 是Hadoop 项目的子项目,使用多台计算机存储文件,并且提供统一的访问接口(NameNode),像是访问一个普通文件系统一样使用分布式文件系统。
    在这里插入图片描述
  • MapReduce 是一种分布式并行编程框架
    MapReduce 策略: 分而治之
    MapReduce 理念: 计算向数据靠拢而不是数据向计算靠拢
    Mapreduce工作流程
    在这里插入图片描述
  • Hive是Facebook为了解决海量日志数据的统计分析而开发的基于Hadoop的一个数据仓库工具(后来开源给了Apache软件基金会),可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能:HQL。
    Hive特点
    Hive 本身并不支持数据存储和处理,只是一个面向用户的编程接口
    Hive 依赖分布式文件系统HDFS存储数据
    Hive 依赖分布式并行计算模型MapReduce 处理数据
    Hive架构
    在这里插入图片描述
    1.用户接口:Client CLI(hive shell)、JDBC/ODBC(java 访问 hive)、WEBUI(浏览器访问 hive)
    2.元数据:Metastore 元数据包括:表名、表所属的数据库(默认是 default)、表的拥有者、列/分区字段、表的类型(是否是外部表)、表的数据所在目录等;
    3.Hadoop 使用 HDFS 进行存储,使用 MapReduce 进行计算。
    4.驱动器:Driver
    (1)解析器(SQL Parser):将 SQL 字符串转换成抽象语法树 AST,这一步一般都用第三方工具库完成,比如 antlr;对 AST 进行语法分析,比如表是否存在、字段是否存在、SQL语义是否有误。
    (2)编译器(Physical Plan):将 AST 编译生成逻辑执行计划。
    (3)优化器(Query Optimizer):对逻辑执行计划进行优化。
    (4)执行器(Execution):把逻辑执行计划转换成可以运行的物理计划。对于 Hive 来说,就是MR/Spark。
    在这里插入图片描述
    Hive 通过给用户提供的一系列交互接口,接收到用户的指令(SQL),使用自己的 Driver,结合元数据(MetaStore),将这些指令翻译成 MapReduce,提交到 Hadoop 中执行,最后,将执行返回的结果输出到用户交互接口。
    Hive 的数据模型主要有以下四种:
    在这里插入图片描述

五、 需求实现

  • 查询订单整体的消费情况(包括:总销售额、人均消费、平均每单消费)
select
	round(sum(purchase),2) totle_sales,
	round(sum(purchase)/count(distinct user_id),2) per_cost,
	round(sum(purchase)/count(*),2) avg_unit_cost
from
	model2_datas;

==>
totle_sales per_cost avg_unit_cost
5.017668378E9 851751.55 9333.86

  • 用户分析(找到高质量人群)

统计各性别消费情况(字段包含性别、人数、人数占比、人均消费、消费金额、消费占比) 并以消费占比降序

select
	*,
	concat(round((num_gender/sum(num_gender) over())*100,2),"%") as num_rate,
	concat(round((cost_gender/sum(cost_gender) over())*100,2),"%") as cost_rate
from
	(
		select
			gender,
			count(distinct user_id) num_gender,
			round(sum(purchase)/count(distinct user_id),2) avg_cost_gender,
			round(sum(purchase),2) cost_gender
		from
			model2_datas
		group by
			gender
	) a
order by
	cost_rate desc;

==>:
a.gendera.num_gender a.avg_cost_gender a.cost_gender num_ratecost_rate
M 4225 911963.16 3.853044357E9 71.72% 76.79%
F 1666 699054.03 1.164624021E9 28.28% 23.21%

统计各年龄段消费情况(字段包含年龄段 、 人数、 人数占比、消费金额、 人均消费、消费占比)并以消费占比降序

select
	*,
	concat(round((num_age/sum(num_age) over())*100,2),"%") as num_rate,
	concat(round((cost_age/sum(cost_age) over())*100,2),"%") as cost_rate
from
(
	select
		age,
		count(distinct user_id) num_age,
		round(sum(purchase)/count(distinct user_id),2) avg_cost_age,
		round(sum(purchase),2) cost_age
	from
		model2_datas
	group by
		age
) a
order by
	cost_rate desc;

==>:
a.age a.num_age a.avg_cost_age a.cost_age num_rate cost_rate
46-50 531 778565.39 4.13418223E8 9.01% 8.24%
51-55 481 752408.22 3.61908356E8 8.16% 7.21%
26-35 2053 974061.91 1.999749106E9 34.85% 39.85%
55+ 372 531222.69 1.97614842E8 6.31% 3.94%
36-45 1167 866023.62 1.010649565E9 19.81% 20.14%
0-17 218 608527.55 1.32659006E8 3.7% 2.64%
18-25 1069 843469.86 9.0166928E818.15% 17.97%
统计各职业消费情况(字段包含职业 、人数、人数占比、消费金额、人均消费、消费占比) 并以消费占比降序

select
	*,
	concat(round((num_o/sum(num_o) over())*100,2),"%") as num_rate,
	concat(round((cost_o/sum(cost_o) over())*100,2),"%") as cost_rate
from
(
	select
		occupation,
		count(distinct user_id) num_o,
		round(sum(purchase)/count(distinct user_id),2) avg_cost_o,
		round(sum(purchase),2) cost_o
	from
		model2_datas
	group by
		occupation
) a
sort by
	cost_rate desc;

==>:
a.occupation a.num_oa.avg_cost_oa.cost_o num_rate cost_rate
1 517 801843.0 4.14552829E8 8.78% 8.26%
17 491 788676.89 3.87240355E8 8.33% 7.72%
12 376 799659.85 3.00672105E8 6.38% 5.99%
20 273 1070611.67 2.92276985E8 4.63% 5.82%
14 294 869369.88 2.55594745E8 4.99% 5.09%
16 235 997626.94 2.3444233E83.99% 4.67%
2 256 911232.0 2.33275393E8 4.35% 4.65%
6 228 811691.65 1.85065697E8 3.87% 3.69%
3 170 943696.76 1.6042845E82.89% 3.2%
15 140 832428.76 1.16540026E8 2.38% 2.32%
10 192 595176.84 1.14273954E8 3.26% 2.28%
5 111 1013741.94 1.12525355E8 1.88% 2.24%
11 128 823729.37 1.05437359E8 2.17% 2.1%
4 740 888554.59 6.57530393E8 12.56% 13.1%
0 688 909614.55 6.25814811E8 11.68% 12.47%
7 669 821050.44 5.49282744E8 11.36% 10.95%
19 71 1029795.62 7.3115489E71.21% 1.46%
13 140 508112.46 7.1135744E72.38% 1.42%
18 67 899249.34 6.0249706E71.14% 1.2%
9 88 609310.33 5.3619309E71.49% 1.07%
8 17 858505.82 1.4594599E70.29% 0.29%

统计各婚姻状况消费情况(字段包含婚姻状况 、人数、人数占比、消费金额、人均消费、消费占比) 并以消费占比降序

select
	*,
	concat(round((num_m/sum(num_m) over())*100,2),"%") as num_rate,
	concat(round((cost_m/sum(cost_m) over())*100,2),"%") as cost_rate
from
(
	select
		marital_status,
		count(distinct user_id) num_m,
		round(sum(purchase)/count(distinct user_id),2) avg_cost_m,
		round(sum(purchase),2) cost_m
	from
		model2_datas
	group by
		marital_status
) a
sort by
	cost_rate desc;

==>:
a.marital_status a.num_m a.avg_cost_m a.cost_m num_rate cost_rate
0 3417 868097.6 2.9662895E958.0% 59.12%
1 2474 829174.97 2.051378878E9 42.0% 40.88%

依据以上查询结果找到高质量人群:

性别为男, 年龄段为 26-35, 职业为 0,4,7, 婚姻状况为 0

  • 产品分析

查询出订单量TOP10的产品 (包含字段 排名编号、商品ID、订单量、订单量占比 )按订单量降序显示

select
	rank() over(order by sales desc) as rank,
	*,
	concat(round((sales/sum(sales) over())*100,2),"%") as sales_rate
from
(
	select
		product_id,
		count(*) as sales
	from
		model2_datas
	group by
		product_id
) a
order by
	sales desc
limit 10;

==>:
ranka.product_id a.sales sales_rate
1 P00265242 1858 0.35%
2 P00110742 1591 0.3%
3 P00025442 1586 0.3%
4 P00112142 1539 0.29%
5 P00057642 1430 0.27%
6 P00184942 1424 0.26%
7 P00046742 1417 0.26%
8 P00058042 1396 0.26%9 P00145042 1384 0.26%
9 P00059442 1384 0.26%

查询出销售额TOP10的产品 (包含字段 排名编号、商品ID、销售额、销售额占比 )

select
	rank() over(order by cost desc) as rank,
	*,
	concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate
from
(
	select
		product_id,
		sum(purchase) as cost
	from
		model2_datas
	group by
		product_id
) a
order by
	cost desc
limit 10;

==>:
ranka.product_id a.cost cost_rate
1 P00025442 2.7532426E70.55%
2 P00110742 2.6382569E70.53%
3 P00255842 2.4652442E70.49%
4 P00184942 2.4060871E70.48%
5 P00059442 2.3948299E70.48%
6 P00112142 2.3882624E70.48%
7 P00110942 2.3232538E70.46%
8 P00237542 2.3096487E70.46%
9 P00057642 2.249369E7 0.45%
10 P00010742 2.1865042E70.44%

统计各一级产品类目的订单量、销售额、订单量占比、销售额占比、累计销售额占比 (以销售额占比降序),并根据查询结果找到累计销售额达到20%的几个一级产品类目

select
	*,
	concat(round((Accu_cost/sum(cost) over())*100,2),"%") as Accu_cost_rate
from
(
	select
		*,
		concat(round((sales/sum(sales) over())*100,2),"%") as sales_rate,
		concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate,
		sum(cost) over(order by cost desc rows between unbounded preceding and current row) Accu_cost
	from
		(
		select
			Product_Category_1,
			count(*) as sales,
			sum(purchase) as cost
		from
			model2_datas
		group by
			Product_Category_1
		) as a
) as b
group by
	Product_Category_1
order by
	cost_rate desc

==>:
a.product_category_1 a.sales a.cost sales_rate cost_rate accu_cost
6 20164 3.19355286E8 3.75% 6.36% 3.969632502E9
2 23499 2.64497242E8 4.37% 5.27% 4.234129744E9
1 138353 1.882666325E9 25.74% 37.52% 1.882666325E9
3 19849 2.00412211E8 3.69% 3.99% 4.434541955E9
16 9697 1.43168035E8 1.8% 2.85% 4.57770999E911 23960 1.12203088E8 4.46% 2.24% 4.689913078E9
5 148592 9.26917497E8 27.64% 18.47% 2.809583822E9
8 112132 8.40693394E8 20.86% 16.75% 3.650277216E9
10 5032 9.9029631E70.94% 1.97% 4.788942709E9
15 6203 9.1658147E71.15% 1.83% 4.880600856E9
7 3668 6.0059209E70.68% 1.2% 4.940660065E9
4 11567 2.6937957E72.15% 0.54% 4.967598022E9
14 1500 1.9718178E70.28% 0.39% 4.9873162E9
18 3075 9149071.0 0.57% 0.18% 4.996465271E9
9 404 6277472.0 0.08% 0.13% 5.002742743E9
17 567 5758702.0 0.11% 0.11% 5.008501445E9
12 3875 5235883.0 0.72% 0.1% 5.013737328E9
13 5440 3931050.0 1.01% 0.08% 5.017668378E9

  • 细化分析

查询出各性别销售额TOP10 产品 ( 字段包含商品ID、订单量、销售额、销售额占比、类别1、类别2、类别3)

select
	*
from
(
	select
		*,
		rank() over(partition by gender order by cost desc) as rank,
		concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate
	from
		(
		select
			gender,
			product_id,
			Product_Category_1,
			Product_Category_2,
			Product_Category_3,
			count(*) sales,
			sum(purchase) as cost
		from model2_datas
		group by
			gender,
			product_id,
			Product_Category_1,
			Product_Category_2,
			Product_Category_3
		) a
)b
where rank<11
order by
	gender,
	cost desc;

==>:
b.genderb.product_id b.product_category_1 b.product_category_2
b.product_category_3b.sales b.cost b.rank b.cost_rate
F P00255842 16 NULL NULL 366 6690088.0 1 0.13%
F P00059442 6 8 16 350 6007826.0 2 0.12%
F P00110842 1 2 5 351 5933348.0 3 0.12%
F P00025442 1 2 9 341 5763524.0 4 0.11%
F P00110742 1 2 8 357 5632357.0 5 0.11%
F P00110942 1 2 NULL 305 5066142.0 6 0.1%
F P00148642 6 10 13 300 5049905.0 7 0.1%
F P00112142 1 2 14 332 4901047.0 8 0.1%
F P00028842 6 8 NULL 289 4867128.0 9 0.1%
F P00184942 1 8 17 293 4723224.0 10 0.09%
M P00025442 1 2 9 1245 2.1768902E71 0.43%
M P00110742 1 2 8 1234 2.0750212E72 0.41%
M P00184942 1 8 17 1131 1.9337647E73 0.39%
M P00112142 1 2 14 1207 1.8981577E74 0.38%
M P00057642 1 15 16 1174 1.872036E7 5 0.37%
M P00237542 1 15 16 1092 1.8562039E76 0.37%
M P00110942 1 2 NULL 1031 1.8166396E77 0.36%
M P00255842 16 NULL NULL 988 1.7962354E78 0.36%
M P00059442 6 8 16 1034 1.7940473E79 0.36%
M P00010742 1 8 17 1056 1.7517618E710 0.35%

查询高质量用户群年龄段的订单量TOP10产品( 字段包含商品ID、订单量、销售额、订单量占比、销售额占比、类别1、类别2、类别3)

select
	*
from
(
	select
		rank() over(order by sales desc) rank,
		*,
		concat(round((sales/sum(sales) over())*100,2),"%") as sales_rate,
		concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate
	from
		(
		select
			age,
			product_id,
			count(*) sales,
			sum(purchase) cost,
			Product_Category_1,
			Product_Category_2,
			Product_Category_3
		from
			model2_datas
		where
			age="26-35"
		group by
			age,
			product_id,
			Product_Category_1,
			Product_Category_2,
			Product_Category_3
		) a
	order by
	sales desc
) b
where
	rank < 11;

==>:
b.rank b.age b.product_id b.sales b.cost b.product_category_1
b.product_category_2 b.product_category_3 b.sales_rate b.cost_rate
10 26-35 P00255842 537 9860878.0 16 NULL NULL 0.250.49%
9 26-35 P00046742 544 7886516.0 1 2 15 0.250.39%
8 26-35 P00237542 568 9697110.0 1 15 16 0.260.48%
7 26-35 P00184942 570 9493975.0 1 8 17 0.270.47%
6 26-35 P00057642 581 9110947.0 1 15 16 0.270.46%
5 26-35 P00058042 584 5116374.0 8 16 NULL 0.270.26%
4 26-35 P00025442 595 1.0594786E71 2 9 0.280.53%
3 26-35 P00112142 597 9258356.0 1 2 14 0.280.46%
2 26-35 P00110742 627 1.0605442E71 2 8 0.290.53%
1 26-35 P00265242 732 5546869.0 5 8 NULL 0.340.28%

  • 0
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 18
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 18
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大大大荣子哇

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值