【Spark实训】-- Spark SQL结构化数据文件处理 ②

⚆Pearl

已于 2022-05-19 14:20:22 修改

阅读量1.6k

点赞数 2

分类专栏： spark # 后端实训题文章标签： spark sql hive

于 2022-05-19 14:16:55 首次发布

本文链接：https://blog.csdn.net/weixin_58330979/article/details/124805792

版权

spark 同时被 2 个专栏收录

10 篇文章 2 订阅

订阅专栏

后端实训题

9 篇文章 3 订阅

订阅专栏

题目：统计分析航空公司客户数据的空值以及异常值。

题目：统计分析航空公司客户数据的空值以及异常值。

1、训练要点

（1）、Spark SQL CLI的配置。

（2）、在Spark SQL CLI中操作Hive表。

2、需求说明

某航空公司积累了大量的会员档案和其乘坐航班记录。抽取2012-04-01~2014-03-31内所有乘客的详细数据，总共62988条记录。其中包含了会员卡号、入会时间、性别等属性。

原始数据中存在票价（sum_yr_1）为空值的情况。票价为空值可能是客户不存在乘机记录造成的。同时原始数据中存在票价（sum_yr_1）最小值为0，折扣率（avg_discount）最小值为0，总飞行千米数（seg_km_sum）大于0的数据，这些数据可能是客户乘坐0折机票或积分兑换造成的。现要求统计sum_yr_1、seg_km_sum、avg_discount这3个字段的空值记录，保存到null_count表，以及统计sum_yr_1、seg_km_sum、avg_discount这3个字段的最小值并保存到min_count表。

3、实现思路及步骤

（1）、检查是否已配置Spark SQL CLI，若没有配置Spark SQL CLI环境，则先配置Spark SQL CLI。

（2）、启动spark-sql。

（3）、创建数据库air，并在air数据库下新建表air_data_base，建表语句可参考代码。

（4）、导入数据到air_data_base表。

（5）、使用hiveContext.sql()方式统计SUM_YR_1、SEG_KM_SUM、AVG_DISCOUNT这3个字段的空值记录数，保存到null_count表中。

（6）、使用hiveContext.sql()方式统计SUM_YR_1、SEG_KM_SUM、AVG_DISCOUNT这3个字段的最小值，保存到null_count表中。

4、具体实现过程代码与截图：

一、启动Hive的metastore服务，使用Hive CLI客户端登录hive

1.通过jps命令查看 RunJar进程存在

[root@node1 hive237]# jps

如果不存在，启动metastore，开启RunJar命令:

[root@node1 bin]# nohup /myserver/hive237/bin/hive --service metastore  &

2.Hive CLI客户端登录hive（node1结点）

[root@node1 bin]# /myserver/hive237/bin/hive

二、在hive上建数据库和表

1.建库air

hive> create database air;
OK
Time taken: 9.871 seconds
hive> show databases;
OK
air
default
law
Time taken: 0.223 seconds, Fetched: 3 row(s)

2.在数据库air里建表

use air;
create table air_data_base(
member_no string,
ffp_date string,
first_flight_date string,
gender string,
ffp_tier int,
work_city string,
work_province string,
work_country string,
age int,
load_time string,
flight_count int,
bp_sum bigint,
ep_sum_yr_1 int,
ep_sum_yr_2 bigint,
sum_yr_1 bigint,
sum_yr_2 bigint,
seg_km_sum bigint,
weighted_seg_km double,
last_flight_date string,
avg_flight_count double,
avg_bp_sum double,
begin_to_first int,
last_to_end int,
avg_interval float,
max_interval int,
add_points_sum_yr_1 bigint,
add_points_sum_yr_2 bigint,
exchange_count int,
avg_discount float,
p1y_flight_count int,
l1y_flight_count int,
p1y_bp_sum bigint,
l1y_bp_sum bigint,
ep_sum bigint,
add_point_sum bigint,
eli_add_point_sum bigint,
l1y_eli_add_points bigint,
points_sum bigint,
l1y_points_sum float,
ration_l1y_flight_count float,
ration_p1y_flight_count float,
ration_p1y_bps float,
ration_l1y_bps float,
point_notflight int
)row format delimited fields terminated by ','
STORED AS TEXTFILE;

show tables;

3. 加载数据到air_data_base表

（1）把air_data_base.txt上传到node1的根目录，并上传到hdfs上/user/root/目录下(在新打开的node1，ssh连接上操作)

[root@node1 ~]# hdfs dfs -put air_data_base.txt /user/root/

（2）加载数据

load data inpath '/user/root/air_data_base.txt' overwrite into table air_data_base;

hive> select count(*) from air_data_base;

三、数据分析

1.统计观测窗口的票价收入（SUM_YR_1）、观测窗口的总飞行公里数（SEG_KM_SUM）和平均折扣率（AVG_DISCOUNT）三个字段的空值记录，并将结果保存到名为null_count的表中。

hive> create table null_count as
select * from 
(select count(*) as sum_yr_1_null_count from air_data_base where sum_yr_1 is null) 
sum_yr_1_null,
(select count(*) as seg_km_sum_null_count from air_data_base where seg_km_sum is null) 
seg_km_sum_null,
(select count(*) as avg_discount_null_count from air_data_base where avg_discount is null) 
avg_discount_null ;

hive> select * from null_count;
592     1       1
Time taken: 0.45 seconds, Fetched 1 row(s)

2. 统计air_data_table表中观测窗口的SUM_YR_1（票价收入）、SEG_KM_SUM（总飞行公里数）和AVG_DISCOUNT（平均折扣率）三列的最小值min_count表中。

hive> create table min_count as
         select min(sum_yr_1) as sum_yr_1_min,
         min(seg_km_sum) as seg_km_sum_min,
         min(avg_discount) as avg_discount_min from air_data_base ;

hive> select * from min_count;
0       368     0.0
Time taken: 0.367 seconds, Fetched 1 row(s)

⚆Pearl

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
【Spark实训】-- Spark SQL结构化数据文件处理 ②

题目：统计分析航空公司客户数据的空值以及异常值。1.统计观测窗口的票价收入（SUM_YR_1）、观测窗口的总飞行公里数（SEG_KM_SUM）和平均折扣率（AVG_DISCOUNT）三个字段的空值记录，并将结果保存到名为null_count的表中。2. 统计air_data_table表中观测窗口的SUM_YR_1（票价收入）、SEG_KM_SUM（总飞行公里数）和AVG_DISCOUNT（平均折扣率）三列的最小值min_count表中。
复制链接

扫一扫