Hive数据分析案例

最新推荐文章于 2025-04-15 15:35:08 发布

whvcse_hlzhang

最新推荐文章于 2025-04-15 15:35:08 发布

阅读量5.8k

点赞数 6

文章标签： hive big data

本文链接：https://blog.csdn.net/qq_35447918/article/details/120729025

版权

本文通过实际案例介绍了如何使用Hive进行数据清洗和汽车销售数据分析。首先，创建并使用cardb数据库，建立cars数据表，加载数据并进行数据清洗，通过去除缺失值和重复项得到cars_clean表。接着，分析了车辆使用性质、汽车销售比例、性别购车比例以及月销售量占比。此外，还展示了如何处理JSON格式的新浪微博历史数据，以及在电商推荐系统中使用Hive进行数据清洗和推荐算法的初步应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第9章汽车销售数据分析

1 创建数据库cardb
hive>create database if not exists cardb;
hive>use cardb;

2 创建数据表car
hive>create external table cars(province string, month int, city string, country string, year int, cartype string, productor string, brand string, mold string, owner string, nature string, number int, ftype string, outv int, power double, fuel string, length int, width int, height int, xlength int, xwidth int, xheight int, count int, base int, front int, norm string, tnumber int, total int, curb int, hcurb int, passenger string, zhcurb int, business string, dtype string, fmold string, fbusiness string, name string, age int, sex string) row format delimited fields terminated by ',' location '/cars';

hive> desc cars;

province                string                                      
month                   int                                         
city                    string                                      
country                 string                                      
year                    int                                         
cartype                 string                                      
productor               string                                      
brand                   string                                      
mold                    string                                      
owner                   string                                      
nature                  string                                      
number                  int                                         
ftype                   string                                      
outv                    int                                         
power                   double                                      
fuel                    string                                      
length                  int                                         
width                   int                                         
height                  int                                         
xlength                 int                                         
xwidth                  int                                         
xheight                 int                                         
count                   int                                         
base                    int                                         
front                   int                                         
norm                    string                                      
tnumber                 int                                         
total                   int                                         
curb                    int                                         
hcurb                   int                                         
passenger               string                                      
zhcurb                  int                                         
business                string                                      
dtype                   string                                      
fmold                   string                                      
fbusiness               string                                      
name                    string                                      
age                     int                                         
sex                     string

3 加载数据文件到数据表cars
1）使用XShell的XFtp文件传输工具，将Win7系统D:\Hive教学\教学课件\Hive离线计算-配套视频代码数据资料\\汽车销售系统\data\car.txt数据文件上传到虚拟机Linux系统的/root目录
2) hive> load data local inpath '/root/car.txt' overwrite into table cars; 加载数据文件
3) hive> select * from cars limit 10; 查询前10条数据

下面开始进行数据分析：

4 统计cars表的数据行数
hive> select count(*) from cars;
OK
70640
说明：cars表总共有70640行数据

hive>select count(*) from cars where nature is not null and nature != '';
OK
70362
说明：car表中nature列值不为null且不为空字符串的数据行共有70362行，70640-70362=278行数据的nature列存在缺失值，即车辆使用性质未知

重要提醒：缺失值和重复列会影响数据分析的准确性，最好在数据分析前先通过数据清洗将有缺失值和重复列的数据行给剔除掉！
数据清洗过程：
1）建立一个表结构相同的新表cars_clean，用于存放数据清洗后的数据
hive>create external table cars_clean(province string, month int, city string, country string, year int, cartype string, productor string, brand string, mold string, owner string, nature string, number int, ftype string, outv int, power double, fuel string, length int, width int, height int, xlength int, xwidth int, xheight int, count int, base int, front int, norm string, tnumber int, total int, curb int, hcurb int, passenger string, zhcurb int, business string, dtype string, fmold string, fbusiness string, name string, age int, sex string) row format delimited fields terminated by ',' location '/cars_clean';

2）选取后面的数据分析会用到的重要列nature，number, province, year, month, sex, owner, cartype, mold, brand, ftype来剔除缺失值，再对所有列进行distinct去重，并逐行插入到新表cars_clean中
hive>insert into cars_clean select distinct * from cars where nature is not null and nature != '' and number is not null and province is not null and province != '' and year is not null and month is not null and sex is not null and sex != '' and owner is not null and owner !='' and cartype is not null and cartype != '' and mold is not null and mold != '' and brand is not null and brand != '' and ftype is not null and ftype != '';
说明：int类型的列只需判断不为null，string类型的列要判断不为null且不为空字符串；后面数据分析的SQL语句中就不用写类似where nature is not null and !=''的语句了

3）统计经过清洗后的数据表cars_clean的行数
hive> select count(*) from cars_clean;
OK
43079
说明：数据清洗后的cars_clean表共有43079行数据，而数据清洗前的cars表总共有70640行数据，有27561行数据被清洗掉了，原始数据文件car.txt的数据质量堪忧！

下面开始进行数据分析，建议在经过数据清洗后的car_clean表中执行进行数据分析！

以下SQL语句都改成针对car_clean表进行操作，并可以去掉where判缺失值的语句

5 统计车辆的使用性质：商用车、乘用车的数量

hive> select nature,count(*) as cnt from cars group by nature having nature is not null and nature != '';

中小学生校车    119
公交客运    1742
公路客运    1448
出租客运    2
初中生校车    2
小学生校车    111
工程救险    1
幼儿校车    17
救护    1
教练    26
旅游客运    219
消防    7
租赁    24
警用    165
非营运    66478

说明：car表中nature列值不为null且不为空字符串的数据行共有119+1742+1448+2+2+111+1+17+1+26+219+7+24+165+66478=70362行

hive>select '非营运', sum(if(a.nature='非营运',a.cnt,0)), '营运', sum(if(a.nature!='非营运',a.cnt,0))
from (select nature,count(*) as cnt from cars
group by nature having nature is not null and nature != '') a;
OK
非营运 66478 营运 3884
Time taken: 47.985 seconds, Fetched: 1 row(s)
说明：car表的nature列值为“非营运”的数据行有66478行，nature列值为“中小学生校车”、“公交客运”等属于营运性质的数据行有3884行，总共66478+3884=70362行
6 按月统计山西省2013年的汽车销售比例
hive> select sum(number) as total from cars where province='山西省' and year='2013';
OK
70362
说明：山西省2013年按年份统计的汽车销售总量是70362；执行select * from cars where number !=1; 查询不到任何结果，说明cars表的所有行number列值都是1，因此sum(number)恰好等于总行数70362

hive> select month, sum(number) as ss from cars where province='山西省' and year='2013' group by month;

说明：山西省2013年按月份统计的汽车销售数量

hive> select month, c1.ss/c.total
from
(select month, sum(number) as ss from cars where province='山西省' and year='2013' group by month) c1,
(select sum(number) as total from cars where province='山西省' and year='2013') c;
OK

1    0.14799181376311077
2    0.05831272561894204
3    0.09306159574770473
4    0.06587362496802251
5    0.0732071288479577
6    0.05547028225462608
7    0.06323015263920867
8    0.06378442909525028
9    0.06948352804070379
10    0.1044882180722549
11    0.10053722179585571
12    0.1045592791563628

说明：山西省2013年每个月的汽车销售比例，即每月的销售数量/一年的销售数量

7 按性别统计男性、女性买车的比例，以及对品牌的选择

hive>select sex, count(*) as cnt from cars group by sex having sex is not null and sex !='';

1913    37
1921    21
1925    18
1926    37
1928    36
1929    13
1930    18
1931    19
1932    21
1933    13
1934    10
1935    10
1936    4
1937    4
1938    2
1939    3
1940    1
1941    5
1942    1
1943    2
1944    1
1947    2
1952    1
1954    1
1956    1
东风汽车有限公司商用车发动机    23
女性    17696
客车    1238
广西玉柴机器股份有限公司    2
男性    41501
轻型客车    5
重庆渝安淮海动力有限公司    4

说明：按性别sex分组统计每种性别的买车数量，并筛选出性别不为null和空字符串的分组；但该SQL语句先分组再筛选有点别扭，建议执行select sex, count(*) as cnt from cars where sex is not null and sex !='' group by sex;先筛选再分组，效率更好

hive>select '男性', sum(if(b.sex='男性', b.cnt, 0)) as man, '女性', sum(if(b.sex='女性' ,b.cnt, 0)) as woman from (select sex, count(*) as cnt from cars where sex is not null and sex !='' group by sex) b;