第9章 汽车销售数据分析
1 创建数据库cardb
hive>create database if not exists cardb;
hive>use cardb;
2 创建数据表car
hive>create external table cars(province string, month int, city string, country string, year int, cartype string, productor string, brand string, mold string, owner string, nature string, number int, ftype string, outv int, power double, fuel string, length int, width int, height int, xlength int, xwidth int, xheight int, count int, base int, front int, norm string, tnumber int, total int, curb int, hcurb int, passenger string, zhcurb int, business string, dtype string, fmold string, fbusiness string, name string, age int, sex string) row format delimited fields terminated by ',' location '/cars';
hive> desc cars;
province string
month int
city string
country string
year int
cartype string
productor string
brand string
mold string
owner string
nature string
number int
ftype string
outv int
power double
fuel string
length int
width int
height int
xlength int
xwidth int
xheight int
count int
base int
front int
norm string
tnumber int
total int
curb int
hcurb int
passenger string
zhcurb int
business string
dtype string
fmold string
fbusiness string
name string
age int
sex string
3 加载数据文件到数据表cars
1)使用XShell的XFtp文件传输工具,将Win7系统D:\Hive教学\教学课件\Hive离线计算-配套视频代码数据资料\\汽车销售系统\data\car.txt数据文件上传到虚拟机Linux系统的/root目录
2) hive> load data local inpath '/root/car.txt' overwrite into table cars; 加载数据文件
3) hive> select * from cars limit 10; 查询前10条数据
下面开始进行数据分析:
4 统计cars表的数据行数
hive> select count(*) from cars;
OK
70640
说明:cars表总共有70640行数据
hive>select count(*) from cars where nature is not null and nature != '';
OK
70362
说明:car表中nature列值不为null且不为空字符串的数据行共有70362行,70640-70362=278行数据的nature列存在缺失值,即车辆使用性质未知
重要提醒:缺失值和重复列会影响数据分析的准确性,最好在数据分析前先通过数据清洗将有缺失值和重复列的数据行给剔除掉!
数据清洗过程:
1)建立一个表结构相同的新表cars_clean,用于存放数据清洗后的数据
hive>create external table cars_clean(province string, month int, city string, country string, year int, cartype string, productor string, brand string, mold string, owner string, nature string, number int, ftype string, outv int, power double, fuel string, length int, width int, height int, xlength int, xwidth int, xheight int, count int, base int, front int, norm string, tnumber int, total int, curb int, hcurb int, passenger string, zhcurb int, business string, dtype string, fmold string, fbusiness string, name string, age int, sex string) row format delimited fields terminated by ',' location '/cars_clean';
2)选取后面的数据分析会用到的重要列nature,number, province, year, month, sex, owner, cartype, mold, brand, ftype来剔除缺失值,再对所有列进行distinct去重,并逐行插入到新表cars_clean中
hive>insert into cars_clean select distinct * from cars where nature is not null and nature != '' and number is not null and province is not null and province != '' and year is not null and month is not null and sex is not null and sex != '' and owner is not null and owner !='' and cartype is not null and cartype != '' and mold is not null and mold != '' and brand is not null and brand != '' and ftype is not null and ftype != '';
说明:int类型的列只需判断不为null,string类型的列要判断不为null且不为空字符串;后面数据分析的SQL语句中就不用写类似where nature is not null and !=''的语句了
3)统计经过清洗后的数据表cars_clean的行数
hive> select count(*) from cars_clean;
OK
43079
说明:数据清洗后的cars_clean表共有43079行数据,而数据清洗前的cars表总共有70640行数据,有27561行数据被清洗掉了,原始数据文件car.txt的数据质量堪忧!
下面开始进行数据分析,建议在经过数据清洗后的car_clean表中执行进行数据分析!
以下SQL语句都改成针对car_clean表进行操作,并可以去掉where判缺失值的语句
5 统计车辆的使用性质:商用车、乘用车的数量
hive> select nature,count(*) as cnt from cars group by nature having nature is not null and nature != '';
中小学生校车 119
公交客运 1742
公路客运 1448
出租客运 2
初中生校车 2
小学生校车 111
工程救险 1
幼儿校车 17
救护 1
教练 26
旅游客运 219
消防 7
租赁 24
警用 165
非营运 66478
说明:car表中nature列值不为null且不为空字符串的数据行共有119+1742+1448+2+2+111+1+17+1+26+219+7+24+165+66478=70362行
hive>select '非营运', sum(if(a.nature='非营运',a.cnt,0)), '营运', sum(if(a.nature!='非营运',a.cnt,0))
from (select nature,count(*) as cnt from cars
group by nature having nature is not null and nature != '') a;
OK
非营运 66478 营运 3884
Time taken: 47.985 seconds, Fetched: 1 row(s)
说明:car表的nature列值为“非营运”的数据行有66478行,nature列值为“中小学生校车”、“公交客运”等属于营运性质的数据行有3884行,总共66478+3884=70362行
6 按月统计山西省2013年的汽车销售比例
hive> select sum(number) as total from cars where province='山西省' and year='2013';
OK
70362
说明:山西省2013年按年份统计的汽车销售总量是70362; 执行select * from cars where number !=1; 查询不到任何结果,说明cars表的所有行number列值都是1,因此sum(number)恰好等于总行数70362
hive> select month, sum(number) as ss from cars where province='山西省' and year='2013' group by month;
1 10413
2 4103
3 6548
4 4635
5 5151
6 3903
7 4449
8 4488
9 4889
10 7352
11 7074
12 7357
说明:山西省2013年按月份统计的汽车销售数量
hive> select month, c1.ss/c.total
from
(select month, sum(number) as ss from cars where province='山西省' and year='2013' group by month) c1,
(select sum(number) as total from cars where province='山西省' and year='2013') c;
OK
1 0.14799181376311077
2 0.05831272561894204
3 0.09306159574770473
4 0.06587362496802251
5 0.0732071288479577
6 0.05547028225462608
7 0.06323015263920867
8 0.06378442909525028
9 0.06948352804070379
10 0.1044882180722549
11 0.10053722179585571
12 0.1045592791563628
说明:山西省2013年每个月的汽车销售比例,即每月的销售数量/一年的销售数量
7 按性别统计男性、女性买车的比例,以及对品牌的选择
hive>select sex, count(*) as cnt from cars group by sex having sex is not null and sex !='';
1913 37
1921 21
1925 18
1926 37
1928 36
1929 13
1930 18
1931 19
1932 21
1933 13
1934 10
1935 10
1936 4
1937 4
1938 2
1939 3
1940 1
1941 5
1942 1
1943 2
1944 1
1947 2
1952 1
1954 1
1956 1
东风汽车有限公司商用车发动机 23
女性 17696
客车 1238
广西玉柴机器股份有限公司 2
男性 41501
轻型客车 5
重庆渝安淮海动力有限公司 4
说明:按性别sex分组统计每种性别的买车数量,并筛选出性别不为null和空字符串的分组;但该SQL语句先分组再筛选有点别扭,建议执行select sex, count(*) as cnt from cars where sex is not null and sex !='' group by sex;先筛选再分组,效率更好
hive>select '男性', sum(if(b.sex='男性', b.cnt, 0)) as man, '女性', sum(if(b.sex='女性' ,b.cnt, 0)) as woman from (select sex, count(*) as cnt from cars where sex is not null and sex !='' group by sex) b;