车辆GPS位置信息分析-大数据项目

[实验数据]
本实验所用数据为江苏省南京市出租车从2010年9月1日到2日的GPS位置数据,包含了7,726辆出租车的共33,042,225条位置记录。数据集已经存放在HDFS上,路径为“/data/13/1/gps/gps.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.gps”。

各字段的定义为:

字段	定义
VehicleId	出租车标识符
Time	数据收集时间
Longitude	经度
Latitude	纬度
Speed	速度,以千米/小时计
Direction	方向,与正北方向的夹角度数
PassengerState	载客状态,1为载客,0为空车

数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/1/gps/gps.csv | head -5
806584008859    2010-09-01 00:01:42 118.85504299999999  31.939446   0   0   0
806770549907    2010-09-01 00:01:45 118.74909   32.089162000000002  30  6   1
806770943693    2010-09-01 00:01:46 118.856236  32.094287999999999  26  318 1
806914743134    2010-09-01 00:01:48 118.836872  31.931645   0   0   0
806451847007    2010-09-01 00:01:51 118.795597  32.039382000000003  20  10  0

[实验步骤提示]
在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。

  1. 用Hive做数据分析和数据准备
    a. 统计每小时出租车的载客情况
    统计每小时总共的独立出租车数量以及在该小时载过客的独立出租车数量,并计算出载过客的比例。

     hive -e \
     "select day(Time) as Day, hour(Time) as Hour, 
     count(distinct case when PassengerState=1 then VehicleId else NULL end) as Passenger, 
     count(distinct VehicleId) as Total,
     count(distinct case when PassengerState=1 then VehicleId else NULL end) / count(distinct VehicleId) as PassengerRatio
     from bigdata_cases.gps
     group by day(Time), hour(Time)
     order by Day, Hour;" \
    

1.csv
得到结果:

Day	Hour	Passenger	Total	PassengerRatio
1	0	4258	7457	0.5710071074158509
1	1	3760	7453	0.5044948342949148
1	2	3156	7457	0.42322649859192707
1	3	2600	7482	0.3475006682705159
1	4	1219	7462	0.16336102921468776
1	5	1356	7454	0.18191574993292192
1	6	2141	7479	0.2862682176761599
1	7	3798	7494	0.5068054443554844
1	8	5345	7504	0.7122867803837953
1	9	5998	7512	0.7984558040468583
1	10	6198	7508	0.8255194459243473
1	11	6246	7497	0.8331332533013205
1	12	6424	7526	0.8535742758437417
1	13	5810	7486	0.7761154154421587
1	14	6004	7485	0.8021376085504343
1	15	6124	7486	0.818060379374833
1	16	6166	7505	0.8215856095936043
1	17	6085	7507	0.8105767949913414
1	18	5957	7501	0.7941607785628583
1	19	6292	7519	0.836813406038037
1	20	5788	7507	0.7710137205275076
1	21	5644	7491	0.7534374582832732
1	22	5314	7486	0.7098584023510554
1	23	6	11	0.5454545454545454
2	0	6	10	0.6
2	1	5	10	0.5
2	2	527	2179	0.24185406149609912
2	3	245	3115	0.07865168539325842
2	4	420	2747	0.15289406625409538
2	5	156	1881	0.08293460925039872
2	7	4653	7496	0.6207310565635006
2	8	5951	7504	0.793043710021322
2	9	6261	7483	0.8366965120940799
2	10	6324	7500	0.8432
2	11	6231	7497	0.8311324529811924
2	12	6075	7484	0.8117316942811331
2	13	5862	7493	0.7823301748298412
2	14	5956	7508	0.7932871603622802
2	15	6114	7531	0.8118443765768159
2	16	6172	7550	0.8174834437086093
2	17	5943	7561	0.7860071419124455
2	18	5948	7564	0.7863564251718668
2	19	5999	7553	0.794253938832252
2	20	5842	7549	0.773877334746324
2	21	5688	7536	0.7547770700636943
2	22	5481	7535	0.7274054412740544
2	23	5203	7527	0.6912448518666136

b. 统计每小时载客出租车的平均车速
统计每小时正在载客的出租车的平均车速。

hive -e \
"select day(Time) as Day, hour(Time) as Hour, 
avg(Speed) as speed
from bigdata_cases.gps
where PassengerState=1
group by day(Time), hour(Time)
order by day, hour;" \

2.csv
得到结果:

Day	Hour	Speed
1	0	26.40957333997252
1	1	26.484185465929794
1	2	25.639321183088185
1	3	25.07481635824269
1	4	23.407378384509105
1	5	25.52963627342468
1	6	29.044904348241705
1	7	24.384438390355054
1	8	18.168545668750205
1	9	18.274947503908844
1	10	19.536583655156143
1	11	20.64434407258455
1	12	21.36393650253309
1	13	23.031837592073078
1	14	21.183105455791274
1	15	19.92800990942307
1	16	19.278142206512133
1	17	19.409214001377936
1	18	17.76479025717124
1	19	20.073146396873305
1	20	22.744269452020934
1	21	23.00955957379188
1	22	23.94621109754561
1	23	25.748275862068965
2	0	13.095555555555556
2	1	11.86861313868613
2	2	21.301467562748595
2	3	19.34041468388851
2	4	25.451918616210484
2	5	26.88686302010285
2	7	17.94664964407304
2	8	18.322770974793492
2	9	20.82934053237346
2	10	21.53841368936541
2	11	22.018161086650508
2	12	22.64793409392082
2	13	23.1530217969407
2	14	21.25080310758536
2	15	20.629242391066345
2	16	20.087993670711402
2	17	18.5745157655044
2	18	15.634892457581637
2	19	18.011473668637095
2	20	21.97895318156716
2	21	22.934715533298963
2	22	22.65631225433132
2	23	23.96024022264538

c. 统计每小时载客出租车的利用率
统计每小时所有出租车载客的总时间和行驶的总时间,并计算出载客时间的比例,即为利用率。

hive -e \
"select day(a.Time) as Day, hour(a.Time) as Hour,
sum(case when a.PassengerState=1 then unix_timestamp(a.PostTime) - unix_timestamp(a.Time) else 0 end) as Passenger,
sum(unix_timestamp(a.PostTime) - unix_timestamp(a.Time)) as Total,
sum(case when a.PassengerState=1 then unix_timestamp(a.PostTime) - unix_timestamp(a.Time) else 0 end) / sum(unix_timestamp(a.PostTime) - unix_timestamp(a.Time)) as PassengerRatio
from (
select VehicleId, Time, PassengerState, 
lead(Time) over (partition by VehicleId order by Time) as PostTime
from bigdata_cases.gps
) a
group by day(a.Time), hour(a.Time)
order by Day, Hour;" \

3.csv
得到结果:

Day	Hour	Passenger	Total	PassengerRatio
1	0	7095657	26718724	0.265568707547561
1	1	6064334	27223971	0.22275714296051813
1	2	4804754	26990898	0.17801386230276592
1	3	3741590	27248583	0.13731319533202882
1	4	1409216	27171164	0.051864395651213174
1	5	1529474	26881007	0.05689794284864402
1	6	2491744	27014891	0.09223594498308359
1	7	5491842	27089971	0.2027260198986555
1	8	11465755	27143530	0.4224120812584067
1	9	15124532	27521886	0.5495456234358358
1	10	14689703	27472348	0.53470868234488
1	11	14310622	27422351	0.5218597778140904
1	12	13357718	28312497	0.4717958292410592
1	13	11311445	27270035	0.41479393040749674
1	14	13089708	27595631	0.4743398692350974
1	15	14068412	27634367	0.5090911617407412
1	16	14751893	27394349	0.5385013164576388
1	17	13556173	27415544	0.49447032676061436
1	18	13071200	27576153	0.4740037524450927
1	19	12485630	27933084	0.44698358405394834
1	20	11268531	27776822	0.4056810746744174
1	21	12256920	27547401	0.4449392521639337
1	22	84651975	208940995	0.40514775475248405
1	23	8388	41022	0.20447564721369021
2	0	12702	35978	0.3530490855522819
2	1	12478	39939	0.3124264503367636
2	2	1115170	9534887	0.11695681343680318
2	3	424528	7926314	0.05355932152069676
2	4	875639	18153132	0.04823624925990733
2	5	949100	17638080	0.05380971171465375
2	7	3919390	9547005	0.4105360791159112
2	8	14705378	28521412	0.5155908129653609
2	9	14001764	26725269	0.5239148013814192
2	10	13872016	27304917	0.508040951012596
2	11	12864680	27323762	0.47082389313740913
2	12	12156640	27130169	0.4480856717110756
2	13	11046548	27102516	0.4075838567902699
2	14	12666754	27173682	0.4661405105130766
2	15	13725740	27160821	0.5053507035004575
2	16	14459637	27551027	0.5248311433181783
2	17	12361111	26985433	0.45806606104856645
2	18	13820382	27239384	0.5073676409128782
2	19	14369189	27257101	0.5271723137394545
2	20	12115073	27100794	0.4470375665008191
2	21	12152858	27112889	0.4482317616540237
2	22	12834472	27062913	0.47424576947795677
2	23	10539314	26607742	0.3960995262206015

d. 得到各出租车在3时、8时和13时的瞬时载客情况
得到各出租车在3时、8时和13时第一次同步数据时的载客情况,为之后做数据可视化做准备。

hive -e \
"select a.VehicleId, a.Longitude, a.Latitude, a.PassengerState
from (
select VehicleId, Longitude, Latitude, PassengerState,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=3
) a
where a.Rank=1;" \

4-1.csv

hive -e \
"select a.VehicleId, a.Longitude, a.Latitude, a.PassengerState
from (
select VehicleId, Longitude, Latitude, PassengerState,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=8
) a
where a.Rank=1;" \

4-2.csv

hive -e \
"select a.VehicleId, a.Longitude, a.Latitude, a.PassengerState
from (
select VehicleId, Longitude, Latitude, PassengerState,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=13
) a
where a.Rank=1;" \

4-3.csv
e. 得到各载客出租车在3时、8时和13时的瞬时速度
得到各载客出租车在3时、8时和13时第一次同步数据时的瞬时速度,为之后做数据可视化做准备。

hive -e \
"select a.VehicleId, a.Longitude, a.Latitude, a.Speed
from (
select VehicleId, Longitude, Latitude, Speed,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=3 and PassengerState=1
) a
where a.Rank=1;" \

5-1.csv

hive -e \
"select a.VehicleId, a.Longitude, a.Latitude, a.Speed
from (
select VehicleId, Longitude, Latitude, Speed,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=8 and PassengerState=1
) a
where a.Rank=1;" \
> 5-2.csv

hive -e \
"select a.VehicleId, a.Longitude, a.Latitude, a.Speed
from (
select VehicleId, Longitude, Latitude, Speed,
row_number() over (partition by VehicleId order by Time) as Rank
from bigdata_cases.gps
where day(Time)=1 and hour(Time)=13 and PassengerState=1
) a
where a.Rank=1;" \

5-3.csv

  1. 用R语言做数据可视化
    a. 载入相关程序包
    载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。

执行以下R语言代码:

> library(ggplot2)
> library(ggmap)
> library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

其中,由于程序包ggmap所用的是谷歌的地图服务,需要修改程序包中源代码“get_googlemap.R”和“geocode.R”中谷歌API的地址,将URL从“http://maps.googleapis.com/maps/api/”替换为“http://maps.google.cn/maps/api/”,并重新用Rtools编译打包。本实验系统中的RStudio已完成修改,不需要再做替换。

b. 可视化每小时出租车的平均速度和利用率
画出每小时出租车的平均速度和利用率的曲线图,其中横坐标表示小时,纵坐标表示出租车利用率,颜色表示日期,线条粗细表示平均速度。

> data2 <- read.table("D:/workspace/2.csv", sep = "\t")
> data3 <- read.table("D:/workspace/3.csv", sep = "\t")
> names(data2) <- c("Day", "Hour", "Speed")
> names(data3) <- c("Day", "Hour", "Passenger", "Total", "PassengerRatio")
> data23 <- cbind(data2, data3[5])
> data23$Day <- as.factor(data23$Day)
> ggplot(data23, aes(x = Hour, y = PassengerRatio, group = Day)) + geom_line(aes(colour = Day, 
+     size = Speed))

b. 画出3时、8时和13时出租车载客情况的地理信息图
画出3时、8时和13时出租车载客情况的地理信息图,其中颜色表示载客情况,红色为空车,蓝色为载客。需要对经度做+0.006的校正,对维度做−0.003的校正。

> data4.1 <- read.table("D:/workspace/4-1.csv", sep = "\t")
> names(data4.1) <- c("VehicleId", "Longitude", "Latitude", "PassengerState")
> data4.1$PassengerState <- as.factor(data4.1$PassengerState)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") + 
+     geom_point(data = data4.1, aes(x = Longitude + 0.006, y = Latitude - 0.003, 
+         colour = PassengerState), alpha = 0.5, size = 2) + labs(title = "2010/9/1 3:00")
Map from URL : http://maps.google.cn/maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Information from URL : http://maps.google.cn/maps/api/geocode/json?address=Nanjing&sensor=false
Warning: Removed 1051 rows containing missing values (geom_point).


> data4.2 <- read.table("D:/workspace/4-2.csv", sep = "\t")
> names(data4.2) <- c("VehicleId", "Longitude", "Latitude", "PassengerState")
> data4.2$PassengerState <- as.factor(data4.2$PassengerState)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") + 
+     geom_point(data = data4.2, aes(x = Longitude + 0.006, y = Latitude - 0.003, 
+         colour = PassengerState), alpha = 0.5, size = 2) + labs(title = "2010/9/1 8:00")
Map from URL : http://maps.google.cn/maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 1078 rows containing missing values (geom_point).


> data4.3 <- read.table("D:/workspace/4-3.csv", sep = "\t")
> names(data4.3) <- c("VehicleId", "Longitude", "Latitude", "PassengerState")
> data4.3$PassengerState <- as.factor(data4.3$PassengerState)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") + 
+     geom_point(data = data4.3, aes(x = Longitude + 0.006, y = Latitude - 0.003, 
+         colour = PassengerState), alpha = 0.5, size = 2) + labs(title = "2010/9/1 13:00")
Map from URL : http://maps.google.cn/maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 888 rows containing missing values (geom_point).

c. 画出3时、8时和13时载客出租车瞬时速度的地理信息图
画出3时、8时和13时载客出租车瞬时速度的地理信息图,其中颜色表示瞬时速度,越接近红色速度越慢,越接近绿色速度越快。仅考虑车速小于等于80码且大于0的样本。需要对经度做+0.006的校正,对维度做−0.003的校正。

> data5.1 <- read.table("D:/workspace/5-1.csv", sep = "\t")
> names(data5.1) <- c("VehicleId", "Longitude", "Latitude", "Speed")
> summary(data5.1)
   VehicleId           Longitude        Latitude         Speed       
 Min.   :8.064e+11   Min.   :  0.0   Min.   : 0.00   Min.   :  0.00  
 1st Qu.:8.066e+11   1st Qu.:118.8   1st Qu.:32.03   1st Qu.:  0.00  
 Median :8.068e+11   Median :118.8   Median :32.05   Median :  7.00  
 Mean   :8.067e+11   Mean   :117.8   Mean   :31.78   Mean   : 16.98  
 3rd Qu.:8.068e+11   3rd Qu.:118.8   3rd Qu.:32.08   3rd Qu.: 27.00  
 Max.   :8.088e+11   Max.   :119.9   Max.   :38.86   Max.   :196.00  
> data5.1 <- filter(data5.1, Speed > 0 & Speed <= 80)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") + 
+     geom_point(data = data5.1, aes(x = Longitude + 0.006, y = Latitude - 0.003, 
+         colour = Speed), alpha = 0.5, size = 2) + scale_colour_gradient(limits = c(0, 
+     80), trans = "sqrt", low = "red", high = "green") + labs(title = "2010/9/1 3:00")
Map from URL : http://maps.google.cn/maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 78 rows containing missing values (geom_point).


> data5.2 <- read.table("D:/workspace/5-2.csv", sep = "\t")
> names(data5.2) <- c("VehicleId", "Longitude", "Latitude", "Speed")
> data5.2 <- filter(data5.2, Speed > 0 & Speed <= 80)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") + 
+     geom_point(data = data5.2, aes(x = Longitude + 0.006, y = Latitude - 0.003, 
+         colour = Speed), alpha = 0.5, size = 2) + scale_colour_gradient(limits = c(0, 
+     80), trans = "sqrt", low = "red", high = "green") + labs(title = "2010/9/1 8:00")
Map from URL : http://maps.google.cn/maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 345 rows containing missing values (geom_point).


> data5.3 <- read.table("D:/workspace/5-3.csv", sep = "\t")
> names(data5.3) <- c("VehicleId", "Longitude", "Latitude", "Speed")
> data5.3 <- filter(data5.3, Speed > 0 & Speed <= 80)
> ggmap(get_googlemap(center = "Nanjing", zoom = 12, maptype = "roadmap"), extent = "device") + 
+     geom_point(data = data5.3, aes(x = Longitude + 0.006, y = Latitude - 0.003, 
+         colour = Speed), alpha = 0.5, size = 2) + scale_colour_gradient(limits = c(0, 
+     80), trans = "sqrt", low = "red", high = "green") + labs(title = "2010/9/1 13:00")
Map from URL : http://maps.google.cn/maps/api/staticmap?center=Nanjing&zoom=12&size=640x640&maptype=roadmap&sensor=false
Warning: Removed 280 rows containing missing values (geom_point).
  • 5
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值