数据分析 - Kaggle NYC打车费用预测

数据分析 - Kaggle NYC打车费用预测

环境准备

EDA过程中使用了的环境和第三方库::

  • python 3.7.2
  • Jupyter Notebook(代码均在此测试成功)
  • pandas 0.23.4
  • numpy 1.15.4(数据处理)
  • matplotlib 3.0.2
  • seaborn 0.9.0(数据可视化)

数据准备

数据字段简介,训练集与测试集下载地址:
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data.

正文

开工前准备,导入第三方库:

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
pd.set_option('max_columns',None) # 设置后打印DataFrame显示完整的列
plt.style.use('seaborn-whitegrid')

训练集有5.4GB,全部导入内存吃不消,先加载200万行训练集看看:
(这里介绍了如何导入大数据集csv:https://www.kaggle.com/szelee/how-to-import-a-csv-file-of-55-million-rows

df=pd.read_csv(
	r'.\dataset\train.csv',
	nrows=2000000,
	parse_dates=['pickup_datetime'])
df_test=pd.read_csv(r'./dataset/test.csv')

简单了解下数据:

display(df.head()) # 预览前5行数据
display(df.info()) # 数据类型,内存使用
display(df.describe()) # 数值型数据的统计量(统计时不计入空缺值nan),包括数量,均值,标准差,百分位值,极值
keyfare_amountpickup_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latitudepassenger_count
02009-06-15 17:26:21.00000014.52009-06-15 17:26:21-73.84431140.721319-73.84161040.7122781
12010-01-05 16:52:16.000000216.92010-01-05 16:52:16-74.01604840.711303-73.97926840.7820041
22011-08-18 00:35:00.000000495.72011-08-18 00:35:00-73.98273840.761270-73.99124240.7505622
32012-04-21 04:30:42.00000017.72012-04-21 04:30:42-73.98713040.733143-73.99156740.7580921
42010-03-09 07:51:00.0000001355.32010-03-09 07:51:00-73.96809540.768008-73.95665540.7837621

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 8 columns):
key object
fare_amount float64
pickup_datetime datetime64[ns]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtypes: datetime64(1), float64(5), int64(1), object(1)
memory usage: 122.1+ MB

fare_amountpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudepassenger_count
count2.000000e+062.000000e+062.000000e+061.999986e+061.999986e+062.000000e+06
mean1.134779e+01-7.252321e+013.992963e+01-7.252395e+013.992808e+011.684113e+00
std9.852883e+001.286804e+017.983352e+001.277497e+011.032382e+011.314982e+00
min-6.200000e+01-3.377681e+03-3.458665e+03-3.383297e+03-3.461541e+030.000000e+00
25%6.000000e+00-7.399208e+014.073491e+01-7.399141e+014.073400e+011.000000e+00
50%8.500000e+00-7.398181e+014.075263e+01-7.398016e+014.075312e+011.000000e+00
75%1.250000e+01-7.396713e+014.076710e+01-7.396369e+014.076809e+012.000000e+00
max1.273310e+032.856442e+032.621628e+033.414307e+033.345917e+032.080000e+02

数据预处理

观察到fare_amount最小值小于0,需要过滤掉这些数据:

print("Old Size: {oldsize}".format(oldsize=len(df)))
df_train=df[df['fare_amount']>=0]
print("New Size: {newsize}".format(newsize=len(df_train)))

Old Size: 2000000
New Size: 1999923

检查一下数据缺失的情况:

df_train.isnull().agg(['any','mean','sum','count'])
# any 是否存在缺失值,mean 缺失值频率,sum 缺失值频数,count 总数
keyfare_amountpickup_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latitudepassenger_count
anyFalseFalseFalseFalseFalseTrueTrueFalse
mean000007.00027e-067.00027e-060
sum0000014140
count19999231999923199992319999231999923199992319999231999923

数据比较完整,可以考虑直接删除不完整的数据:

print("Old Size: {oldsize}".format(oldsize=len(df_train)))
df_train=df_train.dropna(how='any',axis=0)
print("New Size: {newsize}".format(newsize=len(df_train)))

Old Size: 1999923
New Size: 1999909

探索性数据分析

首先了解一下训练集taget - fare_amount 的分布图,值集中在(0,20)之间,且数据跨度很大,也可以从上文的数据describe()看出来:

plt.figure(figsize=(16,4))
# 所有fare_amount数据的分布
plt.subplot(1,2,1)
sns.distplot(df_train['fare_amount'])
# fare_amount<100数据的分布
plt.subplot(1,2,2)
sns.distplot(df_train[df_train['fare_amount']<100]['fare_amount'])
plt.show()

费用分布图
测试集的上下车点经纬度范围:

# 上下车点的纬度范围
min(df_test.pickup_latitude.min(),df_test.dropoff_latitude.min()),\
max(df_test.pickup_latitude.max(),df_test.dropoff_latitude.max())
# 上下车点的经度范围
min(df_test.pickup_longitude.min(),df_test.dropoff_longitude.min()),\
max(df_test.pickup_longitude.max(),df_test.dropoff_longitude.max())

(40.568973, 41.709555)
(-74.263242, -72.986532)

网上查了下纽约市的经纬度范围:

# 地图
BB = (-74.5, -72.8, 40.5, 41.8)
nyc_map=plt.imread('map')
# 地图放大版
BB_zoom = (-74.3, -73.7, 40.5, 40.9)
nyc_map_zoom = plt.imread('map_zoom')

def select_within_boundingbox(df,bd):
	'''
	 - 筛选掉区域外的坐标,返回index
	 - df: 需要筛选的数据集
	 - bd: 区域边界
	'''
	idx=(df['pickup_longitude']>=bd[0]) & (df['pickup_longitude']<=bd[1])
	idx&=(df['dropoff_longitude']>=bd[0]) & (df['dropoff_longitude']<=bd[1])
	idx&=(df['pickup_latitude']>=bd[2]) & (df['pickup_latitude']<=bd[3]) 
	idx&=(df['dropoff_latitude']>=bd[2]) & (df['dropoff_latitude']<=bd[3])
    return idx
    
# 筛选训练集中,在测试边界外的点
print("Old Size: {oldsize}".format(oldsize=len(df_train)))
df_train=df_train[select_within_boundingbox(df_train,BB)]
print("New Size: {newsize}".format(newsize=len(df_train)))

Old Size: 1999909
New Size: 1957918

因为后面会经常在地图上plot,所以先定义一个function:

def plot_on_map(df,BB,nyc_map,s=10,alpha=0.2):
    fig,axs=plt.subplots(1,2,figsize=(16,10))
    axs[0].scatter(df.pickup_longitude,df.pickup_latitude,zorder=1,alpha=alpha,c='r',s=s)
    axs[0].set_xlim((BB[0],BB[1]))
    axs[0].set_ylim((BB[2],BB[3]))
    axs[0].set_title('Pickup Locations')
    axs[0].imshow(nyc_map, zorder=0, extent=BB)
    
    axs[1].scatter(df.dropoff_longitude,df.dropoff_latitude,zorder=1,alpha=alpha,c='r',s=s)
    axs[1].set_xlim((BB[0],BB[1]))
    axs[1].set_ylim((BB[2],BB[3]))
    axs[1].set_title('Dropoff Locations')
    axs[1].imshow(nyc_map,zorder=0,extent=BB)

训练集上车点与下车点分布:

# 正常
plot_on_map(df_train,BB,nyc_map,alpha=0.2,s=0.5)
# 放大
plot_on_map(df_train,BB_zoom,nyc_map_zoom,alpha=0.2,s=0.5)

在这里插入图片描述
在这里插入图片描述
再来看一下测试集的上下车点分布:

plot_on_map(df_test,BB,nyc_map,alpha=0.2,s=0.5)

在这里插入图片描述
我们数据量非常充足,看下把地图拿掉的效果怎么样:

def plot_without_map(df,BB,figsize=(12,12),c=('r','b')):
    fig,ax=plt.subplots(1,1,figsize=figsize)
    idx=select_within_boundingbox(df,BB)
    ax.scatter(df[idx].pickup_longitude, df[idx].pickup_latitude, c=c[0], s=0.01, alpha=0.5)
    ax.scatter(df[idx].dropoff_longitude, df[idx].dropoff_latitude, c=c[1], s=0.01, alpha=0.5)
# 切换到红点密集的地方
plot_without_map(df_train, (-74.1, -73.7, 40.6, 40.9))
# 切换到人多的地方看看
plot_without_map(df_train, (-74, -73.95, 40.7, 40.8))

在这里插入图片描述
在这里插入图片描述
通过地图可以发现数据集有一些异常值,有些坐标点居然落在海里。。。
以下试着把这些点过滤掉:

def remove_datapoints_from_water(df,BB):
	# 将经纬度坐标转换成图像上的像素坐标
    def lonla_to_xy(longitude,latitude,dx,dy,BB):
        '''
         - longitude: longitude needs to transform to x
         - latitude: latitude needs to transform to y
         - dx: pixels in x orientation
         - dy: pixels in y orientation
        '''
        return (dx*(longitude-BB[0])/(BB[1]-BB[0])).astype('int'),(dy-dy*(latitude-BB[2])/(BB[3]-BB[2])).astype('int')
	
	# 加载蒙版,蒙版可以用Photoshop抠出来,值>0.9为白色区域,即正常点区域
    nyc_mask=plt.imread('nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] >0.9
    
    pickup_x,pickup_y=lonla_to_xy(df.pickup_longitude,df.pickup_latitude,nyc_mask.shape[1],nyc_mask.shape[0],BB)
    dropoff_x,dropoff_y=lonla_to_xy(df.dropoff_longitude,df.dropoff_latitude,nyc_mask.shape[1],nyc_mask.shape[0],BB)
    idx=((nyc_mask[pickup_y,pickup_x]) & (nyc_mask[dropoff_y,dropoff_x]))
    return df[idx]

print("Old Size: %d"%len(df_train))
df_train=remove_datapoints_from_water(df_train,BB)
print("New Size: %d"%len(df_train))

又过滤掉近400个异常数据:

Old Size: 1957918
New Size: 1957530

通过以上散点图,我们对数据在地图上分布有大概认识,比如哪个区域上下车点多。
然而,实际上用热力图描述这种场景更准确,先计算每个像素点的数据密度(上下车次数/平方英里),如下:

# 定义热力图分辨率,在经纬方向上有200个像素点,即200*200个区域
n_lon,n_lat=200,200
# 定义各像素点的上下车密度
density_pickup,density_dropoff=np.zeros((n_lon,n_lat)),np.zeros((n_lon,n_lat))
# 计算每个像素的经纬跨度delta_lon,delta_lat,以及区间bins_lon,bins_lat
bins_lon=np.zeros(n_lon+1)
bins_lat=np.zeros(n_lat+1)
delta_lon=(BB[1]-BB[0])/n_lon
delta_lat=(BB[3]-BB[2])/n_lat
bins_lon=[BB[0]+i*delta_lon for i in range(n_lon+1)]
bins_lat=[BB[2]+i*delta_lat for i in range(n_lat+1)]

# 这里使用numpy.digitize将每个数据点对应的像素点(digitize返回每个输入值属于第i个区间)
inds_pickup_lon=np.digitize(df_train.pickup_longitude,bins_lon)
inds_pickup_lat=np.digitize(df_train.pickup_latitude,bins_lat)
inds_dropoff_lon=np.digitize(df_train.dropoff_longitude,bins_lon)
inds_dropoff_lat=np.digitize(df_train.dropoff_latitude,bins_lat)

# 给定两点经纬度,计算其球面上的距离
def distance(lat1,lon1,lat2,lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))
# 计算每个像素的实际尺寸
bin_width_miles=distance(BB[2],BB[0],BB[2],BB[1])/n_lon
bin_height_miles=distance(BB[2],BB[0],BB[3],BB[0])/n_lat

# 计算每个像素点的上下车密度,单位为“次每平方英里”
for i in range(n_lon):
    for j in range(n_lat):
        density_pickup[j,i]=np.sum((inds_pickup_lon==i+1) & (inds_pickup_lat==(n_lat-j)))/dxdy
        density_dropoff[j,i]=np.sum((inds_dropoff_lon==i+1) & (inds_dropoff_lat==(n_lat-j)))/dxdy

根据以上数据,分别绘制上车点与下车点的热力图如下:

fig, axs = plt.subplots(2, 1, figsize=(18, 24))
axs[0].imshow(nyc_map, zorder=0, extent=BB)
# 对于偏度大的数据,应用np.log1p进行数据平滑处理,使其更加服从高斯分布,以可视化与拟合
im = axs[0].imshow(np.log1p(density_pickup), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[0].set_title('Pickup density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[0])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)

axs[1].imshow(nyc_map, zorder=0, extent=BB);
im = axs[1].imshow(np.log1p(density_dropoff), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[1].set_title('Dropoff density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[1])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)

在这里插入图片描述
通过上面空间位置上的数据分析,再尝试一下从时间上分析。
首先,数据需要做一些简单的清洗,添加年、一周的某天、小时特征:

df_train['year']=df_train.pickup_datetime.apply(lambda t: t.year)
df_train['weekday']=df_train.pickup_datetime.apply(lambda t: t.weekday())
df_train['hour']=df_train.pickup_datetime.apply(lambda t: t.hour)

然后计算每个时间窗口的流量,与上面绘制热力图类似:

n_hours=24
n_weekdays = 7
n_years = 7
n_bins_lon = 30
n_bins_lat = 30
# focus on traffic on Manhattan
BB_traffic=(-74.025, -73.925, 40.7, 40.8)
# define function to caculate pickup traffic density
def caculate_traffic_density(df):
    traffic=np.zeros((n_years,n_weekdays,n_hours,n_bins_lat,n_bins_lon))
    
    delta_lon=(BB_traffic[1]-BB_traffic[0])/n_bins_lon
    delta_lat=(BB_traffic[3]-BB_traffic[2])/n_bins_lat
    bins_lon = np.zeros(n_bins_lon+1) # bin
    bins_lat = np.zeros(n_bins_lat+1) # bin
    bins_lon= [BB_traffic[0]+i*delta_lon for i in range(n_bins_lon+1)]
    bins_lat= [BB_traffic[2]+i*delta_lat for i in range(n_bins_lat+1)]
    
    for y in range(n_years):
        for w in range(n_weekdays):
            for h in range(n_hours):
                # filter data by year,weekday, hour 
                idx=(df.year==(2009+y)) & (df.weekday==w) & (df.hour==h)
                
                idx_pickup_lon=np.digitize(df[idx].pickup_longitude,bins_lon)
                idx_pickup_lat=np.digitize(df[idx].pickup_latitude,bins_lat)
                
                for i in range(n_bins_lon):
                    for j in range(n_bins_lat):
                        traffic[y,w,h,j,i]+=np.sum((idx_pickup_lon==(i+1)) & (idx_pickup_lat==(j+1)))
    return traffic

定义一个画图的function,以某年、周几为条件,绘制该日期下每个小时的热力图:

def plot_traffic(traffic,y,d):
    days = {'monday' : 0, 'tuesday' : 1, 'wednesday' : 2, 'thursday' : 3, 'friday' : 4, 'saturday' : 5, 'sunday' : 6}
    fig, axs = plt.subplots(3,8,figsize=(18,7))
    axs = axs.ravel()
    for i in range(24):
        axs[i].imshow(traffic[y-2009,days[d],i,::-1,:],zorder=1,cmap='coolwarm',clim=(0, traffic.max()))
        axs[i].get_xaxis().set_visible(False)
        axs[i].get_yaxis().set_visible(False)
        axs[i].set_title(f'h={i}')
        fig.suptitle("Pickup traffic density, year={}, day={} (max_pickups={})".format(y, d, traffic.max()))

先看一下2009年的情况,挑了比较有特征的周一、周五、周日,发现周一凌晨亮点很少,应该是周一上班所以都很早打车回家了,反观周五和周日,尤其周日到凌晨4点仍然有可观的打车人数,毕竟周日不上班,可以嗨到天亮:

traffic=caculate_traffic_density(df_train)
plot_traffic(traffic,2009,'monday')
plot_traffic(traffic,2009,'friday')
plot_traffic(traffic,2009,'sunday')

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
再来看一下2015年的情况,发现同比2009亮度暗淡了许多,但也和2009一样,呈现周期性打车高峰和平峰:

plot_traffic(traffic,2015,'monday')
plot_traffic(traffic,2015,'friday')
plot_traffic(traffic,2015,'sunday')

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
不同时间段打车的费用也可能不相同:

df_train.pivot_table('fare_per_mile',index='hour',columns='year',aggfunc=np.mean).plot(figsize=(14,6))
plt.ylabel='Fare $USD / mile'

在这里插入图片描述
另外一个影响打车费用的主要因素:距离
发现数据大部分集中在<20 miles

# 先根据经纬度计算其球面距离
df_train['distance_miles']=distance(df_train.pickup_latitude,df_train.pickup_longitude,df_train.dropoff_latitude,df_train.dropoff_longitude)
df_train.distance_miles.hist(bins=50,figsize=(12,4))
plt.xlabel='distance miles'
plt.title='Histogram ride distances in miles'

在这里插入图片描述
打车费用也可能与车上的乘客人数有关,计算每英里的费用:

df_train.groupby(['passenger_count'])['distance_miles', 'fare_amount'].mean()
print("Average $USD/Mile : {:0.2f}".format(df_train.fare_amount.sum()/df_train.distance_miles.sum()))
distance_milesfare_amount
passenger_count
01.7321078.813325
12.04001311.202837
22.17931311.800007
32.10063411.519105
42.13216511.721121
52.07216811.215987
62.12241412.169590
98.106351104.000000

Average $USD/Mile : 5.48

费用与距离的分布关系是怎样的?这里再使用散点图观察:

fig,axs=plt.subplots(1,2,figsize=(16,6))
# 所有数据
axs[0].scatter(df_train.distance_miles,df_train.fare_amount,alpha=0.2,s=5)
axs[0].set_xlabel('distance_miles')
axs[0].set_ylabel('fare_amount')
axs[0].set_title('All data')
# 局部放大到distance_miles<=20 & fare_amount<=120
idx=(df_train.distance_miles<=20) & (df_train.fare_amount<=120)
axs[1].scatter(df_train[idx].distance_miles,df_train[idx].fare_amount,alpha=0.2,s=5)
axs[1].set_xlabel('distance_miles')
axs[1].set_ylabel('fare_amount')
axs[1].set_title('Zoom In')

在这里插入图片描述
右图发现有几条横线,也就是固定费用的交易:
可能是从机场出发到市中心或到达机场,选择JFK机场分析:

# 机场经纬度
jfk = (-73.7822222222, 40.6441666667)
# 市中心经纬度
nyc = (-74.0063889, 40.7141667)

def plot_location_fare(loc,name,rng=0.5):
    fig,axs=plt.subplots(1,2,figsize=(18,5))
    idx=distance(df_train.pickup_latitude,df_train.pickup_longitude,loc[1],loc[0])<rng
    df_train[idx].fare_amount.hist(bins=100,ax=axs[0])
    axs[0].set_xlabel('fare $USD')
    axs[0].set_title(f'Histogram pickup location within {rng} miles of {name}')
    
    idx=distance(df_train.dropoff_latitude,df_train.dropoff_longitude,loc[1],loc[0])<rng
    df_train[idx].fare_amount.hist(bins=100,ax=axs[1])
    axs[1].set_xlabel('fare $USD')
    axs[1].set_title(f'Histogram dropoff location within {rng} miles of {name}')
    
plot_location_fare(jfk,'JFK Airport')

在这里插入图片描述
果然有几个费用的柱子非常高!再看一下另外的两个机场:

ewr = (-74.175, 40.69)
lgr = (-73.87, 40.77)
plot_location_fare(ewr, 'Newark Airport')
plot_location_fare(lgr, 'LaGuardia Airport')

在这里插入图片描述
在这里插入图片描述
既然出发点和下车点与费用相关,增加相关特征应该会提高预测准确度!

  • 1
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值