环境准备
EDA过程中使用了的环境和第三方库::
- python 3.7.2
- Jupyter Notebook(代码均在此测试成功)
- pandas 0.23.4
- numpy 1.15.4(数据处理)
- matplotlib 3.0.2
- seaborn 0.9.0(数据可视化)
数据准备
数据字段简介,训练集与测试集下载地址:
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data.
正文
开工前准备,导入第三方库:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
pd.set_option('max_columns',None) # 设置后打印DataFrame显示完整的列
plt.style.use('seaborn-whitegrid')
训练集有5.4GB,全部导入内存吃不消,先加载200万行训练集看看:
(这里介绍了如何导入大数据集csv:https://www.kaggle.com/szelee/how-to-import-a-csv-file-of-55-million-rows)
df=pd.read_csv(
r'.\dataset\train.csv',
nrows=2000000,
parse_dates=['pickup_datetime'])
df_test=pd.read_csv(r'./dataset/test.csv')
简单了解下数据:
display(df.head()) # 预览前5行数据
display(df.info()) # 数据类型,内存使用
display(df.describe()) # 数值型数据的统计量(统计时不计入空缺值nan),包括数量,均值,标准差,百分位值,极值
key | fare_amount | pickup_datetime | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
---|---|---|---|---|---|---|---|---|
0 | 2009-06-15 17:26:21.0000001 | 4.5 | 2009-06-15 17:26:21 | -73.844311 | 40.721319 | -73.841610 | 40.712278 | 1 |
1 | 2010-01-05 16:52:16.0000002 | 16.9 | 2010-01-05 16:52:16 | -74.016048 | 40.711303 | -73.979268 | 40.782004 | 1 |
2 | 2011-08-18 00:35:00.00000049 | 5.7 | 2011-08-18 00:35:00 | -73.982738 | 40.761270 | -73.991242 | 40.750562 | 2 |
3 | 2012-04-21 04:30:42.0000001 | 7.7 | 2012-04-21 04:30:42 | -73.987130 | 40.733143 | -73.991567 | 40.758092 | 1 |
4 | 2010-03-09 07:51:00.000000135 | 5.3 | 2010-03-09 07:51:00 | -73.968095 | 40.768008 | -73.956655 | 40.783762 | 1 |
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 8 columns):
key object
fare_amount float64
pickup_datetime datetime64[ns]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtypes: datetime64(1), float64(5), int64(1), object(1)
memory usage: 122.1+ MB
fare_amount | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
---|---|---|---|---|---|---|
count | 2.000000e+06 | 2.000000e+06 | 2.000000e+06 | 1.999986e+06 | 1.999986e+06 | 2.000000e+06 |
mean | 1.134779e+01 | -7.252321e+01 | 3.992963e+01 | -7.252395e+01 | 3.992808e+01 | 1.684113e+00 |
std | 9.852883e+00 | 1.286804e+01 | 7.983352e+00 | 1.277497e+01 | 1.032382e+01 | 1.314982e+00 |
min | -6.200000e+01 | -3.377681e+03 | -3.458665e+03 | -3.383297e+03 | -3.461541e+03 | 0.000000e+00 |
25% | 6.000000e+00 | -7.399208e+01 | 4.073491e+01 | -7.399141e+01 | 4.073400e+01 | 1.000000e+00 |
50% | 8.500000e+00 | -7.398181e+01 | 4.075263e+01 | -7.398016e+01 | 4.075312e+01 | 1.000000e+00 |
75% | 1.250000e+01 | -7.396713e+01 | 4.076710e+01 | -7.396369e+01 | 4.076809e+01 | 2.000000e+00 |
max | 1.273310e+03 | 2.856442e+03 | 2.621628e+03 | 3.414307e+03 | 3.345917e+03 | 2.080000e+02 |
数据预处理
观察到fare_amount最小值小于0,需要过滤掉这些数据:
print("Old Size: {oldsize}".format(oldsize=len(df)))
df_train=df[df['fare_amount']>=0]
print("New Size: {newsize}".format(newsize=len(df_train)))
Old Size: 2000000
New Size: 1999923
检查一下数据缺失的情况:
df_train.isnull().agg(['any','mean','sum','count'])
# any 是否存在缺失值,mean 缺失值频率,sum 缺失值频数,count 总数
key | fare_amount | pickup_datetime | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
---|---|---|---|---|---|---|---|---|
any | False | False | False | False | False | True | True | False |
mean | 0 | 0 | 0 | 0 | 0 | 7.00027e-06 | 7.00027e-06 | 0 |
sum | 0 | 0 | 0 | 0 | 0 | 14 | 14 | 0 |
count | 1999923 | 1999923 | 1999923 | 1999923 | 1999923 | 1999923 | 1999923 | 1999923 |
数据比较完整,可以考虑直接删除不完整的数据:
print("Old Size: {oldsize}".format(oldsize=len(df_train)))
df_train=df_train.dropna(how='any',axis=0)
print("New Size: {newsize}".format(newsize=len(df_train)))
Old Size: 1999923
New Size: 1999909
探索性数据分析
首先了解一下训练集taget - fare_amount 的分布图,值集中在(0,20)之间,且数据跨度很大,也可以从上文的数据describe()看出来:
plt.figure(figsize=(16,4))
# 所有fare_amount数据的分布
plt.subplot(1,2,1)
sns.distplot(df_train['fare_amount'])
# fare_amount<100数据的分布
plt.subplot(1,2,2)
sns.distplot(df_train[df_train['fare_amount']<100]['fare_amount'])
plt.show()
测试集的上下车点经纬度范围:
# 上下车点的纬度范围
min(df_test.pickup_latitude.min(),df_test.dropoff_latitude.min()),\
max(df_test.pickup_latitude.max(),df_test.dropoff_latitude.max())
# 上下车点的经度范围
min(df_test.pickup_longitude.min(),df_test.dropoff_longitude.min()),\
max(df_test.pickup_longitude.max(),df_test.dropoff_longitude.max())
(40.568973, 41.709555)
(-74.263242, -72.986532)
网上查了下纽约市的经纬度范围:
# 地图
BB = (-74.5, -72.8, 40.5, 41.8)
nyc_map=plt.imread('map')
# 地图放大版
BB_zoom = (-74.3, -73.7, 40.5, 40.9)
nyc_map_zoom = plt.imread('map_zoom')
def select_within_boundingbox(df,bd):
'''
- 筛选掉区域外的坐标,返回index
- df: 需要筛选的数据集
- bd: 区域边界
'''
idx=(df['pickup_longitude']>=bd[0]) & (df['pickup_longitude']<=bd[1])
idx&=(df['dropoff_longitude']>=bd[0]) & (df['dropoff_longitude']<=bd[1])
idx&=(df['pickup_latitude']>=bd[2]) & (df['pickup_latitude']<=bd[3])
idx&=(df['dropoff_latitude']>=bd[2]) & (df['dropoff_latitude']<=bd[3])
return idx
# 筛选训练集中,在测试边界外的点
print("Old Size: {oldsize}".format(oldsize=len(df_train)))
df_train=df_train[select_within_boundingbox(df_train,BB)]
print("New Size: {newsize}".format(newsize=len(df_train)))
Old Size: 1999909
New Size: 1957918
因为后面会经常在地图上plot,所以先定义一个function:
def plot_on_map(df,BB,nyc_map,s=10,alpha=0.2):
fig,axs=plt.subplots(1,2,figsize=(16,10))
axs[0].scatter(df.pickup_longitude,df.pickup_latitude,zorder=1,alpha=alpha,c='r',s=s)
axs[0].set_xlim((BB[0],BB[1]))
axs[0].set_ylim((BB[2],BB[3]))
axs[0].set_title('Pickup Locations')
axs[0].imshow(nyc_map, zorder=0, extent=BB)
axs[1].scatter(df.dropoff_longitude,df.dropoff_latitude,zorder=1,alpha=alpha,c='r',s=s)
axs[1].set_xlim((BB[0],BB[1]))
axs[1].set_ylim((BB[2],BB[3]))
axs[1].set_title('Dropoff Locations')
axs[1].imshow(nyc_map,zorder=0,extent=BB)
训练集上车点与下车点分布:
# 正常
plot_on_map(df_train,BB,nyc_map,alpha=0.2,s=0.5)
# 放大
plot_on_map(df_train,BB_zoom,nyc_map_zoom,alpha=0.2,s=0.5)
再来看一下测试集的上下车点分布:
plot_on_map(df_test,BB,nyc_map,alpha=0.2,s=0.5)
我们数据量非常充足,看下把地图拿掉的效果怎么样:
def plot_without_map(df,BB,figsize=(12,12),c=('r','b')):
fig,ax=plt.subplots(1,1,figsize=figsize)
idx=select_within_boundingbox(df,BB)
ax.scatter(df[idx].pickup_longitude, df[idx].pickup_latitude, c=c[0], s=0.01, alpha=0.5)
ax.scatter(df[idx].dropoff_longitude, df[idx].dropoff_latitude, c=c[1], s=0.01, alpha=0.5)
# 切换到红点密集的地方
plot_without_map(df_train, (-74.1, -73.7, 40.6, 40.9))
# 切换到人多的地方看看
plot_without_map(df_train, (-74, -73.95, 40.7, 40.8))
通过地图可以发现数据集有一些异常值,有些坐标点居然落在海里。。。
以下试着把这些点过滤掉:
def remove_datapoints_from_water(df,BB):
# 将经纬度坐标转换成图像上的像素坐标
def lonla_to_xy(longitude,latitude,dx,dy,BB):
'''
- longitude: longitude needs to transform to x
- latitude: latitude needs to transform to y
- dx: pixels in x orientation
- dy: pixels in y orientation
'''
return (dx*(longitude-BB[0])/(BB[1]-BB[0])).astype('int'),(dy-dy*(latitude-BB[2])/(BB[3]-BB[2])).astype('int')
# 加载蒙版,蒙版可以用Photoshop抠出来,值>0.9为白色区域,即正常点区域
nyc_mask=plt.imread('nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] >0.9
pickup_x,pickup_y=lonla_to_xy(df.pickup_longitude,df.pickup_latitude,nyc_mask.shape[1],nyc_mask.shape[0],BB)
dropoff_x,dropoff_y=lonla_to_xy(df.dropoff_longitude,df.dropoff_latitude,nyc_mask.shape[1],nyc_mask.shape[0],BB)
idx=((nyc_mask[pickup_y,pickup_x]) & (nyc_mask[dropoff_y,dropoff_x]))
return df[idx]
print("Old Size: %d"%len(df_train))
df_train=remove_datapoints_from_water(df_train,BB)
print("New Size: %d"%len(df_train))
又过滤掉近400个异常数据:
Old Size: 1957918
New Size: 1957530
通过以上散点图,我们对数据在地图上分布有大概认识,比如哪个区域上下车点多。
然而,实际上用热力图描述这种场景更准确,先计算每个像素点的数据密度(上下车次数/平方英里),如下:
# 定义热力图分辨率,在经纬方向上有200个像素点,即200*200个区域
n_lon,n_lat=200,200
# 定义各像素点的上下车密度
density_pickup,density_dropoff=np.zeros((n_lon,n_lat)),np.zeros((n_lon,n_lat))
# 计算每个像素的经纬跨度delta_lon,delta_lat,以及区间bins_lon,bins_lat
bins_lon=np.zeros(n_lon+1)
bins_lat=np.zeros(n_lat+1)
delta_lon=(BB[1]-BB[0])/n_lon
delta_lat=(BB[3]-BB[2])/n_lat
bins_lon=[BB[0]+i*delta_lon for i in range(n_lon+1)]
bins_lat=[BB[2]+i*delta_lat for i in range(n_lat+1)]
# 这里使用numpy.digitize将每个数据点对应的像素点(digitize返回每个输入值属于第i个区间)
inds_pickup_lon=np.digitize(df_train.pickup_longitude,bins_lon)
inds_pickup_lat=np.digitize(df_train.pickup_latitude,bins_lat)
inds_dropoff_lon=np.digitize(df_train.dropoff_longitude,bins_lon)
inds_dropoff_lat=np.digitize(df_train.dropoff_latitude,bins_lat)
# 给定两点经纬度,计算其球面上的距离
def distance(lat1,lon1,lat2,lon2):
p = 0.017453292519943295 # Pi/180
a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))
# 计算每个像素的实际尺寸
bin_width_miles=distance(BB[2],BB[0],BB[2],BB[1])/n_lon
bin_height_miles=distance(BB[2],BB[0],BB[3],BB[0])/n_lat
# 计算每个像素点的上下车密度,单位为“次每平方英里”
for i in range(n_lon):
for j in range(n_lat):
density_pickup[j,i]=np.sum((inds_pickup_lon==i+1) & (inds_pickup_lat==(n_lat-j)))/dxdy
density_dropoff[j,i]=np.sum((inds_dropoff_lon==i+1) & (inds_dropoff_lat==(n_lat-j)))/dxdy
根据以上数据,分别绘制上车点与下车点的热力图如下:
fig, axs = plt.subplots(2, 1, figsize=(18, 24))
axs[0].imshow(nyc_map, zorder=0, extent=BB)
# 对于偏度大的数据,应用np.log1p进行数据平滑处理,使其更加服从高斯分布,以可视化与拟合
im = axs[0].imshow(np.log1p(density_pickup), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[0].set_title('Pickup density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[0])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)
axs[1].imshow(nyc_map, zorder=0, extent=BB);
im = axs[1].imshow(np.log1p(density_dropoff), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[1].set_title('Dropoff density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[1])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)
通过上面空间位置上的数据分析,再尝试一下从时间上分析。
首先,数据需要做一些简单的清洗,添加年、一周的某天、小时特征:
df_train['year']=df_train.pickup_datetime.apply(lambda t: t.year)
df_train['weekday']=df_train.pickup_datetime.apply(lambda t: t.weekday())
df_train['hour']=df_train.pickup_datetime.apply(lambda t: t.hour)
然后计算每个时间窗口的流量,与上面绘制热力图类似:
n_hours=24
n_weekdays = 7
n_years = 7
n_bins_lon = 30
n_bins_lat = 30
# focus on traffic on Manhattan
BB_traffic=(-74.025, -73.925, 40.7, 40.8)
# define function to caculate pickup traffic density
def caculate_traffic_density(df):
traffic=np.zeros((n_years,n_weekdays,n_hours,n_bins_lat,n_bins_lon))
delta_lon=(BB_traffic[1]-BB_traffic[0])/n_bins_lon
delta_lat=(BB_traffic[3]-BB_traffic[2])/n_bins_lat
bins_lon = np.zeros(n_bins_lon+1) # bin
bins_lat = np.zeros(n_bins_lat+1) # bin
bins_lon= [BB_traffic[0]+i*delta_lon for i in range(n_bins_lon+1)]
bins_lat= [BB_traffic[2]+i*delta_lat for i in range(n_bins_lat+1)]
for y in range(n_years):
for w in range(n_weekdays):
for h in range(n_hours):
# filter data by year,weekday, hour
idx=(df.year==(2009+y)) & (df.weekday==w) & (df.hour==h)
idx_pickup_lon=np.digitize(df[idx].pickup_longitude,bins_lon)
idx_pickup_lat=np.digitize(df[idx].pickup_latitude,bins_lat)
for i in range(n_bins_lon):
for j in range(n_bins_lat):
traffic[y,w,h,j,i]+=np.sum((idx_pickup_lon==(i+1)) & (idx_pickup_lat==(j+1)))
return traffic
定义一个画图的function,以某年、周几为条件,绘制该日期下每个小时的热力图:
def plot_traffic(traffic,y,d):
days = {'monday' : 0, 'tuesday' : 1, 'wednesday' : 2, 'thursday' : 3, 'friday' : 4, 'saturday' : 5, 'sunday' : 6}
fig, axs = plt.subplots(3,8,figsize=(18,7))
axs = axs.ravel()
for i in range(24):
axs[i].imshow(traffic[y-2009,days[d],i,::-1,:],zorder=1,cmap='coolwarm',clim=(0, traffic.max()))
axs[i].get_xaxis().set_visible(False)
axs[i].get_yaxis().set_visible(False)
axs[i].set_title(f'h={i}')
fig.suptitle("Pickup traffic density, year={}, day={} (max_pickups={})".format(y, d, traffic.max()))
先看一下2009年的情况,挑了比较有特征的周一、周五、周日,发现周一凌晨亮点很少,应该是周一上班所以都很早打车回家了,反观周五和周日,尤其周日到凌晨4点仍然有可观的打车人数,毕竟周日不上班,可以嗨到天亮:
traffic=caculate_traffic_density(df_train)
plot_traffic(traffic,2009,'monday')
plot_traffic(traffic,2009,'friday')
plot_traffic(traffic,2009,'sunday')
再来看一下2015年的情况,发现同比2009亮度暗淡了许多,但也和2009一样,呈现周期性打车高峰和平峰:
plot_traffic(traffic,2015,'monday')
plot_traffic(traffic,2015,'friday')
plot_traffic(traffic,2015,'sunday')
不同时间段打车的费用也可能不相同:
df_train.pivot_table('fare_per_mile',index='hour',columns='year',aggfunc=np.mean).plot(figsize=(14,6))
plt.ylabel='Fare $USD / mile'
另外一个影响打车费用的主要因素:距离
发现数据大部分集中在<20 miles
# 先根据经纬度计算其球面距离
df_train['distance_miles']=distance(df_train.pickup_latitude,df_train.pickup_longitude,df_train.dropoff_latitude,df_train.dropoff_longitude)
df_train.distance_miles.hist(bins=50,figsize=(12,4))
plt.xlabel='distance miles'
plt.title='Histogram ride distances in miles'
打车费用也可能与车上的乘客人数有关,计算每英里的费用:
df_train.groupby(['passenger_count'])['distance_miles', 'fare_amount'].mean()
print("Average $USD/Mile : {:0.2f}".format(df_train.fare_amount.sum()/df_train.distance_miles.sum()))
distance_miles | fare_amount | |
---|---|---|
passenger_count | ||
0 | 1.732107 | 8.813325 |
1 | 2.040013 | 11.202837 |
2 | 2.179313 | 11.800007 |
3 | 2.100634 | 11.519105 |
4 | 2.132165 | 11.721121 |
5 | 2.072168 | 11.215987 |
6 | 2.122414 | 12.169590 |
9 | 8.106351 | 104.000000 |
Average $USD/Mile : 5.48
费用与距离的分布关系是怎样的?这里再使用散点图观察:
fig,axs=plt.subplots(1,2,figsize=(16,6))
# 所有数据
axs[0].scatter(df_train.distance_miles,df_train.fare_amount,alpha=0.2,s=5)
axs[0].set_xlabel('distance_miles')
axs[0].set_ylabel('fare_amount')
axs[0].set_title('All data')
# 局部放大到distance_miles<=20 & fare_amount<=120
idx=(df_train.distance_miles<=20) & (df_train.fare_amount<=120)
axs[1].scatter(df_train[idx].distance_miles,df_train[idx].fare_amount,alpha=0.2,s=5)
axs[1].set_xlabel('distance_miles')
axs[1].set_ylabel('fare_amount')
axs[1].set_title('Zoom In')
右图发现有几条横线,也就是固定费用的交易:
可能是从机场出发到市中心或到达机场,选择JFK机场分析:
# 机场经纬度
jfk = (-73.7822222222, 40.6441666667)
# 市中心经纬度
nyc = (-74.0063889, 40.7141667)
def plot_location_fare(loc,name,rng=0.5):
fig,axs=plt.subplots(1,2,figsize=(18,5))
idx=distance(df_train.pickup_latitude,df_train.pickup_longitude,loc[1],loc[0])<rng
df_train[idx].fare_amount.hist(bins=100,ax=axs[0])
axs[0].set_xlabel('fare $USD')
axs[0].set_title(f'Histogram pickup location within {rng} miles of {name}')
idx=distance(df_train.dropoff_latitude,df_train.dropoff_longitude,loc[1],loc[0])<rng
df_train[idx].fare_amount.hist(bins=100,ax=axs[1])
axs[1].set_xlabel('fare $USD')
axs[1].set_title(f'Histogram dropoff location within {rng} miles of {name}')
plot_location_fare(jfk,'JFK Airport')
果然有几个费用的柱子非常高!再看一下另外的两个机场:
ewr = (-74.175, 40.69)
lgr = (-73.87, 40.77)
plot_location_fare(ewr, 'Newark Airport')
plot_location_fare(lgr, 'LaGuardia Airport')
既然出发点和下车点与费用相关,增加相关特征应该会提高预测准确度!