移动互联网时代的开启使得每出行者都成为交通信息的贡献者,超大规模的位置数据在云端进行处理和融合生成城市全时段
无盲区的交通信息。本届算法挑战赛以“移动互联时代的智慧交通预测”为主题,邀请参赛者基于互联网交通信息建立算法模型,精准预测各关键路段在某个时段的通行时间,实现对交通状态波动起伏的预判,助力社会智慧出行和城市交通智能管控。组委会将通过计算参赛者提交预测值和记录真实值之间的误差确定预测准确率,评估所提交的预测算法。
数据集介绍:
a:路段属性表
每条道路的每个通行方向由多条路段(link)构成,数据集中会提供每条link的唯一标识,长度,宽度,以及道路类型,如表1所示;图1示例了地面道路link1和link2的属性信息。
b:link之间按照车辆允许同性的方向存在上下游关系,,数据集中提供每条link的直接上游link和直接下游link,如表2所示;图2示例了link2的in_links和out_links。
c:link历史通行时间表
数据集中记录了历史每天不同时间段内,2min为一个时间段,每条link上的平均旅行时间,每个时间段的平均旅行时间是基于在该时间段内
进入link的车辆在该link上的旅行时间产出;
大赛提供132条link的静态信息,以及这些link之间的上下游拓扑结构。同时,大赛提供2016年3月至2016年5月每条link每天的旅行时间,以及2016年6月早上[6:00- 8:00)每条link的平均旅行时间。
请选手基于训练数据预测2016年6月份[8:00-9:00)每条link上每两分钟时间片的平均旅行时间。
方法1:利用LR回归来预测
通过比赛提取特征:
- # -*- coding: utf-8 -*-
- import pandas as pd
- import time
- import numpy as np
- from sklearn import metrics
- from sklearn import tree
- from sklearn.linear_model import LinearRegression
- def main():
- #加载标记数据
- label_ds=pd.read_csv(r"link_train_0801.txt",sep='\t',encoding='utf8',\
- names=['link_id','link_seq','length','width','link_class','start_date','week','time_interval','time_slot','travel_time',\
- 'avg_travel_time','sd_travel_time','inlinks_num','outlinks_num','inlinks_avg_travel_time','outlinks_avg_travel_time',\
- 'inlinks_atl_1','inlinks_atl_2','inlinks_atl_3','inlinks_atl_4','outlinks_atl_1','outlinks_atl_2','outlinks_atl_3','outlinks_atl_4'])
- label_ds["link_id"] = label_ds["link_id"].astype("string")
- label_ds["link_seq"] = label_ds["link_seq"].astype("int")
- label_ds["length"] = label_ds["length"].astype("int")
- label_ds["width"] = label_ds["width"].astype("int")
- label_ds["link_class"] = label_ds["link_class"].astype("int")
- label_ds["start_date"] = label_ds["start_date"].astype("string")
- label_ds["week"] = label_ds["week"].astype("int")
- label_ds["time_interval"] = label_ds["time_interval"].astype("string")
- label_ds["time_slot"] = label_ds["time_slot"].astype("int")
- label_ds["travel_time"] = label_ds["travel_time"].astype("float")
- label_ds["avg_travel_time"] = label_ds["avg_travel_time"].astype("float")
- label_ds["sd_travel_time"] = label_ds["sd_travel_time"].astype("float")
- label_ds["inlinks_num"] = label_ds["inlinks_num"].astype("int")
- label_ds["outlinks_num"] = label_ds["outlinks_num"].astype("int")
- label_ds["inlinks_avg_travel_time"] = label_ds["inlinks_avg_travel_time"].astype("float")
- label_ds["outlinks_avg_travel_time"] = label_ds["outlinks_avg_travel_time"].astype("float")
- label_ds["inlinks_atl_1"] = label_ds["inlinks_atl_1"].astype("float")
- label_ds["inlinks_atl_2"] = label_ds["inlinks_atl_2"].astype("float")
- label_ds["inlinks_atl_3"] = label_ds["inlinks_atl_3"].astype("float")
- label_ds["inlinks_atl_4"] = label_ds["inlinks_atl_4"].astype("float")
- label_ds["outlinks_atl_1"] = label_ds["outlinks_atl_1"].astype("float")
- label_ds["outlinks_atl_2"] = label_ds["outlinks_atl_2"].astype("float")
- label_ds["outlinks_atl_3"] = label_ds["outlinks_atl_3"].astype("float")
- label_ds["outlinks_atl_4"] = label_ds["outlinks_atl_4"].astype("float")
- #加载预测数据
- unlabel_ds=pd.read_csv(r"link_test_0801.txt",sep='\t',encoding='utf8',\
- names=['link_id','link_seq','length','width','link_class','start_date','week','time_interval','time_slot',\
- 'avg_travel_time','sd_travel_time','inlinks_num','outlinks_num','inlinks_avg_travel_time','outlinks_avg_travel_time',\
- 'inlinks_atl_1','inlinks_atl_2','inlinks_atl_3','inlinks_atl_4','outlinks_atl_1','outlinks_atl_2','outlinks_atl_3','outlinks_atl_4'])
- unlabel_ds["link_id"] = unlabel_ds["link_id"].astype("string")
- unlabel_ds["link_seq"] = unlabel_ds["link_seq"].astype("int")
- unlabel_ds["length"] = unlabel_ds["length"].astype("int")
- unlabel_ds["width"] = unlabel_ds["width"].astype("int")
- unlabel_ds["link_class"] = unlabel_ds["link_class"].astype("int")
- unlabel_ds["start_date"] = unlabel_ds["start_date"].astype("string")
- unlabel_ds["week"] = unlabel_ds["week"].astype("int")
- unlabel_ds["time_interval"] = unlabel_ds["time_interval"].astype("string")
- unlabel_ds["time_slot"] = unlabel_ds["time_slot"].astype("int")
- unlabel_ds["avg_travel_time"] = unlabel_ds["avg_travel_time"].astype("float")
- unlabel_ds["sd_travel_time"] = unlabel_ds["sd_travel_time"].astype("float")
- unlabel_ds["inlinks_num"] = unlabel_ds["inlinks_num"].astype("int")
- unlabel_ds["outlinks_num"] = unlabel_ds["outlinks_num"].astype("int")
- unlabel_ds["inlinks_avg_travel_time"] = unlabel_ds["inlinks_avg_travel_time"].astype("float")
- unlabel_ds["outlinks_avg_travel_time"] = unlabel_ds["outlinks_avg_travel_time"].astype("float")
- unlabel_ds["inlinks_atl_1"] = unlabel_ds["inlinks_atl_1"].astype("float")
- unlabel_ds["inlinks_atl_2"] = unlabel_ds["inlinks_atl_2"].astype("float")
- unlabel_ds["inlinks_atl_3"] = unlabel_ds["inlinks_atl_3"].astype("float")
- unlabel_ds["inlinks_atl_4"] = unlabel_ds["inlinks_atl_4"].astype("float")
- unlabel_ds["outlinks_atl_1"] = unlabel_ds["outlinks_atl_1"].astype("float")
- unlabel_ds["outlinks_atl_2"] = unlabel_ds["outlinks_atl_2"].astype("float")
- unlabel_ds["outlinks_atl_3"] = unlabel_ds["outlinks_atl_3"].astype("float")
- unlabel_ds["outlinks_atl_4"] = unlabel_ds["outlinks_atl_4"].astype("float")
- outit=pd.DataFrame()#输出结果
- mr_df=pd.DataFrame()#输出link的mape和rmse
- mape=0;
- rmse=0;
- train_df=label_ds.loc[(pd.to_datetime(label_ds["start_date"])<'2016-06-01')]#训练集
- valid_df=label_ds.loc[(pd.to_datetime(label_ds["start_date"])>='2016-06-01')]#验证集train_df.sample(frac=0.2)
- for linkid in range(1,133):
- #提取训练集、验证集、测试集
- train_df_id=train_df.loc[(train_df["link_seq"]==linkid)]
- print "训练集,有", train_df_id.shape[0], "行", train_df_id.shape[1], "列"
- valid_df_id=valid_df.loc[(valid_df["link_seq"]==linkid)]
- print "验证集,有", valid_df_id.shape[0], "行", valid_df_id.shape[1], "列"
- test_df=unlabel_ds.loc[(unlabel_ds["link_seq"]==linkid)]#测试集
- print "测试集,有", test_df.shape[0], "行", test_df.shape[1], "列"
- #特征选择
- #模型训练
- train_X=train_df_id[['link_seq','time_slot','length','avg_travel_time',\
- 'inlinks_atl_1','inlinks_atl_2','inlinks_atl_3','inlinks_atl_4','outlinks_atl_1','outlinks_atl_2','outlinks_atl_3','outlinks_atl_4']]
- train_X=train_X.fillna(0)#空值替换为0
- train_y = train_df_id['travel_time']#标记
- model_it=LinearRegression()#tree.DecisionTreeRegressor()
- model_it.fit(train_X, train_y)
- #模型验证
- valid_X=valid_df_id[['link_seq','time_slot','length','avg_travel_time',\
- 'inlinks_atl_1','inlinks_atl_2','inlinks_atl_3','inlinks_atl_4','outlinks_atl_1','outlinks_atl_2','outlinks_atl_3','outlinks_atl_4']]
- valid_X=valid_X.fillna(0)#空值替换为0
- valid_y=valid_df_id['travel_time']
- pre_valid_y=model_it.predict(valid_X)
- abs_y=abs(pre_valid_y-valid_y)
- abs_error=abs_y.sum()#求和
- mape_id=abs_error/valid_df_id.shape[0]
- rmse_id=np.sqrt(metrics.mean_squared_error(valid_y, pre_valid_y))#均方差,模型评估
- print "linkseq="+str(linkid)+"的mape=",mape_id
- print "linkseq="+str(linkid)+"的RMSE=",rmse_id
- mr_list=[[linkid,mape_id,rmse_id]]
- mr_df=mr_df.append(mr_list)
- mape=mape+mape_id
- rmse=rmse+rmse_id
- #模型预测
- test_X = test_df[['link_seq','time_slot','length','avg_travel_time',\
- 'inlinks_atl_1','inlinks_atl_2','inlinks_atl_3','inlinks_atl_4','outlinks_atl_1','outlinks_atl_2','outlinks_atl_3','outlinks_atl_4']]
- test_X=test_X.fillna(0)#空值替换为0
- test_info = test_df[['link_id','start_date','time_interval']]
- test_y=model_it.predict(test_X)
- test_info["travel_time"]=test_y
- outit=outit.append(test_info)#追加到输出结果
- print "all mape:",mape/132
- print "all RMSE:",rmse/132
- mr_df.to_csv('linkmape.txt',sep='#',index=False,header=None)
- outit.to_csv('outit.txt',sep='#',index=False,header=None)#输出预测数据
- #执行
- if __name__ == '__main__':
- start = time.clock()
- main()
- end = time.clock()
- print('finish all in %s' % str(end - start))