百度地图api的出行数据有一个小问题,就是它的换乘太多了
从北京到上海是一条线路,远一点到广州就需要换乘了,这要就需要两条数据记录
这样统计的话就很麻烦,旅途时间和终点站都不统一
用pandas来整合一下
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://root:123@localhost/baidutraffice')
需要的库和MySQL的连接
分析一下数据
如果不需要换乘,那么在journey_number相同的情况下,一个route_number对应一条线路
我们可以先把数据分成不换乘和换乘两部分
df = pd.read_sql('select * from journeys', engine)
#df1需要换乘
df1 = df[((df['journey_number'] == df['journey_number'].shift(1))
& (df['route_number'] == df['route_number'].shift(1)))
| ((df['journey_number'] == df['journey_number'].shift(-1)) &
(df['route_number'] == df['route_number'].shift(-1)))]
#df2不需换乘
df2 = df[((df['route_number'] != df['route_number'].shift(1))
& (df['route_number'] != df['route_number'].shift(-1)))
]
然后单独处理换乘部分
df3 = df1[df1['route_number'] == df1['route_number'].shift(-1)]
df3['The_duration'] = df1[df1['route_number'].shift(-1) == df1['route_number']].loc['The_duration'] = df1['The_duration'] + df1['The_duration'].shift(-1)
df3['vehicle_name'] = df1[df1['route_number'].shift(-1) == df1['route_number']].loc['vehicle_name'] = df1[
'vehicle_name'] + ',' + df1['vehicle_name'].shift( -1)
df3['arrive_station_name'] = df1[df1['route_number'].shift(-1) == df1['route_number']].loc['arrive_station_name'] = df1[
'arrive_station_name'].shift(-1)
df3['arrive_time'] = df1[df1['route_number'].shift(-1) == df1['route_number']].loc['arrive_time'] = df1[
'arrive_time'].shift(-1)
把时间,车名和终点站有机整合一下,最后合并保存
DF = pd.concat([df2, df3])
DF.to_sql(name='journeys_modify', con=engine, if_exists='replace', index=False)
这样就方便多了