DCIC2021学习笔记 - 数据分析及Baseline解读

1. 目标:

  • TASK 1:明确赛题内容;
  • TASK 2:Baseline解读;

2. 任务一描述:

为更好地掌握早高峰潮汐现象的变化规律与趋势,参赛者需基于主办方提供的数据进行数据分析和计算模型构建等工作,识别出工作日早高峰07:00-09:00潮汐现象最突出的40个区域,列出各区域所包含的共享单车停车点位编号名称,并提供计算方法说明及计算模型,为下一步优化措施提供辅助支撑。

  • 课程链接:https://coggle.club/learn
  • 共享单车潮汐点分析:https://coggle.club/learn/dcic2021/task2
  • 共享单车潮汐点优化:https://coggle.club/learn/dcic2021/task3

3. 难点

  • 【阿水建议】对当前调度方案的解读;
  • 【个人分析】对时间数据的切分,如工作日和非工作日;

4. Baseline解读

4.1 基础数据处理

调用三方库与读取数据,此项为准备工作,两个关键变量:

  • bike_fence:电子围栏数据;
  • bike_order:共享单车数据;
import os, codecs
import pandas as pd
import numpy as np
import seaborn as sns

%pylab inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg')

from matplotlib import font_manager as fm, rcParams
import matplotlib.pyplot as plt

PATH = '../DATA/'

def bike_fence_format(s):
    s = s.replace('[', '').replace(']', '').split(',')
    s = np.array(s).astype(float).reshape(5, -1)
    return s

# 共享单车停车点位(电子围栏)数据
bike_fence = pd.read_csv(PATH + 'gxdc_tcd.csv')
bike_fence['FENCE_LOC'] = bike_fence['FENCE_LOC'].apply(bike_fence_format)

# 共享单车订单数据
bike_order = pd.read_csv(PATH + 'gxdc_dd.csv')
bike_order = bike_order.sort_values(['BICYCLE_ID', 'UPDATE_TIME'])

4.2 地理信息处理

使用geohash三方库,映射地图坐标为哈希值,precision表示哈希值的位数。
根据电子围栏的四角数据计算得出,电子围栏的中心位置。

# 地图地址 编辑哈希值
import geohash
from geopy.distance import geodesic

bike_order['geohash'] = bike_order.apply(lambda x: 
                        geohash.encode(x['LATITUDE'], x['LONGITUDE'], precision=9), axis=1)

# 找出电子围栏的中间位置
bike_fence['MIN_LATITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.min(x[:, 1]))
bike_fence['MAX_LATITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.max(x[:, 1]))

bike_fence['MIN_LONGITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.min(x[:, 0]))
bike_fence['MAX_LONGITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.max(x[:, 0]))

bike_fence['FENCE_AREA'] = bike_fence.apply(lambda x: geodesic(
    (x['MIN_LATITUDE'], x['MIN_LONGITUDE']), (x['MAX_LATITUDE'], x['MAX_LONGITUDE'])
).meters, axis=1)

bike_fence['FENCE_CENTER'] = bike_fence['FENCE_LOC'].apply(
    lambda x: np.mean(x[:-1, ::-1], 0)
)

映射车辆位置的哈希值

# 车辆地址/电子围栏 地图位置编辑哈希值
bike_order['geohash'] = bike_order.apply(
    lambda x: geohash.encode(x['LATITUDE'], x['LONGITUDE'], precision=6), 
axis=1)

bike_fence['geohash'] = bike_fence['FENCE_CENTER'].apply(
    lambda x: geohash.encode(x[0], x[1], precision=6)
)

# 时间戳处理 注意时间切片
bike_order['UPDATE_TIME'] = pd.to_datetime(bike_order['UPDATE_TIME'])
bike_order['DAY'] = bike_order['UPDATE_TIME'].dt.day.astype(object)
bike_order['DAY'] = bike_order['DAY'].apply(str)

bike_order['HOUR'] = bike_order['UPDATE_TIME'].dt.hour.astype(object)
bike_order['HOUR'] = bike_order['HOUR'].apply(str)
bike_order['HOUR'] = bike_order['HOUR'].str.pad(width=2,side='left',fillchar='0')

bike_order['DAY_HOUR'] = bike_order['DAY'] + bike_order['HOUR']

4.3 按照经纬度融合

通过数据透视图观察点 “wsk52r”的流入流出数据。

# 数据透视图
bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1], 
                   values='LOCK_STATUS', index=['geohash'],
                    columns=['DAY_HOUR'], aggfunc='count', fill_value=0
)

bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0], 
                   values='LOCK_STATUS', index=['geohash'],
                    columns=['DAY_HOUR'], aggfunc='count', fill_value=0
)

# 特定点流入流出数据分析 wsk52r
bike_inflow.loc['wsk52r'].plot()
bike_outflow.loc['wsk52r'].plot()
plt.xticks(list(range(bike_inflow.shape[1])), bike_inflow.columns, rotation=40)
plt.legend(['Inflow', 'OutFlow'])

在这里插入图片描述

根据数据透视表计算的流入流出数据,绘制小时潮汐热力图。

# 数据透视表
bike_inflow_dayhour = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1], 
                   values='LOCK_STATUS', index=['geohash'],
                    columns=['DAY_HOUR'], aggfunc='count', fill_value=0
)

bike_outflow_dayhour = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0], 
                   values='LOCK_STATUS', index=['geohash'],
                    columns=['DAY_HOUR'], aggfunc='count', fill_value=0
)

# 绘图
plt.figure(figsize=(15, 25))
sns.heatmap((bike_inflow_dayhour - bike_outflow_dayhour), vmin=-100, vmax=100, cmap='RdBu_r')

在这里插入图片描述

# 总体的潮汐密度
bike_remain = (bike_inflow - bike_outflow).fillna(0)
bike_remain[bike_remain < 0] = 0  
bike_remain = bike_remain.sum(1)
bike_fence['DENSITY'] = bike_fence['geohash'].map(bike_remain).fillna(0)

4.4 按照最近邻经纬度计算潮汐点

import hnswlib
import numpy as np

p = hnswlib.Index(space='l2', dim=2)
p.init_index(max_elements=300000, ef_construction=1000, M=32)
p.set_ef(1024)
p.set_num_threads(14)

p.add_items(np.stack(bike_fence['FENCE_CENTER'].values))

index, dist = p.knn_query(bike_order[['LATITUDE','LONGITUDE']].values[:], k=1)
bike_order['fence'] = bike_fence.iloc[index.flatten()]['FENCE_ID'].values

bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1], 
                   values='LOCK_STATUS', index=['fence'],
                    columns=['DAY'], aggfunc='count', fill_value=0
)

bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0], 
                   values='LOCK_STATUS', index=['fence'],
                    columns=['DAY'], aggfunc='count', fill_value=0
)

bike_remain = (bike_inflow - bike_outflow).fillna(0)
bike_remain[bike_remain < 0] = 0  
bike_remain = bike_remain.sum(1)

# bike_fence = bike_fence.set_index('FENCE_ID')
bike_density = bike_remain / bike_fence.set_index('FENCE_ID')['FENCE_AREA']

bike_density = bike_density.sort_values(ascending=False).reset_index()
bike_density = bike_density.fillna(0)

bike_density['label'] = '0'
bike_density.iloc[:100, -1] = '1'

bike_density['BELONG_AREA'] ='厦门'
bike_density = bike_density.drop(0, axis=1)

bike_density.columns = ['FENCE_ID', 'FENCE_TYPE', 'BELONG_AREA']
bike_density.to_csv('../RESULT/result_baseline.txt', index=None, sep='|')

5. 赛题解读:

比赛的baseline基本在于对现有数据分层处理,使用地图库进行多数据融合,对baseline的解读,发现代码包含以下几部分:
读取数据 -> 地图信息编码(哈希值) -> 找出电子围栏的中心点 -> 车辆地址和电子围栏的中间位置哈希值 -> 时间切片 -> 数据透视表 当日流量

数据时间为2020年12月21日-24日,

  • 21日-24日为明显上班时间,有明显时间规律;
  • 25日为圣诞节前,规律较混乱;
    根据数值变化应该关注单位时间进出停车位的流量变化。

6. 结果:

在这里插入图片描述
采用建议库函数结果为17.4530.
经过KNN结果为17.51,提升不大。

7. 遇到问题及解决方案

Q1: 安装geohash、geopy、hnswlib
A1:使用以下方法:

  • 安装geohash,使用“pip install python-geohash”
  • 安装geopy,使用“pip install geopy”
  • 安装hnswlib,遇到“Microsoft Visual C++ 14.0 is required”,下载链接
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值