kaggle之共享单车案例

最新推荐文章于 2024-06-25 18:26:01 发布

天青如水

最新推荐文章于 2024-06-25 18:26:01 发布

阅读量5.4k

点赞数 11

分类专栏： # 机器学习文章标签：机器学习 Kaggle 租借自行车

本文链接：https://blog.csdn.net/qq_16829085/article/details/105724612

版权

机器学习专栏收录该内容

29 篇文章 6 订阅

订阅专栏

kaggle之共享单车案例

自行车共享系统是租借自行车的一种手段，通过这些系统，人们可以从任意地点租借一辆自行车，到达目的地后归还。自行车共享系统明确记录了旅行时间，出发地点，到达地点和时间。因此，其可用于研究城市中的移动性。在本项目中，要求将历史使用模式与天气数据结合起来，以预测华盛顿特区的自行车租赁租赁需求。

数据提供了跨越两年的每小时租赁数据，包含天气信息和日期信息，训练集由每月前19天的数据组成，测试集是每月第20天到当月底的数据。

变量说明：

datetime（日期） - 年、月、日+ 整点时刻
season（季节） - 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
holiday - 是否是节假日
workingday - 是否是工作日
weather（天气等级）1. 清澈，少云，多云。2. 雾+阴天，雾+碎云、雾+少云、雾 3. 小雪、小雨+雷暴+散云，小雨+云 4. 暴雨+冰雹+雷暴+雾，雪+雾
temp 温度
atemp 体感温度
humidity 相对湿度
windspeed 风速
casual 非用户租赁数量
registered 会员租赁数量
count 租赁总量

数据探索

缺失值检查
异常值检查

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

#忽略警告提示
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

data_train = pd.read_csv('./data/train.csv')
data_test  = pd.read_csv('./data/test.csv')

data_train.info()
print('-'*40)
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null object
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity      6493 non-null int64
windspeed     6493 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.6+ KB

数据没有缺失值

data_train.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

data_test.head()

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000
3	2011-01-20 03:00:00	1	1	1	10.66	12.880	56	11.0014
4	2011-01-20 04:00:00	1	1	1	10.66	12.880	56	11.0014

# 统计描述
data_train.describe().T

	count	mean	std	min	25%	50%	75%	max
season	10886.0	2.506614	1.116174	1.00	2.0000	3.000	4.0000	4.0000
holiday	10886.0	0.028569	0.166599	0.00	0.0000	0.000	0.0000	1.0000
workingday	10886.0	0.680875	0.466159	0.00	0.0000	1.000	1.0000	1.0000
weather	10886.0	1.418427	0.633839	1.00	1.0000	1.000	2.0000	4.0000
temp	10886.0	20.230860	7.791590	0.82	13.9400	20.500	26.2400	41.0000
atemp	10886.0	23.655084	8.474601	0.76	16.6650	24.240	31.0600	45.4550
humidity	10886.0	61.886460	19.245033	0.00	47.0000	62.000	77.0000	100.0000
windspeed	10886.0	12.799395	8.164537	0.00	7.0015	12.998	16.9979	56.9969
casual	10886.0	36.021955	49.960477	0.00	4.0000	17.000	49.0000	367.0000
registered	10886.0	155.552177	151.039033	0.00	36.0000	118.000	222.0000	886.0000
count	10886.0	191.574132	181.144454	1.00	42.0000	145.000	284.0000	977.0000

异常值检查

count
casual
registered

# 查看是否符合高斯分布
fig,axes = plt.subplots(1,3)
# 设置图形的尺寸，单位为英寸。1英寸等于2.54cm
fig.set_size_inches(18,5)

sns.distplot(data_train['count'],bins=100,ax=axes[0])
sns.distplot(data_train['casual'],bins=100,ax=axes[1])
sns.distplot(data_train['registered'],bins=100,ax=axes[2])

在这里插入图片描述

data_train[['count','casual','registered']].describe().T

	count	mean	std	min	25%	50%	75%	max
count	10886.0	191.574132	181.144454	1.0	42.0	145.0	284.0	977.0
casual	10886.0	36.021955	49.960477	0.0	4.0	17.0	49.0	367.0
registered	10886.0	155.552177	151.039033	0.0	36.0	118.0	222.0	886.0

fig,axes = plt.subplots(1,3)
fig.set_size_inches(12,6)

sns.boxplot(data = data_train['count'],ax=axes[0])
axes[0].set(xlabel='count')
sns.boxplot(data = data_train['casual'], ax=axes[1])
axes[1].set(xlabel='casual')
sns.boxplot(data = data_train['registered'], ax=axes[2])
axes[2].set(xlabel='registered')

在这里插入图片描述

count:均值191，标准差181，50%分位数是145，75%分位数是284，最大值977，说明右侧存在长尾。去除掉异常值，并取log处理，观察结果。

count = casual+registered

# 去除异常值 将大于μ＋3σ的数据值作为异常值
def drop_outlier(data,col):
    mask = np.abs(data[col]-data[col].mean())<(3*data[col].std())
    data = data.loc[mask]
    # 可视化剔除异常值后的col和col_log
    data[col+'_log'] = np.log1p(data[col])
    f, [ax1, ax2] = plt.subplots(1,2, figsize=(15,6))

    sns.distplot(data[col], ax=ax1)
    ax1.set_title(col+'分布')

    sns.distplot(data[col+'_log'], ax=ax2)
    ax2.set_title(col+'_log分布')
    return data

data_train = drop_outlier(data_train,'count')

在这里插入图片描述

data_train = drop_outlier(data_train,'casual')

在这里插入图片描述

data_train = drop_outlier(data_train,'registered')

在这里插入图片描述

特征分解

将datetime特征拆分为日期、星期、年、月、日、小时

def split_datetime(data):
    data['date'] = data['datetime'].apply(lambda x:x.split()[0])
    data['weekday'] =data['date'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d').isoweekday())
    data['year'] = data['date'].apply(lambda x:x.split('-')[0]).astype('int')
    data['month'] = data['date'].apply(lambda x:x.split('-')[1]).astype('int')
    data['day'] = data['date'].apply(lambda x:x.split('-')[2]).astype('int')
    data['hour'] = data['datetime'].apply(lambda x:x.split()[1].split(':')[0]).astype('int')
    return data

data_train = split_datetime(data_train)
data_train.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	count_log	date	weekday	year	month	day	hour
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	2.833213	2011-01-01	6	2011	1	1	0
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	3.713572	2011-01-01	6	2011	1	1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	3.496508	2011-01-01	6	2011	1	1	2
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13	2.639057	2011-01-01	6	2011	1	1	3
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1	0.693147	2011-01-01	6	2011	1	1	4

可视化分析

数值型数据分布分析
类别型数据箱线图分布分析

data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10739 entries, 0 to 10885
Data columns (total 19 columns):
datetime      10739 non-null object
season        10739 non-null int64
holiday       10739 non-null int64
workingday    10739 non-null int64
weather       10739 non-null int64
temp          10739 non-null float64
atemp         10739 non-null float64
humidity      10739 non-null int64
windspeed     10739 non-null float64
casual        10739 non-null int64
registered    10739 non-null int64
count         10739 non-null int64
count_log     10739 non-null float64
date          10739 non-null object
weekday       10739 non-null int64
year          10739 non-null int64
month         10739 non-null int64
day           10739 non-null int64
hour          10739 non-null int64
dtypes: float64(4), int64(13), object(2)
memory usage: 2.0+ MB

fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)

sns.distplot(data_train['temp'],bins=60,ax=axes[0,0])
sns.distplot(data_train['atemp'],bins=60,ax=axes[0,1])
sns.distplot(data_train['humidity'],bins=60,ax=axes[1,0])
sns.distplot(data_train['windspeed'],bins=60,ax=axes[1,1])

在这里插入图片描述

fig,axes = plt.subplots(2,2)
fig.set_size_inches(15,12)

sns.boxplot(x='season', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,0])
sns.boxplot(x='holiday', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,1])
sns.boxplot(x='workingday', y='count', data = data_train, orient='v', width=0.6, ax=axes[1,0])
sns.boxplot(x='weather',y='count',data=data_train,orient='v',width=0.6,ax=axes[1,1])

在这里插入图片描述

data_train['windspeed'].describe()

count    10739.000000
mean        12.787706
std          8.171075
min          0.000000
25%          7.001500
50%         12.998000
75%         16.997900
max         56.996900
Name: windspeed, dtype: float64

data_train.boxplot(['windspeed'])

在这里插入图片描述

通过上图发现风速0的数据很多,可能数据本身是有缺失值的，但是用0填充了。这里我们使用随即森林进行填充风速为0的值进行填充。

np.sum(data_train['windspeed'] == 0),data_train['windspeed'].shape[0]

(1253, 10264)

# 使用随机森林填充风速
from sklearn.ensemble import RandomForestRegressor

def RFG_windspeed(data):
    # 将数据分成风速等于0和不等于0的两部分
    mask = data['windspeed'] == 0
    wind_0 = data[mask]
    wind_1 = data[~mask]
    
    if len(wind_0.index)==0:
        return data

    Model_wind = RandomForestRegressor(n_estimators=1000,random_state=42)

    # 选取特征
    cols = ["season","weather","humidity","month","temp","year","atemp"]
    windspeed_X = wind_1[cols]
    # 预测值
    windspeed_y = wind_1['windspeed']

    windspeedpre_X = wind_0[cols]

    Model_wind.fit(windspeed_X,windspeed_y)

    # 预测风速
    wind_0Values = Model_wind.predict(X=windspeedpre_X)

    # 填充
    wind_0.loc[:,'windspeed'] = wind_0Values
    data = wind_1.append(wind_0).reset_index()
    data.drop('index',inplace=True,axis=1)
    return data

data_train['windspeed'].head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: windspeed, dtype: float64

data_train = RFG_windspeed(data_train)
data_train['windspeed'].head()

0     6.0032
1    16.9979
2    19.0012
3    19.0012
4    19.9995
Name: windspeed, dtype: float64

再观察一下这四个特征的密度分布

fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)

sns.distplot(data_train['temp'],ax=axes[0,0])
axes[0,0].set(xlabel='temp')

sns.distplot(data_train['atemp'],ax=axes[0,1])
axes[0,1].set(xlabel='atemp')

sns.distplot(data_train['humidity'],ax=axes[1,0])
axes[1,0].set(xlabel='humidity')

sns.distplot(data_train['windspeed'],ax=axes[1,1])
axes[1,1].set(xlabel='windseed')

[Text(0.5,0,'windseed')]

在这里插入图片描述

整体看一下租赁额相关的三个值和其他特征值的关系

# 使用seaborn的整体关系图
cols =['season','holiday','workingday','weekday','weather','temp',
       'atemp','humidity','windspeed','hour']

sns.pairplot(data_train ,x_vars=cols,
             y_vars=['casual','registered','count'], 
             plot_kws={'alpha': 0.2})

在这里插入图片描述

season（季节） 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
holiday 节假日
workingday 工作日
weather 天气等级
temp 温度
atemp 体感温度
humidity 相对湿度
windspeed 风速
casual 非用户租赁数量
registered 会员租赁数量
count 租赁总量

可以观察到：

一季度出行人数总体偏少
非假日借车总数比假日借车总数要高
会员在工作日出行多，节假日出行少，临时用户则相反
租赁数量随天气等级上升而减少
温度、湿度对非会员影响较大，对会员影响较小
小时数对租赁情况影响明显，会员呈现两个高峰，非会员呈正态分布

查看各特征与count的相关性

corr = data_train.corr()
plt.subplots(figsize=(14,14))
sns.heatmap(corr,annot=True,vmax=1,cmap='YlGnBu')

在这里插入图片描述

# 降序
np.abs(corr['count']).sort_values(ascending=False)

count             1.000000
registered        0.977642
count_log         0.845294
registered_log    0.839080
casual_log        0.758863
casual            0.711105
hour              0.442659
temp              0.376656
atemp             0.372705
humidity          0.307362
month             0.176115
season            0.173657
year              0.171519
weather           0.123578
windspeed         0.116767
workingday        0.046246
weekday           0.022100
day               0.015243
holiday           0.002421
Name: count, dtype: float64

可以看出特征对count的影响力度分别为:
hour(时段)>temp(温度)>atemp(体感温度)>humidity(湿度)>month(月份)>season(季节)>year(年份)
>weather(天气等级)>windspeed(风速)>workingday(工作日)>weekday(星期几)>day(天数)>(holiday)节假日

hour与count

# hour总体变化趋势
date = data_train.groupby(['hour'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用总量
plt.plot(date['hour'], date['count'], linewidth=1.3)
# 会员使用量
plt.plot(date['hour'], date['registered'], linewidth=1.3)
# 非会员使用量
plt.plot(date['hour'], date['casual'], linewidth=1.3)
plt.legend()

在这里插入图片描述

# 工作日与非工作日下，hour与count的关系
date = data_train.groupby(['workingday','hour'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})

mask = date['workingday'] == 1

workingday_date= date[mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)
nworkingday_date = date[~mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)

fig, axes = plt.subplots(1,2,sharey = True)
workingday_date.plot(figsize=(15,5),title ='working day',ax=axes[0])
axes[0].set(xlabel='hour')
nworkingday_date.plot(figsize=(15,5),title ='nonworkdays',ax=axes[1])
axes[1].set(xlabel='hour')

在这里插入图片描述

可以看出:

工作日
1. 会员用户(registered)上下班时间是两个用车高峰，而中午也会有一个小高峰，猜测可能是外出午餐的人。
2. 临时用户(casual)起伏比较平缓，高峰期在17点左右。
3. 会员用户(registered)的用车数量远超临时用户(casual)。
非工作日
1. 租赁数量(count)随时间呈现一个正态分布，高峰在12点左右，低谷在4点左右，且分布比较均匀。

温度与count

可视化温度这两年的总体走势

# 数据按天汇总取一天的气温中位数
temp_df = data_train.groupby(['date','weekday'],as_index=False).agg({'year':'mean',
                                                                     'month':'mean',
                                                                     'temp':'median'})
# 缺失的数据丢弃
# temp_df.dropna (axis=0,how ='any',inplace=True)

# 预计按天统计的波动仍然很大，再按月取日平均值
temp_month = temp_df.groupby(['year','month'],as_index=False).agg({'weekday':'min',
                                                                    'temp':'median'})

# 将按天求和统计数据的日期转换成datetime格式
temp_df['date']=pd.to_datetime(temp_df['date'])

# 将按月统计数据设置一列时间序列
temp_month.rename(columns={'weekday':'day'},inplace=True)
temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])


# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况（count）随时间的走势
plt.plot(temp_df['date'] , temp_df['temp'], linewidth=1.3, label='日均')
ax.set_title('两年平均每天温度变化趋势')
plt.plot(temp_month['date'] , temp_month['temp'], marker='o',
         linewidth=1.3,label='月均')
ax.legend()

在这里插入图片描述

可以看出每年的气温变化趋势相同，在7月份气温最高，1月份气温最低。再看一下每小时平均租赁数量随温度变化的趋势

# 按温度取平均值
temp = data_train.groupby(['temp'], as_index=True).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
temp.plot(figsize=(10,5),title='温度与count的变化趋势')

在这里插入图片描述

可观察到在气温4度时，count达到最低点，然后随气温上升租车数量总体呈现上升趋势，但在气温超过35时开始下降。

湿度与count

# 可视化湿度这两年的总体走势
humidity_df = data_train.groupby(['date'],as_index=False).agg({'humidity':'mean'})
humidity_df['date']=pd.to_datetime(humidity_df['date'])

# 将日期设置为时间索引
humidity_df = humidity_df.set_index('date')

humidity_month = data_train.groupby(['year','month'],as_index=False).agg({'weekday':'min',
                                                                         'humidity':'mean'})

# 将按月统计数据设置一列时间序列
humidity_month.rename(columns={'weekday':'day'},inplace=True)
humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])

# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况（count）随湿度的走势
ax.set_title('两年平均每天湿度变化趋势')
plt.plot(humidity_df.index,humidity_df['humidity'], linewidth=1.3, label='日均')
plt.plot(humidity_month['date'],humidity_month['humidity'], marker='o',
         linewidth=1.3,label='月均')
plt.grid()
ax.legend()

在这里插入图片描述

观察一下租赁人数随湿度变化趋势，按湿度对租赁数量取平均值。

# 湿度
humidity = data_train.groupby(['humidity'], as_index=True).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
humidity.plot(figsize=(10,5),title='湿度与count的变化趋势')

在这里插入图片描述

可以观察到在湿度20左右租赁数量迅速达到高峰值，此后缓慢递减

year、month与count

# 先观察两年时间里，总租车数量随时间变化的趋势
count_df = data_train.groupby(['date','weekday'], as_index=False).agg({'year':'mean',
                                                                      'month':'mean',
                                                                      'casual':'sum',
                                                                      'registered':'sum',
                                                                       'count':'sum'})

# 按天统计的波动仍然很大，再按月取日平均值
count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                      'casual':'mean', 
                                                                      'registered':'mean',
                                                                      'count':'mean'})

# 将按天求和统计数据的日期转换成datetime格式
count_df['date']=pd.to_datetime(count_df['date'])

# 将按月统计数据设置一列时间序列
count_month.rename(columns={'weekday':'day'},inplace=True)
count_month['date']=pd.to_datetime(count_month[['year','month','day']])

# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况（count）随时间的走势
ax.set_title('这两年count随时间的总体趋势')
plt.plot(count_df['date'],count_df['count'],linewidth=1.3,label='日均')

plt.plot(count_month['date'],count_month['count'],marker='o',
         linewidth=1.3,label='月均')
plt.grid()
ax.legend()

在这里插入图片描述

可以看出:

共享单车的租赁情况是2012年整体比2011年有增涨的；
租赁情况随月份波动明显；
数据在2011年9到12月，2012年3到9月间波动剧烈；
有很多局部波谷值。

# 月使用量变化趋势
date = data_train.groupby(['month'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(date['month'], date['count'] , linewidth=1.3 , label = '使用总量' )
plt.plot(date['month'], date['registered'] , linewidth=1.3 , label = '会员使用量' )
plt.plot(date['month'], date['casual'] , linewidth=1.3 , label = '非会员使用量' )
plt.legend()

在这里插入图片描述

季节与count

day_df=data_train.groupby('date').agg({'year':'mean','season':'mean',
                                      'casual':'sum', 'registered':'sum'
                                      ,'count':'sum','temp':'mean',
                                      'atemp':'mean'})
season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 
                                                                  'registered':'mean',
                                                                  'count':'mean'})
temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 
                                                                'atemp':'mean'})

fig = plt.figure(figsize=(10,10))
xlables = season_df.index.map(lambda x:str(x))

ax1 = fig.add_subplot(2,1,1)
ax1.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,season_df)
plt.legend(['casual','registered','count'])

ax2 = fig.add_subplot(2,1,2)
ax2.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,temp_df)

plt.legend(['temp','atemp'])

在这里插入图片描述

可以看出无论是临时用户还是会员用户用车的数量都在秋季迎来高峰，而春季度用户数量最低

天气与count

考虑到不同天气的天数不同，例如非常糟糕的天气（4）会很少出现，查看一下不同天气等级的数据条数，再对租赁数量按天气等级取每小时平均值

count_weather = data_train.groupby('weather')
count_weather[['casual','registered','count']].count()

	casual	registered	count
weather
1	6719	6719	6719
2	2705	2705	2705
3	839	839	839
4	1	1	1

weather_df = data_train.groupby('weather',as_index=True).agg({'casual':'mean',
                                                              'registered':'mean'})
weather_df.plot.bar(stacked=True)

在这里插入图片描述

发现天气等级为4的时候，租车数量也很多，感觉不太合常理，打印对应数据观察一下。

data_train[data_train['weather']==4].T

	4863
datetime	2012-01-09 18:00:00
season	1
holiday	0
workingday	1
weather	4
temp	8.2
atemp	11.365
humidity	86
windspeed	6.0032
casual	6
registered	158
count	164
count_log	5.10595
casual_log	1.94591
registered_log	5.0689
date	2012-01-09
weekday	1
year	2012
month	1
day	9
hour	18

发现是周一下班高峰期,所以是异常数据

windspeed和count

# 这两年风速的总体变化趋势
windspeed_df = data_train.groupby('date',as_index=False).agg({'windspeed':'mean'})
windspeed_df['date'] = pd.to_datetime(windspeed_df['date'])

# 将日期设置为时间索引
windspeed_df = windspeed_df.set_index('date')


windspeed_month = data_train.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                           'windspeed':'mean'})
windspeed_month.rename(columns={'weekday':'day'},inplace=True)
windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(windspeed_df.index, windspeed_df['windspeed'] , linewidth=1.3,label='日均')
plt.plot(windspeed_month['date'], windspeed_month['windspeed'],
         marker='o', linewidth=1.3,label='月均')
ax.legend()
ax.set_title('这两年风速的总体变化趋势')

在这里插入图片描述

可以看出风速在2011年9月份和2011年12月到2012年3月份间波动和大，观察一下租赁人数随风速变化趋势，考虑到风速特别大的时候很少，如果取平均值会出现异常，所以按风速对租赁数量取最大值。

# 风速 
# 化为整数型数据
data_train['windspeed'] = data_train['windspeed'].astype(int)
windspeed = data_train.groupby(['windspeed'], as_index=True).agg({'count':'mean',
                                                                  'registered':'mean',  
                                                                  'casual':'mean'})
windspeed.plot(figsize=(10,8))

在这里插入图片描述

可以看到租赁数量随风速越大租赁数量越少，在风速超过18的时候明显减少，但风速在风速20左右却有一次反弹，应该是和天气情况一样存在异常的数据，打印异常数据观察一下

df2=data_train[data_train['windspeed']>40]
df2=df2[df2['count']>150]
df2

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	casual	...	count	count_log	casual_log	registered_log	date	weekday	year	month	day	hour
760	2011-02-19 14:00:00	1	0	1	18.86	22.725	15	43	102	...	196	5.283204	4.634729	4.553877	2011-02-19	6	2011	2	19	14
761	2011-02-19 15:00:00	1	0	1	18.04	21.970	16	50	84	...	171	5.147494	4.442651	4.477337	2011-02-19	6	2011	2	19	15
2447	2011-07-03 17:00:00	3	0	3	32.80	37.120	49	56	181	...	358	5.883322	5.204007	5.181784	2011-07-03	7	2011	7	3	17
2448	2011-07-03 18:00:00	3	0	3	32.80	37.120	49	56	74	...	181	5.204007	4.317488	4.682131	2011-07-03	7	2011	7	3	18
2941	2011-08-07 17:00:00	3	0	3	30.34	35.605	74	43	63	...	194	5.273000	4.158883	4.882802	2011-08-07	7	2011	8	7	17
5590	2012-03-05 18:00:00	1	1	3	11.48	11.365	55	43	12	...	375	5.929589	2.564949	5.897154	2012-03-05	1	2012	3	5	18
5652	2012-03-08 13:00:00	1	1	2	24.60	31.060	49	43	35	...	233	5.455321	3.583519	5.293305	2012-03-08	4	2012	3	8	13
5653	2012-03-08 14:00:00	1	1	2	25.42	31.060	43	43	48	...	203	5.318120	3.891820	5.049856	2012-03-08	4	2012	3	8	14
5654	2012-03-08 15:00:00	1	1	1	26.24	31.060	38	46	24	...	185	5.225747	3.218876	5.087596	2012-03-08	4	2012	3	8	15
5655	2012-03-08 16:00:00	1	1	2	25.42	31.060	41	43	37	...	342	5.837730	3.637586	5.723585	2012-03-08	4	2012	3	8	16
5656	2012-03-08 17:00:00	1	1	1	25.42	31.060	38	43	52	...	597	6.393591	3.970292	6.302619	2012-03-08	4	2012	3	8	17
6015	2012-04-09 12:00:00	2	1	1	22.14	25.760	28	47	94	...	280	5.638355	4.553877	5.231109	2012-04-09	1	2012	4	9	12
7880	2012-09-18 10:00:00	3	1	3	27.88	31.820	79	43	30	...	160	5.081404	3.433987	4.875197	2012-09-18	2	2012	9	18	10
7881	2012-09-18 11:00:00	3	1	2	27.88	31.820	79	43	36	...	151	5.023881	3.610918	4.753590	2012-09-18	2	2012	9	18	11

14 rows × 21 columns

日期对出行的影响

考虑到相同日期是否工作日，星期几，以及所属年份等信息是一样的，把租赁数据按天求和，其它日期类数据取平均值

day_df = data_train.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum',
                                                          'count':'sum', 'workingday':'mean',
                                                          'weekday':'mean','holiday':'mean',
                                                          'year':'mean'})
day_df.head()

	date	casual	registered	count	workingday	weekday	year
0	2011-01-01	331	654	985	0	6	2011
1	2011-01-02	131	670	801	0	7	2011
2	2011-01-03	120	1229	1349	1	1	2011
3	2011-01-04	108	1454	1562	1	2	2011
4	2011-01-05	82	1518	1600	1	3	2011

number_pei=day_df[['casual','registered']].mean()
number_pei

casual         657.543860
registered    3040.800439
dtype: float64

# 将横、纵坐标轴标准化处理,保证饼图是一个正圆,否则为椭圆
plt.axes(aspect='equal')
plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.05 , radius=1)  
plt.title('Casual or registered in the total lease')

在这里插入图片描述

由于工作日和休息日的天数差别，对工作日和非工作日租赁数量取了平均值，对一周中每天的租赁数量求和

workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 
                                                                 'registered':'mean'})
workingday_df_0 = workingday_df.loc[0]
workingday_df_1 = workingday_df.loc[1]

# plt.axes(aspect='equal')
fig = plt.figure(figsize=(8,6)) 
plt.subplots_adjust(hspace=0.5, wspace=0.2)     #设置子图表间隔
grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5)   #设置子图表坐标轴 对齐

plt.subplot2grid((2,2),(1,0), rowspan=2)
width = 0.3       # 设置条宽

p1 = plt.bar(workingday_df.index,workingday_df['casual'], width)
p2 = plt.bar(workingday_df.index,workingday_df['registered'], 
             width,bottom=workingday_df['casual'])
plt.title('Average number of rentals initiated per day')
plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20)
plt.legend((p1[0], p2[0]), ('casual', 'registered'))

plt.subplot2grid((2,2),(0,0))
plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.axis('equal') 
plt.title('nonworking day')

plt.subplot2grid((2,2),(0,1))
plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.title('working day')
plt.axis('equal')

(-1.438451504893538,
 1.4304024814759062,
 -1.4388335098293494,
 1.4343901442970892)

在这里插入图片描述

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'})
weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

在这里插入图片描述

1.工作日会员用户出行数量较多，临时用户出行数量较少；
2.周末会员用户租赁数量降低，临时用户租赁数量增加。

节假日
由于节假日在一年中数量占比非常少，先来看一每年的节假日下有几天，

holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'})
holiday_coun

	holiday
year
2011	6
2012	7

假期的天数占一年天数的份额十分少，所以对假期和非假期取日平均值

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'})
holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated ')

在这里插入图片描述

特征工程

import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime

train = pd.read_csv('./data/data31405/train.csv')
test  = pd.read_csv('./data/data31405/test.csv')

#训练集去除3倍方差以外数据
train_std = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())]

train_std.reset_index(drop=True,inplace=True)
train_std.shape

(10739, 12)

#对数据进行对数变换后的分布
ylabels = train_std['count']
ylabels_log = np.log(ylabels)
sns.distplot(ylabels_log)

在这里插入图片描述

#将train_std、test 合并，便于修改

#index都没有实际含义，使用ignore_inde
combine_train_test = train_std.append(test,ignore_index=True)
datetimecol = test['datetime']
print ('合并后的数据集:',combine_train_test.shape)

合并后的数据集: (17232, 12)

# 记录数据的行数 0表示行，1表示列
row_train = train_std.shape[0]
row_test = test.shape[0]
print('训练集行数：',row_train,'\n测试集行数：',row_test)

训练集行数： 10739 
测试集行数： 6493

# datetime特征拆分
combine_train_test = split_datetime(combine_train_test)

# 填充风速 注意会打乱数据顺序
combine_train_test = RFG_windspeed(combine_train_test)

根据前面的观察，决定将时段（hour）、温度（temp）、湿度（humidity）、年份（year）、月份（month）、季节（season）、天气等级（weather）、风速（windspeed）、星期几（weekday）、是否工作日（workingday）、是否假日（holiday），11项作为特征值由于CART决策树使用二分类，所以将多类别型数据使用one-hot转化成多个二分型类别

combine_feature = combine_train_test[['temp','humidity','weather','season','year','weather',
                                      'month','weekday','hour','workingday','windspeed','count']]

combine_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 12 columns):
temp          17232 non-null float64
humidity      17232 non-null int64
weather       17232 non-null int64
season        17232 non-null int64
year          17232 non-null int64
weather       17232 non-null int64
month         17232 non-null int64
weekday       17232 non-null int64
hour          17232 non-null int64
workingday    17232 non-null int64
windspeed     17232 non-null float64
count         10739 non-null float64
dtypes: float64(3), int64(9)
memory usage: 1.6 MB

# 将多类别型数据使用one-hot转化成多个二分型类别
cols = ['month','season','weather','year']
combine_feature = pd.get_dummies(combine_feature,columns=cols,prefix_sep='_')

combine_feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 33 columns):
temp          17232 non-null float64
humidity      17232 non-null int64
weekday       17232 non-null int64
hour          17232 non-null int64
workingday    17232 non-null int64
windspeed     17232 non-null float64
count         10739 non-null float64
month_1       17232 non-null uint8
month_2       17232 non-null uint8
month_3       17232 non-null uint8
month_4       17232 non-null uint8
month_5       17232 non-null uint8
month_6       17232 non-null uint8
month_7       17232 non-null uint8
month_8       17232 non-null uint8
month_9       17232 non-null uint8
month_10      17232 non-null uint8
month_11      17232 non-null uint8
month_12      17232 non-null uint8
season_1      17232 non-null uint8
season_2      17232 non-null uint8
season_3      17232 non-null uint8
season_4      17232 non-null uint8
weather_1     17232 non-null uint8
weather_2     17232 non-null uint8
weather_3     17232 non-null uint8
weather_4     17232 non-null uint8
weather_1     17232 non-null uint8
weather_2     17232 non-null uint8
weather_3     17232 non-null uint8
weather_4     17232 non-null uint8
year_2011     17232 non-null uint8
year_2012     17232 non-null uint8
dtypes: float64(3), int64(4), uint8(26)
memory usage: 1.3 MB

构建模型

#将数据集拆分为训练集和测试集，注意之前用随机深林填充风速，打乱了数据顺序
mask = pd.notnull(combine_feature['count'])
train_data = combine_feature[mask]
test_data = combine_feature[~mask]

train_data.shape,test_data.shape

((10739, 33), (6493, 33))

# source特征
source_X = train_data.drop(['count'],axis = 1)

# source标签
source_y  = np.log1p(train_data['count'])

# 测试集特征
pred_X = test_data.drop(['count'],axis = 1)

模型

from sklearn.model_selection import GridSearchCV

# 评价函数
def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model, # 要搜索的模型
                        params, # 要尝试的参数
                        n_jobs=-1,
                        error_score=0.) # 如果报错，结果是0
    grid.fit(X, y) # 拟合模型和参数
    # 经典的性能指标
    print("Best Accuracy: {}".format(grid.best_score_))
    # 得到最佳准确率的最佳参数
    print("Best Parameters: {}".format(grid.best_params_))
    # 拟合的平均时间（秒）
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # 预测的平均时间（秒）
    # 从该指标可以看出模型在真实世界的性能
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
    return grid

from sklearn.model_selection import train_test_split 

# 划分数据集
train_X, test_X, train_y, test_y = train_test_split(source_X,
                                                    source_y,
                                                    train_size = 0.80)

#输出数据集大小
print ('原始数据集特征：',source_X.shape, '训练数据集特征：',train_X.shape,'测试数据集特征：',test_X.shape)

print ('原始数据集标签：',source_y.shape, '训练数据集标签：',train_y.shape,'测试数据集标签：',test_y.shape)

原始数据集特征： (10739, 32) 训练数据集特征： (8591, 32) 测试数据集特征： (2148, 32)
原始数据集标签： (10739,) 训练数据集标签： (8591,) 测试数据集标签： (2148,)

随机森林

from sklearn.ensemble import RandomForestRegressor


# 模型参数
forest_parmas = {'n_estimators':[1300,1500,1700], 'max_depth':range(20,30,4)}

Model = RandomForestRegressor(oob_score=True,n_jobs=-1,random_state = 42)

Model = get_best_model_and_accuracy(Model,forest_parmas ,train_X, train_y)

Best Accuracy: 0.9495761615636888
Best Parameters: {'max_depth': 24, 'n_estimators': 1500}
Average Time to Fit (s): 33.625
Average Time to Score (s): 1.056

Model=Model.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=24,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1500, n_jobs=-1,
           oob_score=True, random_state=42, verbose=0, warm_start=False)

# 分类问题，score得到的是模型的正确率
Model.score(test_X,test_y)

0.9506555663644386

# 袋外分数
Model.oob_score_

0.9544882922113979

# 模型保存
from sklearn.externals import joblib

joblib.dump(Model, "rf.pkl", compress=9)

xgboost

import xgboost as xg

# 模型参数  subsample:对于每棵树，随机采样的比例
xg_parmas = {'subsample':[i/10.0 for i in range(6,10)],
            'colsample_bytree':[i/10.0 for i in range(6,10)]} # 控制每棵随机采样的列数的占比

xg_model = xg.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4)

xg_model = get_best_model_and_accuracy(xg_model,xg_parmas,train_X.values, train_y.values)

Best Accuracy: 0.9519465121003385
Best Parameters: {'colsample_bytree': 0.9, 'subsample': 0.9}
Average Time to Fit (s): 1.136
Average Time to Score (s): 0.019

xg_model=xg_model.best_estimator_
xg_model

XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.9, gamma=0.4, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.300000012, max_delta_step=0, max_depth=8,
       min_child_weight=6, missing=nan, monotone_constraints=None,
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='reg:squarederror', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=0.9, tree_method=None,
       validate_parameters=False, verbosity=None)

from sklearn.metrics import mean_absolute_error

pre_y = xg_model.predict(test_X.values)
mean_absolute_error(pre_y,test_y.values)

0.20327569987981398

learning curve

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, 
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
    """
    画出data在某模型上的learning curve.
    参数解释
    ----------
    estimator : 分类器。
    title : 表格的标题。
    X : 输入的feature，numpy类型
    y : 输入的target vector
    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
    cv : 做cross-validation的时候，数据分成的份数，其中一份作为cv集，其余n-1份作为training(默认为3份)
    n_jobs : 并行的的任务数(默认1)
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    if plot:
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"训练样本数")
        plt.ylabel(u"得分")
        plt.gca().invert_yaxis()
        plt.grid() # 网格
    
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
                         alpha=0.1, color="b") #填充两条线间区域
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train score")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"test score")
    
        plt.legend(loc="best")
        
        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()
    
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff

plot_learning_curve(Model, u"学习曲线",train_X,train_y)

在这里插入图片描述

(0.969214095199126, 0.048081832611191144)

# 预测数据
pred_value = Model.predict(pred_X)
pred_value = np.exp(pred_value)
submission = pd.DataFrame({'datetime':datetimecol, 'count':pred_value})
submission['count'] = submission['count'].astype(int)

submission.to_csv('bike_predictions.csv',index = False)

submission.head()

	datetime	count
0	2011-01-20 00:00:00	10
1	2011-01-20 01:00:00	3
2	2011-01-20 02:00:00	3
3	2011-01-20 03:00:00	6
4	2011-01-20 04:00:00	37
5	2011-01-20 05:00:00	90

参考资料

天青如水

关注

11
点赞
踩
105

收藏

觉得还不错? 一键收藏
5
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录