kaggle之共享单车案例

kaggle之共享单车案例

自行车共享系统是租借自行车的一种手段,通过这些系统,人们可以从任意地点租借一辆自行车,到达目的地后归还。自行车共享系统明确记录了旅行时间,出发地点,到达地点和时间。因此,其可用于研究城市中的移动性。在本项目中,要求将历史使用模式与天气数据结合起来,以预测华盛顿特区的自行车租赁租赁需求。

数据提供了跨越两年的每小时租赁数据,包含天气信息和日期信息,训练集由每月前19天的数据组成,测试集是每月第20天到当月底的数据。

变量说明:

  • datetime(日期) - 年 、月、 日+ 整点时刻
  • season(季节) - 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
  • holiday - 是否是节假日
  • workingday - 是否是工作日
  • weather(天气等级)1. 清澈,少云,多云。2. 雾+阴天,雾+碎云、雾+少云、雾 3. 小雪、小雨+雷暴+散云,小雨+云 4. 暴雨+冰雹+雷暴+雾,雪+雾
  • temp 温度
  • atemp 体感温度
  • humidity 相对湿度
  • windspeed 风速
  • casual 非用户租赁数量
  • registered 会员租赁数量
  • count 租赁总量

数据探索

  • 缺失值检查
  • 异常值检查
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

#忽略警告提示
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
data_train = pd.read_csv('./data/train.csv')
data_test  = pd.read_csv('./data/test.csv')

data_train.info()
print('-'*40)
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null object
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity      6493 non-null int64
windspeed     6493 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.6+ KB

数据没有缺失值

data_train.head()
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
02011-01-01 00:00:0010019.8414.395810.031316
12011-01-01 01:00:0010019.0213.635800.083240
22011-01-01 02:00:0010019.0213.635800.052732
32011-01-01 03:00:0010019.8414.395750.031013
42011-01-01 04:00:0010019.8414.395750.0011
data_test.head()
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeed
02011-01-20 00:00:00101110.6611.3655626.0027
12011-01-20 01:00:00101110.6613.635560.0000
22011-01-20 02:00:00101110.6613.635560.0000
32011-01-20 03:00:00101110.6612.8805611.0014
42011-01-20 04:00:00101110.6612.8805611.0014
# 统计描述
data_train.describe().T
countmeanstdmin25%50%75%max
season10886.02.5066141.1161741.002.00003.0004.00004.0000
holiday10886.00.0285690.1665990.000.00000.0000.00001.0000
workingday10886.00.6808750.4661590.000.00001.0001.00001.0000
weather10886.01.4184270.6338391.001.00001.0002.00004.0000
temp10886.020.2308607.7915900.8213.940020.50026.240041.0000
atemp10886.023.6550848.4746010.7616.665024.24031.060045.4550
humidity10886.061.88646019.2450330.0047.000062.00077.0000100.0000
windspeed10886.012.7993958.1645370.007.001512.99816.997956.9969
casual10886.036.02195549.9604770.004.000017.00049.0000367.0000
registered10886.0155.552177151.0390330.0036.0000118.000222.0000886.0000
count10886.0191.574132181.1444541.0042.0000145.000284.0000977.0000

异常值检查

  • count
  • casual
  • registered
# 查看是否符合高斯分布
fig,axes = plt.subplots(1,3)
# 设置图形的尺寸,单位为英寸。1英寸等于2.54cm
fig.set_size_inches(18,5)

sns.distplot(data_train['count'],bins=100,ax=axes[0])
sns.distplot(data_train['casual'],bins=100,ax=axes[1])
sns.distplot(data_train['registered'],bins=100,ax=axes[2])

在这里插入图片描述

data_train[['count','casual','registered']].describe().T
countmeanstdmin25%50%75%max
count10886.0191.574132181.1444541.042.0145.0284.0977.0
casual10886.036.02195549.9604770.04.017.049.0367.0
registered10886.0155.552177151.0390330.036.0118.0222.0886.0
fig,axes = plt.subplots(1,3)
fig.set_size_inches(12,6)

sns.boxplot(data = data_train['count'],ax=axes[0])
axes[0].set(xlabel='count')
sns.boxplot(data = data_train['casual'], ax=axes[1])
axes[1].set(xlabel='casual')
sns.boxplot(data = data_train['registered'], ax=axes[2])
axes[2].set(xlabel='registered')

在这里插入图片描述

count:均值191,标准差181,50%分位数是145,75%分位数是284,最大值977,说明右侧存在长尾。去除掉异常值,并取log处理,观察结果。

count = casual+registered

# 去除异常值 将大于μ+3σ的数据值作为异常值
def drop_outlier(data,col):
    mask = np.abs(data[col]-data[col].mean())<(3*data[col].std())
    data = data.loc[mask]
    # 可视化剔除异常值后的col和col_log
    data[col+'_log'] = np.log1p(data[col])
    f, [ax1, ax2] = plt.subplots(1,2, figsize=(15,6))

    sns.distplot(data[col], ax=ax1)
    ax1.set_title(col+'分布')

    sns.distplot(data[col+'_log'], ax=ax2)
    ax2.set_title(col+'_log分布')
    return data
data_train = drop_outlier(data_train,'count')

在这里插入图片描述

data_train = drop_outlier(data_train,'casual')

在这里插入图片描述

data_train = drop_outlier(data_train,'registered')

在这里插入图片描述

特征分解

将datetime特征拆分为日期、星期、年、月、日、小时

def split_datetime(data):
    data['date'] = data['datetime'].apply(lambda x:x.split()[0])
    data['weekday'] =data['date'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d').isoweekday())
    data['year'] = data['date'].apply(lambda x:x.split('-')[0]).astype('int')
    data['month'] = data['date'].apply(lambda x:x.split('-')[1]).astype('int')
    data['day'] = data['date'].apply(lambda x:x.split('-')[2]).astype('int')
    data['hour'] = data['datetime'].apply(lambda x:x.split()[1].split(':')[0]).astype('int')
    return data
data_train = split_datetime(data_train)
data_train.head()
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountcount_logdateweekdayyearmonthdayhour
02011-01-01 00:00:0010019.8414.395810.0313162.8332132011-01-0162011110
12011-01-01 01:00:0010019.0213.635800.0832403.7135722011-01-0162011111
22011-01-01 02:00:0010019.0213.635800.0527323.4965082011-01-0162011112
32011-01-01 03:00:0010019.8414.395750.0310132.6390572011-01-0162011113
42011-01-01 04:00:0010019.8414.395750.00110.6931472011-01-0162011114

可视化分析

  • 数值型数据分布分析
  • 类别型数据箱线图分布分析
data_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10739 entries, 0 to 10885
Data columns (total 19 columns):
datetime      10739 non-null object
season        10739 non-null int64
holiday       10739 non-null int64
workingday    10739 non-null int64
weather       10739 non-null int64
temp          10739 non-null float64
atemp         10739 non-null float64
humidity      10739 non-null int64
windspeed     10739 non-null float64
casual        10739 non-null int64
registered    10739 non-null int64
count         10739 non-null int64
count_log     10739 non-null float64
date          10739 non-null object
weekday       10739 non-null int64
year          10739 non-null int64
month         10739 non-null int64
day           10739 non-null int64
hour          10739 non-null int64
dtypes: float64(4), int64(13), object(2)
memory usage: 2.0+ MB
fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)

sns.distplot(data_train['temp'],bins=60,ax=axes[0,0])
sns.distplot(data_train['atemp'],bins=60,ax=axes[0,1])
sns.distplot(data_train['humidity'],bins=60,ax=axes[1,0])
sns.distplot(data_train['windspeed'],bins=60,ax=axes[1,1])

在这里插入图片描述

fig,axes = plt.subplots(2,2)
fig.set_size_inches(15,12)

sns.boxplot(x='season', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,0])
sns.boxplot(x='holiday', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,1])
sns.boxplot(x='workingday', y='count', data = data_train, orient='v', width=0.6, ax=axes[1,0])
sns.boxplot(x='weather',y='count',data=data_train,orient='v',width=0.6,ax=axes[1,1])

在这里插入图片描述

data_train['windspeed'].describe()
count    10739.000000
mean        12.787706
std          8.171075
min          0.000000
25%          7.001500
50%         12.998000
75%         16.997900
max         56.996900
Name: windspeed, dtype: float64
data_train.boxplot(['windspeed'])

在这里插入图片描述

通过上图发现风速0的数据很多,可能数据本身是有缺失值的,但是用0填充了。这里我们使用随即森林进行填充风速为0的值进行填充。

np.sum(data_train['windspeed'] == 0),data_train['windspeed'].shape[0]
(1253, 10264)
# 使用随机森林填充风速
from sklearn.ensemble import RandomForestRegressor

def RFG_windspeed(data):
    # 将数据分成风速等于0和不等于0的两部分
    mask = data['windspeed'] == 0
    wind_0 = data[mask]
    wind_1 = data[~mask]
    
    if len(wind_0.index)==0:
        return data

    Model_wind = RandomForestRegressor(n_estimators=1000,random_state=42)

    # 选取特征
    cols = ["season","weather","humidity","month","temp","year","atemp"]
    windspeed_X = wind_1[cols]
    # 预测值
    windspeed_y = wind_1['windspeed']

    windspeedpre_X = wind_0[cols]

    Model_wind.fit(windspeed_X,windspeed_y)

    # 预测风速
    wind_0Values = Model_wind.predict(X=windspeedpre_X)

    # 填充
    wind_0.loc[:,'windspeed'] = wind_0Values
    data = wind_1.append(wind_0).reset_index()
    data.drop('index',inplace=True,axis=1)
    return data
data_train['windspeed'].head()
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: windspeed, dtype: float64
data_train = RFG_windspeed(data_train)
data_train['windspeed'].head()
0     6.0032
1    16.9979
2    19.0012
3    19.0012
4    19.9995
Name: windspeed, dtype: float64

再观察一下这四个特征的密度分布

fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)

sns.distplot(data_train['temp'],ax=axes[0,0])
axes[0,0].set(xlabel='temp')

sns.distplot(data_train['atemp'],ax=axes[0,1])
axes[0,1].set(xlabel='atemp')

sns.distplot(data_train['humidity'],ax=axes[1,0])
axes[1,0].set(xlabel='humidity')

sns.distplot(data_train['windspeed'],ax=axes[1,1])
axes[1,1].set(xlabel='windseed')
[Text(0.5,0,'windseed')]

在这里插入图片描述

整体看一下租赁额相关的三个值和其他特征值的关系

# 使用seaborn的整体关系图
cols =['season','holiday','workingday','weekday','weather','temp',
       'atemp','humidity','windspeed','hour']

sns.pairplot(data_train ,x_vars=cols,
             y_vars=['casual','registered','count'], 
             plot_kws={'alpha': 0.2})

在这里插入图片描述

  • season(季节) 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
  • holiday 节假日
  • workingday 工作日
  • weather 天气等级
  • temp 温度
  • atemp 体感温度
  • humidity 相对湿度
  • windspeed 风速
  • casual 非用户租赁数量
  • registered 会员租赁数量
  • count 租赁总量

可以观察到:

  1. 一季度出行人数总体偏少
  2. 非假日借车总数比假日借车总数要高
  3. 会员在工作日出行多,节假日出行少,临时用户则相反
  4. 租赁数量随天气等级上升而减少
  5. 温度、湿度对非会员影响较大,对会员影响较小
  6. 小时数对租赁情况影响明显,会员呈现两个高峰,非会员呈正态分布

查看各特征与count的相关性

corr = data_train.corr()
plt.subplots(figsize=(14,14))
sns.heatmap(corr,annot=True,vmax=1,cmap='YlGnBu')

在这里插入图片描述

# 降序
np.abs(corr['count']).sort_values(ascending=False)
count             1.000000
registered        0.977642
count_log         0.845294
registered_log    0.839080
casual_log        0.758863
casual            0.711105
hour              0.442659
temp              0.376656
atemp             0.372705
humidity          0.307362
month             0.176115
season            0.173657
year              0.171519
weather           0.123578
windspeed         0.116767
workingday        0.046246
weekday           0.022100
day               0.015243
holiday           0.002421
Name: count, dtype: float64

可以看出特征对count的影响力度分别为:
hour(时段)>temp(温度)>atemp(体感温度)>humidity(湿度)>month(月份)>season(季节)>year(年份)
>weather(天气等级)>windspeed(风速)>workingday(工作日)>weekday(星期几)>day(天数)>(holiday)节假日

hour与count

# hour总体变化趋势
date = data_train.groupby(['hour'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用总量
plt.plot(date['hour'], date['count'], linewidth=1.3)
# 会员使用量
plt.plot(date['hour'], date['registered'], linewidth=1.3)
# 非会员使用量
plt.plot(date['hour'], date['casual'], linewidth=1.3)
plt.legend()

在这里插入图片描述

# 工作日与非工作日下,hour与count的关系
date = data_train.groupby(['workingday','hour'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})

mask = date['workingday'] == 1

workingday_date= date[mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)
nworkingday_date = date[~mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)

fig, axes = plt.subplots(1,2,sharey = True)
workingday_date.plot(figsize=(15,5),title ='working day',ax=axes[0])
axes[0].set(xlabel='hour')
nworkingday_date.plot(figsize=(15,5),title ='nonworkdays',ax=axes[1])
axes[1].set(xlabel='hour')

在这里插入图片描述

可以看出:

  • 工作日

    1. 会员用户(registered)上下班时间是两个用车高峰,而中午也会有一个小高峰,猜测可能是外出午餐的人。
    2. 临时用户(casual)起伏比较平缓,高峰期在17点左右。
    3. 会员用户(registered)的用车数量远超临时用户(casual)。
  • 非工作日

    1. 租赁数量(count)随时间呈现一个正态分布,高峰在12点左右,低谷在4点左右,且分布比较均匀。

温度与count

可视化温度这两年的总体走势

# 数据按天汇总取一天的气温中位数
temp_df = data_train.groupby(['date','weekday'],as_index=False).agg({'year':'mean',
                                                                     'month':'mean',
                                                                     'temp':'median'})
# 缺失的数据丢弃
# temp_df.dropna (axis=0,how ='any',inplace=True)

# 预计按天统计的波动仍然很大,再按月取日平均值
temp_month = temp_df.groupby(['year','month'],as_index=False).agg({'weekday':'min',
                                                                    'temp':'median'})

# 将按天求和统计数据的日期转换成datetime格式
temp_df['date']=pd.to_datetime(temp_df['date'])

# 将按月统计数据设置一列时间序列
temp_month.rename(columns={'weekday':'day'},inplace=True)
temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])


# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况(count)随时间的走势
plt.plot(temp_df['date'] , temp_df['temp'], linewidth=1.3, label='日均')
ax.set_title('两年平均每天温度变化趋势')
plt.plot(temp_month['date'] , temp_month['temp'], marker='o',
         linewidth=1.3,label='月均')
ax.legend()

在这里插入图片描述

可以看出每年的气温变化趋势相同,在7月份气温最高,1月份气温最低。再看一下每小时平均租赁数量随温度变化的趋势

# 按温度取平均值
temp = data_train.groupby(['temp'], as_index=True).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
temp.plot(figsize=(10,5),title='温度与count的变化趋势')

在这里插入图片描述

可观察到在气温4度时,count达到最低点,然后随气温上升租车数量总体呈现上升趋势,但在气温超过35时开始下降。

湿度与count

# 可视化湿度这两年的总体走势
humidity_df = data_train.groupby(['date'],as_index=False).agg({'humidity':'mean'})
humidity_df['date']=pd.to_datetime(humidity_df['date'])

# 将日期设置为时间索引
humidity_df = humidity_df.set_index('date')

humidity_month = data_train.groupby(['year','month'],as_index=False).agg({'weekday':'min',
                                                                         'humidity':'mean'})

# 将按月统计数据设置一列时间序列
humidity_month.rename(columns={'weekday':'day'},inplace=True)
humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])

# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况(count)随湿度的走势
ax.set_title('两年平均每天湿度变化趋势')
plt.plot(humidity_df.index,humidity_df['humidity'], linewidth=1.3, label='日均')
plt.plot(humidity_month['date'],humidity_month['humidity'], marker='o',
         linewidth=1.3,label='月均')
plt.grid()
ax.legend()

在这里插入图片描述

观察一下租赁人数随湿度变化趋势,按湿度对租赁数量取平均值。

# 湿度
humidity = data_train.groupby(['humidity'], as_index=True).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
humidity.plot(figsize=(10,5),title='湿度与count的变化趋势')

在这里插入图片描述

可以观察到在湿度20左右租赁数量迅速达到高峰值,此后缓慢递减

year、month与count

# 先观察两年时间里,总租车数量随时间变化的趋势
count_df = data_train.groupby(['date','weekday'], as_index=False).agg({'year':'mean',
                                                                      'month':'mean',
                                                                      'casual':'sum',
                                                                      'registered':'sum',
                                                                       'count':'sum'})

# 按天统计的波动仍然很大,再按月取日平均值
count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                      'casual':'mean', 
                                                                      'registered':'mean',
                                                                      'count':'mean'})

# 将按天求和统计数据的日期转换成datetime格式
count_df['date']=pd.to_datetime(count_df['date'])

# 将按月统计数据设置一列时间序列
count_month.rename(columns={'weekday':'day'},inplace=True)
count_month['date']=pd.to_datetime(count_month[['year','month','day']])

# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况(count)随时间的走势
ax.set_title('这两年count随时间的总体趋势')
plt.plot(count_df['date'],count_df['count'],linewidth=1.3,label='日均')

plt.plot(count_month['date'],count_month['count'],marker='o',
         linewidth=1.3,label='月均')
plt.grid()
ax.legend()

在这里插入图片描述

可以看出:

  1. 共享单车的租赁情况是2012年整体比2011年有增涨的;
  2. 租赁情况随月份波动明显;
  3. 数据在2011年9到12月,2012年3到9月间波动剧烈;
  4. 有很多局部波谷值。
# 月使用量变化趋势
date = data_train.groupby(['month'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(date['month'], date['count'] , linewidth=1.3 , label = '使用总量' )
plt.plot(date['month'], date['registered'] , linewidth=1.3 , label = '会员使用量' )
plt.plot(date['month'], date['casual'] , linewidth=1.3 , label = '非会员使用量' )
plt.legend()

在这里插入图片描述

季节与count

day_df=data_train.groupby('date').agg({'year':'mean','season':'mean',
                                      'casual':'sum', 'registered':'sum'
                                      ,'count':'sum','temp':'mean',
                                      'atemp':'mean'})
season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 
                                                                  'registered':'mean',
                                                                  'count':'mean'})
temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 
                                                                'atemp':'mean'})

fig = plt.figure(figsize=(10,10))
xlables = season_df.index.map(lambda x:str(x))

ax1 = fig.add_subplot(2,1,1)
ax1.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,season_df)
plt.legend(['casual','registered','count'])

ax2 = fig.add_subplot(2,1,2)
ax2.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,temp_df)

plt.legend(['temp','atemp'])

在这里插入图片描述

可以看出无论是临时用户还是会员用户用车的数量都在秋季迎来高峰,而春季度用户数量最低

天气与count

考虑到不同天气的天数不同,例如非常糟糕的天气(4)会很少出现,查看一下不同天气等级的数据条数,再对租赁数量按天气等级取每小时平均值

count_weather = data_train.groupby('weather')
count_weather[['casual','registered','count']].count()
casualregisteredcount
weather
1671967196719
2270527052705
3839839839
4111
weather_df = data_train.groupby('weather',as_index=True).agg({'casual':'mean',
                                                              'registered':'mean'})
weather_df.plot.bar(stacked=True)

在这里插入图片描述

发现天气等级为4的时候,租车数量也很多,感觉不太合常理,打印对应数据观察一下。

data_train[data_train['weather']==4].T
4863
datetime2012-01-09 18:00:00
season1
holiday0
workingday1
weather4
temp8.2
atemp11.365
humidity86
windspeed6.0032
casual6
registered158
count164
count_log5.10595
casual_log1.94591
registered_log5.0689
date2012-01-09
weekday1
year2012
month1
day9
hour18

发现是周一下班高峰期,所以是异常数据

windspeed和count

# 这两年风速的总体变化趋势
windspeed_df = data_train.groupby('date',as_index=False).agg({'windspeed':'mean'})
windspeed_df['date'] = pd.to_datetime(windspeed_df['date'])

# 将日期设置为时间索引
windspeed_df = windspeed_df.set_index('date')


windspeed_month = data_train.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                           'windspeed':'mean'})
windspeed_month.rename(columns={'weekday':'day'},inplace=True)
windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(windspeed_df.index, windspeed_df['windspeed'] , linewidth=1.3,label='日均')
plt.plot(windspeed_month['date'], windspeed_month['windspeed'],
         marker='o', linewidth=1.3,label='月均')
ax.legend()
ax.set_title('这两年风速的总体变化趋势')

在这里插入图片描述

可以看出风速在2011年9月份和2011年12月到2012年3月份间波动和大,观察一下租赁人数随风速变化趋势,考虑到风速特别大的时候很少,如果取平均值会出现异常,所以按风速对租赁数量取最大值。

# 风速 
# 化为整数型数据
data_train['windspeed'] = data_train['windspeed'].astype(int)
windspeed = data_train.groupby(['windspeed'], as_index=True).agg({'count':'mean',
                                                                  'registered':'mean',  
                                                                  'casual':'mean'})
windspeed.plot(figsize=(10,8))

在这里插入图片描述

可以看到租赁数量随风速越大租赁数量越少,在风速超过18的时候明显减少,但风速在风速20左右却有一次反弹,应该是和天气情况一样存在异常的数据,打印异常数据观察一下

df2=data_train[data_train['windspeed']>40]
df2=df2[df2['count']>150]
df2
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasual...countcount_logcasual_logregistered_logdateweekdayyearmonthdayhour
7602011-02-19 14:00:00100118.8622.7251543102...1965.2832044.6347294.5538772011-02-196201121914
7612011-02-19 15:00:00100118.0421.970165084...1715.1474944.4426514.4773372011-02-196201121915
24472011-07-03 17:00:00300332.8037.1204956181...3585.8833225.2040075.1817842011-07-03720117317
24482011-07-03 18:00:00300332.8037.120495674...1815.2040074.3174884.6821312011-07-03720117318
29412011-08-07 17:00:00300330.3435.605744363...1945.2730004.1588834.8828022011-08-07720118717
55902012-03-05 18:00:00101311.4811.365554312...3755.9295892.5649495.8971542012-03-05120123518
56522012-03-08 13:00:00101224.6031.060494335...2335.4553213.5835195.2933052012-03-08420123813
56532012-03-08 14:00:00101225.4231.060434348...2035.3181203.8918205.0498562012-03-08420123814
56542012-03-08 15:00:00101126.2431.060384624...1855.2257473.2188765.0875962012-03-08420123815
56552012-03-08 16:00:00101225.4231.060414337...3425.8377303.6375865.7235852012-03-08420123816
56562012-03-08 17:00:00101125.4231.060384352...5976.3935913.9702926.3026192012-03-08420123817
60152012-04-09 12:00:00201122.1425.760284794...2805.6383554.5538775.2311092012-04-09120124912
78802012-09-18 10:00:00301327.8831.820794330...1605.0814043.4339874.8751972012-09-182201291810
78812012-09-18 11:00:00301227.8831.820794336...1515.0238813.6109184.7535902012-09-182201291811

14 rows × 21 columns

日期对出行的影响

考虑到相同日期是否工作日,星期几,以及所属年份等信息是一样的,把租赁数据按天求和,其它日期类数据取平均值

day_df = data_train.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum',
                                                          'count':'sum', 'workingday':'mean',
                                                          'weekday':'mean','holiday':'mean',
                                                          'year':'mean'})
day_df.head()
datecasualregisteredcountworkingdayweekdayholidayyear
02011-01-013316549850602011
12011-01-021316708010702011
22011-01-03120122913491102011
32011-01-04108145415621202011
42011-01-0582151816001302011
number_pei=day_df[['casual','registered']].mean()
number_pei
casual         657.543860
registered    3040.800439
dtype: float64
# 将横、纵坐标轴标准化处理,保证饼图是一个正圆,否则为椭圆
plt.axes(aspect='equal')
plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.05 , radius=1)  
plt.title('Casual or registered in the total lease')

在这里插入图片描述

由于工作日和休息日的天数差别,对工作日和非工作日租赁数量取了平均值,对一周中每天的租赁数量求和

workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 
                                                                 'registered':'mean'})
workingday_df_0 = workingday_df.loc[0]
workingday_df_1 = workingday_df.loc[1]

# plt.axes(aspect='equal')
fig = plt.figure(figsize=(8,6)) 
plt.subplots_adjust(hspace=0.5, wspace=0.2)     #设置子图表间隔
grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5)   #设置子图表坐标轴 对齐

plt.subplot2grid((2,2),(1,0), rowspan=2)
width = 0.3       # 设置条宽

p1 = plt.bar(workingday_df.index,workingday_df['casual'], width)
p2 = plt.bar(workingday_df.index,workingday_df['registered'], 
             width,bottom=workingday_df['casual'])
plt.title('Average number of rentals initiated per day')
plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20)
plt.legend((p1[0], p2[0]), ('casual', 'registered'))

plt.subplot2grid((2,2),(0,0))
plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.axis('equal') 
plt.title('nonworking day')

plt.subplot2grid((2,2),(0,1))
plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.title('working day')
plt.axis('equal')
(-1.438451504893538,
 1.4304024814759062,
 -1.4388335098293494,
 1.4343901442970892)

在这里插入图片描述

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'})
weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

在这里插入图片描述

1.工作日会员用户出行数量较多,临时用户出行数量较少;
2.周末会员用户租赁数量降低,临时用户租赁数量增加。

节假日
由于节假日在一年中数量占比非常少,先来看一每年的节假日下有几天,

holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'})
holiday_coun
holiday
year
20116
20127

假期的天数占一年天数的份额十分少,所以对假期和非假期取日平均值

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'})
holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated ')

在这里插入图片描述

特征工程

import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime

train = pd.read_csv('./data/data31405/train.csv')
test  = pd.read_csv('./data/data31405/test.csv')
#训练集去除3倍方差以外数据
train_std = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())]

train_std.reset_index(drop=True,inplace=True)
train_std.shape
(10739, 12)
#对数据进行对数变换后的分布
ylabels = train_std['count']
ylabels_log = np.log(ylabels)
sns.distplot(ylabels_log)

在这里插入图片描述

#将train_std、test 合并,便于修改

#index都没有实际含义,使用ignore_inde
combine_train_test = train_std.append(test,ignore_index=True)
datetimecol = test['datetime']
print ('合并后的数据集:',combine_train_test.shape)
合并后的数据集: (17232, 12)
# 记录数据的行数 0表示行,1表示列
row_train = train_std.shape[0]
row_test = test.shape[0]
print('训练集行数:',row_train,'\n测试集行数:',row_test)
训练集行数: 10739 
测试集行数: 6493
# datetime特征拆分
combine_train_test = split_datetime(combine_train_test)
# 填充风速 注意会打乱数据顺序
combine_train_test = RFG_windspeed(combine_train_test)

根据前面的观察,决定将时段(hour)、温度(temp)、湿度(humidity)、年份(year)、月份(month)、季节(season)、天气等级(weather)、风速(windspeed)、星期几(weekday)、是否工作日(workingday)、是否假日(holiday),11项作为特征值由于CART决策树使用二分类,所以将多类别型数据使用one-hot转化成多个二分型类别

combine_feature = combine_train_test[['temp','humidity','weather','season','year','weather',
                                      'month','weekday','hour','workingday','windspeed','count']]

combine_feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 12 columns):
temp          17232 non-null float64
humidity      17232 non-null int64
weather       17232 non-null int64
season        17232 non-null int64
year          17232 non-null int64
weather       17232 non-null int64
month         17232 non-null int64
weekday       17232 non-null int64
hour          17232 non-null int64
workingday    17232 non-null int64
windspeed     17232 non-null float64
count         10739 non-null float64
dtypes: float64(3), int64(9)
memory usage: 1.6 MB
# 将多类别型数据使用one-hot转化成多个二分型类别
cols = ['month','season','weather','year']
combine_feature = pd.get_dummies(combine_feature,columns=cols,prefix_sep='_')

combine_feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 33 columns):
temp          17232 non-null float64
humidity      17232 non-null int64
weekday       17232 non-null int64
hour          17232 non-null int64
workingday    17232 non-null int64
windspeed     17232 non-null float64
count         10739 non-null float64
month_1       17232 non-null uint8
month_2       17232 non-null uint8
month_3       17232 non-null uint8
month_4       17232 non-null uint8
month_5       17232 non-null uint8
month_6       17232 non-null uint8
month_7       17232 non-null uint8
month_8       17232 non-null uint8
month_9       17232 non-null uint8
month_10      17232 non-null uint8
month_11      17232 non-null uint8
month_12      17232 non-null uint8
season_1      17232 non-null uint8
season_2      17232 non-null uint8
season_3      17232 non-null uint8
season_4      17232 non-null uint8
weather_1     17232 non-null uint8
weather_2     17232 non-null uint8
weather_3     17232 non-null uint8
weather_4     17232 non-null uint8
weather_1     17232 non-null uint8
weather_2     17232 non-null uint8
weather_3     17232 non-null uint8
weather_4     17232 non-null uint8
year_2011     17232 non-null uint8
year_2012     17232 non-null uint8
dtypes: float64(3), int64(4), uint8(26)
memory usage: 1.3 MB

构建模型

#将数据集拆分为训练集和测试集,注意之前用随机深林填充风速,打乱了数据顺序
mask = pd.notnull(combine_feature['count'])
train_data = combine_feature[mask]
test_data = combine_feature[~mask]

train_data.shape,test_data.shape
((10739, 33), (6493, 33))
# source特征
source_X = train_data.drop(['count'],axis = 1)

# source标签
source_y  = np.log1p(train_data['count'])

# 测试集特征
pred_X = test_data.drop(['count'],axis = 1)

模型

from sklearn.model_selection import GridSearchCV

# 评价函数
def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model, # 要搜索的模型
                        params, # 要尝试的参数
                        n_jobs=-1,
                        error_score=0.) # 如果报错,结果是0
    grid.fit(X, y) # 拟合模型和参数
    # 经典的性能指标
    print("Best Accuracy: {}".format(grid.best_score_))
    # 得到最佳准确率的最佳参数
    print("Best Parameters: {}".format(grid.best_params_))
    # 拟合的平均时间(秒)
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # 预测的平均时间(秒)
    # 从该指标可以看出模型在真实世界的性能
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
    return grid
from sklearn.model_selection import train_test_split 

# 划分数据集
train_X, test_X, train_y, test_y = train_test_split(source_X,
                                                    source_y,
                                                    train_size = 0.80)

#输出数据集大小
print ('原始数据集特征:',source_X.shape, '训练数据集特征:',train_X.shape,'测试数据集特征:',test_X.shape)

print ('原始数据集标签:',source_y.shape, '训练数据集标签:',train_y.shape,'测试数据集标签:',test_y.shape)
原始数据集特征: (10739, 32) 训练数据集特征: (8591, 32) 测试数据集特征: (2148, 32)
原始数据集标签: (10739,) 训练数据集标签: (8591,) 测试数据集标签: (2148,)

随机森林

from sklearn.ensemble import RandomForestRegressor


# 模型参数
forest_parmas = {'n_estimators':[1300,1500,1700], 'max_depth':range(20,30,4)}

Model = RandomForestRegressor(oob_score=True,n_jobs=-1,random_state = 42)

Model = get_best_model_and_accuracy(Model,forest_parmas ,train_X, train_y)
Best Accuracy: 0.9495761615636888
Best Parameters: {'max_depth': 24, 'n_estimators': 1500}
Average Time to Fit (s): 33.625
Average Time to Score (s): 1.056
Model=Model.best_estimator_
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=24,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1500, n_jobs=-1,
           oob_score=True, random_state=42, verbose=0, warm_start=False)
# 分类问题,score得到的是模型的正确率
Model.score(test_X,test_y)
0.9506555663644386
# 袋外分数
Model.oob_score_
0.9544882922113979
# 模型保存
from sklearn.externals import joblib

joblib.dump(Model, "rf.pkl", compress=9)

xgboost

import xgboost as xg

# 模型参数  subsample:对于每棵树,随机采样的比例
xg_parmas = {'subsample':[i/10.0 for i in range(6,10)],
            'colsample_bytree':[i/10.0 for i in range(6,10)]} # 控制每棵随机采样的列数的占比

xg_model = xg.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4)

xg_model = get_best_model_and_accuracy(xg_model,xg_parmas,train_X.values, train_y.values)
Best Accuracy: 0.9519465121003385
Best Parameters: {'colsample_bytree': 0.9, 'subsample': 0.9}
Average Time to Fit (s): 1.136
Average Time to Score (s): 0.019
xg_model=xg_model.best_estimator_
xg_model
XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.9, gamma=0.4, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.300000012, max_delta_step=0, max_depth=8,
       min_child_weight=6, missing=nan, monotone_constraints=None,
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='reg:squarederror', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=0.9, tree_method=None,
       validate_parameters=False, verbosity=None)
from sklearn.metrics import mean_absolute_error

pre_y = xg_model.predict(test_X.values)
mean_absolute_error(pre_y,test_y.values)
0.20327569987981398

learning curve

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, 
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
    """
    画出data在某模型上的learning curve.
    参数解释
    ----------
    estimator : 分类器。
    title : 表格的标题。
    X : 输入的feature,numpy类型
    y : 输入的target vector
    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
    cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
    n_jobs : 并行的的任务数(默认1)
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    if plot:
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"训练样本数")
        plt.ylabel(u"得分")
        plt.gca().invert_yaxis()
        plt.grid() # 网格
    
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
                         alpha=0.1, color="b") #填充两条线间区域
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train score")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"test score")
    
        plt.legend(loc="best")
        
        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()
    
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff
plot_learning_curve(Model, u"学习曲线",train_X,train_y)

在这里插入图片描述

(0.969214095199126, 0.048081832611191144)
# 预测数据
pred_value = Model.predict(pred_X)
pred_value = np.exp(pred_value)
submission = pd.DataFrame({'datetime':datetimecol, 'count':pred_value})
submission['count'] = submission['count'].astype(int)
submission.to_csv('bike_predictions.csv',index = False)
submission.head()
datetimecount
02011-01-20 00:00:0010
12011-01-20 01:00:003
22011-01-20 02:00:003
32011-01-20 03:00:006
42011-01-20 04:00:0037
52011-01-20 05:00:0090

参考资料

  • 11
    点赞
  • 105
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
Kaggle共享单车比赛案例旨在通过预测共享单车的需求来提供最佳的共享单车管理方案。在这个比赛中,参赛者需要利用历史共享单车使用的数据来建立预测模型,以便在未来的时间段预测共享单车的需求量。 参赛者将获得一个包含共享单车使用记录的数据集,其中包含有关共享单车的各种信息,如日期、时间、天气、节假日等等。参赛者需要通过分析这些数据并找到与共享单车需求量相关的模式和规律,从而建立一个能够准确预测需求量的模型。 在构建预测模型时,一般会采用机器学习的方法,如回归模型、决策树、随机森林等等。参赛者可以根据自己的经验和技能选择合适的算法,并通过对模型进行训练和验证来优化预测效果。此外,参赛者还可以使用特征工程技巧来提取数据中的关键信息,以提高建模的准确性。 参赛者的模型将通过使用来自未知时间段的测试数据集进行评估,评估指标一般为均方根误差(RMSE)或者平均绝对百分比误差(MAPE)。最终,参赛者提交他们的预测结果,并根据评估指标的表现情况进行排名。排名靠前的参赛者将有机会赢得奖金或其他相关奖励。 通过参与Kaggle共享单车比赛案例,参赛者可以锻炼和展示自己的数据分析和机器学习能力。此外,比赛还提供了一个学习和交流的平台,参赛者可以与其他数据科学家和机器学习专家分享经验和技巧,共同进步。
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值