kaggle之共享单车案例
自行车共享系统是租借自行车的一种手段,通过这些系统,人们可以从任意地点租借一辆自行车,到达目的地后归还。自行车共享系统明确记录了旅行时间,出发地点,到达地点和时间。因此,其可用于研究城市中的移动性。在本项目中,要求将历史使用模式与天气数据结合起来,以预测华盛顿特区的自行车租赁租赁需求。
数据提供了跨越两年的每小时租赁数据,包含天气信息和日期信息,训练集由每月前19天的数据组成,测试集是每月第20天到当月底的数据。
变量说明:
- datetime(日期) - 年 、月、 日+ 整点时刻
- season(季节) - 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
- holiday - 是否是节假日
- workingday - 是否是工作日
- weather(天气等级)1. 清澈,少云,多云。2. 雾+阴天,雾+碎云、雾+少云、雾 3. 小雪、小雨+雷暴+散云,小雨+云 4. 暴雨+冰雹+雷暴+雾,雪+雾
- temp 温度
- atemp 体感温度
- humidity 相对湿度
- windspeed 风速
- casual 非用户租赁数量
- registered 会员租赁数量
- count 租赁总量
数据探索
- 缺失值检查
- 异常值检查
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
#忽略警告提示
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
data_train = pd.read_csv('./data/train.csv')
data_test = pd.read_csv('./data/test.csv')
data_train.info()
print('-'*40)
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime 6493 non-null object
season 6493 non-null int64
holiday 6493 non-null int64
workingday 6493 non-null int64
weather 6493 non-null int64
temp 6493 non-null float64
atemp 6493 non-null float64
humidity 6493 non-null int64
windspeed 6493 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.6+ KB
数据没有缺失值
data_train.head()
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
data_test.head()
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
# 统计描述
data_train.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
season | 10886.0 | 2.506614 | 1.116174 | 1.00 | 2.0000 | 3.000 | 4.0000 | 4.0000 |
holiday | 10886.0 | 0.028569 | 0.166599 | 0.00 | 0.0000 | 0.000 | 0.0000 | 1.0000 |
workingday | 10886.0 | 0.680875 | 0.466159 | 0.00 | 0.0000 | 1.000 | 1.0000 | 1.0000 |
weather | 10886.0 | 1.418427 | 0.633839 | 1.00 | 1.0000 | 1.000 | 2.0000 | 4.0000 |
temp | 10886.0 | 20.230860 | 7.791590 | 0.82 | 13.9400 | 20.500 | 26.2400 | 41.0000 |
atemp | 10886.0 | 23.655084 | 8.474601 | 0.76 | 16.6650 | 24.240 | 31.0600 | 45.4550 |
humidity | 10886.0 | 61.886460 | 19.245033 | 0.00 | 47.0000 | 62.000 | 77.0000 | 100.0000 |
windspeed | 10886.0 | 12.799395 | 8.164537 | 0.00 | 7.0015 | 12.998 | 16.9979 | 56.9969 |
casual | 10886.0 | 36.021955 | 49.960477 | 0.00 | 4.0000 | 17.000 | 49.0000 | 367.0000 |
registered | 10886.0 | 155.552177 | 151.039033 | 0.00 | 36.0000 | 118.000 | 222.0000 | 886.0000 |
count | 10886.0 | 191.574132 | 181.144454 | 1.00 | 42.0000 | 145.000 | 284.0000 | 977.0000 |
异常值检查
- count
- casual
- registered
# 查看是否符合高斯分布
fig,axes = plt.subplots(1,3)
# 设置图形的尺寸,单位为英寸。1英寸等于2.54cm
fig.set_size_inches(18,5)
sns.distplot(data_train['count'],bins=100,ax=axes[0])
sns.distplot(data_train['casual'],bins=100,ax=axes[1])
sns.distplot(data_train['registered'],bins=100,ax=axes[2])
data_train[['count','casual','registered']].describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
count | 10886.0 | 191.574132 | 181.144454 | 1.0 | 42.0 | 145.0 | 284.0 | 977.0 |
casual | 10886.0 | 36.021955 | 49.960477 | 0.0 | 4.0 | 17.0 | 49.0 | 367.0 |
registered | 10886.0 | 155.552177 | 151.039033 | 0.0 | 36.0 | 118.0 | 222.0 | 886.0 |
fig,axes = plt.subplots(1,3)
fig.set_size_inches(12,6)
sns.boxplot(data = data_train['count'],ax=axes[0])
axes[0].set(xlabel='count')
sns.boxplot(data = data_train['casual'], ax=axes[1])
axes[1].set(xlabel='casual')
sns.boxplot(data = data_train['registered'], ax=axes[2])
axes[2].set(xlabel='registered')
count:均值191,标准差181,50%分位数是145,75%分位数是284,最大值977,说明右侧存在长尾。去除掉异常值,并取log处理,观察结果。
count = casual+registered
# 去除异常值 将大于μ+3σ的数据值作为异常值
def drop_outlier(data,col):
mask = np.abs(data[col]-data[col].mean())<(3*data[col].std())
data = data.loc[mask]
# 可视化剔除异常值后的col和col_log
data[col+'_log'] = np.log1p(data[col])
f, [ax1, ax2] = plt.subplots(1,2, figsize=(15,6))
sns.distplot(data[col], ax=ax1)
ax1.set_title(col+'分布')
sns.distplot(data[col+'_log'], ax=ax2)
ax2.set_title(col+'_log分布')
return data
data_train = drop_outlier(data_train,'count')
data_train = drop_outlier(data_train,'casual')
data_train = drop_outlier(data_train,'registered')
特征分解
将datetime特征拆分为日期、星期、年、月、日、小时
def split_datetime(data):
data['date'] = data['datetime'].apply(lambda x:x.split()[0])
data['weekday'] =data['date'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d').isoweekday())
data['year'] = data['date'].apply(lambda x:x.split('-')[0]).astype('int')
data['month'] = data['date'].apply(lambda x:x.split('-')[1]).astype('int')
data['day'] = data['date'].apply(lambda x:x.split('-')[2]).astype('int')
data['hour'] = data['datetime'].apply(lambda x:x.split()[1].split(':')[0]).astype('int')
return data
data_train = split_datetime(data_train)
data_train.head()
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | count_log | date | weekday | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 2.833213 | 2011-01-01 | 6 | 2011 | 1 | 1 | 0 |
1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 3.713572 | 2011-01-01 | 6 | 2011 | 1 | 1 | 1 |
2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 3.496508 | 2011-01-01 | 6 | 2011 | 1 | 1 | 2 |
3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 2.639057 | 2011-01-01 | 6 | 2011 | 1 | 1 | 3 |
4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 0.693147 | 2011-01-01 | 6 | 2011 | 1 | 1 | 4 |
可视化分析
- 数值型数据分布分析
- 类别型数据箱线图分布分析
data_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10739 entries, 0 to 10885
Data columns (total 19 columns):
datetime 10739 non-null object
season 10739 non-null int64
holiday 10739 non-null int64
workingday 10739 non-null int64
weather 10739 non-null int64
temp 10739 non-null float64
atemp 10739 non-null float64
humidity 10739 non-null int64
windspeed 10739 non-null float64
casual 10739 non-null int64
registered 10739 non-null int64
count 10739 non-null int64
count_log 10739 non-null float64
date 10739 non-null object
weekday 10739 non-null int64
year 10739 non-null int64
month 10739 non-null int64
day 10739 non-null int64
hour 10739 non-null int64
dtypes: float64(4), int64(13), object(2)
memory usage: 2.0+ MB
fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)
sns.distplot(data_train['temp'],bins=60,ax=axes[0,0])
sns.distplot(data_train['atemp'],bins=60,ax=axes[0,1])
sns.distplot(data_train['humidity'],bins=60,ax=axes[1,0])
sns.distplot(data_train['windspeed'],bins=60,ax=axes[1,1])
fig,axes = plt.subplots(2,2)
fig.set_size_inches(15,12)
sns.boxplot(x='season', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,0])
sns.boxplot(x='holiday', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,1])
sns.boxplot(x='workingday', y='count', data = data_train, orient='v', width=0.6, ax=axes[1,0])
sns.boxplot(x='weather',y='count',data=data_train,orient='v',width=0.6,ax=axes[1,1])
data_train['windspeed'].describe()
count 10739.000000
mean 12.787706
std 8.171075
min 0.000000
25% 7.001500
50% 12.998000
75% 16.997900
max 56.996900
Name: windspeed, dtype: float64
data_train.boxplot(['windspeed'])
通过上图发现风速0的数据很多,可能数据本身是有缺失值的,但是用0填充了。这里我们使用随即森林进行填充风速为0的值进行填充。
np.sum(data_train['windspeed'] == 0),data_train['windspeed'].shape[0]
(1253, 10264)
# 使用随机森林填充风速
from sklearn.ensemble import RandomForestRegressor
def RFG_windspeed(data):
# 将数据分成风速等于0和不等于0的两部分
mask = data['windspeed'] == 0
wind_0 = data[mask]
wind_1 = data[~mask]
if len(wind_0.index)==0:
return data
Model_wind = RandomForestRegressor(n_estimators=1000,random_state=42)
# 选取特征
cols = ["season","weather","humidity","month","temp","year","atemp"]
windspeed_X = wind_1[cols]
# 预测值
windspeed_y = wind_1['windspeed']
windspeedpre_X = wind_0[cols]
Model_wind.fit(windspeed_X,windspeed_y)
# 预测风速
wind_0Values = Model_wind.predict(X=windspeedpre_X)
# 填充
wind_0.loc[:,'windspeed'] = wind_0Values
data = wind_1.append(wind_0).reset_index()
data.drop('index',inplace=True,axis=1)
return data
data_train['windspeed'].head()
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
Name: windspeed, dtype: float64
data_train = RFG_windspeed(data_train)
data_train['windspeed'].head()
0 6.0032
1 16.9979
2 19.0012
3 19.0012
4 19.9995
Name: windspeed, dtype: float64
再观察一下这四个特征的密度分布
fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)
sns.distplot(data_train['temp'],ax=axes[0,0])
axes[0,0].set(xlabel='temp')
sns.distplot(data_train['atemp'],ax=axes[0,1])
axes[0,1].set(xlabel='atemp')
sns.distplot(data_train['humidity'],ax=axes[1,0])
axes[1,0].set(xlabel='humidity')
sns.distplot(data_train['windspeed'],ax=axes[1,1])
axes[1,1].set(xlabel='windseed')
[Text(0.5,0,'windseed')]
整体看一下租赁额相关的三个值和其他特征值的关系
# 使用seaborn的整体关系图
cols =['season','holiday','workingday','weekday','weather','temp',
'atemp','humidity','windspeed','hour']
sns.pairplot(data_train ,x_vars=cols,
y_vars=['casual','registered','count'],
plot_kws={'alpha': 0.2})
- season(季节) 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
- holiday 节假日
- workingday 工作日
- weather 天气等级
- temp 温度
- atemp 体感温度
- humidity 相对湿度
- windspeed 风速
- casual 非用户租赁数量
- registered 会员租赁数量
- count 租赁总量
可以观察到:
- 一季度出行人数总体偏少
- 非假日借车总数比假日借车总数要高
- 会员在工作日出行多,节假日出行少,临时用户则相反
- 租赁数量随天气等级上升而减少
- 温度、湿度对非会员影响较大,对会员影响较小
- 小时数对租赁情况影响明显,会员呈现两个高峰,非会员呈正态分布
查看各特征与count的相关性
corr = data_train.corr()
plt.subplots(figsize=(14,14))
sns.heatmap(corr,annot=True,vmax=1,cmap='YlGnBu')
# 降序
np.abs(corr['count']).sort_values(ascending=False)
count 1.000000
registered 0.977642
count_log 0.845294
registered_log 0.839080
casual_log 0.758863
casual 0.711105
hour 0.442659
temp 0.376656
atemp 0.372705
humidity 0.307362
month 0.176115
season 0.173657
year 0.171519
weather 0.123578
windspeed 0.116767
workingday 0.046246
weekday 0.022100
day 0.015243
holiday 0.002421
Name: count, dtype: float64
可以看出特征对count的影响力度分别为:
hour(时段)>temp(温度)>atemp(体感温度)>humidity(湿度)>month(月份)>season(季节)>year(年份)
>weather(天气等级)>windspeed(风速)>workingday(工作日)>weekday(星期几)>day(天数)>(holiday)节假日
hour与count
# hour总体变化趋势
date = data_train.groupby(['hour'], as_index=False).agg({'count':'mean',
'registered':'mean',
'casual':'mean'})
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用总量
plt.plot(date['hour'], date['count'], linewidth=1.3)
# 会员使用量
plt.plot(date['hour'], date['registered'], linewidth=1.3)
# 非会员使用量
plt.plot(date['hour'], date['casual'], linewidth=1.3)
plt.legend()
# 工作日与非工作日下,hour与count的关系
date = data_train.groupby(['workingday','hour'], as_index=False).agg({'count':'mean',
'registered':'mean',
'casual':'mean'})
mask = date['workingday'] == 1
workingday_date= date[mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)
nworkingday_date = date[~mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)
fig, axes = plt.subplots(1,2,sharey = True)
workingday_date.plot(figsize=(15,5),title ='working day',ax=axes[0])
axes[0].set(xlabel='hour')
nworkingday_date.plot(figsize=(15,5),title ='nonworkdays',ax=axes[1])
axes[1].set(xlabel='hour')
可以看出:
-
工作日
- 会员用户(registered)上下班时间是两个用车高峰,而中午也会有一个小高峰,猜测可能是外出午餐的人。
- 临时用户(casual)起伏比较平缓,高峰期在17点左右。
- 会员用户(registered)的用车数量远超临时用户(casual)。
-
非工作日
- 租赁数量(count)随时间呈现一个正态分布,高峰在12点左右,低谷在4点左右,且分布比较均匀。
温度与count
可视化温度这两年的总体走势
# 数据按天汇总取一天的气温中位数
temp_df = data_train.groupby(['date','weekday'],as_index=False).agg({'year':'mean',
'month':'mean',
'temp':'median'})
# 缺失的数据丢弃
# temp_df.dropna (axis=0,how ='any',inplace=True)
# 预计按天统计的波动仍然很大,再按月取日平均值
temp_month = temp_df.groupby(['year','month'],as_index=False).agg({'weekday':'min',
'temp':'median'})
# 将按天求和统计数据的日期转换成datetime格式
temp_df['date']=pd.to_datetime(temp_df['date'])
# 将按月统计数据设置一列时间序列
temp_month.rename(columns={'weekday':'day'},inplace=True)
temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])
# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用折线图展示总体租赁情况(count)随时间的走势
plt.plot(temp_df['date'] , temp_df['temp'], linewidth=1.3, label='日均')
ax.set_title('两年平均每天温度变化趋势')
plt.plot(temp_month['date'] , temp_month['temp'], marker='o',
linewidth=1.3,label='月均')
ax.legend()
可以看出每年的气温变化趋势相同,在7月份气温最高,1月份气温最低。再看一下每小时平均租赁数量随温度变化的趋势
# 按温度取平均值
temp = data_train.groupby(['temp'], as_index=True).agg({'count':'mean',
'registered':'mean',
'casual':'mean'})
temp.plot(figsize=(10,5),title='温度与count的变化趋势')
可观察到在气温4度时,count达到最低点,然后随气温上升租车数量总体呈现上升趋势,但在气温超过35时开始下降。
湿度与count
# 可视化湿度这两年的总体走势
humidity_df = data_train.groupby(['date'],as_index=False).agg({'humidity':'mean'})
humidity_df['date']=pd.to_datetime(humidity_df['date'])
# 将日期设置为时间索引
humidity_df = humidity_df.set_index('date')
humidity_month = data_train.groupby(['year','month'],as_index=False).agg({'weekday':'min',
'humidity':'mean'})
# 将按月统计数据设置一列时间序列
humidity_month.rename(columns={'weekday':'day'},inplace=True)
humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])
# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用折线图展示总体租赁情况(count)随湿度的走势
ax.set_title('两年平均每天湿度变化趋势')
plt.plot(humidity_df.index,humidity_df['humidity'], linewidth=1.3, label='日均')
plt.plot(humidity_month['date'],humidity_month['humidity'], marker='o',
linewidth=1.3,label='月均')
plt.grid()
ax.legend()
观察一下租赁人数随湿度变化趋势,按湿度对租赁数量取平均值。
# 湿度
humidity = data_train.groupby(['humidity'], as_index=True).agg({'count':'mean',
'registered':'mean',
'casual':'mean'})
humidity.plot(figsize=(10,5),title='湿度与count的变化趋势')
可以观察到在湿度20左右租赁数量迅速达到高峰值,此后缓慢递减
year、month与count
# 先观察两年时间里,总租车数量随时间变化的趋势
count_df = data_train.groupby(['date','weekday'], as_index=False).agg({'year':'mean',
'month':'mean',
'casual':'sum',
'registered':'sum',
'count':'sum'})
# 按天统计的波动仍然很大,再按月取日平均值
count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min',
'casual':'mean',
'registered':'mean',
'count':'mean'})
# 将按天求和统计数据的日期转换成datetime格式
count_df['date']=pd.to_datetime(count_df['date'])
# 将按月统计数据设置一列时间序列
count_month.rename(columns={'weekday':'day'},inplace=True)
count_month['date']=pd.to_datetime(count_month[['year','month','day']])
# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用折线图展示总体租赁情况(count)随时间的走势
ax.set_title('这两年count随时间的总体趋势')
plt.plot(count_df['date'],count_df['count'],linewidth=1.3,label='日均')
plt.plot(count_month['date'],count_month['count'],marker='o',
linewidth=1.3,label='月均')
plt.grid()
ax.legend()
可以看出:
- 共享单车的租赁情况是2012年整体比2011年有增涨的;
- 租赁情况随月份波动明显;
- 数据在2011年9到12月,2012年3到9月间波动剧烈;
- 有很多局部波谷值。
# 月使用量变化趋势
date = data_train.groupby(['month'], as_index=False).agg({'count':'mean',
'registered':'mean',
'casual':'mean'})
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(date['month'], date['count'] , linewidth=1.3 , label = '使用总量' )
plt.plot(date['month'], date['registered'] , linewidth=1.3 , label = '会员使用量' )
plt.plot(date['month'], date['casual'] , linewidth=1.3 , label = '非会员使用量' )
plt.legend()
季节与count
day_df=data_train.groupby('date').agg({'year':'mean','season':'mean',
'casual':'sum', 'registered':'sum'
,'count':'sum','temp':'mean',
'atemp':'mean'})
season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean',
'registered':'mean',
'count':'mean'})
temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean',
'atemp':'mean'})
fig = plt.figure(figsize=(10,10))
xlables = season_df.index.map(lambda x:str(x))
ax1 = fig.add_subplot(2,1,1)
ax1.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,season_df)
plt.legend(['casual','registered','count'])
ax2 = fig.add_subplot(2,1,2)
ax2.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,temp_df)
plt.legend(['temp','atemp'])
可以看出无论是临时用户还是会员用户用车的数量都在秋季迎来高峰,而春季度用户数量最低
天气与count
考虑到不同天气的天数不同,例如非常糟糕的天气(4)会很少出现,查看一下不同天气等级的数据条数,再对租赁数量按天气等级取每小时平均值
count_weather = data_train.groupby('weather')
count_weather[['casual','registered','count']].count()
casual | registered | count | |
---|---|---|---|
weather | |||
1 | 6719 | 6719 | 6719 |
2 | 2705 | 2705 | 2705 |
3 | 839 | 839 | 839 |
4 | 1 | 1 | 1 |
weather_df = data_train.groupby('weather',as_index=True).agg({'casual':'mean',
'registered':'mean'})
weather_df.plot.bar(stacked=True)
发现天气等级为4的时候,租车数量也很多,感觉不太合常理,打印对应数据观察一下。
data_train[data_train['weather']==4].T
4863 | |
---|---|
datetime | 2012-01-09 18:00:00 |
season | 1 |
holiday | 0 |
workingday | 1 |
weather | 4 |
temp | 8.2 |
atemp | 11.365 |
humidity | 86 |
windspeed | 6.0032 |
casual | 6 |
registered | 158 |
count | 164 |
count_log | 5.10595 |
casual_log | 1.94591 |
registered_log | 5.0689 |
date | 2012-01-09 |
weekday | 1 |
year | 2012 |
month | 1 |
day | 9 |
hour | 18 |
发现是周一下班高峰期,所以是异常数据
windspeed和count
# 这两年风速的总体变化趋势
windspeed_df = data_train.groupby('date',as_index=False).agg({'windspeed':'mean'})
windspeed_df['date'] = pd.to_datetime(windspeed_df['date'])
# 将日期设置为时间索引
windspeed_df = windspeed_df.set_index('date')
windspeed_month = data_train.groupby(['year','month'], as_index=False).agg({'weekday':'min',
'windspeed':'mean'})
windspeed_month.rename(columns={'weekday':'day'},inplace=True)
windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(windspeed_df.index, windspeed_df['windspeed'] , linewidth=1.3,label='日均')
plt.plot(windspeed_month['date'], windspeed_month['windspeed'],
marker='o', linewidth=1.3,label='月均')
ax.legend()
ax.set_title('这两年风速的总体变化趋势')
可以看出风速在2011年9月份和2011年12月到2012年3月份间波动和大,观察一下租赁人数随风速变化趋势,考虑到风速特别大的时候很少,如果取平均值会出现异常,所以按风速对租赁数量取最大值。
# 风速
# 化为整数型数据
data_train['windspeed'] = data_train['windspeed'].astype(int)
windspeed = data_train.groupby(['windspeed'], as_index=True).agg({'count':'mean',
'registered':'mean',
'casual':'mean'})
windspeed.plot(figsize=(10,8))
可以看到租赁数量随风速越大租赁数量越少,在风速超过18的时候明显减少,但风速在风速20左右却有一次反弹,应该是和天气情况一样存在异常的数据,打印异常数据观察一下
df2=data_train[data_train['windspeed']>40]
df2=df2[df2['count']>150]
df2
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | ... | count | count_log | casual_log | registered_log | date | weekday | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
760 | 2011-02-19 14:00:00 | 1 | 0 | 0 | 1 | 18.86 | 22.725 | 15 | 43 | 102 | ... | 196 | 5.283204 | 4.634729 | 4.553877 | 2011-02-19 | 6 | 2011 | 2 | 19 | 14 |
761 | 2011-02-19 15:00:00 | 1 | 0 | 0 | 1 | 18.04 | 21.970 | 16 | 50 | 84 | ... | 171 | 5.147494 | 4.442651 | 4.477337 | 2011-02-19 | 6 | 2011 | 2 | 19 | 15 |
2447 | 2011-07-03 17:00:00 | 3 | 0 | 0 | 3 | 32.80 | 37.120 | 49 | 56 | 181 | ... | 358 | 5.883322 | 5.204007 | 5.181784 | 2011-07-03 | 7 | 2011 | 7 | 3 | 17 |
2448 | 2011-07-03 18:00:00 | 3 | 0 | 0 | 3 | 32.80 | 37.120 | 49 | 56 | 74 | ... | 181 | 5.204007 | 4.317488 | 4.682131 | 2011-07-03 | 7 | 2011 | 7 | 3 | 18 |
2941 | 2011-08-07 17:00:00 | 3 | 0 | 0 | 3 | 30.34 | 35.605 | 74 | 43 | 63 | ... | 194 | 5.273000 | 4.158883 | 4.882802 | 2011-08-07 | 7 | 2011 | 8 | 7 | 17 |
5590 | 2012-03-05 18:00:00 | 1 | 0 | 1 | 3 | 11.48 | 11.365 | 55 | 43 | 12 | ... | 375 | 5.929589 | 2.564949 | 5.897154 | 2012-03-05 | 1 | 2012 | 3 | 5 | 18 |
5652 | 2012-03-08 13:00:00 | 1 | 0 | 1 | 2 | 24.60 | 31.060 | 49 | 43 | 35 | ... | 233 | 5.455321 | 3.583519 | 5.293305 | 2012-03-08 | 4 | 2012 | 3 | 8 | 13 |
5653 | 2012-03-08 14:00:00 | 1 | 0 | 1 | 2 | 25.42 | 31.060 | 43 | 43 | 48 | ... | 203 | 5.318120 | 3.891820 | 5.049856 | 2012-03-08 | 4 | 2012 | 3 | 8 | 14 |
5654 | 2012-03-08 15:00:00 | 1 | 0 | 1 | 1 | 26.24 | 31.060 | 38 | 46 | 24 | ... | 185 | 5.225747 | 3.218876 | 5.087596 | 2012-03-08 | 4 | 2012 | 3 | 8 | 15 |
5655 | 2012-03-08 16:00:00 | 1 | 0 | 1 | 2 | 25.42 | 31.060 | 41 | 43 | 37 | ... | 342 | 5.837730 | 3.637586 | 5.723585 | 2012-03-08 | 4 | 2012 | 3 | 8 | 16 |
5656 | 2012-03-08 17:00:00 | 1 | 0 | 1 | 1 | 25.42 | 31.060 | 38 | 43 | 52 | ... | 597 | 6.393591 | 3.970292 | 6.302619 | 2012-03-08 | 4 | 2012 | 3 | 8 | 17 |
6015 | 2012-04-09 12:00:00 | 2 | 0 | 1 | 1 | 22.14 | 25.760 | 28 | 47 | 94 | ... | 280 | 5.638355 | 4.553877 | 5.231109 | 2012-04-09 | 1 | 2012 | 4 | 9 | 12 |
7880 | 2012-09-18 10:00:00 | 3 | 0 | 1 | 3 | 27.88 | 31.820 | 79 | 43 | 30 | ... | 160 | 5.081404 | 3.433987 | 4.875197 | 2012-09-18 | 2 | 2012 | 9 | 18 | 10 |
7881 | 2012-09-18 11:00:00 | 3 | 0 | 1 | 2 | 27.88 | 31.820 | 79 | 43 | 36 | ... | 151 | 5.023881 | 3.610918 | 4.753590 | 2012-09-18 | 2 | 2012 | 9 | 18 | 11 |
14 rows × 21 columns
日期对出行的影响
考虑到相同日期是否工作日,星期几,以及所属年份等信息是一样的,把租赁数据按天求和,其它日期类数据取平均值
day_df = data_train.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum',
'count':'sum', 'workingday':'mean',
'weekday':'mean','holiday':'mean',
'year':'mean'})
day_df.head()
date | casual | registered | count | workingday | weekday | holiday | year | |
---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 | 331 | 654 | 985 | 0 | 6 | 0 | 2011 |
1 | 2011-01-02 | 131 | 670 | 801 | 0 | 7 | 0 | 2011 |
2 | 2011-01-03 | 120 | 1229 | 1349 | 1 | 1 | 0 | 2011 |
3 | 2011-01-04 | 108 | 1454 | 1562 | 1 | 2 | 0 | 2011 |
4 | 2011-01-05 | 82 | 1518 | 1600 | 1 | 3 | 0 | 2011 |
number_pei=day_df[['casual','registered']].mean()
number_pei
casual 657.543860
registered 3040.800439
dtype: float64
# 将横、纵坐标轴标准化处理,保证饼图是一个正圆,否则为椭圆
plt.axes(aspect='equal')
plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%',
pctdistance=0.6 , labeldistance=1.05 , radius=1)
plt.title('Casual or registered in the total lease')
由于工作日和休息日的天数差别,对工作日和非工作日租赁数量取了平均值,对一周中每天的租赁数量求和
workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean',
'registered':'mean'})
workingday_df_0 = workingday_df.loc[0]
workingday_df_1 = workingday_df.loc[1]
# plt.axes(aspect='equal')
fig = plt.figure(figsize=(8,6))
plt.subplots_adjust(hspace=0.5, wspace=0.2) #设置子图表间隔
grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5) #设置子图表坐标轴 对齐
plt.subplot2grid((2,2),(1,0), rowspan=2)
width = 0.3 # 设置条宽
p1 = plt.bar(workingday_df.index,workingday_df['casual'], width)
p2 = plt.bar(workingday_df.index,workingday_df['registered'],
width,bottom=workingday_df['casual'])
plt.title('Average number of rentals initiated per day')
plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20)
plt.legend((p1[0], p2[0]), ('casual', 'registered'))
plt.subplot2grid((2,2),(0,0))
plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%',
pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.axis('equal')
plt.title('nonworking day')
plt.subplot2grid((2,2),(0,1))
plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%',
pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.title('working day')
plt.axis('equal')
(-1.438451504893538,
1.4304024814759062,
-1.4388335098293494,
1.4343901442970892)
weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'})
weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')
1.工作日会员用户出行数量较多,临时用户出行数量较少;
2.周末会员用户租赁数量降低,临时用户租赁数量增加。
节假日
由于节假日在一年中数量占比非常少,先来看一每年的节假日下有几天,
holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'})
holiday_coun
holiday | |
---|---|
year | |
2011 | 6 |
2012 | 7 |
假期的天数占一年天数的份额十分少,所以对假期和非假期取日平均值
holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'})
holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated ')
特征工程
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
train = pd.read_csv('./data/data31405/train.csv')
test = pd.read_csv('./data/data31405/test.csv')
#训练集去除3倍方差以外数据
train_std = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())]
train_std.reset_index(drop=True,inplace=True)
train_std.shape
(10739, 12)
#对数据进行对数变换后的分布
ylabels = train_std['count']
ylabels_log = np.log(ylabels)
sns.distplot(ylabels_log)
#将train_std、test 合并,便于修改
#index都没有实际含义,使用ignore_inde
combine_train_test = train_std.append(test,ignore_index=True)
datetimecol = test['datetime']
print ('合并后的数据集:',combine_train_test.shape)
合并后的数据集: (17232, 12)
# 记录数据的行数 0表示行,1表示列
row_train = train_std.shape[0]
row_test = test.shape[0]
print('训练集行数:',row_train,'\n测试集行数:',row_test)
训练集行数: 10739
测试集行数: 6493
# datetime特征拆分
combine_train_test = split_datetime(combine_train_test)
# 填充风速 注意会打乱数据顺序
combine_train_test = RFG_windspeed(combine_train_test)
根据前面的观察,决定将时段(hour)、温度(temp)、湿度(humidity)、年份(year)、月份(month)、季节(season)、天气等级(weather)、风速(windspeed)、星期几(weekday)、是否工作日(workingday)、是否假日(holiday),11项作为特征值由于CART决策树使用二分类,所以将多类别型数据使用one-hot转化成多个二分型类别
combine_feature = combine_train_test[['temp','humidity','weather','season','year','weather',
'month','weekday','hour','workingday','windspeed','count']]
combine_feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 12 columns):
temp 17232 non-null float64
humidity 17232 non-null int64
weather 17232 non-null int64
season 17232 non-null int64
year 17232 non-null int64
weather 17232 non-null int64
month 17232 non-null int64
weekday 17232 non-null int64
hour 17232 non-null int64
workingday 17232 non-null int64
windspeed 17232 non-null float64
count 10739 non-null float64
dtypes: float64(3), int64(9)
memory usage: 1.6 MB
# 将多类别型数据使用one-hot转化成多个二分型类别
cols = ['month','season','weather','year']
combine_feature = pd.get_dummies(combine_feature,columns=cols,prefix_sep='_')
combine_feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 33 columns):
temp 17232 non-null float64
humidity 17232 non-null int64
weekday 17232 non-null int64
hour 17232 non-null int64
workingday 17232 non-null int64
windspeed 17232 non-null float64
count 10739 non-null float64
month_1 17232 non-null uint8
month_2 17232 non-null uint8
month_3 17232 non-null uint8
month_4 17232 non-null uint8
month_5 17232 non-null uint8
month_6 17232 non-null uint8
month_7 17232 non-null uint8
month_8 17232 non-null uint8
month_9 17232 non-null uint8
month_10 17232 non-null uint8
month_11 17232 non-null uint8
month_12 17232 non-null uint8
season_1 17232 non-null uint8
season_2 17232 non-null uint8
season_3 17232 non-null uint8
season_4 17232 non-null uint8
weather_1 17232 non-null uint8
weather_2 17232 non-null uint8
weather_3 17232 non-null uint8
weather_4 17232 non-null uint8
weather_1 17232 non-null uint8
weather_2 17232 non-null uint8
weather_3 17232 non-null uint8
weather_4 17232 non-null uint8
year_2011 17232 non-null uint8
year_2012 17232 non-null uint8
dtypes: float64(3), int64(4), uint8(26)
memory usage: 1.3 MB
构建模型
#将数据集拆分为训练集和测试集,注意之前用随机深林填充风速,打乱了数据顺序
mask = pd.notnull(combine_feature['count'])
train_data = combine_feature[mask]
test_data = combine_feature[~mask]
train_data.shape,test_data.shape
((10739, 33), (6493, 33))
# source特征
source_X = train_data.drop(['count'],axis = 1)
# source标签
source_y = np.log1p(train_data['count'])
# 测试集特征
pred_X = test_data.drop(['count'],axis = 1)
模型
from sklearn.model_selection import GridSearchCV
# 评价函数
def get_best_model_and_accuracy(model, params, X, y):
grid = GridSearchCV(model, # 要搜索的模型
params, # 要尝试的参数
n_jobs=-1,
error_score=0.) # 如果报错,结果是0
grid.fit(X, y) # 拟合模型和参数
# 经典的性能指标
print("Best Accuracy: {}".format(grid.best_score_))
# 得到最佳准确率的最佳参数
print("Best Parameters: {}".format(grid.best_params_))
# 拟合的平均时间(秒)
print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
# 预测的平均时间(秒)
# 从该指标可以看出模型在真实世界的性能
print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
return grid
from sklearn.model_selection import train_test_split
# 划分数据集
train_X, test_X, train_y, test_y = train_test_split(source_X,
source_y,
train_size = 0.80)
#输出数据集大小
print ('原始数据集特征:',source_X.shape, '训练数据集特征:',train_X.shape,'测试数据集特征:',test_X.shape)
print ('原始数据集标签:',source_y.shape, '训练数据集标签:',train_y.shape,'测试数据集标签:',test_y.shape)
原始数据集特征: (10739, 32) 训练数据集特征: (8591, 32) 测试数据集特征: (2148, 32)
原始数据集标签: (10739,) 训练数据集标签: (8591,) 测试数据集标签: (2148,)
随机森林
from sklearn.ensemble import RandomForestRegressor
# 模型参数
forest_parmas = {'n_estimators':[1300,1500,1700], 'max_depth':range(20,30,4)}
Model = RandomForestRegressor(oob_score=True,n_jobs=-1,random_state = 42)
Model = get_best_model_and_accuracy(Model,forest_parmas ,train_X, train_y)
Best Accuracy: 0.9495761615636888
Best Parameters: {'max_depth': 24, 'n_estimators': 1500}
Average Time to Fit (s): 33.625
Average Time to Score (s): 1.056
Model=Model.best_estimator_
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=24,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1500, n_jobs=-1,
oob_score=True, random_state=42, verbose=0, warm_start=False)
# 分类问题,score得到的是模型的正确率
Model.score(test_X,test_y)
0.9506555663644386
# 袋外分数
Model.oob_score_
0.9544882922113979
# 模型保存
from sklearn.externals import joblib
joblib.dump(Model, "rf.pkl", compress=9)
xgboost
import xgboost as xg
# 模型参数 subsample:对于每棵树,随机采样的比例
xg_parmas = {'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]} # 控制每棵随机采样的列数的占比
xg_model = xg.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4)
xg_model = get_best_model_and_accuracy(xg_model,xg_parmas,train_X.values, train_y.values)
Best Accuracy: 0.9519465121003385
Best Parameters: {'colsample_bytree': 0.9, 'subsample': 0.9}
Average Time to Fit (s): 1.136
Average Time to Score (s): 0.019
xg_model=xg_model.best_estimator_
xg_model
XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, gamma=0.4, gpu_id=-1,
importance_type='gain', interaction_constraints=None,
learning_rate=0.300000012, max_delta_step=0, max_depth=8,
min_child_weight=6, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=0, num_parallel_tree=1,
objective='reg:squarederror', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=0.9, tree_method=None,
validate_parameters=False, verbosity=None)
from sklearn.metrics import mean_absolute_error
pre_y = xg_model.predict(test_X.values)
mean_absolute_error(pre_y,test_y.values)
0.20327569987981398
learning curve
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1,
train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
"""
画出data在某模型上的learning curve.
参数解释
----------
estimator : 分类器。
title : 表格的标题。
X : 输入的feature,numpy类型
y : 输入的target vector
ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
n_jobs : 并行的的任务数(默认1)
"""
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
if plot:
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(u"训练样本数")
plt.ylabel(u"得分")
plt.gca().invert_yaxis()
plt.grid() # 网格
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha=0.1, color="b") #填充两条线间区域
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha=0.1, color="r")
plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"test score")
plt.legend(loc="best")
plt.draw()
plt.gca().invert_yaxis()
plt.show()
midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
return midpoint, diff
plot_learning_curve(Model, u"学习曲线",train_X,train_y)
(0.969214095199126, 0.048081832611191144)
# 预测数据
pred_value = Model.predict(pred_X)
pred_value = np.exp(pred_value)
submission = pd.DataFrame({'datetime':datetimecol, 'count':pred_value})
submission['count'] = submission['count'].astype(int)
submission.to_csv('bike_predictions.csv',index = False)
submission.head()
datetime | count | |
---|---|---|
0 | 2011-01-20 00:00:00 | 10 |
1 | 2011-01-20 01:00:00 | 3 |
2 | 2011-01-20 02:00:00 | 3 |
3 | 2011-01-20 03:00:00 | 6 |
4 | 2011-01-20 04:00:00 | 37 |
5 | 2011-01-20 05:00:00 | 90 |