Kaggle—共享单车数据分析

最新推荐文章于 2023-02-23 08:30:00 发布

dadadachai

最新推荐文章于 2023-02-23 08:30:00 发布

阅读量4.1k

点赞数 4

分类专栏： kaggle项目

本文链接：https://blog.csdn.net/dadadachai/article/details/118293090

版权

kaggle项目专栏收录该内容

1 篇文章 0 订阅

订阅专栏

问题

依据过去两年的每小时租赁数据train集，通过test测试集中天气等特征值预测租赁数量

特征说明

datetime：时间，年月日小时
season：1：春；2：夏；3：秋；4：冬
holiday：是否节假日。0：否；1：是
workingday：是否工作日。0：否；1：是
weather：1：晴；2：阴；3：小雨or小雪；4：恶劣天气
temp：实际温度
atemp：体感温度
humidity：湿度
windspeed：风速
casual：未注册用户租赁数量
registered：注册用户租赁数量
count：总租赁数量

数据预处理

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 导入数据，查看结构
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape)
print(train.info())
print(test.shape)
print(test.info())
# 查看异常值
train.plot(kind='box', figsize=(18,16), layout=(5,4), subplots=True, sharex=False, sharey=False)

在这里插入图片描述

其中租赁数量和风速存在异常

# 查看租赁量的分布情况
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(8,6))
plt.hist(train['count'], bins=20)
plt.title('租赁量分布情况')
plt.xlabel('租赁量')

在这里插入图片描述

数据分布存在明显的偏态，且有一长尾，去除异常值并作对数转换,使其更服从高斯分布

# 将3个标准差以外的数据排除
train = train[np.abs(train['count'] - train['count'].mean()) <= 3*train['count'].std()]
fig=plt.figure()
plt.subplot(1,1,1)
sns.distplot(train['count'])
plt.title('除去异常点的租赁量分布')
plt.xlabel('租赁量')
# 对极端偏态数据，进行log转换
y_log = np.log1p(train['count'])
sns.distplot(y_log)
plt.title('log变换后的租赁量分布')
plt.xlabel('租赁量')

在这里插入图片描述

将train数据集和test数据集合并以同时对日期变量做处理

# 合并俩数据，使之共同进行数据规范化
data = train.append(test)
# 拆分年、月、日、时
data['year'] = data.datetime.apply(lambda x: x.split()[0].split('-')[0])
data['year'] = data['year'].apply(lambda x: int(x))
data['month'] = data.datetime.apply(lambda x: x.split()[0].split('-')[1])
data['month'] = data['month'].apply(lambda x: int(x))
data['day'] = data.datetime.apply(lambda x: x.split()[0].split('-')[1])
data['day'] = data['day'].apply(lambda x: int(x))
data['hour'] = data.datetime.apply(lambda x: x.split()[1].split(':')[0])
data['hour'] = data['hour'].apply(lambda x: int(x))
data['date'] = data.datetime.apply(lambda x: x.split()[0])
data['weekday'] = pd.to_datetime(data['date']).dt.weekday
data = data.drop('datetime', axis=1)
# 重新排列
cols = ['year','month','day','weekday','hour','season','holiday','workingday','weather','temp','atemp','humidity','windspeed','casual','registered','count']
data = data.loc[:,cols]
# 查看风速的分布情况
fig=plt.figure()
plt.subplot(1,1,1)
sns.distplot(data['windspeed'])
plt.title('风速分布')
plt.xlabel('windspeed')

在这里插入图片描述

通过分布图及表格查看，风速存在很多0数据，可以推测将0代替了缺失值，所以用随机森林根据不为0的风速下年、月、季节、天气、温度、湿度等特征来预测风速为0的值

# 用随机森林预测风速
null = data[data.windspeed == 0]
not_null = data[data.windspeed != 0]
# 分离训练集和测试集
Xtrain_windspeed = not_null[['year','month','season','weather','temp','atemp','humidity']]
ytrain_windspeed = not_null['windspeed']
Xtest_windspeed = null[['year','month','season','weather','temp','atemp','humidity']]
# 网格搜索法选参数
params1 = {'n_estimators': np.arange(100,500,50)}
model1 = GridSearchCV(estimator=RandomForestRegressor(random_state=42), param_grid=params1, scoring='neg_mean_squared_error', cv=5)
model1.fit(Xtrain_windspeed, ytrain_windspeed)
model1.best_params_, model1.best_score_
params2 = {'min_samples_split': np.arange(2,10,2)}
model2 = GridSearchCV(estimator=RandomForestRegressor(n_estimators = 450, max_depth=10, random_state=42), param_grid=params2, scoring='neg_mean_squared_error', cv=5)
model2.fit(Xtrain_windspeed, ytrain_windspeed)
model2.best_params_, model2.best_score_
# 训练模型，填补缺失值
speed_model = RandomForestRegressor(n_estimators=450, random_state=42, max_depth=10, min_samples_split=8)
speed_model.fit(Xtrain_windspeed, ytrain_windspeed)
pred_ytest = speed_model.predict(Xtest_windspeed)
data.loc[data.windspeed == 0, 'windspeed'] = pred_ytest
# 再画图观察分布图
fig=plt.figure()
plt.subplot(1,1,1)
sns.distplot(data['windspeed'])
plt.title('风速分布')
plt.xlabel('windspeed')

在这里插入图片描述

数据可视化分析

# 查看每个字段的分布直方图
data.hist(figsize=(18,16))

在这里插入图片描述

# 查看变量之间关系图
sns.pairplot(data, vars=['count','temp', 'atemp', 'humidity', 'windspeed'])

在这里插入图片描述

工作日与非工作日中一天各时间段与租车人数的关系：

workingday = data[data['workingday']==1]
group_working = workingday.groupby('hour')
hour_mean1 = group_working[['count', 'registered', 'casual']].mean()
weekend = data[data['workingday']==0]
group_weekend = weekend.groupby('hour')
hour_mean2 = group_weekend[['count', 'registered', 'casual']].mean()

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
plt.plot(hour_mean1['count'],label='count')
plt.plot(hour_mean1['registered'],label='registered')
plt.plot(hour_mean1['casual'],label='casual')
plt.title('工作日一天中不同时间段的租赁量')
plt.legend()
plt.subplot(1,2,2)
plt.plot(hour_mean2['count'],label='count')
plt.plot(hour_mean2['registered'],label='registered')
plt.plot(hour_mean2['casual'],label='casual')
plt.title('非工作日一天中不同时间段的租赁量')
plt.legend()

在这里插入图片描述

非工作日：用户数集中在白天；
工作日：注册用户使用时间集中在上下班高峰期；非注册用户使用时间集中在白天。

不同月份与租车人数的关系：

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16,8))
sns.boxplot(x="month",y="count",data=data,ax=axes[0]);
sns.boxplot(x="month",y="registered",data=data,ax=axes[1]);
sns.boxplot(x="month",y="casual",data=data,ax=axes[2]);
plt.show()

气温最低的1、2月份使用人数最少。
2-4月租车人数逐月提升，6-10月达到峰值并趋于平缓，10月后出现下降。

不同季节与租车人数的关系：

fig, axes = plt.subplots(nrows=1, ncols=3)
fig.set_size_inches(16, 8)
sns.boxplot(x='season',y='count',data=data,ax=axes[0]);
sns.boxplot(x='season',y='registered',data=data,ax=axes[1]);
sns.boxplot(x='season',y='casual', data=data,ax=axes[2]);
plt.show()

在这里插入图片描述

夏天、秋天使用人数较多。春季使用人数最少

假期与租车人数的关系：

在这里插入图片描述

对于注册用户，非节假日使用率高于节假日；
对非注册用户，节假日使用率更高

天气与租车人数的关系：

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16,8))
sns.boxplot(x='weather',y='count',data=data,ax=axes[0])
sns.boxplot(x='weather',y='registered',data=data,ax=axes[1])
sns.boxplot(x='weather',y='casual', data=data,ax=axes[2])
plt.show()

在这里插入图片描述

天气越好使用人数越多，大雨、大雪等极端天气下几乎没有用户使用

气温与租车人数的关系：

temp_grouped = data.groupby(['temp']).agg({'count':'mean', 'registered':'mean', 'casual':'mean'})
temp_grouped.plot(title='不同温度下的租赁量')

在这里插入图片描述

温度越高，用户使用率越高
当温度高于37°，体感温度高于34°时，用户使用率降低

湿度与租车人数的关系：

humidity_grouped = data.groupby(['humidity']).agg({'count':'mean', 'registered':'mean', 'casual':'mean'})
humidity_grouped.plot(title='不同湿度下的租赁量')

在这里插入图片描述

湿度在20~40时，用户使用率最高
湿度在0~20用户使用率较低
湿度>40用户使用率下降

风速与租车人数的关系：

fig,axes = plt.subplots(nrows=1, ncols=3, figsize=(16,8))
data.groupby('windspeed').mean()["count"].plot(linestyle='-',color='b',ax=axes[0],legend=" ")
data.groupby('windspeed').mean()["registered"].plot(linestyle="-",color='g',ax=axes[1],legend=" ")
data.groupby('windspeed').mean()["casual"].plot(linestyle="-",color='r',ax=axes[2],legend=" ")
plt.show()

在这里插入图片描述

风速大于20时，使用人数大幅下降。

总结

注册用户大多都是上班族，在工作日的使用率相对更高；非注册用户使用大多集中在周末和假期的白天
夏、秋季是使用单车的高峰期，可多投入单车数量；春天使用人数较少，可相应减少单车的投放
天气越好使用人数越多；湿度在20-40、风速小于20时最适宜骑行；温度在37度内与租赁数量成正比，高于37度使用人数会减少

特征处理

查看各变量间相关性：

data.corrwith(data['count']).sort_values(ascending=False)
# 使用热度图可视化相关系数矩阵
plt.figure(figsize=(16,9))
sns.heatmap(data.corr(),annot=True)

在这里插入图片描述
将数值型变量转换为类别型：

category_features = ['year','month','weekday','weather','season','hour']
for feature in category_features:
    data[feature] = data[feature].astype('category')

作one-hot编码处理：

dummies_year = pd.get_dummies(data['year'], prefix='year')
dummies_month = pd.get_dummies(data['month'], prefix='month')
dummies_season = pd.get_dummies(data['season'], prefix='season')
dummies_weather = pd.get_dummies(data['weather'], prefix='weather')
data_onehot = pd.concat([data, dummies_year, dummies_month, dummies_season, dummies_weather], axis=1)

扔掉原特征变量：

drop_columns = ['year','month','season','weather']
data_onehot = data_onehot.drop(drop_columns, axis=1)

将训练集和测试集分开：

train_X = data_onehot[pd.notnull(data_onehot['count'])]
test_X = data_onehot[~pd.notnull(data_onehot['count'])]
y_log = np.log1p(train_X['count'])

扔掉多余的特征变量：

drop_features = ['count','registered','casual','day','atemp']
train_X = train_X.drop(drop_features, axis=1)
test_X = test_X.drop(drop_features, axis=1)

模型构建及预测

将训练集数据拆分为训练集和测试集：

from sklearn import model_selection
x_train, x_test, y_train, y_test = model_selection.train_test_split(train_X, y_log, test_size=0.2, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

随机森林的网格搜索法调参：

rf = RandomForestRegressor(random_state=42)
params1 = {'n_estimators' : np.arange(300,1000,50)}
model1 = GridSearchCV(estimator=rf, param_grid = params1, scoring='neg_mean_squared_error', cv=5)
model1.fit(x_train, y_train)
model1.best_score_, model1.best_params_
params2 = {'min_samples_split': np.arange(2,10,2)}
model2 = GridSearchCV(estimator=RandomForestRegressor(n_estimators=750, random_state=42),\
                      param_grid = params2, scoring='neg_mean_squared_error', cv=5)
model2.fit(x_train, y_train)
model2.best_score_, model2.best_params_

随机森林模型构建、拟合及预测：

rf_final = RandomForestRegressor(n_estimators=750, random_state=42, max_depth=10, min_samples_split=2)
rf_final.fit(x_train, y_train)
pred = rf_final.predict(x_test)

R平方：

from sklearn.metrics import r2_score
r2_score(y_test, pred)

在这里插入图片描述
RMSE：

from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(y_test, pred))
RMSE

在这里插入图片描述
2. Xgboost的网格搜索法调参：

# 安装xgboost
!pip install xgboost
params3 = {'n_estimators': np.arange(500,1000,50)}
model3 = GridSearchCV(estimator=XGBRegressor(random_state=42), \
                      param_grid = params3, scoring='neg_mean_squared_error', cv=5)
model3.fit(x_train, y_train)
model3.best_score_, model3.best_params_
params4 = {'max_depth': np.arange(3,10), 'learning_rate':[0.01,0.03,0.1,0.3]}
model4 = GridSearchCV(estimator=XGBRegressor(random_state=42, n_estimators=600),\
                      param_grid = params4, scoring='neg_mean_squared_error', cv=5)
model4.fit(x_train, y_train)
model4.best_score_, model4.best_params_

Xgboost模型构建、拟合及预测：

from xgboost import XGBRegressor
xgb = XGBRegressor(max_depth=7, learning_rate=0.03, n_estimators=600, random_state=42)
xgb.fit(x_train, y_train)
preds = xgb.predict(x_test)

R平方：

r2_score(y_test, preds)

在这里插入图片描述
RMSE：

RMSE = np.sqrt(mean_squared_error(y_test, preds))
RMSE

在这里插入图片描述

Xgboost模型预测效果比随机森林好，选择Xgboost模型

用训练集所有数据重新训练模型并预测测试集数据

xgb_final = XGBRegressor(max_depth=7, learning_rate=0.03, n_estimators=600, random_state=42)
xgb_final.fit(train_X, y_log)
preds_test = xgb_final.predict(test_X)
submission = pd.DataFrame({'datetime': test['datetime'], 'count': [max(0,x) for x in np.exp(preds_test)]})
submission.to_csv(r'C:\Users\Administrator\Desktop\submission.csv', index=False)

kaggle上交结果：
在这里插入图片描述

dadadachai

关注

4
点赞
踩
84

收藏

觉得还不错? 一键收藏
2
评论
Kaggle—共享单车数据分析

问题依据过去两年的每小时租赁数据train集，通过test测试集中天气等特征值预测租赁数量特征说明datetime：时间，年月日小时season：1：春；2：夏；3：秋；4：冬holiday：是否节假日。0：否；1：是workingday：是否工作日。0：否；1：是weather：1：晴；2：阴；3：小雨or小雪；4：恶劣天气temp：实际温度atemp：体感温度humidity：湿度windspeed：风速casual：未注册用户租赁数量registered：注册用户租赁数
复制链接

扫一扫

专栏目录