Bike-sharing-demand

最新推荐文章于 2023-07-28 21:08:22 发布

weixin_35980266

最新推荐文章于 2023-07-28 21:08:22 发布

阅读量1.6k

点赞数

文章标签：数据

本文介绍了一项Kaggle竞赛任务，通过历史数据和天气信息预测华盛顿特区的共享单车需求。采用数据处理技巧，利用箱线图、热力图进行数据分析，并运用随机森林算法进行预测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

kaggle的竞赛题目：https://www.kaggle.com/c/bike-sharing-demand
引用文章： https://www.kaggle.com/c/bike-sharing-demand

1、问题的提出：
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
（摘自https://www.kaggle.com/c/bike-sharing-demand）

简单讲，就是根据华盛顿市前一年的共享单车租用数据，预测下一年的共享单车租用人数。用train数据（包含租用人数，日期，天气等），预测test数据的租用人数。

2、数据的处理：
df["date"] = df["datetime"].map(lambda s: s.split("-")[2].split()[0])
df["year"] = df["datetime"].map(lambda s: s.split("-")[0])
df["month"] = df["datetime"].map(lambda s: s.split("-")[1])
df["weekday"] = df["datetime"].map(lambda s: datetime.strptime(s.split()[0], "%Y-%m-%d").weekday())
df["hour"] = df["datetime"].map(lambda s: s.split()[1].split(":")[0])
Datetime_train = df.loc[pd.notnull(df["count"]), "datetime"]   #取出训练数据的Datetime
Datetime_test = df.loc[~pd.notnull(df["count"]), "datetime"]   #取出训测试据的Datetime
df.drop("datetime", axis=1, inplace=True)    #除掉datetime这一列
data = df[pd.notnull(df["count"])]           #取非空得 10886R * 16C

作用是，拆解数据列表中“datetime”（如：2011-01-01 01:00:00），变成"date"（01）、"year"（2011）、"month"（01）、"weekday"（1）、"hour（01）"，变成新的数据项添加到data表的后面。

3.1、数据的分析（箱线图）：
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20, 20))
sns.boxplot(y="count", data=data, ax=axes[0][0])
sns.boxplot(x="hour", y="count", data=data, ax=axes[0][1])
sns.boxplot(x="weather", y="count", data=data, ax=axes[1][0])
sns.boxplot(x="weekday", y="count", data=data, ax=axes[1][1], order=["Monday", "Tuesday"
, "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"])
sns.boxplot(x="season", y="count", data=data, ax=axes[2][0])
sns.boxplot(x="workingday", y="count", data=data, ax=axes[2][1])
plt.show()

作用是，调用seaborn库的箱线图函数列出各时间季节对租用总数的图像关系，如下：

可以看出“weather”、“hour”、“season”对总数影响较大，其余项影响不大

3.2、数据的分析（热力图）：
corrMatt = data[["atemp", "casual", "humidity", "registered", "windspeed", "temp", "count"]].corr() #corrsponding values
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig, axes = plt.subplots()
fig.set_size_inches(20, 10)
sns.heatmap(corrMatt, mask=mask, annot=True, square=True) # (数据，是否折半，是否显示数值，是否是正方形)
plt.show()

作用是，调用seaborn库热力图函数列出除各时间季节外，其余的各种因素对租用总数的相关系数（因为时间对租用人数的关系列成函数统计图比较直观，而其他像晴雨天，风力，温度等其他因素不是直增的数值不好显示，所以另外用热力图分析），如下：

可以看出“registered”和“casual”与总数的相关系数较大，即关联较大。但是test文件中缺失这两项数据，所以不予考虑。

3.3、数据的分析（折线图）：

# 具体分hour在各种情况下对count的影响
fig, axes = plt.subplots(nrows=4)         # 设置data frame
fig.set_size_inches(12, 20)
df2 = pd.DataFrame(data.groupby(["hour", "season"])["count"].mean()).reset_index()
sns.pointplot(x="hour", y="count", hue="season", join=True, data=df2, ax=axes[1])
df3 = pd.DataFrame(data.groupby(["hour", "weekday"])["count"].mean()).reset_index()
sns.pointplot(x="hour", y="count", hue="weekday", join=True, data=df3, ax=axes[2])
df4 = pd.melt(data[["hour", "casual", "registered", "count"]], id_vars=["hour"], value_vars=["casual",
                "registered", "count"])
df5 = pd.DataFrame(df4.groupby(["hour", "variable"])["value"].mean()).reset_index()
sns.pointplot(x="hour", y="value", hue="variable", join=True, data=df5, ax=axes[3], hue_order=
              ["count", "registered", "casual"])
plt.show()

由于箱线图中发现“hour”对总数的影响较大，故再具体分析在各种情况下对”count“的影响

3.3、数据的分析（折线图）：

可以很明显地看出各个时段用户对共享单车的需求

4、模型预测（随机森林法）：
# 数据预处理

df = pd.get_dummies(data=df, columns=["holiday", "season", "weather", "workingday", "date", "month", "weekday", "hour", "year"])
df.drop(["casual", "registered"], axis=1, inplace=True)      #删掉有空的数据项

train = df[pd.notnull(df["count"])]    #取非空
test = df[~pd.notnull(df["count"])]    #取非空
test.drop(["count"], axis=1, inplace=True)        #删掉有空的数据项

rfr = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,              # 训练trian数据
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))
rfr.fit(train.drop("count", axis=1), train["count"])

predicted= rfr.predict(test)                                                                                                             # 处理test数据
print(predicted)

输出结果：（中间的省略数据可在把predicted转化为列表类型后显示出来）

5、引用：

1、https://zhuanlan.zhihu.com/p/27585123   大佬的文章

2、http://blog.csdn.net/ac540101928/article/details/51689505     随机森林算法数学原理

3、http://blog.csdn.net/lulei1217/article/details/49583287       随机森林算法参数处理

4、http://www.cnblogs.com/kylinlin/p/5236601.html、
       http://blog.csdn.net/w5310335/article/details/48247689      Seaborn库的使用

5、http://www.cnblogs.com/yymn/p/4801875.html         随机森林简单例子