Bike-sharing-demand

kaggle的竞赛题目 :https://www.kaggle.com/c/bike-sharing-demand
引用文章 : https://www.kaggle.com/c/bike-sharing-demand


1、问题的提出:
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
(摘自https://www.kaggle.com/c/bike-sharing-demand)

简单讲,就是根据华盛顿市前一年的共享单车租用数据,预测下一年的共享单车租用人数。用train数据(包含租用人数,日期,天气等),预测test数据的租用人数。


2、数据的处理:
df["date"] = df["datetime"].map(lambda s: s.split("-")[2].split()[0])
df["year"] = df["datetime"].map(lambda s: s.split("-")[0])
df["month"] = df["datetime"].map(lambda s: s.split("-")[1])
df["weekday"] = df["datetime"].map(lambda s: datetime.strptime(s.split()[0], "%Y-%m-%d").weekday())
df["hour"] = df["datetime"].map(lambda s: s.split()[1].split(":")[0])
Datetime_train = df.loc[pd.notnull(df["count"]), "datetime"]   #取出训练数据的Datetime
Datetime_test = df.loc[~pd.notnull(df["count"]), "datetime"]   #取出训测试据的Datetime
df.drop("datetime", axis=1, inplace=True)    #除掉datetime这一列
data = df[pd.notnull(df["count"])]           #取非空得 10886R * 16C

  作用是,拆解数据列表中“datetime”(如:2011-01-01 01:00:00),变成"date"(01)、"year"(2011)、"month"(01)、"weekday"(1)、"hour(01)",变成新的数据项添加到data表的后面。


3.1、数据的分析(箱线图):
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20, 20))
sns.boxplot(y="count", data=data, ax=axes[0][0])
sns.boxplot(x="hour", y="count", data=data, ax=axes[0][1])
sns.boxplot(x="weather", y="count", data=data, ax=axes[1][0])
sns.boxplot(x="weekday", y="count", data=data, ax=axes[1][1], order=["Monday", "Tuesday"
                                                                , "Wednesday", "Thursday",
                                                                "Friday", "Saturday", "Sunday"])
sns.boxplot(x="season", y="count", data=data, ax=axes[2][0])
sns.boxplot(x="workingday", y="count", data=data, ax=axes[2][1])
plt.show()

作用是,调用seaborn库的箱线图函数列出各时间季节对租用总数的图像关系,如下:


可以看出“weather”、“hour”、“season”对总数影响较大,其余项影响不大

3.2、数据的分析(热力图):
corrMatt = data[["atemp", "casual", "humidity", "registered", "windspeed", "temp", "count"]].corr()   #corrsponding values
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig, axes = plt.subplots()
fig.set_size_inches(20, 10)
sns.heatmap(corrMatt, mask=mask, annot=True, square=True)  # (数据,是否折半,是否显示数值,是否是正方形)
plt.show()

作用是,调用seaborn库热力图函数列出除各时间季节外,其余的各种因素对租用总数的  相关系数(因为时间对租用人数的关系列成函数统计图比较直观,而其他像晴雨天,风力,温度等其他因素不是直增的数值不好显示,所以另外用热力图分析),如下:



可以看出“registered”和“casual”与总数的相关系数较大,即关联较大。但是test文件中缺失这两项数据,所以不予考虑。


3.3、数据的分析(折线图):

#  具体分hour在各种情况下对count的影响
fig, axes = plt.subplots(nrows=4)         # 设置data frame
fig.set_size_inches(12, 20)
df2 = pd.DataFrame(data.groupby(["hour", "season"])["count"].mean()).reset_index()
sns.pointplot(x="hour", y="count", hue="season", join=True, data=df2, ax=axes[1])
df3 = pd.DataFrame(data.groupby(["hour", "weekday"])["count"].mean()).reset_index()
sns.pointplot(x="hour", y="count", hue="weekday", join=True, data=df3, ax=axes[2])
df4 = pd.melt(data[["hour", "casual", "registered", "count"]], id_vars=["hour"], value_vars=["casual",
                "registered", "count"])
df5 = pd.DataFrame(df4.groupby(["hour", "variable"])["value"].mean()).reset_index()
sns.pointplot(x="hour", y="value", hue="variable", join=True, data=df5, ax=axes[3], hue_order=
              ["count", "registered", "casual"])
plt.show()

由于箱线图中发现“hour”对总数的影响较大,故再具体分析在各种情况下对”count“的影响

3.3、数据的分析(折线图):



可以很明显地看出各个时段用户对共享单车的需求

4、模型预测(随机森林法):
# 数据预处理

df = pd.get_dummies(data=df, columns=["holiday", "season", "weather", "workingday", "date", "month", "weekday", "hour", "year"])
df.drop(["casual", "registered"], axis=1, inplace=True)      #删掉有空的数据项

train = df[pd.notnull(df["count"])]    #取非空
test = df[~pd.notnull(df["count"])]    #取非空
test.drop(["count"], axis=1, inplace=True)        #删掉有空的数据项

rfr = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,              # 训练trian数据
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))
rfr.fit(train.drop("count", axis=1), train["count"])    

predicted= rfr.predict(test)                                                                                                             # 处理test数据
print(predicted)

输出结果:(中间的省略数据可在把predicted转化为列表类型后显示出来)



5、引用:

1、https://zhuanlan.zhihu.com/p/27585123   大佬的文章

2、http://blog.csdn.net/ac540101928/article/details/51689505     随机森林算法数学原理

3、http://blog.csdn.net/lulei1217/article/details/49583287       随机森林算法参数处理

4、http://www.cnblogs.com/kylinlin/p/5236601.html、
       http://blog.csdn.net/w5310335/article/details/48247689      Seaborn库的使用

5、http://www.cnblogs.com/yymn/p/4801875.html         随机森林简单例子

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值