共享单车项目计划书_共享单车项目相关性分析

最新推荐文章于 2023-09-15 14:09:56 发布

weixin_39734184

最新推荐文章于 2023-09-15 14:09:56 发布

阅读量952

点赞数

文章标签：共享单车项目计划书

本文链接：https://blog.csdn.net/weixin_39734184/article/details/112177090

版权

一、项目介绍

1.1 项目背景

共享单车是指企业在校园、地铁站点、公交站点、居民区、商业区、公共服务区等提供自行车单车共享服务，人们可以从各站点租赁自行车，到达目的地进行归还。

1.2 项目需求

本项目需要结合历史天气数据的使用信息，来预测共享单车租赁需求。本文将通过相关分析探究影响租赁额的因素，暂不对租赁需求进行预测。

二、数据理解

2.1 数据来源

Bike Sharing Demand | Kagglewww.kaggle.com

2.2 导入数据

# 忽略警告
import warnings 
warnings.filterwarnings('ignore')

# 导入处理数据包
import numpy as np 
import pandas as pd

# 导入数据
# 训练数据集
train = pd.read_csv("D:/BaiduNetdiskDownload/train_bike.csv")  # 用正斜杠
# 测试数据集
test = pd.read_csv("D:/BaiduNetdiskDownload/test_bike.csv")

2.3 数据字段含义

共12个字段，其含义如下：

2.4 查看数据信息

# 查看训练数据集、测试数据集共有几行几列
print('训练数据集大小：',train.shape,'测试数据集大小',test.shape)

# 查看数据集的前几行
train.head()

# 查看训练数据的描述统计信息
train.describe()

train.info()

# 查看测试数据
test.info()

训练数据集大小为（10886,12），测试数据集大小为（6493,8），利用info方法查看各列数据大小、类型，发现训练数据集、测试数据集均无缺失值。

三、数据清洗

对时间进行处理，增加日期、月份、时段、星期列。

# 先备份测试数据集
bikeDf = train

# 将时间增加日期，月份，时段,星期几
from datetime import datetime

bikeDf['date'] = bikeDf['datetime'].apply( lambda c : c.split()[0] )
bikeDf['month'] = bikeDf['datetime'].apply( lambda c : c.split()[0].split('-')[1] ).astype('int')
bikeDf['hour'] = bikeDf['datetime'].apply( lambda c : c.split()[1].split(':')[0] ).astype('int')
# datetime.strptime() 将字符串格式转化为日期格式
# datetime.datetime.isoweekday() 返回的1-7代表周一至周日;date.weekday()返回的0-6代表周一至周日
bikeDf['weekday'] = bikeDf['date'].apply( lambda c : datetime.strptime(c,'%Y-%m-%d').isoweekday() )
bikeDf.head()

四、相关性分析

由于总租赁数量=注册用户租赁数量+临时用户租赁数量，因此下文从整体研究租赁额的3个值与其他特征值的关系。

sns.pairplot(bikeDf,x_vars = ['holiday','workingday','weather','season','month','weekday','hour','windspeed','humidity','temp','atemp'],
            y_vars = ['casual','registered','count'],plot_kws = {'alpha':0.1})

从上图可以看出：

（1）注册用户用车占多数，且注册用户在工作日出行比在假日出行多，临时用户在假日出行比工作日出行多。

（2）临时用户和注册用户随天气等级增加而减少用车。

（3）临时用户和注册用户在春天用车数量较少。

（4）临时用户租赁数量随时段的变化呈正态分布，注册用户租赁数量随时间变化呈两个高峰。

（5）随温度升高，租赁额呈上升的趋势

（6）湿度对临时用户影响较大，对注册用户影响不大。

由于特征值很多，这里只研究对租赁额影响较大的3个因素，以下将通过相关系数热力图来确定。

4.1 相关系数热力图

# 求出相关系数
correlation = bikeDf.corr()
correlation

对相关系数进行可视化，便于分析。

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# figsize设置图形的大小，figure(a,b),a为图形的宽，a为图形的高
fig = plt.figure(figsize = (12,12))
ax1 = fig.add_subplot(1,1,1)
sns.set(style = 'dark')

"""
seaborns.heatmap()各参数介绍
(1) vmax,vmin图例中最大值和最小值的显示值，没有该参数时默认不显示
(2) annot 是annotate的缩写，annot默认为False，当annot为True时，在heatmap中每个方格写入数据
(3)cmap:matplotlib 的colormap名称或颜色对象，若没有提供，默认为cubehelix map（数据集为连续数据集时）或RdBu_r（数据集为离散数据集时）
  注意cmap中的YlGnBu是L的小写，而非是数据1或字母I，出错解决方式：复制错误提示中的正确值至代码中，就发现解决了
(4)linewidths热力图矩阵之间的间隔大小
"""
sns.heatmap(correlation,ax=ax1,vmax=1,square=False,annot=True,cmap='YlGnBu',linewidths=.5)  

plt.show()

根据上图，可以看到租赁额(count)与hour、temp、humidity相关度较高，因此将从这3个方面研究对租赁额的影响。

4.2 不同时段对租赁额的影响

workingday_df = bikeDf[bikeDf['workingday']==1]
# agg方法将一个函数使用在一个数列上，然后返回一个标量的值。即agg每次传入的是一列数据，对其聚合后返回标量。这里是返回平均值。
workingday_df = workingday_df.groupby(['hour'],as_index=True).agg({'casual':'mean','registered':'mean','count':'mean'})

nworkingday_df = bikeDf[bikeDf['workingday']==0]
nworkingday_df = nworkingday_df.groupby(['hour'],as_index=True).agg({'casual':'mean','registered':'mean','count':'mean'})

fig,axes = plt.subplots(1 ,2 ,sharey = True)
sns.set_style('whitegrid')

workingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the working day',ax = axes[0])
nworkingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the nworkingdays',ax = axes[1])

结论：

（1）在工作日注册用户租赁额呈两个高峰，集中在上下班时间，临时用户用车在17点达到最高峰

（2）在非工作日注册用户和临时用户基本呈正态分布。

4.3 不同温度对租赁额的影响

temp_df = bikeDf.groupby(['temp'],as_index = True).agg({'casual':'mean','registered':'mean','count':'mean'})
temp_df.plot(figsize = (15,5),title = 'The average of numbers of rentals initiated per hour changes with the temperature')

租赁额随温度升高呈上升的趋势，当温度为36度时，租赁额开始下降，在38度时温度达到最低值。

4.4 不同湿度对租赁额的影响

humidity_df = bikeDf.groupby(['humidity'],as_index = True ).agg({'casual':'mean','registered':'mean','count':'mean'})
humidity_df.plot(figsize = (15,5),title = 'The average of number of rentals initiated  with the humidity')