共享单车项目分析

最新推荐文章于 2020-12-03 17:42:09 发布

weixin_44422539

最新推荐文章于 2020-12-03 17:42:09 发布

阅读量4.7k

点赞数 3

本文链接：https://blog.csdn.net/weixin_44422539/article/details/89341819

版权

该博客分析了共享单车项目，揭示了租赁数据中的模式和问题。通过数据预处理和随机森林模型，研究了时段、温度、湿度、风速等因素对租赁数量的影响，并预测了租赁额。

摘要由CSDN通过智能技术生成

共享单车项目分析

分析背景与目的：

经过多年的发展，共享单车越来越受到了人们的追捧，成为了大多数人外出必不可少的出行工具，
城市街道上随处可见的共享单车成为了一个城市快速发展中的一道靓丽的风景线；
在共享单车快速发展的的同时，接踵而至的是竞争地不断加剧、租赁数量增长缓慢等问题；
期望通过对本次共享单车的租赁数据进行诊断分析，以及剖析已发现的问题，为以后的产品维护、运营工作提供参考与指导。

分析思路

在这里插入图片描述

接下来是具体的分析操作：

数据下载自Kaggle

数据源

数据分析方面可用Excel或者python来分析，为了便于演示，加上后面需要预测租赁额，需要构造随机森林模型来进行预测，所以就在python的jupyter notebook环境下进行操作：

导入相关模块和数据：

import numpy as np
import pandas as pd
from datetime import datetime
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')
import seaborn as sns
sns.set(style='whitegrid', palette='tab10')

path = 'C:\\Users\\Master\\Desktop\\bike-sharing-demand\\'
train=pd.read_csv(path + 'train.csv')

查看训练数据是否有缺失值：

train.info()

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB

可一看到训练数据共有10886行，并无缺失值
接下来导入导入测试数据：

test=pd.read_csv('test.csv')
test.info()

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null object
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity      6493 non-null int64
windspeed     6493 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.6+ KB

本数据集没有缺失数据，但没有缺失不代表没有异常。
检查异常值：

#查看数据有误异常值
train.describe()

输出：
在这里插入图片描述
从数值型数据入手，想要知道租赁额（count）数据差异多大，可以观察一下它的密度分布情况：

#观察租赁额密度分布情况
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
fig.set_size_inches(6,5)

sns.distplot(train['count'])
ax.set(xlabel='count', title='Distribution of count');

输出：
在这里插入图片描述
可以发现数据密度分布的偏斜比较严重，且有一个很长的尾巴，为了后面的预测分析准确率高，需要尽可能的让数据符合正太分布，所以希望将这一列的数据的长尾巴处理掉，可以排除掉距离均值3个标准差以为的数据

train_withountOutliers = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())]
train_withountOutliers.shape

输出：
(10739, 12)
再来查看一下它的密度分布，同时可以和原始数据对比一下：

fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

sns.distplot(train['count'], ax=ax1)
sns.distplot(train_withountOutliers['count'], ax=ax2)

ax1.set(xlabel='count', title='Distribution of count')
ax2.set(xlabel='count', title='Distribution of count without outliers')

输出：
在这里插入图片描述
可以看到波动性依旧很大，为了在训练数据的时候避免容易产生过拟合，我们希望波动性能稳定点，所以这里可以采用对数转换：

yLabels = train_withountOutliers['count']
yLabels_log = np.log(yLabels)
sns.distplot(yLabels_log)

输出：
在这里插入图片描述
可以看到数据分布均匀了很多，大小差异也缩小了，利用这样的数据标签对模型的训练是有益的

接下来就是对其它的数据进行处理了，在这里可以把测试数据合并起来一起处理：

Bike_data = pd.concat([train_withountOutliers,test],ignore_index=True)

最低0.47元/天解锁文章

weixin_44422539

关注

3
点赞
踩
48

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫