python数据分析案例（四）：共享单车租用分析

最新推荐文章于 2024-01-02 22:01:14 发布

bb8886

最新推荐文章于 2024-01-02 22:01:14 发布

阅读量3.9k

点赞数 6

分类专栏：数据分析文章标签： python 数据分析开发语言

本文链接：https://blog.csdn.net/bb8886/article/details/128121567

版权

数据分析专栏收录该内容

15 篇文章 8 订阅

订阅专栏

1、数据获取

数据集来源：https://www.kaggle.com/pronto/cycle-share-dataset

trip.csv字段描述

trip_id	starttime	stoptime	bikeid	from_station_id	to_station_id	usertype	gender	birthyear
订单编号	骑行开始时间	骑行结束时间	骑行编号	出发站编号	到达站编号	用户类型	性别	出生年月

weather.csv字段描述

date	temperature	dew_point	humidity	sea_pressure	visibility_miles	wind_speed	precipitation_in
日期	温度	露点	湿度	海平面气压	能见度	风速	降水量

2、分析内容

1）共享单车一天中那个时间段使用人数最多？

2）工作日和非工作日使用情况？

3）一年中每个月份使用的情况如何？

4）会员与非会员对共享单车需求量情况？

5）使用共享单车男女比例如何？

6）使用共享单车的年龄分布如何？

7）天气因素，如温度，湿度，能见度等，对共享单车使用情况有什么影响？

3、步骤

3.1 导入数据

3.2 查看数据

3.3 数据清洗

1）删除冗余字段

代码：

trip.drop(['tripduration', 'from_station_name', 'to_station_name'], axis=1, inplace=True)
weather.drop([
    'Max_Temperature_F',
    'Min_TemperatureF',
    'Max_Dew_Point_F',
    'Min_Dewpoint_F',
    'Max_Humidity',
    'Min_Humidity ',
    'Max_Sea_Level_Pressure_In ',
    'Min_Sea_Level_Pressure_In ',
    'Max_Visibility_Miles ',
    'Min_Visibility_Miles ',
    'Max_Wind_Speed_MPH ',
    'Max_Gust_Speed_MPH'
], axis=1, inplace=True)

2）字段重命名

weather.rename(columns={
    'Mean_Temperature_F': 'Temperature',
    'MeanDew_Point_F': 'Dew_point',
    'Mean_Humidity ': 'Humidity',
    'Mean_Sea_Level_Pressure_In ': 'Sea_Pressure',
    'Mean_Visibility_Miles ': 'Visibility_Miles',
    'Mean_Wind_Speed_MPH ': 'wind_Speed'
}, inplace=True)

3）修改数据类型

将starttime、stoptime、date转换为datetime数据类型的时间序列。

代码：

# 将starttime、stoptime、date转换为datetime数据类型的时间序列
trip['starttime'] = pd.to_datetime(trip.starttime)
trip['stoptime'] = pd.to_datetime(trip.stoptime)
weather['Date'] = pd.to_datetime(weather.Date)

4）数据表优化（在trip中新增字段）

# 在trip中新增字段
trip['date'] = trip.starttime.astype('datetime64[D]')
trip['year'] = trip.date.apply(lambda x: x.year)
trip['month'] = trip.date.apply(lambda x: x.month)
trip['day'] = trip.date.apply(lambda x: x.day)
trip['hour'] = trip.starttime.apply(lambda x: x.hour)
trip['weekday'] = trip.starttime.apply(lambda x: x.weekday())
print(trip.head())

5）删除空值，对年龄段进行操作

trip_age = trip.dropna().copy()
trip_age['birthyear'] = trip_age.birthyear.astype('int64')
trip_age.info()
# 通过自定义函数对用户年龄分层，增加分层年龄age字段
'''
birthyear>2000:0~20岁，
birthyear在1990~2000之间：20-30岁，
birthyear在1980~1990之间：30-40岁，
birthyear在1970~1980之间：40-50岁，
其他：50+
'''
def birthyear2age(x):
    if x > 2000:
        return '0-20'
    elif 1990 < x < 2000:
        return '20-30'
    elif 1980 < x < 1990:
        return '30-40'
    elif 1970 < x < 1980:
        return '40-50'
    else:
        return '50+'
trip_age['age'] = trip_age.birthyear.apply(birthyear2age)
# 等价于下面
# trip_age['age'] = trip_age['birthyear'].map(birthyear2age)
# 解决df.head()显示列不完整，设置显示的最大列数为20列
pd.set_option('display.max_columns', 20)
# 设置最大可见100行
# pd.set_option('display.max_rows',100)
print(trip_age.head())

应用后的结果截图：

3.4 数据分析及可视化

1）共享单车一天中那个时间段使用人数最多？

代码：

# （一天中每个时间段共享单车使用量分布情况）
trip1 = trip[['date', 'year', 'month', 'day', 'hour', 'weekday']]
df = trip1.drop_duplicates().copy()
a = trip1.groupby([trip1['date'], trip1['hour']])
df['count'] = a.size().values
df1 = df[['hour', 'count']]
df1.boxplot(by='hour', figsize=[8.2, 4.2])
plt.title('2014-2015共享单车pro每小时使用量')
plt.xlabel('图1')
plt.show()

从箱线图中可以得出，共享单车使用高峰分别是早上8点和下午5点，此时间段正好是早晚上下班高峰。

2）工作日和非工作日使用情况？

代码:

# 工作日和非工作日共享单车使用量分布情况
df2 = df[['weekday', 'count']]
ax = df2.boxplot(by='weekday', figsize=[8.2, 4.2])
plt.title('2014-2015共享单车pro每星期使用量')
ax.set_xticklabels(['M','T','W','T','F','S','S'],rotation='horizontal')
plt.xlabel('图2')
plt.show()

从箱线图中可以得出，工作日平均使用量高于非工作日。

3）一年中每个月份使用的情况如何？

代码：

# 一年中每个月共享单车使用量分布情况
df3 = df[['month', 'count']]
ax3 = df3.boxplot(by='month', figsize=[8.2, 4.2])
ax3.set_title('2014-2015共享单车pro每月使用量')
ax3.set_xticklabels(['J','F','M','A','M','J','J','A','S','O','V','D'],rotation='horizontal')
ax3.set_xlabel('图3')
plt.show()

从箱线图中可以看出，6月-8月共享单车使用量较高，12月-次年2月共享单车使用量较低。

4）会员与非会员对共享单车需求量情况？

从饼图中可以看出，会员占比为61.2%，非会员占比为38.8%

代码：

df4 = trip.groupby(['usertype']).usertype.count()
ax4 = df4.plot.pie(startangle=90, autopct='%.1f%%')
ax4.set_title('2014-2015共享单车pro用户类型占比')
ax4.set_xlabel('图4')
plt.axis('equal')
plt.show()

5）使用共享单车男女比例如何？

从饼图中可以看出，男性使用共享单车数量占比最大，达到77.4%

代码：

# 用户性别占比分析
df5 = trip.groupby(['gender']).gender.count()
ax5 = df5.plot.pie(startangle=90, autopct='%.1f%%')
ax5.set_title('2014-2015共享单车pro用户性别占比')
ax5.set_xlabel('图5')
plt.axis('equal')
plt.show()

6）使用共享单车的年龄分布如何？

代码：

df6 = trip_age.groupby(['age']).age.count()
ax6 = df6.plot(kind='bar')
for a, b in enumerate(df6):
    plt.text(a, b, b, ha='center', va='bottom')
ax6.set_title('2014-2015共享单车pro用户年龄段占比')
ax6.set_xlabel('图6')
plt.show()

从柱形图中可以看出，使用共享单车中30-40岁的用户最多。

7）天气因素，如温度，湿度，能见度等，对共享单车使用情况有什么影响？

从箱型图可以看出，温度在50-80华氏度，共享单车需求量最大。

代码：

# 从天气维度
# df['date']:2014-10-13 00:00:00 weather['Date']:2014-10-13 merge时报错
# 去掉df[date]的00:00：00
df['date'] = df.date.apply(lambda x: x.split(" ")[0])
merged = df.merge(right=weather, how='inner', left_on='date', right_on='Date')
print(merged.head())

# 温度对共享单车使用量的影响
df7 = merged[['Temperature', 'count']]
ax7 = df7.boxplot(by='Temperature', figsize=[16.2, 8.2])
ax7.set_title('2014-2015共享单车Pronto用户不同温度使用量')
ax7.set_xlabel('温度')
ax7.set_ylabel('数量')
plt.show()

从箱型图可以看出，湿度在30-70需求量最大，温度过高会抑制共享单车需求量。

代码：

# 湿度对共享单车使用量的影响
df7 = merged[['Humidity', 'count']]
ax7 = df7.boxplot(by='Humidity', figsize=[16.2, 8.2])
ax7.set_title('2014-2015共享单车Pronto用户不同湿度使用量')
ax7.set_xlabel('湿度')
ax7.set_ylabel('数量')
plt.show()

从箱形图可以看出，能见度大于3，共享单车使用数量之间差别不大，但能见度小于3会影响单车的使用率。

代码：

# 能见度对共享单车使用量的影响
df7 = merged[['Visibility_Miles', 'count']]
ax7 = df7.boxplot(by='Visibility_Miles', figsize=[10.2, 6.2])
ax7.set_title('2014-2015共享单车Pronto用户不同能见度使用量')
ax7.set_xlabel('能见度')
ax7.set_ylabel('数量')
plt.show()