- 所需的环境配置
Windows10,Python,PyCharm
推荐下载顺序:Python->PyCharm
Python与第三方库安装教程:https://blog.csdn.net/weixin_42128329/article/details/90046108
PyCharm与第三方库安装教程:https://blog.csdn.net/weixin_42128329/article/details/90046725
用到第三方库:Numpy,Pandas,Matplotlib,Seaborn
PyCharm第三方库安装教程:https://blog.csdn.net/qq_41106517/article/details/81140563
如果出现No information available提示,请检查网络,如网络没问题,可以通过将pip的源更换为阿里源解决
新建文件pip.ini,其中写入
[global]
index-url = http://mirrors.aliyun.com/pypi/simple/
[install]
trusted-host=mirrors.aliyun.com
在C:\Users\用户名\AppData\Roaming 目录下新建pip文件夹,将pip.ini文件放入
- 项目背景
拿到了一份共享单车骑行数据,利用Python进行数据可视化。
数据下载地址:链接:https://pan.baidu.com/s/11DiPR0LjT5xNgic4wrpLGg 提取码:g7bc
- 数据的理解
使用XLS表格打开
Datetime:时间
Season:季度
Holiday:节假日(0为否,1为真)
Workingday:工作日(0为否,1为真)
Casual:非会员
Registered:会员
Count:总数
对第某行数据的解读:第一季度,2011/1/1 0:00 - 2011/1/1 1:00时段,非节假日,非工作日,非会员骑行人数为3,会员骑行人数为13,总人数为16。
- 代码讲解
导包(将所需要的库导入):
#导包
import numpy as np
import pandas as pd
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']#设置字体,防止中文无法显示
from datetime import datetime
import calendar
import matplotlib.pyplot as plt
import seaborn as sn
数据的读入和相关查看:
#读入数据
bikedata=pd.read_csv("train.csv");
#显示相应所需数据
print(bikedata)
print(bikedata.shape)
print(bikedata.head())
print(bikedata.tail())
print(bikedata.dtypes)
print(bikedata.describe())
Bikedata.shape查看数据大小
Bikedata.head()查看数据前5行
Bikedata.tail()查看数据后5行
Bikedata.dtypes查看数据类型
Bikedata.describe()查看统计摘要信息
结果展示:
数据提取:
bikedata['date']=bikedata.datetime.apply(lambda x:x.split()[0])#新建date列,数据是datetime中的年月日
bikedata['hour']=bikedata.datetime.apply(lambda x:x.split()[1].split(':')[0])
bikedata.drop('season',axis=1,inplace=True)#删除season列
bikedata['season']=bikedata.date.apply(lambda x:x.split('/')[1])#再新建season列,数据是date列中的月份
bikedata['weekday']=bikedata.date.apply(lambdadateString:calendar.day_name[datetime.strptime(dateString,'%Y/%m/%d').weekday()])
bikedata['month']=bikedata.date.apply(lambdadateString:calendar.month_name[datetime.strptime(dateString,'%Y/%m/%d').month])
bikedata['season'] = bikedata['season'].astype('int')#数据转为int型
bikedata['season']=bikedata.season.map({3:'Spring',4:'Spring',5:'Spring',6:'summer',7:'summer',8:'summer',9:'Fall',10:'Fall',11:'Fall',12:'Winter',1:'Winter',2:'Winter'})#使用字典进行替换
bikedata['hour']=bikedata['hour'].astype('int')
varlist=['weekday','month','season','holiday','workingday']
for x in varlist:
bikedata[x]=bikedata[x].astype('category')
bikedata.drop('datetime',axis=1,inplace=True)
# 处理数据
fig, axes = plt.subplots(nrows=2, ncols=2)#plt包里的包 绘制子图 如果里面没有,默认绘制一个
fig.set_size_inches(12,12)
sn.boxplot(data=bikedata,y="count",orient="v",ax=axes[0][0])
sn.boxplot(data=bikedata,y="count",x="season",orient="v",ax=axes[0][1])
sn.boxplot(data=bikedata,y="count",x="hour",orient="v",ax=axes[1][0])
sn.boxplot(data=bikedata,y="count",x="workingday",orient="v",ax=axes[1][1])
#绘制箱型图
axes[0][0].set(ylabel='骑行人数',title="骑行人数")
axes[0][1].set(ylabel='骑行人数',xlabel='季节',title="不同季节骑行人数")
axes[1][0].set(xlabel='时间',ylabel='骑行人数',title="一天不同时间骑行人数")
axes[1][1].set(xlabel='工作日',ylabel='骑行人数',title="工作日骑行人数")
plt.savefig("Abnormal_value_analysis.png")
plt.show()
# 剔除数据
bikedata1 = bikedata[np.abs(bikedata["count"] - bikedata["count"].mean()) <=(3*bikedata["count"].std())]
#三倍标准差剔除异常值 abs绝对值 mean平均
bikedata1.to_csv('processed_data.csv')#保存处理后的数据为bikedata1
结果展示:
绘制不同月份骑行人数图:
#不同月份骑行人数
def Data_Analysis_and_Visualization_month(bikedata1):
fig1, ax1 = plt.subplots()
fig1.set_size_inches(12,20)
sortOrder =["January","February","March","April","May","June","July","August","September","October","November","December"]
monthAggregated = pd.DataFrame(bikedata1.groupby("month")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sn.barplot(data=monthSorted,x="month",y="count",order=sortOrder)
ax1.set(xlabel='月份',ylabel='平均骑行人数',title="不同月份骑行人数")
plt.savefig('result1.png')
plt.show()
结果展示:
绘制一周内不同时间骑行人数图:
#一周内不同时间的骑行人数
def Data_Analysis_and_Visualization_week(bikedata1):
fig2, ax2 = plt.subplots()
fig2.set_size_inches(12,20)
hueOrder = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
hourAggregated = pd.DataFrame(bikedata1.groupby(["hour","weekday"])["count"].mean()).reset_index()
print(hourAggregated)
sn.pointplot(x=hourAggregated["hour"],y=hourAggregated["count"],hue=hourAggregated["weekday"],hue_order=hueOrder,data=hourAggregated)
ax2.set(xlabel='时间',ylabel='骑行人数',title='一周内不同时间的骑行人数')
plt.savefig('result2.png')
plt.show()
结果展示:
绘制不同季节不同时间的骑行人数图:
#不同季节不同时间的骑行人数
def season_And_hour(bikedata1):
fig2, ax2 = plt.subplots()
fig2.set_size_inches(12, 20)
hueOrder = ['Spring','summer','Fall','Winter']
hourAggregated = pd.DataFrame(bikedata1.groupby(["hour", "season"])["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"], hue=hourAggregated["season"],hue_order=hueOrder,data=hourAggregated)
ax2.set(xlabel='时间', ylabel='骑行人数', title='不同季节不同时间的骑行人数')
plt.savefig('result3.png')
plt.show()
结果展示:
绘制不同用户在不同时间内的骑行人数图:
def user_And_hour(bikedata1):
fig, axes = plt.subplots()
fig.set_size_inches(12, 20)
hour_Transform = pd.melt(bikedata1[['hour', 'casual', 'registered', 'weekday']],id_vars=['hour', 'weekday'],value_vars=['casual', 'registered'])
hour_Aggregated = pd.DataFrame(hour_Transform.groupby(['hour', 'variable'])['value'].mean()).reset_index()
sn.pointplot(data=hour_Aggregated, x='hour', y='value', hue='variable', hue_order=['casual', 'registered'])
axes.set(xlabel='时间', ylabel='骑行人数', title='不同用户在不同时间内的骑行人数')
plt.savefig('result4.png')
plt.show()
结果展示:
源代码下载:链接:https://pan.baidu.com/s/1ZJjKHMRo07xfFGk84OnDCQ 提取码:a5v0