特征工程--BikeSharing预测

最新推荐文章于 2024-05-15 10:53:23 发布

fly_Xiaoma

最新推荐文章于 2024-05-15 10:53:23 发布

阅读量914

点赞数 2

分类专栏： ML-Demo

本文链接：https://blog.csdn.net/weixin_38664232/article/details/86760786

版权

ML-Demo 专栏收录该内容

6 篇文章 2 订阅

订阅专栏

6.2.1 首先需要对数值型特征进行标准化/MinMaxScaler()

6.2.2 结合categorical 和numerical特征

6.2.3 进一步结合记录号（instant）、年份、骑行量生成一份综合的数据

6.2.4 最后一步保存

1 bikesharing中的字段说明

Instant:记录号
Dateday:日期
Season:季节（1 春天；2 夏天；3 秋天；4 冬天）
yr:年份（0 2011年；1 2012年）
mnth：月份（1~12）
holiday：是否是节假日（0/1）
weekday：星期中的哪天（0~6）
workingday：是否是工作日（1 工作日；0 非工作日）
weathersit：天气（1 晴天，多云；2 雾天，阴天；3 小雪，小雨；4 大雨，大雾，大雪）
temp：气温摄氏度
atemp：体感温度
hum：湿度
windspeed：风速
casual：非注册用户个数
registered：注册用户个数
cnt：总租车人数（=casual+registered）

2 数据预处理

在导入数据前，需要对pandas、绘图工具seaborn做参数设置

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
import sklearn.datasets
from sklearn import preprocessing

params={'legend.fontsize':'x-large',
        'figure.figsize':(30,10),
        'axes.labelsize':'x-large',
        'axes.titlesize':'x-large',
        'xtick.labelsize':'x-large',
        'ytick.labelsize':'x-large'
        }
sn.set_style('whitegrid')
sn.set_context('talk')
plt.rcParams.update(params)
pd.options.display.max_colwidth=600
from IPython.display import display,HTML

3 读入数据

train=pd.read_csv('day.csv')
print(train.head())
print(train.info())

数据前5行：

数据的总体信息：

可以看出这里是没有确实信息的，每一列都是731行数据

4 数据探索

对于数值型特征，用常用统计量观察其分布

print(train.describe())

输出结果为：

count：总数量

mean：均值

std：标准差

min：最小值

max：最大值

25%：四分之一分位数

75%：四分之三分位数

5 特征分析

5.1 离散特征的分析

单车数据的字段中可以分析出离散特征有：season 、mnth、 weathersit和 weekly这四种，它们的编码方式为独热编码（one-hot encoding），对这四种类型特征，观察它们的取值范围及直方图显示。

category_features=['season','mnth','weathersit','weekday']
for col in category_features:
    print('\n%s属性的不同取值范围和出现的次数'%col)
    print(train[col].value_counts())
    #原来是Int型数据变为object型
    train[col]=train[col].astype('object')

输出1：

对season：冬天骑行量最少；对mnth：3月份最少（57）

输出2：

对weathersit：1季度骑行量最多；对weekday：骑行量算是平均

5.2 数值型特征分析

numerical_features=['temp','atemp','hum','windspeed']
train[numerical_features].hist()
plt.show()

输出：

从图中可以看出：这四种情况对单车的骑行量都有影响，特别是风速和湿度。

5.2.1 年份与骑行量的分布

sn.violinplot(data=train[['yr','cnt']],x='yr',y='cnt')
plt.show()

提琴图：

可以看出在2011年单车骑行量比较均匀；到了2012年骑行量有了较大的增加。、

5.2.2 一年中每天的骑行量

使用颜色参数hue表示类别（年）信息

import datetime
train['date']=pd.to_datetime(train['dteday'])
train['dayofyear']=train['date'].dt.dayofyear#减去今年的第几天
fig,ax=plt.subplots()
sn.pointplot(data=train[['dayofyear','cnt','yr']],x='dayofyear',y='cnt',hue='yr',ax=ax)
ax.set(title='dayly distribution of counts')
plt.show()

效果图：

这张图可以与5.2.1中的提琴图相对应，2011年的骑行量不如2012年；同时在每年的开始跟结束阶段骑行量也是较少的。

5.2.3 季节与骑行量的关系

一年中分为4个季节，0~3分别代表春、夏、秋和冬。是离散型数值变量，因此使用提琴图看看效果。

sn.violinplot(data=train[['season','cnt']],x='season',y='cnt')
plt.show()

效果图：

能看出来每个季节骑行量的分布不同。下面使用barplot效果图看一下：

fig,ax=plt.subplots()
sn.barplot(data=train[['season','cnt']],x='season',y='cnt')
ax.set(title='seasonly distribution of couts')
plt.show()

效果图：

barplot利用了矩阵的高度来反映数值变量的集中趋势，同时结合差棒图(errorbar)反应数值的波动范围（图中的黑色竖线）

5.2.4 月份与骑行量的关系

将1~12每个月的骑行量显示出来，因为是离散型变量所以可以选择直方图或者提琴图来显示。

fig,ax=plt.subplots()
sn.barplot(data=train[['mnth','cnt']],x='mnth',y='cnt')
ax.set(title='monthly distributon of counts ')
plt.show()

效果图：

从图中可以看出5-10月份的骑行量还是比较高的。结合提琴图看一下：

5.2.5 天气和骑行量的关系

fig,ax=plt.subplots()
sn.barplot(data=train[['weathersit','cnt']],x='weathersit',y='cnt')
ax.set(title='weather distribution of counts')
plt.show()

效果图：

晴天骑行量较高、阴天次之、雨雪天较少、大雨大雪天气骑行量为0。

5.2.6 工作日和节假日的分布

fig,(ax1,ax2)=plt.subplots(ncols=2)
sn.barplot(data=train,x='holiday',y='cnt',ax=ax1)
sn.barplot(data=train,x='weekday',y='cnt',ax=ax2)
plt.show()

注意：subplots()中ncols为2，默认值为1.

效果图：

5.2.7 数值特征与y之间的相关性

#转换成矩阵形式
corrMatt=train[['temp','atemp','hum','windspeed','casual','registered','cnt']].corr()
mask=np.array(corrMatt)
mask[np.tril_indices_from(mask)]=False
sn.heatmap(corrMatt,mask=mask,vmax=0.8,square=True,annot=True)
plt.show(

效果图：

6 特征工程

6.1 离散值的特征工程：

category_features=['season','mnth','weathersit','weekday']
for col in category_features:
        train[col]=train[col].astype('object')

X_train_cat=train[category_features]
X_train_cat=pd.get_dummies(X_train_cat)
print(X_train_cat.head())

输出：

6.2 数值型的特征工程

6.2.1 首先需要对数值型特征进行标准化/MinMaxScaler()

numer
ical_features=['temp','atemp','hum','windspeed']
from sklearn.preprocessing import MinMaxScaler
mn_X=MinMaxScaler()
temp=mn_X.fit_transform(train[numerical_features])
X_trian_num=pd.DataFrame(data=temp,columns=numerical_features,index=train.index)
print(X_trian_num.head())

输出：

6.2.2 结合categorical 和numerical特征


X_train=pd.concat([X_train_cat,X_trian_num,train['holiday'],train['workingday']],
                  axis=1,ignore_index=False)
print(X_train.head())

输出：

6.2.3 进一步结合记录号（instant）、年份、骑行量生成一份综合的数据

FE_train=pd.concat([train['instant'],X_train,train['yr'],train['cnt']],
                   axis=1)
print(FE_train.head())

输出：

6.2.4 最后一步保存

FE_train.to_csv('FE_train.csv',index=False)
print(FE_train.info())

全部信息：

fly_Xiaoma

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
特征工程--BikeSharing预测

目录 1 bikesharing中的字段说明2 数据预处理3 读入数据4 数据探索5 特征分析5.1 离散特征的分析5.2 数值型特征分析5.2.1 年份与骑行量的分布5.2.2 一年中每天的骑行量5.2.3 季节与骑行量的关系5.2.4 月份与骑行量的关系5.2.5 天气和骑行量的关系5.2.6 工作日和节假日的分布5.2.7 数值特征...
复制链接

扫一扫