七月机器学习项目实战之特征工程6城市自行车共享系统使用状况

特征工程小案例

Kaggle上有这样一个比赛:城市自行车共享系统使用状况。

提供的数据为2年内按小时做的自行车租赁数据,其中训练集由每个月的前19天组成,测试集由20号之后的时间组成。

本项目功能:数据清理,特征提取,标准化连续值特征,离散性数据实现one-hot编码
本项目数据及源码:https://github.com/qiu997018209/MachineLearning

#先把数据读进来
import pandas as pd
data = pd.read_csv('kaggle_bike_competition_train.csv', header = 0, error_bad_lines=False)
#看一眼数据长什么样
data.head()
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcount
02011-01-01 00:00:0010019.8414.395810.031316
12011-01-01 01:00:0010019.0213.635800.083240
22011-01-01 02:00:0010019.0213.635800.052732
32011-01-01 03:00:0010019.8414.395750.031013
42011-01-01 04:00:0010019.8414.395750.0011

把datetime域切成 日期 和 时间 两部分。

# 处理时间字段
temp = pd.DatetimeIndex(data['datetime'])
data['date'] = temp.date
data['time'] = temp.time
data.head()
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountdatetime
02011-01-01 00:00:0010019.8414.395810.0313162011-01-0100:00:00
12011-01-01 01:00:0010019.0213.635800.0832402011-01-0101:00:00
22011-01-01 02:00:0010019.0213.635800.0527322011-01-0102:00:00
32011-01-01 03:00:0010019.8414.395750.0310132011-01-0103:00:00
42011-01-01 04:00:0010019.8414.395750.00112011-01-0104:00:00

时间那部分,好像最细的粒度也只到小时,所以我们干脆把小时字段拿出来作为更简洁的特征。

# 设定hour这个小时字段
data['hour'] = pd.to_datetime(data.time, format="%H:%M:%S")
data['hour'] = pd.Index(data['hour']).hour
data
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountdatetimehour
02011-01-01 00:00:0010019.8414.395810.0000313162011-01-0100:00:000
12011-01-01 01:00:0010019.0213.635800.0000832402011-01-0101:00:001
22011-01-01 02:00:0010019.0213.635800.0000527322011-01-0102:00:002
32011-01-01 03:00:0010019.8414.395750.0000310132011-01-0103:00:003
42011-01-01 04:00:0010019.8414.395750.00000112011-01-0104:00:004
52011-01-01 05:00:0010029.8412.880756.00320112011-01-0105:00:005
62011-01-01 06:00:0010019.0213.635800.00002022011-01-0106:00:006
72011-01-01 07:00:0010018.2012.880860.00001232011-01-0107:00:007
82011-01-01 08:00:0010019.8414.395750.00001782011-01-0108:00:008
92011-01-01 09:00:00100113.1217.425760.000086142011-01-0109:00:009
102011-01-01 10:00:00100115.5819.6957616.99791224362011-01-0110:00:0010
112011-01-01 11:00:00100114.7616.6658119.00122630562011-01-0111:00:0011
122011-01-01 12:00:00100117.2221.2107719.00122955842011-01-0112:00:0012
132011-01-01 13:00:00100218.8622.7257219.99954747942011-01-0113:00:0013
142011-01-01 14:00:00100218.8622.7257219.001235711062011-01-0114:00:0014
152011-01-01 15:00:00100218.0421.9707719.999540701102011-01-0115:00:0015
162011-01-01 16:00:00100217.2221.2108219.99954152932011-01-0116:00:0016
172011-01-01 17:00:00100218.0421.9708219.00121552672011-01-0117:00:0017
182011-01-01 18:00:00100317.2221.2108816.9979926352011-01-0118:00:0018
192011-01-01 19:00:00100317.2221.2108816.9979631372011-01-0119:00:0019
202011-01-01 20:00:00100216.4020.4558716.99791125362011-01-0120:00:0020
212011-01-01 21:00:00100216.4020.4558712.9980331342011-01-0121:00:0021
222011-01-01 22:00:00100216.4020.4559415.00131117282011-01-0122:00:0022
232011-01-01 23:00:00100218.8622.7258819.99951524392011-01-0123:00:0023
242011-01-02 00:00:00100218.8622.7258819.9995413172011-01-0200:00:000
252011-01-02 01:00:00100218.0421.9709416.9979116172011-01-0201:00:001
262011-01-02 02:00:00100217.2221.21010019.00121892011-01-0202:00:002
272011-01-02 03:00:00100218.8622.7259412.99802462011-01-0203:00:003
282011-01-02 04:00:00100218.8622.7259412.99802132011-01-0204:00:004
292011-01-02 06:00:00100317.2221.2107719.99950222011-01-0206:00:006
................................................
108562012-12-18 18:00:00401115.5819.6954622.0028135125252012-12-1818:00:0018
108572012-12-18 19:00:00401115.5819.6954626.0027193343532012-12-1819:00:0019
108582012-12-18 20:00:00401114.7616.6655016.997942642682012-12-1820:00:0020
108592012-12-18 21:00:00401114.7617.4255015.001391591682012-12-1821:00:0021
108602012-12-18 22:00:00401113.9416.665490.000051271322012-12-1822:00:0022
108612012-12-18 23:00:00401113.9417.425496.0032180812012-12-1823:00:0023
108622012-12-19 00:00:00401112.3015.910610.0000635412012-12-1900:00:000
108632012-12-19 01:00:00401112.3015.910656.0032114152012-12-1901:00:001
108642012-12-19 02:00:00401111.4815.150656.00321232012-12-1902:00:002
108652012-12-19 03:00:00401110.6613.635758.99810552012-12-1903:00:003
108662012-12-19 04:00:0040119.8412.120758.99811672012-12-1904:00:004
108672012-12-19 05:00:00401110.6614.395756.0032229312012-12-1905:00:005
108682012-12-19 06:00:0040119.8412.880756.003231091122012-12-1906:00:006
108692012-12-19 07:00:00401110.6613.635758.998133603632012-12-1907:00:007
108702012-12-19 08:00:0040119.8412.880877.0015136656782012-12-1908:00:008
108712012-12-19 09:00:00401111.4814.395757.001583093172012-12-1909:00:009
108722012-12-19 10:00:00401113.1216.665707.0015171471642012-12-1910:00:0010
108732012-12-19 11:00:00401116.4020.4555415.0013311692002012-12-1911:00:0011
108742012-12-19 12:00:00401116.4020.4555419.0012332032362012-12-1912:00:0012
108752012-12-19 13:00:00401117.2221.2105012.9980301832132012-12-1913:00:0013
108762012-12-19 14:00:00401117.2221.2105012.9980331852182012-12-1914:00:0014
108772012-12-19 15:00:00401117.2221.2105019.0012282092372012-12-1915:00:0015
108782012-12-19 16:00:00401117.2221.2105023.9994372973342012-12-1916:00:0016
108792012-12-19 17:00:00401116.4020.4555026.0027265365622012-12-1917:00:0017
108802012-12-19 18:00:00401115.5819.6955023.9994235465692012-12-1918:00:0018
108812012-12-19 19:00:00401115.5819.6955026.002773293362012-12-1919:00:0019
108822012-12-19 20:00:00401114.7617.4255715.0013102312412012-12-1920:00:0020
108832012-12-19 21:00:00401113.9415.9106115.001341641682012-12-1921:00:0021
108842012-12-19 22:00:00401113.9417.425616.0032121171292012-12-1922:00:0022
108852012-12-19 23:00:00401113.1216.665668.9981484882012-12-1923:00:0023

10886 rows × 15 columns

仔细想想,数据只告诉我们是哪天了,按照一般逻辑,应该周末和工作日出去的人数量不同吧。我们设定一个新的字段dayofweek表示是一周中的第几天。再设定一个字段dateDays表示离第一天开始租车多久了(猜测在欧美国家,这种绿色环保的出行方式,会迅速蔓延吧)

# 我们对时间类的特征做处理,产出一个星期几的类别型变量
data['dayofweek'] = pd.DatetimeIndex(data.date).dayofweek

# 对时间类特征处理,产出一个时间长度变量
data['dateDays'] = (data.date - data.date[0]).astype('timedelta64[D]')

data
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountdatetimehourdayofweekdateDays
02011-01-01 00:00:0010019.8414.395810.0000313162011-01-0100:00:00050.0
12011-01-01 01:00:0010019.0213.635800.0000832402011-01-0101:00:00150.0
22011-01-01 02:00:0010019.0213.635800.0000527322011-01-0102:00:00250.0
32011-01-01 03:00:0010019.8414.395750.0000310132011-01-0103:00:00350.0
42011-01-01 04:00:0010019.8414.395750.00000112011-01-0104:00:00450.0
52011-01-01 05:00:0010029.8412.880756.00320112011-01-0105:00:00550.0
62011-01-01 06:00:0010019.0213.635800.00002022011-01-0106:00:00650.0
72011-01-01 07:00:0010018.2012.880860.00001232011-01-0107:00:00750.0
82011-01-01 08:00:0010019.8414.395750.00001782011-01-0108:00:00850.0
92011-01-01 09:00:00100113.1217.425760.000086142011-01-0109:00:00950.0
102011-01-01 10:00:00100115.5819.6957616.99791224362011-01-0110:00:001050.0
112011-01-01 11:00:00100114.7616.6658119.00122630562011-01-0111:00:001150.0
122011-01-01 12:00:00100117.2221.2107719.00122955842011-01-0112:00:001250.0
132011-01-01 13:00:00100218.8622.7257219.99954747942011-01-0113:00:001350.0
142011-01-01 14:00:00100218.8622.7257219.001235711062011-01-0114:00:001450.0
152011-01-01 15:00:00100218.0421.9707719.999540701102011-01-0115:00:001550.0
162011-01-01 16:00:00100217.2221.2108219.99954152932011-01-0116:00:001650.0
172011-01-01 17:00:00100218.0421.9708219.00121552672011-01-0117:00:001750.0
182011-01-01 18:00:00100317.2221.2108816.9979926352011-01-0118:00:001850.0
192011-01-01 19:00:00100317.2221.2108816.9979631372011-01-0119:00:001950.0
202011-01-01 20:00:00100216.4020.4558716.99791125362011-01-0120:00:002050.0
212011-01-01 21:00:00100216.4020.4558712.9980331342011-01-0121:00:002150.0
222011-01-01 22:00:00100216.4020.4559415.00131117282011-01-0122:00:002250.0
232011-01-01 23:00:00100218.8622.7258819.99951524392011-01-0123:00:002350.0
242011-01-02 00:00:00100218.8622.7258819.9995413172011-01-0200:00:00061.0
252011-01-02 01:00:00100218.0421.9709416.9979116172011-01-0201:00:00161.0
262011-01-02 02:00:00100217.2221.21010019.00121892011-01-0202:00:00261.0
272011-01-02 03:00:00100218.8622.7259412.99802462011-01-0203:00:00361.0
282011-01-02 04:00:00100218.8622.7259412.99802132011-01-0204:00:00461.0
292011-01-02 06:00:00100317.2221.2107719.99950222011-01-0206:00:00661.0
......................................................
108562012-12-18 18:00:00401115.5819.6954622.0028135125252012-12-1818:00:00181717.0
108572012-12-18 19:00:00401115.5819.6954626.0027193343532012-12-1819:00:00191717.0
108582012-12-18 20:00:00401114.7616.6655016.997942642682012-12-1820:00:00201717.0
108592012-12-18 21:00:00401114.7617.4255015.001391591682012-12-1821:00:00211717.0
108602012-12-18 22:00:00401113.9416.665490.000051271322012-12-1822:00:00221717.0
108612012-12-18 23:00:00401113.9417.425496.0032180812012-12-1823:00:00231717.0
108622012-12-19 00:00:00401112.3015.910610.0000635412012-12-1900:00:0002718.0
108632012-12-19 01:00:00401112.3015.910656.0032114152012-12-1901:00:0012718.0
108642012-12-19 02:00:00401111.4815.150656.00321232012-12-1902:00:0022718.0
108652012-12-19 03:00:00401110.6613.635758.99810552012-12-1903:00:0032718.0
108662012-12-19 04:00:0040119.8412.120758.99811672012-12-1904:00:0042718.0
108672012-12-19 05:00:00401110.6614.395756.0032229312012-12-1905:00:0052718.0
108682012-12-19 06:00:0040119.8412.880756.003231091122012-12-1906:00:0062718.0
108692012-12-19 07:00:00401110.6613.635758.998133603632012-12-1907:00:0072718.0
108702012-12-19 08:00:0040119.8412.880877.0015136656782012-12-1908:00:0082718.0
108712012-12-19 09:00:00401111.4814.395757.001583093172012-12-1909:00:0092718.0
108722012-12-19 10:00:00401113.1216.665707.0015171471642012-12-1910:00:00102718.0
108732012-12-19 11:00:00401116.4020.4555415.0013311692002012-12-1911:00:00112718.0
108742012-12-19 12:00:00401116.4020.4555419.0012332032362012-12-1912:00:00122718.0
108752012-12-19 13:00:00401117.2221.2105012.9980301832132012-12-1913:00:00132718.0
108762012-12-19 14:00:00401117.2221.2105012.9980331852182012-12-1914:00:00142718.0
108772012-12-19 15:00:00401117.2221.2105019.0012282092372012-12-1915:00:00152718.0
108782012-12-19 16:00:00401117.2221.2105023.9994372973342012-12-1916:00:00162718.0
108792012-12-19 17:00:00401116.4020.4555026.0027265365622012-12-1917:00:00172718.0
108802012-12-19 18:00:00401115.5819.6955023.9994235465692012-12-1918:00:00182718.0
108812012-12-19 19:00:00401115.5819.6955026.002773293362012-12-1919:00:00192718.0
108822012-12-19 20:00:00401114.7617.4255715.0013102312412012-12-1920:00:00202718.0
108832012-12-19 21:00:00401113.9415.9106115.001341641682012-12-1921:00:00212718.0
108842012-12-19 22:00:00401113.9417.425616.0032121171292012-12-1922:00:00222718.0
108852012-12-19 23:00:00401113.1216.665668.9981484882012-12-1923:00:00232718.0

10886 rows × 17 columns

其实我们刚才一直都在猜测,并不知道真实的日期相关的数据分布对吧,所以我们要做一个小小的统计来看看真实的数据分布,我们统计一下一周各天的自行车租赁情况(分注册的人和没注册的人)

byday = data.groupby('dayofweek')
# 统计下没注册的用户租赁情况
byday['casual'].sum().reset_index()
dayofweekcasual
0046288
1135365
2234931
3337283
4447402
55100782
6690084
# 统计下注册的用户的租赁情况
byday['registered'].sum().reset_index()
dayofweekregistered
00249008
11256620
22257295
33269118
44255102
55210736
66195462

周末既然有不同,就单独拿一列出来给星期六,再单独拿一列出来给星期日

data['Saturday']=0
data.Saturday[data.dayofweek==5]=1

data['Sunday']=0
data.Sunday[data.dayofweek==6]=1

data
/opt/conda/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/opt/conda/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountdatetimehourdayofweekdateDaysSaturdaySunday
02011-01-01 00:00:0010019.8414.395810.0000313162011-01-0100:00:00050.010
12011-01-01 01:00:0010019.0213.635800.0000832402011-01-0101:00:00150.010
22011-01-01 02:00:0010019.0213.635800.0000527322011-01-0102:00:00250.010
32011-01-01 03:00:0010019.8414.395750.0000310132011-01-0103:00:00350.010
42011-01-01 04:00:0010019.8414.395750.00000112011-01-0104:00:00450.010
52011-01-01 05:00:0010029.8412.880756.00320112011-01-0105:00:00550.010
62011-01-01 06:00:0010019.0213.635800.00002022011-01-0106:00:00650.010
72011-01-01 07:00:0010018.2012.880860.00001232011-01-0107:00:00750.010
82011-01-01 08:00:0010019.8414.395750.00001782011-01-0108:00:00850.010
92011-01-01 09:00:00100113.1217.425760.000086142011-01-0109:00:00950.010
102011-01-01 10:00:00100115.5819.6957616.99791224362011-01-0110:00:001050.010
112011-01-01 11:00:00100114.7616.6658119.00122630562011-01-0111:00:001150.010
122011-01-01 12:00:00100117.2221.2107719.00122955842011-01-0112:00:001250.010
132011-01-01 13:00:00100218.8622.7257219.99954747942011-01-0113:00:001350.010
142011-01-01 14:00:00100218.8622.7257219.001235711062011-01-0114:00:001450.010
152011-01-01 15:00:00100218.0421.9707719.999540701102011-01-0115:00:001550.010
162011-01-01 16:00:00100217.2221.2108219.99954152932011-01-0116:00:001650.010
172011-01-01 17:00:00100218.0421.9708219.00121552672011-01-0117:00:001750.010
182011-01-01 18:00:00100317.2221.2108816.9979926352011-01-0118:00:001850.010
192011-01-01 19:00:00100317.2221.2108816.9979631372011-01-0119:00:001950.010
202011-01-01 20:00:00100216.4020.4558716.99791125362011-01-0120:00:002050.010
212011-01-01 21:00:00100216.4020.4558712.9980331342011-01-0121:00:002150.010
222011-01-01 22:00:00100216.4020.4559415.00131117282011-01-0122:00:002250.010
232011-01-01 23:00:00100218.8622.7258819.99951524392011-01-0123:00:002350.010
242011-01-02 00:00:00100218.8622.7258819.9995413172011-01-0200:00:00061.001
252011-01-02 01:00:00100218.0421.9709416.9979116172011-01-0201:00:00161.001
262011-01-02 02:00:00100217.2221.21010019.00121892011-01-0202:00:00261.001
272011-01-02 03:00:00100218.8622.7259412.99802462011-01-0203:00:00361.001
282011-01-02 04:00:00100218.8622.7259412.99802132011-01-0204:00:00461.001
292011-01-02 06:00:00100317.2221.2107719.99950222011-01-0206:00:00661.001
............................................................
108562012-12-18 18:00:00401115.5819.6954622.0028135125252012-12-1818:00:00181717.000
108572012-12-18 19:00:00401115.5819.6954626.0027193343532012-12-1819:00:00191717.000
108582012-12-18 20:00:00401114.7616.6655016.997942642682012-12-1820:00:00201717.000
108592012-12-18 21:00:00401114.7617.4255015.001391591682012-12-1821:00:00211717.000
108602012-12-18 22:00:00401113.9416.665490.000051271322012-12-1822:00:00221717.000
108612012-12-18 23:00:00401113.9417.425496.0032180812012-12-1823:00:00231717.000
108622012-12-19 00:00:00401112.3015.910610.0000635412012-12-1900:00:0002718.000
108632012-12-19 01:00:00401112.3015.910656.0032114152012-12-1901:00:0012718.000
108642012-12-19 02:00:00401111.4815.150656.00321232012-12-1902:00:0022718.000
108652012-12-19 03:00:00401110.6613.635758.99810552012-12-1903:00:0032718.000
108662012-12-19 04:00:0040119.8412.120758.99811672012-12-1904:00:0042718.000
108672012-12-19 05:00:00401110.6614.395756.0032229312012-12-1905:00:0052718.000
108682012-12-19 06:00:0040119.8412.880756.003231091122012-12-1906:00:0062718.000
108692012-12-19 07:00:00401110.6613.635758.998133603632012-12-1907:00:0072718.000
108702012-12-19 08:00:0040119.8412.880877.0015136656782012-12-1908:00:0082718.000
108712012-12-19 09:00:00401111.4814.395757.001583093172012-12-1909:00:0092718.000
108722012-12-19 10:00:00401113.1216.665707.0015171471642012-12-1910:00:00102718.000
108732012-12-19 11:00:00401116.4020.4555415.0013311692002012-12-1911:00:00112718.000
108742012-12-19 12:00:00401116.4020.4555419.0012332032362012-12-1912:00:00122718.000
108752012-12-19 13:00:00401117.2221.2105012.9980301832132012-12-1913:00:00132718.000
108762012-12-19 14:00:00401117.2221.2105012.9980331852182012-12-1914:00:00142718.000
108772012-12-19 15:00:00401117.2221.2105019.0012282092372012-12-1915:00:00152718.000
108782012-12-19 16:00:00401117.2221.2105023.9994372973342012-12-1916:00:00162718.000
108792012-12-19 17:00:00401116.4020.4555026.0027265365622012-12-1917:00:00172718.000
108802012-12-19 18:00:00401115.5819.6955023.9994235465692012-12-1918:00:00182718.000
108812012-12-19 19:00:00401115.5819.6955026.002773293362012-12-1919:00:00192718.000
108822012-12-19 20:00:00401114.7617.4255715.0013102312412012-12-1920:00:00202718.000
108832012-12-19 21:00:00401113.9415.9106115.001341641682012-12-1921:00:00212718.000
108842012-12-19 22:00:00401113.9417.425616.0032121171292012-12-1922:00:00222718.000
108852012-12-19 23:00:00401113.1216.665668.9981484882012-12-1923:00:00232718.000

10886 rows × 19 columns

从数据中,把原始的时间字段等踢掉

# remove old data features
dataRel = data.drop(['datetime', 'count','date','time','dayofweek'], axis=1)
dataRel.head()
seasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredhourdateDaysSaturdaySunday
010019.8414.395810.031300.010
110019.0213.635800.083210.010
210019.0213.635800.052720.010
310019.8414.395750.031030.010
410019.8414.395750.00140.010

特征向量化

我们这里打算用scikit-learn来建模。对于pandas的dataframe我们有方法/函数可以直接转成python中的dict。
另外,在这里我们要对离散值和连续值特征区分一下了,以便之后分开做不同的特征处理。

from sklearn.feature_extraction import DictVectorizer
# 我们把连续值的属性放入一个dict中
featureConCols = ['temp','atemp','humidity','windspeed','dateDays','hour']
dataFeatureCon = dataRel[featureConCols]
dataFeatureCon = dataFeatureCon.fillna( 'NA' ) #in case I missed any
X_dictCon = dataFeatureCon.T.to_dict().values() 

# 把离散值的属性放到另外一个dict中
featureCatCols = ['season','holiday','workingday','weather','Saturday', 'Sunday']
dataFeatureCat = dataRel[featureCatCols]
dataFeatureCat = dataFeatureCat.fillna( 'NA' ) #in case I missed any
X_dictCat = dataFeatureCat.T.to_dict().values() 

# 向量化特征
vec = DictVectorizer(sparse = False)
X_vec_cat = vec.fit_transform(X_dictCat)
X_vec_con = vec.fit_transform(X_dictCon)
dataFeatureCon.head()
tempatemphumiditywindspeeddateDayshour
09.8414.395810.00.00
19.0213.635800.00.01
29.0213.635800.00.02
39.8414.395750.00.03
49.8414.395750.00.04
X_vec_con
array([[  14.395 ,    0.    ,    0.    ,   81.    ,    9.84  ,    0.    ],
       [  13.635 ,    0.    ,    1.    ,   80.    ,    9.02  ,    0.    ],
       [  13.635 ,    0.    ,    2.    ,   80.    ,    9.02  ,    0.    ],
       ..., 
       [  15.91  ,  718.    ,   21.    ,   61.    ,   13.94  ,   15.0013],
       [  17.425 ,  718.    ,   22.    ,   61.    ,   13.94  ,    6.0032],
       [  16.665 ,  718.    ,   23.    ,   66.    ,   13.12  ,    8.9981]])
dataFeatureCat.head()
seasonholidayworkingdayweatherSaturdaySunday
0100110
1100110
2100110
3100110
4100110
X_vec_cat
array([[ 1.,  0.,  0.,  1.,  1.,  0.],
       [ 1.,  0.,  0.,  1.,  1.,  0.],
       [ 1.,  0.,  0.,  1.,  1.,  0.],
       ..., 
       [ 0.,  0.,  0.,  4.,  1.,  1.],
       [ 0.,  0.,  0.,  4.,  1.,  1.],
       [ 0.,  0.,  0.,  4.,  1.,  1.]])

标准化连续值特征

我们要对连续值属性做一些处理,最基本的当然是标准化,让连续值属性处理过后均值为0,方差为1。
这样的数据放到模型里,对模型训练的收敛和模型的准确性都有好处

from sklearn import preprocessing
# 标准化连续值数据
scaler = preprocessing.StandardScaler().fit(X_vec_con)
X_vec_con = scaler.transform(X_vec_con)
X_vec_con
array([[-1.09273697, -1.70912256, -1.66894356,  0.99321305, -1.33366069,
        -1.56775367],
       [-1.18242083, -1.70912256, -1.52434128,  0.94124921, -1.43890721,
        -1.56775367],
       [-1.18242083, -1.70912256, -1.379739  ,  0.94124921, -1.43890721,
        -1.56775367],
       ..., 
       [-0.91395927,  1.70183906,  1.36770431, -0.04606385, -0.80742813,
         0.26970368],
       [-0.73518157,  1.70183906,  1.51230659, -0.04606385, -0.80742813,
        -0.83244247],
       [-0.82486544,  1.70183906,  1.65690887,  0.21375537, -0.91267464,
        -0.46560752]])

类别特征编码

最常用的当然是one-hot编码咯,比如颜色 红、蓝、黄 会被编码为[1, 0, 0],[0, 1, 0],[0, 0, 1]

from sklearn import preprocessing
# one-hot编码
enc = preprocessing.OneHotEncoder()
enc.fit(X_vec_cat)
X_vec_cat = enc.transform(X_vec_cat).toarray()
X_vec_cat
array([[ 1.,  0.,  0., ...,  1.,  1.,  0.],
       [ 1.,  0.,  0., ...,  1.,  1.,  0.],
       [ 1.,  0.,  0., ...,  1.,  1.,  0.],
       ..., 
       [ 0.,  1.,  1., ...,  0.,  0.,  1.],
       [ 0.,  1.,  1., ...,  0.,  0.,  1.],
       [ 0.,  1.,  1., ...,  0.,  0.,  1.]])

把特征拼一起

把离散和连续的特征都组合在一起

import numpy as np
# combine cat & con features
X_vec = np.concatenate((X_vec_con,X_vec_cat), axis=1)
X_vec
array([[-1.09273697, -1.70912256, -1.66894356, ...,  1.        ,
         1.        ,  0.        ],
       [-1.18242083, -1.70912256, -1.52434128, ...,  1.        ,
         1.        ,  0.        ],
       [-1.18242083, -1.70912256, -1.379739  , ...,  1.        ,
         1.        ,  0.        ],
       ..., 
       [-0.91395927,  1.70183906,  1.36770431, ...,  0.        ,
         0.        ,  1.        ],
       [-0.73518157,  1.70183906,  1.51230659, ...,  0.        ,
         0.        ,  1.        ],
       [-0.82486544,  1.70183906,  1.65690887, ...,  0.        ,
         0.        ,  1.        ]])

最后的特征,前6列是标准化过后的连续值特征,后面是编码后的离散值特征

对结果值也处理一下

拿到结果的浮点数值

# 对Y向量化
Y_vec_reg = dataRel['registered'].values.astype(float)
Y_vec_cas = dataRel['casual'].values.astype(float)
Y_vec_reg
array([  13.,   32.,   27., ...,  164.,  117.,   84.])

这里是引fq

Y_vec_cas
array([  3.,   8.,   5., ...,   4.,  12.,   4.])


  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值