3.7 高级处理–数据离散化
- 目标
- 应用cut、qcut实现数据的区间分组
- 应用get_dummies实现数据的one-hot编码
- 内容预览
- 3.7.1 什么是数据的离散化
- 3.7.2 为什么要离散化
- 3.7.3 如何实现数据的离散化
3.7.1 什么是数据的离散化
连续属性的离散化就是在连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数值代表落在每个子区间中的属性值
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cFUxvyId-1586742259769)(attachment:image.png)]
非离散化数据:
性别 | 年龄 | |
---|---|---|
A | 1 | 23 |
B | 2 | 30 |
C | 1 | 18 |
非离散化数据:
物种 | 毛发 | |
---|---|---|
A | 1 | |
B | 2 | |
C | 3 |
数据的离散化
one-hot编码/哑变量:
男 | 女 | 年龄 | |
---|---|---|---|
A | 1 | 0 | 23 |
B | 0 | 1 | 30 |
C | 1 | 0 | 18 |
3.7.2 为什么要离散化
连续属性数据的离散化是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值的个数。离散化数据经常作为数据挖掘的工具
3.7.3 如何实现数据的离散化
流程:
- 对数据进行分组
- 自动分组:pd.qcut(data, bins) # bins为分组的组数,返回一个Series。
- 自定义分组:pd.cut(data, []) # []中为设置好的分组区间,返回一个Series。
- 对数据进行分组一般会与value_counts搭配使用
- Series.value_counts():统计分组次数
- 对分好组的数据求哑变量
- pd.get_dummies(data, prefix=None)
- data:array-like, Series or DataFrame
- prefix:分组名字
- pd.get_dummies(data, prefix=None)
# 1)准备数据
import pandas as pd
data = pd.Series([165, 174, 160, 180, 159, 163, 192, 184], index=['NO1:165', 'NO2:174', 'NO3:160', 'NO4:180', 'NO5:159', 'NO6:163', 'NO7:192', 'NO8:184'])
data
NO1:165 165
NO2:174 174
NO3:160 160
NO4:180 180
NO5:159 159
NO6:163 163
NO7:192 192
NO8:184 184
dtype: int64
# 2)分组
# 自动分组
sr = pd.qcut(data, 3)
sr
type(sr)
NO1:165 (163.667, 178.0]
NO2:174 (163.667, 178.0]
NO3:160 (158.999, 163.667]
NO4:180 (178.0, 192.0]
NO5:159 (158.999, 163.667]
NO6:163 (158.999, 163.667]
NO7:192 (178.0, 192.0]
NO8:184 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
pandas.core.series.Series
# 统计分组次数
sr.value_counts()
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
# 3)转换成哑变量
pd.get_dummies(sr, prefix='height')
height_(158.999, 163.667] | height_(163.667, 178.0] | height_(178.0, 192.0] | |
---|---|---|---|
NO1:165 | 0 | 1 | 0 |
NO2:174 | 0 | 1 | 0 |
NO3:160 | 1 | 0 | 0 |
NO4:180 | 0 | 0 | 1 |
NO5:159 | 1 | 0 | 0 |
NO6:163 | 1 | 0 | 0 |
NO7:192 | 0 | 0 | 1 |
NO8:184 | 0 | 0 | 1 |
# 2)分组
# 自定义分组
bins = [150, 165, 180, 195]
sr2 = pd.cut(data, bins)
sr2
NO1:165 (150, 165]
NO2:174 (165, 180]
NO3:160 (150, 165]
NO4:180 (165, 180]
NO5:159 (150, 165]
NO6:163 (150, 165]
NO7:192 (180, 195]
NO8:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr2.value_counts()
(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64
pd.get_dummies(sr2, prefix='身高')
身高_(150, 165] | 身高_(165, 180] | 身高_(180, 195] | |
---|---|---|---|
NO1:165 | 1 | 0 | 0 |
NO2:174 | 0 | 1 | 0 |
NO3:160 | 1 | 0 | 0 |
NO4:180 | 0 | 1 | 0 |
NO5:159 | 1 | 0 | 0 |
NO6:163 | 1 | 0 | 0 |
NO7:192 | 0 | 0 | 1 |
NO8:184 | 0 | 0 | 1 |
案例:股票的涨跌幅离散化
# 1)读取数据
import pandas as pd
stock = pd.read_excel('stock.xls')
stock
trade_date | close | open | high | low | pre_close | change | pct_chg | vol | amount | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 20200313 | 2887.4265 | 2804.2322 | 2910.8812 | 2799.9841 | 2923.4856 | -36.0591 | -1.2334 | 366450436.0 | 3.930197e+08 |
1 | 20200312 | 2923.4856 | 2936.0163 | 2944.4651 | 2906.2838 | 2968.5174 | -45.0318 | -1.5170 | 307778457.0 | 3.282092e+08 |
2 | 20200311 | 2968.5174 | 3001.7616 | 3010.0286 | 2968.5174 | 2996.7618 | -28.2444 | -0.9425 | 352470970.0 | 3.787666e+08 |
3 | 20200310 | 2996.7618 | 2918.9347 | 3000.2963 | 2904.7989 | 2943.2907 | 53.4711 | 1.8167 | 393296648.0 | 4.250172e+08 |
4 | 20200309 | 2943.2907 | 2987.1805 | 2989.2051 | 2940.7138 | 3034.5113 | -91.2206 | -3.0061 | 414560736.0 | 4.381439e+08 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6997 | 19910719 | 136.7000 | 137.6600 | 138.5400 | 136.6600 | 137.1700 | -0.4700 | -0.3426 | 10823.0 | 5.242826e+03 |
6998 | 19910718 | 137.1700 | 137.1700 | 137.1700 | 135.8100 | 135.8100 | 1.3600 | 1.0014 | 847.0 | 4.644160e+02 |
6999 | 19910717 | 135.8100 | 135.8100 | 135.8100 | 135.3900 | 134.4700 | 1.3400 | 0.9965 | 660.0 | 3.975240e+02 |
7000 | 19910716 | 134.4700 | 134.3900 | 134.4700 | 133.1400 | 133.1400 | 1.3300 | 0.9989 | 2796.0 | 1.328502e+03 |
7001 | 19910715 | 133.1400 | 133.9000 | 134.1000 | 131.8700 | 132.8000 | 0.3400 | 0.2560 | 11938.0 | 5.534900e+03 |
7002 rows × 10 columns
change = stock['change']
change
0 -36.0591
1 -45.0318
2 -28.2444
3 53.4711
4 -91.2206
...
6997 -0.4700
6998 1.3600
6999 1.3400
7000 1.3300
7001 0.3400
Name: change, Length: 7002, dtype: float64
# 2)分组
# 自动分组
sr3 = pd.qcut(change, 10)
sr3
0 (-354.685, -33.319]
1 (-354.685, -33.319]
2 (-33.319, -17.08]
3 (37.551, 649.5]
4 (-354.685, -33.319]
...
6997 (-3.416, 0.934]
6998 (0.934, 4.84]
6999 (0.934, 4.84]
7000 (0.934, 4.84]
7001 (-3.416, 0.934]
Name: change, Length: 7002, dtype: category
Categories (10, interval[float64]): [(-354.685, -33.319] < (-33.319, -17.08] < (-17.08, -9.298] < (-9.298, -3.416] ... (4.84, 10.614] < (10.614, 19.612] < (19.612, 37.551] < (37.551, 649.5]]
sr3.value_counts()
(37.551, 649.5] 701
(0.934, 4.84] 701
(-354.685, -33.319] 701
(19.612, 37.551] 700
(10.614, 19.612] 700
(-3.416, 0.934] 700
(-9.298, -3.416] 700
(-17.08, -9.298] 700
(-33.319, -17.08] 700
(4.84, 10.614] 699
Name: change, dtype: int64
# 3)离散化(获得哑变量/one-hot编码)
stock_change = pd.get_dummies(sr3, prefix='涨跌幅')
stock_change
涨跌幅_(-354.685, -33.319] | 涨跌幅_(-33.319, -17.08] | 涨跌幅_(-17.08, -9.298] | 涨跌幅_(-9.298, -3.416] | 涨跌幅_(-3.416, 0.934] | 涨跌幅_(0.934, 4.84] | 涨跌幅_(4.84, 10.614] | 涨跌幅_(10.614, 19.612] | 涨跌幅_(19.612, 37.551] | 涨跌幅_(37.551, 649.5] | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6997 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
6998 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
6999 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
7000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
7001 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
7002 rows × 10 columns
# 自定义分组
bins = [-600, -300, 0, 300, 600, 900]
sr = pd.cut(change, bins)
sr
0 (-300, 0]
1 (-300, 0]
2 (-300, 0]
3 (0, 300]
4 (-300, 0]
...
6997 (-300, 0]
6998 (0, 300]
6999 (0, 300]
7000 (0, 300]
7001 (0, 300]
Name: change, Length: 7002, dtype: category
Categories (5, interval[int64]): [(-600, -300] < (-300, 0] < (0, 300] < (300, 600] < (600, 900]]
sr.value_counts()
(0, 300] 3702
(-300, 0] 3290
(-600, -300] 7
(300, 600] 2
(600, 900] 1
Name: change, dtype: int64
stock_change = pd.get_dummies(sr, prefix='涨跌幅') # onr-hot/哑变量
stock_change
涨跌幅_(-600, -300] | 涨跌幅_(-300, 0] | 涨跌幅_(0, 300] | 涨跌幅_(300, 600] | 涨跌幅_(600, 900] | |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... |
6997 | 0 | 1 | 0 | 0 | 0 |
6998 | 0 | 0 | 1 | 0 | 0 |
6999 | 0 | 0 | 1 | 0 | 0 |
7000 | 0 | 0 | 1 | 0 | 0 |
7001 | 0 | 0 | 1 | 0 | 0 |
7002 rows × 5 columns