数据分析第三篇——Pandas的数据离散化

3.7 高级处理–数据离散化

  • 目标
    • 应用cut、qcut实现数据的区间分组
    • 应用get_dummies实现数据的one-hot编码
  • 内容预览
    • 3.7.1 什么是数据的离散化
    • 3.7.2 为什么要离散化
    • 3.7.3 如何实现数据的离散化

3.7.1 什么是数据的离散化

连续属性的离散化就是在连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数值代表落在每个子区间中的属性值
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cFUxvyId-1586742259769)(attachment:image.png)]

非离散化数据:

性别年龄
A123
B230
C118

非离散化数据:

物种毛发
A1
B2
C3

数据的离散化


one-hot编码/哑变量:


年龄
A1023
B0130
C1018

3.7.2 为什么要离散化

连续属性数据的离散化是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值的个数。离散化数据经常作为数据挖掘的工具

3.7.3 如何实现数据的离散化

流程:

  1. 对数据进行分组
    • 自动分组:pd.qcut(data, bins) # bins为分组的组数,返回一个Series。
    • 自定义分组:pd.cut(data, []) # []中为设置好的分组区间,返回一个Series。
    • 对数据进行分组一般会与value_counts搭配使用
      • Series.value_counts():统计分组次数
  2. 对分好组的数据求哑变量
    • pd.get_dummies(data, prefix=None)
      • data:array-like, Series or DataFrame
      • prefix:分组名字
# 1)准备数据
import pandas as pd
data = pd.Series([165, 174, 160, 180, 159, 163, 192, 184], index=['NO1:165', 'NO2:174', 'NO3:160', 'NO4:180', 'NO5:159', 'NO6:163', 'NO7:192', 'NO8:184'])
data
NO1:165    165
NO2:174    174
NO3:160    160
NO4:180    180
NO5:159    159
NO6:163    163
NO7:192    192
NO8:184    184
dtype: int64
# 2)分组
# 自动分组
sr = pd.qcut(data, 3)
sr
type(sr)
NO1:165      (163.667, 178.0]
NO2:174      (163.667, 178.0]
NO3:160    (158.999, 163.667]
NO4:180        (178.0, 192.0]
NO5:159    (158.999, 163.667]
NO6:163    (158.999, 163.667]
NO7:192        (178.0, 192.0]
NO8:184        (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]

pandas.core.series.Series
# 统计分组次数
sr.value_counts()
(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
# 3)转换成哑变量
pd.get_dummies(sr, prefix='height')
height_(158.999, 163.667]height_(163.667, 178.0]height_(178.0, 192.0]
NO1:165010
NO2:174010
NO3:160100
NO4:180001
NO5:159100
NO6:163100
NO7:192001
NO8:184001
# 2)分组
# 自定义分组
bins = [150, 165, 180, 195]
sr2 = pd.cut(data, bins)
sr2
NO1:165    (150, 165]
NO2:174    (165, 180]
NO3:160    (150, 165]
NO4:180    (165, 180]
NO5:159    (150, 165]
NO6:163    (150, 165]
NO7:192    (180, 195]
NO8:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr2.value_counts()
(150, 165]    4
(180, 195]    2
(165, 180]    2
dtype: int64
pd.get_dummies(sr2, prefix='身高')
身高_(150, 165]身高_(165, 180]身高_(180, 195]
NO1:165100
NO2:174010
NO3:160100
NO4:180010
NO5:159100
NO6:163100
NO7:192001
NO8:184001

案例:股票的涨跌幅离散化

# 1)读取数据
import pandas as pd
stock = pd.read_excel('stock.xls')
stock
trade_datecloseopenhighlowpre_closechangepct_chgvolamount
0202003132887.42652804.23222910.88122799.98412923.4856-36.0591-1.2334366450436.03.930197e+08
1202003122923.48562936.01632944.46512906.28382968.5174-45.0318-1.5170307778457.03.282092e+08
2202003112968.51743001.76163010.02862968.51742996.7618-28.2444-0.9425352470970.03.787666e+08
3202003102996.76182918.93473000.29632904.79892943.290753.47111.8167393296648.04.250172e+08
4202003092943.29072987.18052989.20512940.71383034.5113-91.2206-3.0061414560736.04.381439e+08
.................................
699719910719136.7000137.6600138.5400136.6600137.1700-0.4700-0.342610823.05.242826e+03
699819910718137.1700137.1700137.1700135.8100135.81001.36001.0014847.04.644160e+02
699919910717135.8100135.8100135.8100135.3900134.47001.34000.9965660.03.975240e+02
700019910716134.4700134.3900134.4700133.1400133.14001.33000.99892796.01.328502e+03
700119910715133.1400133.9000134.1000131.8700132.80000.34000.256011938.05.534900e+03

7002 rows × 10 columns

change = stock['change']
change
0      -36.0591
1      -45.0318
2      -28.2444
3       53.4711
4      -91.2206
         ...   
6997    -0.4700
6998     1.3600
6999     1.3400
7000     1.3300
7001     0.3400
Name: change, Length: 7002, dtype: float64
# 2)分组
# 自动分组
sr3 = pd.qcut(change, 10)
sr3
0       (-354.685, -33.319]
1       (-354.685, -33.319]
2         (-33.319, -17.08]
3           (37.551, 649.5]
4       (-354.685, -33.319]
               ...         
6997        (-3.416, 0.934]
6998          (0.934, 4.84]
6999          (0.934, 4.84]
7000          (0.934, 4.84]
7001        (-3.416, 0.934]
Name: change, Length: 7002, dtype: category
Categories (10, interval[float64]): [(-354.685, -33.319] < (-33.319, -17.08] < (-17.08, -9.298] < (-9.298, -3.416] ... (4.84, 10.614] < (10.614, 19.612] < (19.612, 37.551] < (37.551, 649.5]]
sr3.value_counts()
(37.551, 649.5]        701
(0.934, 4.84]          701
(-354.685, -33.319]    701
(19.612, 37.551]       700
(10.614, 19.612]       700
(-3.416, 0.934]        700
(-9.298, -3.416]       700
(-17.08, -9.298]       700
(-33.319, -17.08]      700
(4.84, 10.614]         699
Name: change, dtype: int64
# 3)离散化(获得哑变量/one-hot编码)
stock_change = pd.get_dummies(sr3, prefix='涨跌幅')
stock_change
涨跌幅_(-354.685, -33.319]涨跌幅_(-33.319, -17.08]涨跌幅_(-17.08, -9.298]涨跌幅_(-9.298, -3.416]涨跌幅_(-3.416, 0.934]涨跌幅_(0.934, 4.84]涨跌幅_(4.84, 10.614]涨跌幅_(10.614, 19.612]涨跌幅_(19.612, 37.551]涨跌幅_(37.551, 649.5]
01000000000
11000000000
20100000000
30000000001
41000000000
.................................
69970000100000
69980000010000
69990000010000
70000000010000
70010000100000

7002 rows × 10 columns

# 自定义分组
bins = [-600, -300, 0, 300, 600, 900]
sr = pd.cut(change, bins)
sr
0       (-300, 0]
1       (-300, 0]
2       (-300, 0]
3        (0, 300]
4       (-300, 0]
          ...    
6997    (-300, 0]
6998     (0, 300]
6999     (0, 300]
7000     (0, 300]
7001     (0, 300]
Name: change, Length: 7002, dtype: category
Categories (5, interval[int64]): [(-600, -300] < (-300, 0] < (0, 300] < (300, 600] < (600, 900]]
sr.value_counts()
(0, 300]        3702
(-300, 0]       3290
(-600, -300]       7
(300, 600]         2
(600, 900]         1
Name: change, dtype: int64
stock_change = pd.get_dummies(sr, prefix='涨跌幅') # onr-hot/哑变量
stock_change
涨跌幅_(-600, -300]涨跌幅_(-300, 0]涨跌幅_(0, 300]涨跌幅_(300, 600]涨跌幅_(600, 900]
001000
101000
201000
300100
401000
..................
699701000
699800100
699900100
700000100
700100100

7002 rows × 5 columns

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值