数据分箱技术Binning
数据分箱技术Binning
引入相关库
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
数据获取
产生一些考试的成绩分数,一共20个数据在25到100之间
score_list=np.random.randint(25,100,size=20)
score_list
array([66, 40, 32, 55, 81, 91, 49, 57, 36, 96, 83, 55, 98, 38, 36, 82, 71,
39, 73, 60])
数据分箱
把0-59作为不及格,60-70作为ok,70-80作为良好,80-100作为优秀,定义一个list作为数据的取值范围
bins=[0,59,70,80,100]
通过cut方法,第一个参数为对哪个数据做分箱,第二个参数作为分箱的范围
pd.cut(score_list,bins)
[(59, 70], (0, 59], (0, 59], (0, 59], (80, 100], ..., (80, 100], (70, 80], (0, 59], (70, 80], (59, 70]]
Length: 20
Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]]
score_cat=pd.cut(score_list,bins)
统计每个分数段人数的结果
pd.value_counts(score_cat)
(0, 59] 10
(80, 100] 6
(70, 80] 2
(59, 70] 2
dtype: int64
创建一个空的DataFrame
df=DataFrame()
把‘score’这一列用score_list赋值
df['score']=score_list
df
score | |
---|---|
0 | 66 |
1 | 40 |
2 | 32 |
3 | 55 |
4 | 81 |
5 | 91 |
6 | 49 |
7 | 57 |
8 | 36 |
9 | 96 |
10 | 83 |
11 | 55 |
12 | 98 |
13 | 38 |
14 | 36 |
15 | 82 |
16 | 71 |
17 | 39 |
18 | 73 |
19 | 60 |
通过rands生成随机的20个字符串,作为学生的姓名
df['student']=[pd.util.testing.rands(3) for i in range(20)]
df
score | student | |
---|---|---|
0 | 66 | mDG |
1 | 40 | pCe |
2 | 32 | Iuv |
3 | 55 | sWt |
4 | 81 | eaR |
5 | 91 | 5Gw |
6 | 49 | 8Xc |
7 | 57 | Tu2 |
8 | 36 | IRS |
9 | 96 | VQU |
10 | 83 | lsc |
11 | 55 | nek |
12 | 98 | cFQ |
13 | 38 | ZeB |
14 | 36 | Lfi |
15 | 82 | jYv |
16 | 71 | x6q |
17 | 39 | t9I |
18 | 73 | CJg |
19 | 60 | hF4 |
把DataFrame的score作cut
pd.cut(df['score'],bins)
0 (59, 70]
1 (0, 59]
2 (0, 59]
3 (0, 59]
4 (80, 100]
5 (80, 100]
6 (0, 59]
7 (0, 59]
8 (0, 59]
9 (80, 100]
10 (80, 100]
11 (0, 59]
12 (80, 100]
13 (0, 59]
14 (0, 59]
15 (80, 100]
16 (70, 80]
17 (0, 59]
18 (70, 80]
19 (59, 70]
Name: score, dtype: category
Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]]
把做完cut的Categorie赋给DataFrame的‘Categories’
df['Categories ']=pd.cut(df['score'],bins)
df
score | student | Categories | |
---|---|---|---|
0 | 66 | mDG | (59, 70] |
1 | 40 | pCe | (0, 59] |
2 | 32 | Iuv | (0, 59] |
3 | 55 | sWt | (0, 59] |
4 | 81 | eaR | (80, 100] |
5 | 91 | 5Gw | (80, 100] |
6 | 49 | 8Xc | (0, 59] |
7 | 57 | Tu2 | (0, 59] |
8 | 36 | IRS | (0, 59] |
9 | 96 | VQU | (80, 100] |
10 | 83 | lsc | (80, 100] |
11 | 55 | nek | (0, 59] |
12 | 98 | cFQ | (80, 100] |
13 | 38 | ZeB | (0, 59] |
14 | 36 | Lfi | (0, 59] |
15 | 82 | jYv | (80, 100] |
16 | 71 | x6q | (70, 80] |
17 | 39 | t9I | (0, 59] |
18 | 73 | CJg | (70, 80] |
19 | 60 | hF4 | (59, 70] |
定义一个list,作为labels方法的参数,通过标签的方法,把Categories标志出来,更适合人的阅读
df['Categories ']=pd.cut(df['score'],bins,labels=['Low','OK','Good','Great'])
df
score | student | Categories | |
---|---|---|---|
0 | 66 | mDG | OK |
1 | 40 | pCe | Low |
2 | 32 | Iuv | Low |
3 | 55 | sWt | Low |
4 | 81 | eaR | Great |
5 | 91 | 5Gw | Great |
6 | 49 | 8Xc | Low |
7 | 57 | Tu2 | Low |
8 | 36 | IRS | Low |
9 | 96 | VQU | Great |
10 | 83 | lsc | Great |
11 | 55 | nek | Low |
12 | 98 | cFQ | Great |
13 | 38 | ZeB | Low |
14 | 36 | Lfi | Low |
15 | 82 | jYv | Great |
16 | 71 | x6q | Good |
17 | 39 | t9I | Low |
18 | 73 | CJg | Good |
19 | 60 | hF4 | OK |