数据分箱技术Binning

最新推荐文章于 2023-09-22 17:05:17 发布

法蒂芬

最新推荐文章于 2023-09-22 17:05:17 发布

阅读量9.6k

点赞数 1

分类专栏：实战网课 python 文章标签： python 大数据

本文链接：https://blog.csdn.net/weixin_44039183/article/details/108013021

版权

实战网课 python 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

数据分箱技术Binning

数据分箱技术Binning

数据分箱技术Binning

引入相关库

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

数据获取

产生一些考试的成绩分数，一共20个数据在25到100之间

score_list=np.random.randint(25,100,size=20)
score_list

array([66, 40, 32, 55, 81, 91, 49, 57, 36, 96, 83, 55, 98, 38, 36, 82, 71,
       39, 73, 60])

数据分箱

把0-59作为不及格，60-70作为ok，70-80作为良好，80-100作为优秀，定义一个list作为数据的取值范围

bins=[0,59,70,80,100]

通过cut方法，第一个参数为对哪个数据做分箱，第二个参数作为分箱的范围

pd.cut(score_list,bins)

[(59, 70], (0, 59], (0, 59], (0, 59], (80, 100], ..., (80, 100], (70, 80], (0, 59], (70, 80], (59, 70]]
Length: 20
Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]]

score_cat=pd.cut(score_list,bins)

统计每个分数段人数的结果

pd.value_counts(score_cat)

(0, 59]      10
(80, 100]     6
(70, 80]      2
(59, 70]      2
dtype: int64

创建一个空的DataFrame

df=DataFrame()

把‘score’这一列用score_list赋值

df['score']=score_list
df

	score
0	66
1	40
2	32
3	55
4	81
5	91
6	49
7	57
8	36
9	96
10	83
11	55
12	98
13	38
14	36
15	82
16	71
17	39
18	73
19	60

通过rands生成随机的20个字符串，作为学生的姓名

df['student']=[pd.util.testing.rands(3) for i in range(20)]
df

	score	student
0	66	mDG
1	40	pCe
2	32	Iuv
3	55	sWt
4	81	eaR
5	91	5Gw
6	49	8Xc
7	57	Tu2
8	36	IRS
9	96	VQU
10	83	lsc
11	55	nek
12	98	cFQ
13	38	ZeB
14	36	Lfi
15	82	jYv
16	71	x6q
17	39	t9I
18	73	CJg
19	60	hF4

把DataFrame的score作cut

pd.cut(df['score'],bins)

0      (59, 70]
1       (0, 59]
2       (0, 59]
3       (0, 59]
4     (80, 100]
5     (80, 100]
6       (0, 59]
7       (0, 59]
8       (0, 59]
9     (80, 100]
10    (80, 100]
11      (0, 59]
12    (80, 100]
13      (0, 59]
14      (0, 59]
15    (80, 100]
16     (70, 80]
17      (0, 59]
18     (70, 80]
19     (59, 70]
Name: score, dtype: category
Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]]

把做完cut的Categorie赋给DataFrame的‘Categories’

df['Categories ']=pd.cut(df['score'],bins)
df

	score	student	Categories
0	66	mDG	(59, 70]
1	40	pCe	(0, 59]
2	32	Iuv	(0, 59]
3	55	sWt	(0, 59]
4	81	eaR	(80, 100]
5	91	5Gw	(80, 100]
6	49	8Xc	(0, 59]
7	57	Tu2	(0, 59]
8	36	IRS	(0, 59]
9	96	VQU	(80, 100]
10	83	lsc	(80, 100]
11	55	nek	(0, 59]
12	98	cFQ	(80, 100]
13	38	ZeB	(0, 59]
14	36	Lfi	(0, 59]
15	82	jYv	(80, 100]
16	71	x6q	(70, 80]
17	39	t9I	(0, 59]
18	73	CJg	(70, 80]
19	60	hF4	(59, 70]

定义一个list，作为labels方法的参数，通过标签的方法，把Categories标志出来，更适合人的阅读

df['Categories ']=pd.cut(df['score'],bins,labels=['Low','OK','Good','Great'])
df

	score	student	Categories
0	66	mDG	OK
1	40	pCe	Low
2	32	Iuv	Low
3	55	sWt	Low
4	81	eaR	Great
5	91	5Gw	Great
6	49	8Xc	Low
7	57	Tu2	Low
8	36	IRS	Low
9	96	VQU	Great
10	83	lsc	Great
11	55	nek	Low
12	98	cFQ	Great
13	38	ZeB	Low
14	36	Lfi	Low
15	82	jYv	Great
16	71	x6q	Good
17	39	t9I	Low
18	73	CJg	Good
19	60	hF4	OK

法蒂芬

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分箱技术Binning

数据分箱技术Binning数据分箱技术Binning引入相关库数据获取数据分箱数据分箱技术Binning引入相关库import numpy as npimport pandas as pdfrom pandas import Series,DataFrame数据获取产生一些考试的成绩分数，一共20个数据在25到100之间score_list=np.random.randint(25,100,size=20)score_listarray([66, 40, 32, 55, 81, 91
复制链接

扫一扫