![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/09050e5904f84e51bffd8cbced07f311.png)
1. 分箱操作
- 嘿,小伙伴们,你们知道吗?分箱操作,简直就是数据界的魔术师!想象一下,那些密密麻麻、让人眼花的连续数字,经过这位大师的手,瞬间变成了井然有序的小盒子——每个盒子都装着一段数字,就像是把糖果按颜色分好一样。这样一来,原本复杂的数据变得简单易懂,异常值也乖乖就范,不再捣乱。咱们的数据分析、建模之路,从此变得顺畅无阻,简直就是数据处理的“一键美颜”啊!
- 接着咱们聊聊数据界的两大“分箱高手”——等距分箱与等频分箱!等距分箱,就像是给数据量身高,不管胖瘦高矮,一刀切下去,大家间隔都相等,整整齐齐排排站。而等频分箱呢,它更像个公平分配的小能手,确保每个箱子里数据的数量都差不多,就像是分蛋糕,力求人人有份,公平公正!这两大高手各有千秋,让数据处理变得既有趣又高效,你们说是不是啊?
2. 导包与数据准备
import numpy as np
import pandas as pd
data = np.random.randint(0,100,size=(5,3))
df = pd.DataFrame(data=data,columns=["Python","NumPy","Pandas"])
df
| Python | NumPy | Pandas |
---|
0 | 95 | 59 | 16 |
---|
1 | 81 | 68 | 63 |
---|
2 | 73 | 70 | 35 |
---|
3 | 30 | 4 | 47 |
---|
4 | 58 | 18 | 80 |
---|
3. 等距分箱
s = pd.cut(df.Python,bins=4)
s
0 (78.75, 95.0]
1 (78.75, 95.0]
2 (62.5, 78.75]
3 (29.935, 46.25]
4 (46.25, 62.5]
Name: Python, dtype: category
Categories (4, interval[float64, right]): [(29.935, 46.25] < (46.25, 62.5] < (62.5, 78.75] < (78.75, 95.0]]
s.value_counts()
Python
(78.75, 95.0] 2
(29.935, 46.25] 1
(46.25, 62.5] 1
(62.5, 78.75] 1
Name: count, dtype: int64
s.value_counts().plot.bar()
<Axes: xlabel='Python'>
![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/6a10ab18c73a465e91dcde3446c00c15.png)
pd.cut(
df.Python,
bins=[0,30,60,80,100],
right=False,
labels=["D","C","B","A"]
)
0 A
1 A
2 B
3 C
4 C
Name: Python, dtype: category
Categories (4, object): ['D' < 'C' < 'B' < 'A']
4. 等频分箱
pd.qcut(
df.Python,
q=4,
)
0 (81.0, 95.0]
1 (73.0, 81.0]
2 (58.0, 73.0]
3 (29.999, 58.0]
4 (29.999, 58.0]
Name: Python, dtype: category
Categories (4, interval[float64, right]): [(29.999, 58.0] < (58.0, 73.0] < (73.0, 81.0] < (81.0, 95.0]]
pd.qcut(
df.Python,
q=4,
labels=["D","C","B","A"]
)
0 A
1 B
2 C
3 D
4 D
Name: Python, dtype: category
Categories (4, object): ['D' < 'C' < 'B' < 'A']