python高级-21.pandas - 数据拆分_pandas 分割数据-CSDN博客

本文链接：https://blog.csdn.net/weixin_47440383/article/details/108165341

本文介绍了如何使用Pandas库进行数据拆分，重点讲解了`pd.cut()`和`pd.qcut()`函数，用于根据区间和数量进行数据分箱，并结合`value_counts()`进行统计分析。同时，还提到了检查和过滤异常值的方法，如通过`Series.apply()`和`DataFrame.applymap()`应用自定义函数处理单个值。

摘要由CSDN通过智能技术生成

数据拆分

pd.cut() 根据区间，求数量。结合value_counts()
pd.qcut() 根据数量，求区间。结合value_counts()

pd.cut()

pd.cut(
x, //被分割的值的对象
bins, //分箱可以是数字也可以是list-like的分箱
right: bool = True, //默认右边闭合
labels=None, //给每个区间取别名
retbins: bool = False, //返回一个区间数组
precision: int = 3, //默认精确小数点3位
include_lowest: bool = False, //分割区间默认不包含最小值，True则包含
duplicates: str = ‘raise’/ ‘drop’, //
ordered: bool = True, //
)

# 准备一个数据，加年龄
bins=[18,40,60,100,801]
ages = [16,20,24,28,30,38,40,44,47,54,56,61,77,88,99,800]
# 按照学过的value_counts()
Series(ages).value_counts(bins=bins)

(17.999, 40.0]    6
(60.0, 100.0]     4
(40.0, 60.0]      4
(100.0, 801.0]    1
dtype: int64

# pd.cut()
pd.cut(ages,bins=bins)

[NaN, (18.0, 40.0], (18.0, 40.0], (18.0, 40.0], (18.0, 40.0], ..., (60, 100], (60, 100], (60, 100], (60, 100], (100, 801]]
Length: 16
Categories (4, interval[int64]): [(18, 40] < (40, 60] < (60, 100] < (100, 801]]

pd.cut(ages,bins=bins).value_counts(dropna=False)

(18.0, 40.0]      6
(40.0, 60.0]      4
(60.0, 100.0]     4
(100.0, 801.0]    1
NaN               1
dtype: int64

pd.cut(ages,bins=bins,right=False)

[NaN, [18.0, 40.0), [18.0, 40.0), [18.0, 40.0), [18.0, 40.0), ..., [60, 100), [60, 100), [60, 100), [60, 100), [100, 801)]
Length: 16
Categories (4, interval[int64]): [[18, 40) < [40, 60) < [60, 100) < [100, 801)]

# labels 参数，给每个区间取别名
pd.cut(ages,bins=bins,right=False,labels=['青年','中年','老年','神仙'])

[NaN, '青年', '青年', '青年', '青年', ..., '老年', '老年', '老年', '老年', '神仙']
Length: 16
Categories (4, object): ['青年' < '中年' < '老年' < '神仙']

# 好处用值统计显示更直观
# labels 参数，给每个区间取别名
pd.cut(ages,bins=bins,right=False,labels=['青年','中年','老年','神仙']).value_counts()

青年    5
中年    5
老年    4
神仙    1
dtype: int64

pd.cut(ages,bins=bins,right=False,labels=['青年'

python高级-21.pandas - 数据拆分

目录

数据拆分

pd.cut()