Pandas杂记(三)

最新推荐文章于 2024-07-23 14:36:35 发布

一杯敬朝阳一杯敬月光

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量96

点赞数

分类专栏： pandas 文章标签： pandas python 连续特征离散化

本文链接：https://blog.csdn.net/qq_xuanshuang/article/details/109691640

版权

pandas 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

连续->离散

pandas.cut用来把一组数据分割成离散的区间。比如有一组年龄数据，可以使用pandas.cut将年龄数据分割成不同的年龄段并打上标签。

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4

x：被切分的类数组（array-like）数据，必须是1维的（不能用DataFrame）；
bins：bins是被切割后的区间（或者叫“桶”、“箱”、“面元”），有3种形式：一个int型的标量、标量序列（数组）或者pandas.IntervalIndex 。

注意：类别类型，如何判断空值pd.isna，cut，类型转换astype

示例

pddf = DataFrame({'A': range(10, 17),
               'B': [1, 4, 5, 6, 2, 5, 8,]})
b_bins = [1, 3, 5, 7, 8]
print(b_bins)
group_names = range(len(b_bins) -1)
pddf['C'] = pd.cut(pddf['B'], b_bins, labels=group_names)
print('C列NaN的长度: {}'.format(len(pddf[pd.isna(pddf['C'])])))

pddf['C'] = pddf['C'].cat.add_categories([str(len(b_bins) - 1)])

print(pddf)

print('=' * 20)

type_dic = {}
for k, v in zip(pddf.columns, pddf.dtypes):
    type_dic.update({k: v})
print(type_dic)

[1, 3, 5, 7, 8]
C列NaN的长度: 1
    A  B    C
0  10  1  NaN
1  11  4    1
2  12  5    1
3  13  6    2
4  14  2    0
5  15  5    1
6  16  8    3
====================
{'A': dtype('int64'), 'B': dtype('int64'), 'C': CategoricalDtype(categories=[0, 1, 2, 3, '4'], ordered=True)}

b_bins = [0, 1, 3, 5, 7, 8]
group_names = range(len(b_bins) -1)
pddf['C'] = pd.cut(pddf['B'], b_bins, labels=group_names)
print('C列NaN的长度: {}'.format(len(pddf[pd.isna(pddf['C'])])))

pddf['C'] = pddf['C'].cat.add_categories([len(b_bins) - 1])

type_dic = {}
for k, v in zip(pddf.columns, pddf.dtypes):
    type_dic.update({k: v})
print(type_dic)
print('=' * 20)

pddf['C'] = pddf['C'].astype('int32')

type_dic = {}
for k, v in zip(pddf.columns, pddf.dtypes):
    type_dic.update({k: v})
print(type_dic)
print('=' * 20)
print(pddf)

C列NaN的长度: 0
{'A': dtype('int64'), 'B': dtype('int64'), 'C': CategoricalDtype(categories=[0, 1, 2, 3, 4, 5], ordered=True)}
====================
{'A': dtype('int64'), 'B': dtype('int64'), 'C': dtype('int32')}
====================
    A  B  C
0  10  1  0
1  11  4  2
2  12  5  2
3  13  6  3
4  14  2  1
5  15  5  2
6  16  8  4

一杯敬朝阳一杯敬月光

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas杂记(三)

目录连续->离散连续->离散pandas.cut用来把一组数据分割成离散的区间。比如有一组年龄数据，可以使用pandas.cut将年龄数据分割成不同的年龄段并打上标签。pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4x：被切分的类数组（array-like）数据，必须是1维.
复制链接

扫一扫

专栏目录