目录
连续->离散
pandas.cut
用来把一组数据分割成离散的区间。比如有一组年龄数据,可以使用pandas.cut
将年龄数据分割成不同的年龄段并打上标签。
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4
x
:被切分的类数组(array-like)数据,必须是1维的(不能用DataFrame);bins
:bins是被切割后的区间(或者叫“桶”、“箱”、“面元”),有3种形式:一个int型的标量、标量序列(数组)或者pandas.IntervalIndex 。
注意:类别类型,如何判断空值pd.isna,cut,类型转换astype
示例
pddf = DataFrame({'A': range(10, 17),
'B': [1, 4, 5, 6, 2, 5, 8,]})
b_bins = [1, 3, 5, 7, 8]
print(b_bins)
group_names = range(len(b_bins) -1)
pddf['C'] = pd.cut(pddf['B'], b_bins, labels=group_names)
print('C列NaN的长度: {}'.format(len(pddf[pd.isna(pddf['C'])])))
pddf['C'] = pddf['C'].cat.add_categories([str(len(b_bins) - 1)])
print(pddf)
print('=' * 20)
type_dic = {}
for k, v in zip(pddf.columns, pddf.dtypes):
type_dic.update({k: v})
print(type_dic)
[1, 3, 5, 7, 8]
C列NaN的长度: 1
A B C
0 10 1 NaN
1 11 4 1
2 12 5 1
3 13 6 2
4 14 2 0
5 15 5 1
6 16 8 3
====================
{'A': dtype('int64'), 'B': dtype('int64'), 'C': CategoricalDtype(categories=[0, 1, 2, 3, '4'], ordered=True)}
b_bins = [0, 1, 3, 5, 7, 8]
group_names = range(len(b_bins) -1)
pddf['C'] = pd.cut(pddf['B'], b_bins, labels=group_names)
print('C列NaN的长度: {}'.format(len(pddf[pd.isna(pddf['C'])])))
pddf['C'] = pddf['C'].cat.add_categories([len(b_bins) - 1])
type_dic = {}
for k, v in zip(pddf.columns, pddf.dtypes):
type_dic.update({k: v})
print(type_dic)
print('=' * 20)
pddf['C'] = pddf['C'].astype('int32')
type_dic = {}
for k, v in zip(pddf.columns, pddf.dtypes):
type_dic.update({k: v})
print(type_dic)
print('=' * 20)
print(pddf)
C列NaN的长度: 0
{'A': dtype('int64'), 'B': dtype('int64'), 'C': CategoricalDtype(categories=[0, 1, 2, 3, 4, 5], ordered=True)}
====================
{'A': dtype('int64'), 'B': dtype('int64'), 'C': dtype('int32')}
====================
A B C
0 10 1 0
1 11 4 2
2 12 5 2
3 13 6 3
4 14 2 1
5 15 5 2
6 16 8 4