Datawhale 12 月组队学习笔记（九）：pandas分类数据

最新推荐文章于 2023-05-31 11:36:34 发布

1/8s延时

最新推荐文章于 2023-05-31 11:36:34 发布

阅读量89

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_45943100/article/details/112341388

版权

思维导图更新在这里插入图片描述
练习1：统计未出现的类别
请实现一个带有 dropna 参数的 my_crosstab 函数来完成上面的功能

def my_crosstab(x1, x2, dropna=True):
    idx1 = (x1.cat.categories if x1.dtype.name == 'category' and not dropna else x1.unique())
    idx2 = (x2.cat.categories if x2.dtype.name == 'category' and not dropna else x2.unique())
    res = pd.DataFrame(np.zeros((idx1.shape[0], idx2.shape[0])), index = idx1, columns = idx2)
    for i, j in zip(x1, x2):
        res.at[i, j] += 1
    res = res.rename_axis(index = x1.name, columns = x2.name).astype('int')
    return res

df = pd.DataFrame({'A': ['a', 'b', 'c', 'a'], 'B': ['cat', 'cat', 'dog', 'cat']})
df.B = df.B.astype('category').cat.add_categories('sheep')
my_crosstab(df.A, df.B)
print(my_crosstab(df.A, df.B, dropna = False))
#B  cat  dog  sheep
A                 
a    2    0      0
b    1    0      0
c    0    1      0

练习2：钻石数据集
现有一份关于钻石的数据集，其中 carat, cut, clarity, price 分别表示克拉重量、切割质量、纯净度和价格，样
例如下：
在这里插入图片描述

分别对 df.cut 在 object 类型和 category 类型下使用 nunique 函数，并比较它们的性能。
钻石的切割质量可以分为五个等级，由次到好分别是 Fair, Good, Very Good, Premium, Ideal ，纯净
度有八个等级，由次到好分别是 I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF ，请对切割质量按照由好到
次的顺序排序，相同切割质量的钻石，按照纯净度进行由次到好的排序。
分别采用两种不同的方法，把 cut, clarity 这两列按照由好到次的顺序，映射到从 0 到 n-1 的整数，其
中 n 表示类别的个数。
对每克拉的价格按照分别按照分位数（q=[0.2, 0.4, 0.6, 0.8]）与 [1000, 3500, 5500, 18000] 割点进行分
箱得到五个类别 Very Low, Low, Mid, High, Very High ，并把按这两种分箱方法得到的 category 序列
依次添加到原表中。
第 4 问中按照整数分箱得到的序列中，是否出现了所有的类别？如果存在没有出现的类别请把该类别
删除。
对第 4 问中按照分位数分箱得到的序列，求每个样本对应所在区间的左右端点值和长度。

df = pd.read_csv('data/diamonds.csv')
df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], ordered=True)#转换类型后设置顺序
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)
res = df.sort_values(['cut', 'clarity'], ascending=[False, True]) #降序/升序
print(res.head())
#     carat    cut clarity  price
315   0.96  Ideal      I1   2801
535   0.96  Ideal      I1   2826
551   0.97  Ideal      I1   2830
653   1.01  Ideal      I1   2844
718   0.97  Ideal      I1   2856
df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])
df.cut = df.cut.cat.codes
clarity_cat = df.clarity.cat.codes
print(df.head())
#   carat  cut clarity  price
0   0.23    0     SI2    326
1   0.21    1     SI1    326
2   0.23    3     VS1    327
3   0.29    1     VS2    334
4   0.31    3     SI2    335