Pandas练习题 (八)

Ex1:统计未出现的类别

import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','a'], 'B':['cat','cat','dog','cat']})
pd.crosstab(df.A, df.B)
Bcatdog
A
a20
b10
c01
df.B = df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna=False)
Bcatdogsheep
A
a200
b100
c010
def my_crosstab(s1, s2, dropna=True):
    idx1 = (s1.cat.categories if s1.dtype.name == 'category' and not dropna else s1.unique())# 确定当前的s1中全部的属性
    idx2 = (s2.cat.categories if s2.dtype.name == 'category' and not dropna else s2.unique())#确定s2
    res = pd.DataFrame(np.zeros((idx1.shape[0], idx2.shape[0])), index=idx1, columns=idx2)#构建数组
    for i, j in zip(s1, s2):
        res.at[i, j] += 1
    res = res.rename_axis(index=s1.name, columns=s2.name).astype('int')
    return res
df = pd.DataFrame({'A':['a','b','c','a'], 'B':['cat','cat','dog','cat']})
df.B = df.B.astype('category').cat.add_categories('sheep')
my_crosstab(df.A, df.B)
my_crosstab(df.A, df.B, dropna=False)
Bcatdogsheep
A
a200
b100
c010

Ex2:钻石数据集

data = [[0.13,'Fair','I1',326],[0.31,'Premium','SI2',326],[0.13,'Good','VS1',327],[0.23,'Ideal','SI2',326],[0.25,'Premium','VS2',326],[0.27,'Good','VS1',327],[0.28,'Ideal','SI2',326],[0.20,'Premium','SI1',326],[0.21,'Good','VVS1',327],[0.24,'Ideal','SI2',326],[0.26,'Premium','SI1',326],[0.24,'Very Good','VVS2',327],[0.23,'Ideal','SI2',326],[0.21,'Premium','SI1',326],[0.23,'Ideal','IF',327]]
df = pd.DataFrame(data=data,
                 columns =['carat','cut','clarity','price'])
1.分别对df.cut在object类型和category类型下使用nunique函数,并比较它们的性能。
s_obj, s_cat = df.cut, df.cut.astype('category')
%timeit -n 30 s_obj.nunique()
%timeit -n 30 s_cat.nunique()
74.5 µs ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
The slowest run took 4.54 times longer than the fastest. This could mean that an intermediate result is being cached.
551 µs ± 337 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
2.钻石的切割质量可以分为五个等级,由次到好分别是Fair, Good, Very Good, Premium, Ideal,纯净度有八个等级,由次到好分别是I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF,请对切割质量按照由好到次的顺序排序,相同切割质量的钻石,按照纯净度进行由次到好的排序。
df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],ordered=True)
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)
res = df.sort_values(['cut', 'clarity'], ascending=[False, True])
3.分别采用两种不同的方法,把cut, clarity这两列按照由好到次的顺序,映射到从0到n-1的整数,其中n表示类别的个数。
df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])
df.cut = df.cut.cat.codes
clarity_cat = df.clarity.cat.categories
df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat)))))
4.对每克拉的价格分别按照分位数(q=[0.2, 0.4, 0.6, 0.8])与[1000, 3500, 5500, 18000]割点进行分箱得到五个类别Very Low, Low, Mid, High, Very High,并把按这两种分箱方法得到的category序列依次添加到原表中。
q = [0, 0.2, 0.4, 0.6, 0.8, 1]
point = [-np.infty, 1000, 3500, 5500, 18000, np.infty]
avg = df.price / df.carat
df['avg_cut'] = pd.cut(avg, bins=point, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
df['avg_qcut'] = pd.qcut(avg, q=q, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
5.第4问中按照整数分箱得到的序列中,是否出现了所有的类别?如果存在没有出现的类别请把该类别删除。
df.avg_cut.unique()
['Low']
Categories (1, object): ['Low']
df.avg_cut.cat.categories
Index(['Very Low', 'Low', 'Mid', 'High', 'Very High'], dtype='object')
df.avg_cut = df.avg_cut.cat.remove_categories(['Very Low', 'Mid', 'High', 'Very High'])
6.对第4问中按照分位数分箱得到的序列,求每个样本对应所在区间的左右端点值和长度。
interval_avg = pd.IntervalIndex(pd.qcut(avg, q=q))
interval_avg.right.to_series().reset_index(drop=True).head(3)
0    2515.385
1    1245.299
2    2515.385
dtype: float64
interval_avg.left.to_series().reset_index(drop=True).head(3)
0    1571.714
1    1051.612
2    1571.714
dtype: float64
interval_avg.length.to_series().reset_index(drop=True).head(3)
0    943.671
1    193.687
2    943.671
dtype: float64
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值