Joyful-Pandas：第3章分组

最新推荐文章于 2024-09-25 13:38:24 发布

Axiiiz

最新推荐文章于 2024-09-25 13:38:24 发布

阅读量353

点赞数

分类专栏： pandas 文章标签： python

本文链接：https://blog.csdn.net/sinat_38059712/article/details/105766452

版权

pandas 专栏收录该内容

4 篇文章

订阅专栏

第3章分组

本章思维导图
本章问题与练习
补充

本章思维导图

在这里插入图片描述

本章问题与练习

【练习一】现有一份关于diamonds的数据集，列分别记录了克拉数、颜色、开采深度、价格，请解决下列问题：

（a）在所有重量超过1克拉的钻石中，价格的极差是多少？

df = pd.read_csv('data/Diamonds.csv')
df_r = df.query('carat>1')['price']
tp1 = df_r.max()-df_r.min()
print(tp1)

# 17561

（b）若以开采深度的0.2\0.4\0.6\0.8分位数为分组依据，每一组中钻石颜色最多的是哪一种？该种颜色是组内平均而言单位重量最贵的吗？

bins = df['depth'].quantile(np.linspace(0,1,6)).tolist()
cuts = pd.cut(df['depth'],bins=bins) #可选label添加自定义标签
df['cuts'] = cuts
print(df.head())

#   carat color	depth	price	cuts
# 0	0.23	E	61.5	326	(60.8, 61.6]
# 1	0.21	E	59.8	326	(43.0, 60.8]
# 2	0.23	E	56.9	327	(43.0, 60.8]
# 3	0.29	I	62.4	334	(62.1, 62.7]
# 4	0.31	J	63.3	335	(62.7, 79.0]

color_result = df.groupby('cuts')['color'].describe()
print(color_result)

#               count	unique	top	freq
# cuts				
# (43.0, 60.8]	11294	7	E	2259
# (60.8, 61.6]	11831	7	G	2593
# (61.6, 62.1]	10403	7	G	2247
# (62.1, 62.7]	10137	7	G	2193
# (62.7, 79.0]	10273	7	G	2000

# 前三个分位数区间不满足条件，后两个区间中数量最多的颜色的确是均重价格中最贵的
df['均重价格']=df['price']/df['carat']
color_result['top'] == [i[1] for i in df.groupby(['cuts','color'])['均重价格'].mean().groupby(['cuts']).idxmax().values

# cuts
# (43.0, 60.8]    False
# (60.8, 61.6]    False
# (61.6, 62.1]    False
# (62.1, 62.7]     True
# (62.7, 79.0]     True
# Name: top, dtype: bool

（c）以重量分组(0-0.5,0.5-1,1-1.5,1.5-2,2+)，按递增的深度为索引排序，求每组中连续的严格递增价格序列长度的最大值。

df = df.drop(columns='均重价格')
cuts = pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.inf]) #可选label添加自定义标签
df['cuts'] = cuts
print(df.head())

#   carat	color	depth	price	cuts
# 0	0.23	E	61.5	326	(0.0, 0.5]
# 1	0.21	E	59.8	326	(0.0, 0.5]
# 2	0.23	E	56.9	327	(0.0, 0.5]
# 3	0.29	I	62.4	334	(0.0, 0.5]
# 4	0.31	J	63.3	335	(0.0, 0.5]

def f(nums):
    if not nums:        
        return 0
    res = 1                            
    cur_len = 1                        
    for i in range(1, len(nums)):      
        if nums[i-1] < nums[i]:        
            cur_len += 1                
            res = max(cur_len, res)     
        else:                       
            cur_len = 1                 
    return res
    
for name,group in df.groupby('cuts'):
    group = group.sort_values(by='depth')
    s = group['price']
    print(name,f(s.tolist()))

# (0.0, 0.5] 8
# (0.5, 1.0] 8
# (1.0, 1.5] 7
# (1.5, 2.0] 11
# (2.0, inf] 7

（d）请按颜色分组，分别计算价格关于克拉数的回归系数。（单变量的简单线性回归，并只使用Pandas和Numpy完成）

for name,group in df[['carat','price','color']].groupby('color'):
    L1 = np.array([np.ones(group.shape[0]),group['carat']]).reshape(2,group.shape[0])
    L2 = group['price']
    result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
    print('当颜色为%s时，截距项为：%f，回归系数为：%f'%(name,result[0],result[1]))

# 当颜色为D时，截距项为：-2361.017152，回归系数为：8408.353126
# 当颜色为E时，截距项为：-2381.049600，回归系数为：8296.212783
# 当颜色为F时，截距项为：-2665.806191，回归系数为：8676.658344
# 当颜色为G时，截距项为：-2575.527643，回归系数为：8525.345779
# 当颜色为H时，截距项为：-2460.418046，回归系数为：7619.098320
# 当颜色为I时，截距项为：-2878.150356，回归系数为：7761.041169
# 当颜色为J时，截距项为：-2920.603337，回归系数为：7094.192092

【练习二】有一份关于美国10年至17年的非法药物数据集，列分别记录了年份、州（5个）、县、药物类型、报告数量，请解决下列问题：

（a）按照年份统计，哪个县的报告数量最多？这个县所属的州在当年也是报告数最多的吗？

df = pd.read_csv('data/Drugs.csv')
idx=pd.IndexSlice
for i in range(2010,2018):
    county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0])
    state = df.query('COUNTY == "%s"'%county)['State'].iloc[0]
    state_true = df.groupby(['State','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0]
    if state==state_true:
        print('在%d年，%s县的报告数最多，它所属的州%s也是报告数最多的'%(i,county,state))
    else:
        print('在%d年，%s县的报告数最多，但它所属的州%s不是报告数最多的，%s州报告数最多'%(i,county,state,state_true))

# 在2010年，PHILADELPHIA县的报告数最多，它所属的州PA也是报告数最多的
# 在2011年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
# 在2012年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
# 在2013年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
# 在2014年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
# 在2015年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
# 在2016年，HAMILTON县的报告数最多，它所属的州OH也是报告数最多的
# 在2017年，HAMILTON县的报告数最多，它所属的州OH也是报告数最多的

（b）从14年到15年，Heroin的数量增加最多的是哪一个州？它在这个州是所有药物中增幅最大的吗？若不是，请找出符合该条件的药物。

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
df_add = df_b.groupby(['YYYY','State']).sum()
(df_add.loc[2015]-df_add.loc[2014]).idxmax()

# DrugReports    OH
# dtype: object

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['State']=='OH')]
df_add = df_b.groupby(['YYYY','SubstanceName']).sum()
display((df_add.loc[2015]-df_add.loc[2014]).idxmax()) #这里利用了索引对齐的特点
display((df_add.loc[2015]/df_add.loc[2014]).idxmax())

# DrugReports    Heroin
# dtype: object
# DrugReports    Acetyl fentanyl
# dtype: object