pandas入门（三）：pandas数据分组

go2coding

已于 2022-04-07 13:26:09 修改

阅读量4.8k

点赞数

分类专栏： pandas入门教程文章标签： windows python

于 2022-04-02 14:45:02 首次发布

本文链接：https://blog.csdn.net/weixin_40425640/article/details/123920008

版权

pandas入门教程专栏收录该内容

6 篇文章 32 订阅

订阅专栏

在SQL 中需要对数据进行分组操作，这是现实在处理数据中经常用到的，SQL 的group by 语句较为灵活的满足了这一个需求，同样的对于pandas 来说，也有分组的功能，甚至有些用法比 SQL 来的方便多了。

如果用SQL语句的数据分析人员，对下面的分组语句应该是在熟悉不够的：

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

在表格SomeTable中，按照Column1 和 Column2 进行分组，求出Column3的平均值，和Column4的总和。

来看看在pandas 中是如何实现的。

模拟数据如下：

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})

具体的数据表格：

     A      B         C         D
0  foo    one  0.979662  0.279855
1  bar    one -0.953565 -1.208252
2  foo    two  0.674858 -1.309619
3  bar  three -0.955485  0.117016
4  foo    two  0.847228 -0.188685
5  bar    two  0.028046  0.594465
6  foo    one  1.696228 -0.835612
7  foo  three  0.326754  1.910218

对栏目A进行分组，并查看栏目A有几种分组：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})

print df

print ''
print 'All Group:'
for group in  df.groupby(['A']).groups:
    print group

结果如下：

     A      B         C         D
0  foo    one  0.987971  0.189223
1  bar    one  1.099048  2.043736
2  foo    two  0.440967 -0.425073
3  bar  three  1.689837  0.476937
4  foo    two  0.700422 -0.873579
5  bar    two  0.321166  2.573001
6  foo    one -0.452850  0.166900
7  foo  three -0.200025 -0.125616

All Group:
foo
bar

分组下求得每一个分组下的和：

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})

print df
print ''
print df.groupby(['A']).sum()

结果如下：

     A      B         C         D
0  foo    one  0.241693 -0.390582
1  bar    one  1.030158 -1.204830
2  foo    two  1.644640  0.028990
3  bar  three -0.271512  0.589653
4  foo    two -0.206261  0.874401
5  bar    two  0.899257  0.208874
6  foo    one  1.091846  0.157223
7  foo  three -1.843262 -0.716281

            C         D
A                      
bar  1.657903 -0.406302
foo  0.928656 -0.046249

像map计算一样，pandas也可以使用 aggregate ，前面的求和语句可以用 df.groupby(['A']).aggregate(np.sum) 来代替。

分别对不同组，做不同的数据处理，对C进行求和，对D取平均值，像上面的那句SQL语句一样：

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})

print df
print ''
print df.groupby(['A']).agg({'D': 'std', 'C': 'mean'})

数据如下：

            C         D
A                      
bar -0.262869  1.200517
foo -0.481279  1.500957

pandas入门专栏

go2coding

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录