[Python] Pandas之merge groupby

1.merge用来连接两个DataFrame对象,参数on选择基于哪个列进行连接,how选择内连接,左连接还是全连接的方式。merge操作类似于SQL中的join操作。

import pandas as pd
df1 = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
df2 = pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
print df1
print df2
#    data1 key
# 0      0   b
# 1      1   b
# 2      2   a
# 3      3   c
# 4      4   a
# 5      5   b
#    data2 key
# 0      0   a
# 1      1   b
# 2      2   a
# 3      3   b
# 4      4   d
df3 = pd.merge(df1,df2,on = 'key',how = 'left')
print df3
#     data1 key  data2
# 0       0   b    1.0
# 1       0   b    3.0
# 2       1   b    1.0
# 3       1   b    3.0
# 4       2   a    0.0
# 5       2   a    2.0
# 6       3   c    NaN
# 7       4   a    0.0
# 8       4   a    2.0
# 9       5   b    1.0
# 10      5   b    3.0

2.对于大数据,很可能要对于其中的部分列进行聚合,这里使用groupby

import numpy as np
import pandas as pd
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                   'key2':['one','two','one','two','one'],
                   'data1':np.random.randn(5),
                   'data2':np.random.randn(5)})
print df
#       data1     data2 key1 key2
# 0  0.336673  0.540336    a  one
# 1 -0.821839 -1.348654    a  two
# 2  1.066305  0.230884    b  one
# 3  0.788950 -0.540482    b  two
# 4 -0.872019 -0.813607    a  one
df1 = df[['data1','data2']].groupby(df['key1']).sum()
print df1
#          data1     data2
# key1
# a     0.827444  1.512656
# b    -0.060730 -0.461793
print df.groupby('key1')[['data1','data2']].sum()   #简便写法
#          data1     data2
# key1
# a     0.827444  1.512656
# b    -0.060730 -0.461793
注意,这里还可以利用字典对表进行分组聚合,这里以行聚合为例(以列分组聚合只要修改字典并把axis设置成1)

import numpy as np
import pandas as pd
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                   'key2':['one','two','one','two','one'],
                   'data1':np.random.randn(5),
                   'data2':np.random.randn(5)})
print df
#       data1     data2 key1 key2
# 0  0.336673  0.540336    a  one
# 1 -0.821839 -1.348654    a  two
# 2  1.066305  0.230884    b  one
# 3  0.788950 -0.540482    b  two
# 4 -0.872019 -0.813607    a  one
mapping = {0:'one',1:'one',2:'two',3:'two',4:'two'}
print df.groupby(mapping,axis = 0).sum()
#         data1     data2
# one  2.044990  0.916197
# two  2.310946 -2.240196



  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值