《利用python进行数据分析》读书笔记之联合与合并数据集

数据库风格的DataFrame连接

我们可以使用pandas中的merge函数,将两个DataFrame相连,先看多对一的列子:

import pandas as pd

df1 = pd.DataFrame({"key":list('bbacaab'),'data1':range(7)})
#   key  data1
# 0   b      0
# 1   b      1
# 2   a      2
# 3   c      3
# 4   a      4
# 5   a      5
# 6   b      6
df2 = pd.DataFrame({'key':list('abd'),'data2':range(3)})
#   key  data2
# 0   a      0
# 1   b      1
# 2   d      2
print(pd.merge(df1,df2))
#   key  data1  data2
# 0   b      0      1
# 1   b      1      1
# 2   b      6      1
# 3   a      2      0
# 4   a      4      0
# 5   a      5      0

对以上代码进行以下说明:

  1. 当没有指定按哪一列进行连接的情况下,会默认将重列列名作为连接的键。可以使用left_on、right_on(在需要连接的列名不同的情况下)和on(相同的情况下)来指定连接的列。
    2.我们可以看到结果中缺少c和d的值,这是因为merge采取的是内连接(inner),结果中的键是两张表的交集,其他连接方式还有外连接(outer)、左连接(left)和右连接(right),其中外连接是键的并集。
print(pd.merge(df1,df2,how='outer'))
#   key  data1  data2
# 0   b    0.0    1.0
# 1   b    1.0    1.0
# 2   b    6.0    1.0
# 3   a    2.0    0.0
# 4   a    4.0    0.0
# 5   a    5.0    0.0
# 6   c    3.0    NaN
# 7   d    NaN    2.0
print(pd.merge(df1,df2,how='left'))
# key  data1  data2
# 0   b      0    1.0
# 1   b      1    1.0
# 2   a      2    0.0
# 3   c      3    NaN
# 4   a      4    0.0
# 5   a      5    0.0
# 6   b      6    1.0
print(pd.merge(df1,df2,how='right'))
# key  data1  data2
# 0   b    0.0      1
# 1   b    1.0      1
# 2   b    6.0      1
# 3   a    2.0      0
# 4   a    4.0      0
# 5   a    5.0      0
# 6   d    NaN      2

下面考虑多对多(两个DataFrame中一个键值对应着多个数值)的合并:

import pandas as pd

df1 = pd.DataFrame({'key':list('bbacab'),
                    'data1':range(6)})
#   key  data1
# 0   b      0
# 1   b      1
# 2   a      2
# 3   c      3
# 4   a      4
# 5   b      5
df2 = pd.DataFrame({'key':list('ababd'),
                    'data2':range(5)})
#   key  data2
# 0   a      0
# 1   b      1
# 2   a      2
# 3   b      3
# 4   d      4
print(pd.merge(df1,df2,on = 'key',how = 'outer'))
#    key  data1  data2
# 0    b      0    1.0
# 1    b      0    3.0
# 2    b      1    1.0
# 3    b      1    3.0
# 4    a      2    0.0
# 5    a      2    2.0
# 6    c      3    NaN
# 7    a      4    0.0
# 8    a      4    2.0
# 9    b      5    1.0
# 10   b      5    3.0

这里我们注意到,df1中有3个’b’行,df2中有2个’b’行,因此结果中有6个’b’行

使用多个键进行合并时,传入的是一个列名的列表:

import pandas as pd

left = pd.DataFrame({'key1':['foo','foo','bar'],
                     'key2':['one','two','one'],
                     'lval':[1,2,3]})
#   key1 key2  lval
# 0  foo  one     1
# 1  foo  two     2
# 2  bar  one     3
right = pd.DataFrame({'key1':['foo','foo','bar','bar'],
                      'key2':['one','one','one','two'],
                      'rval':[4,5,6,7]})
#   key1 key2  rval
# 0  foo  one     4
# 1  foo  one     5
# 2  bar  one     6
# 3  bar  two     7
print(pd.merge(left,right,on = ['key1','key2'],how = 'outer'))
#   key1 key2  lval  rval
# 0  foo  one   1.0   4.0
# 1  foo  one   1.0   5.0
# 2  foo  two   2.0   NaN
# 3  bar  one   3.0   6.0
# 4  bar  two   NaN   7.0

在上面的例子中,倘若我们连接的键是’key1’而不是’key1’和’key2’,那么就会产生重名问题,这种情况下pandas会默认为其加上后缀:

print(pd.merge(left,right,on = 'key1',how = 'outer'))
#   key1 key2_x  lval key2_y  rval
# 0  foo    one     1    one     4
# 1  foo    one     1    one     5
# 2  foo    two     2    one     4
# 3  foo    two     2    one     5
# 4  bar    one     3    one     6
# 5  bar    one     3    two     7

我们可以指定suffixes来指定标签在重名后添加的后缀:

print(pd.merge(left,right,on = 'key1',how = 'outer',suffixes=('_left','_right')))
#   key1 key2_left  lval key2_right  rval
# 0  foo       one     1        one     4
# 1  foo       one     1        one     5
# 2  foo       two     2        one     4
# 3  foo       two     2        one     5
# 4  bar       one     3        one     6
# 5  bar       one     3        two     7

以下是merge函数的参数:

参数描述
left合并时操作左边的DataFrame
right合并时操作左边的DataFrame
how连接的方式
on需要连接的列名。必须是两边的DataFrame对象都有的列名,并以left和right中的列名的交集作为连接键
left_onleft DataFrame中用作连接键的列
right_onright DataFrame中用作连接键的列
sort通过连接键的字母顺序对合并的数据进行排序;默认为True
suffixes在重叠的情况下,添加到列名后的字符串元组;默认是(_x,_y)
copy如果为False,则在某些特殊情况下避免将数据复制到结果数据结果中,默认总是复制
indicator添加一个列_merge,指示每一行的来源,值将根据每行中的连接数据的来源分别为‘left_only’,‘right_only’,‘both’

搜索索引合并

我们可以将right_index和left_index的值指定为True来将索引作为合并的键:

import pandas as pd

left = pd.DataFrame({'key':list('abaabc'),
                     'value':range(6)})
#   key  value
# 0   a      0
# 1   b      1
# 2   a      2
# 3   a      3
# 4   b      4
# 5   c      5
right = pd.DataFrame({'group_val':[3.5,7]},
                      index = ['a','b'])
#    group_val
# a        3.5
# b        7.0

print(pd.merge(left,right,left_on='key',right_index=True,how = 'outer'))
#   key  value  group_val
# 0   a      0        3.5
# 2   a      2        3.5
# 3   a      3        3.5
# 1   b      1        7.0
# 4   b      4        7.0
# 5   c      5        NaN

当具有多层索引时,情况更加的复杂:

left = pd.DataFrame({'key1':['O','O','O','N','N'],
                    'key2':[2000,2001,2002,2001,2002],
                    'data':np.arange(5.)})
#   key1  key2  data
# 0    O  2000   0.0
# 1    O  2001   1.0
# 2    O  2002   2.0
# 3    N  2001   3.0
# 4    N  2002   4.0
right = pd.DataFrame(np.arange(12).reshape((6,2)),
                                           index = [['N','N','O','O','O','O'],
                                                    [2001,2000,2000,2000,2001,2002]],
                                        columns = ['envent1','evnet2'])
#         envent1  evnet2
# N 2001        0       1
#   2000        2       3
# O 2000        4       5
#   2000        6       7
#   2001        8       9
#   2002       10      11

print(pd.merge(left,right,left_on=['key1','key2'],right_index=True))
#   key1  key2  data  envent1  evnet2
# 0    O  2000   0.0        4       5
# 0    O  2000   0.0        6       7
# 1    O  2001   1.0        8       9
# 2    O  2002   2.0       10      11
# 3    N  2001   3.0        0       1
print(pd.merge(left,right,left_on=['key1','key2'],right_index=True,how='outer'))
#   key1  key2  data  envent1  evnet2
# 0    O  2000   0.0      4.0     5.0
# 0    O  2000   0.0      6.0     7.0
# 1    O  2001   1.0      8.0     9.0
# 2    O  2002   2.0     10.0    11.0
# 3    N  2001   3.0      0.0     1.0
# 4    N  2002   4.0      NaN     NaN
# 4    N  2000   NaN      2.0     3.0

也可以使用两边的索引进行合并,用法与上面类似,将left_index和right_index都设置为True即可。对于索引之间的合并我们还可以使用join方法:

left2 = pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],
                     index=list('abc'),columns=['O','N'])
#      O    N
# a  1.0  2.0
# b  3.0  4.0
# c  5.0  6.0
right2 = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13,14]],
                      index = list('abcd'),columns=['M','A'])
#       M     A
# a   7.0   8.0
# b   9.0  10.0
# c  11.0  12.0
# d  13.0  14.0


print(pd.merge(left2,right2,right_index=True,left_index=True))
#      O    N     M     A
# a  1.0  2.0   7.0   8.0
# b  3.0  4.0   9.0  10.0
# c  5.0  6.0  11.0  12.0
print(left2.join(right2,how='outer'))
#      O    N     M     A
# a  1.0  2.0   7.0   8.0
# b  3.0  4.0   9.0  10.0
# c  5.0  6.0  11.0  12.0
# d  NaN  NaN  13.0  14.0

沿轴向连接

我们可以使用pandas中的cancat函数来进行轴向连接,默认是沿着axis=0生效的,此时会返回一个Series,如果axis=1则返回一个DataFrame:

s1 = pd.Series([0,1],index=list('ab'))
s2 = pd.Series([2,3,4],index=list('cde'))
s3 = pd.Series([5,6],index=list('fg'))

print(pd.concat([s1,s2,s3]))
# a    0
# b    1
# c    2
# d    3
# e    4
# f    5
# g    6
# dtype: int64
print(pd.concat([s1,s2,s3],axis = 1,sort=False))
#      0    1    2
# a  0.0  NaN  NaN
# b  1.0  NaN  NaN
# c  NaN  2.0  NaN
# d  NaN  3.0  NaN
# e  NaN  4.0  NaN
# f  NaN  NaN  5.0
# g  NaN  NaN  6.0

如果我们在进行连接时想加上索引以进行区分,可以使用keys参数实现:

print(pd.concat([s1,s2,s3],keys=['one','two','three']))
# one    a    0
#        b    1
# two    c    2
#        d    3
#        e    4
# three  f    5
#        g    6
# dtype: int64
print(pd.concat([s1,s2,s3],keys=['one','two','three'],axis=1,sort=False))
#    one  two  three
# a  0.0  NaN    NaN
# b  1.0  NaN    NaN
# c  NaN  2.0    NaN
# d  NaN  3.0    NaN
# e  NaN  4.0    NaN
# f  NaN  NaN    5.0
# g  NaN  NaN    6.0

对于DataFrame,concat的用法类似:

df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),
                   columns=['one','two'])
#    one  two
# a    0    1
# b    2    3
# c    4    5
df2 = pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','b'],
                   columns=['three','four'])
#    three  four
# a      5     6
# b      7     8

print(pd.concat([df1,df2],axis=1,keys=['lv1','lv2'],sort=False))
#   lv1       lv2
#   one two three four
# a   0   1   5.0  6.0
# b   2   3   7.0  8.0
# c   4   5   NaN  NaN

对于按行连接,有:

df1 = pd.DataFrame(np.random.randn(3,4),columns=list('abcd'))
#           a         b         c         d
# 0 -0.804995  0.943988 -1.182787 -0.137897
# 1  0.134306  1.961522  0.752514 -0.100846
# 2 -0.129814  0.036155 -0.449419  0.005875
df2 = pd.DataFrame(np.random.randn(2,3),columns=list('bda'))
#           b         d         a
# 0 -1.404842  2.205611  0.597599
# 1  0.662636 -1.347026  0.069774
print(pd.concat([df1,df2],ignore_index=True,sort=False))
#           a         b         c         d
# 0 -0.804995  0.943988 -1.182787 -0.137897
# 1  0.134306  1.961522  0.752514 -0.100846
# 2 -0.129814  0.036155 -0.449419  0.005875
# 3  0.597599 -1.404842       NaN  2.205611
# 4  0.069774  0.662636       NaN -1.347026

其中ignore_index设为True则表示产生新的轴索引。
concat函数的参数见下表:

参数描述
objs需要连接的pandas对象列表或者字典;必选参数
axis连接的轴向;默认是0(沿着行方向)
join可以是’inner’或者’outer’ (默认为’outer’),用于指定连接方式是内连接或者外连接
join_axis用于指定其他n-1轴的特定索引,可以替代内/外连接的逻辑
keys与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或者数组,也可以是元组的数组,也可以是数组的列表
levels在键值传递时,该参数用于指示多层索引的层级
names如果传入了keys或者levels参数,该参数用于多层索引的层级名称
verify_integrity检查连接对象中的新轴是否重复,如果是则引发异常,默认为False
ignore_index不沿着连接轴保留索引,而是产生一段新的索引

联合重叠数据

使用combine_frist方法可以将对象中的缺失部分用传入对象中对应的部分填补:

a = pd.Series([np.nan,2.5,0.0,3.5,4.5,np.nan],
              index=list('fedcba'))
# f    NaN
# e    2.5
# d    0.0
# c    3.5
# b    4.5
# a    NaN
# dtype: float64
b = pd.Series([0.,np.nan,2,np.nan,np.nan,5.],
              index=list('abcdef'))
# a    0.0
# b    NaN
# c    2.0
# d    NaN
# e    NaN
# f    5.0
print(b.combine_first(a))
# a    0.0
# b    4.5
# c    2.0
# d    0.0
# e    2.5
# f    5.0
# dtype: float64

在DataFrame对象中,combine_frist逐列做相同的操作:

df1 = pd.DataFrame({
    'a':[1.,np.nan,5.,np.nan],
    'b':[np.nan,2,np.nan,6.],
    'c':range(2,18,4)
})
#      a    b   c
# 0  1.0  NaN   2
# 1  NaN  2.0   6
# 2  5.0  NaN  10
# 3  NaN  6.0  14
df2 = pd.DataFrame({
    'a':[5.,4.,np.nan,3.,7.],
    'b':[np.nan,3.,4.,6.,8.]
})
#      a    b
# 0  5.0  NaN
# 1  4.0  3.0
# 2  NaN  4.0
# 3  3.0  6.0
# 4  7.0  8.0
print(df1.combine_first(df2))
#      a    b     c
# 0  1.0  NaN   2.0
# 1  4.0  2.0   6.0
# 2  5.0  4.0  10.0
# 3  3.0  6.0  14.0
# 4  7.0  8.0   NaN
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值