0 引言
Pandas中不仅可以调用 .concat 函数对多个表格进行合并,还可以调用 .merge 函数进行多个表格的合并。
1 .merge合并DataFrame表格
import pandas as pd
生成多个DataFrame表格
left = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
掉用 .merge 函数并依据’key’进行两个表格的合并
res = pd.merge(left,right,on='key')
res
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
left = pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key1':['K0','K1','K1','K3'],
'key2':['K0','K0','K0','K0'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K3 K0 C3 D3
调用 .merge,并依据多个key,设置how为’outer’,结果就是两个表格的位置一样的值就完全合并,不一样的值就设置为空值(NaN)
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='outer')# how默认inner
res
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K3 | K0 | NaN | NaN | C3 | D3 |
how 设置为 ‘inner’ ,有空值的行会直接删掉
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='inner')# how默认inner
res
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
how 设置为’left‘,只会考虑左边的DataFrame
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='left')# how默认inner
res
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
继续在 .merge 函数中添加参数 indicator ,显示merge信息
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='outer',indicator=True)# 显示merge信息
res
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
5 | K3 | K0 | NaN | NaN | C3 | D3 | right_only |
也可以为参数 indicator 设置名字
# how = ['left','right','inner','outer']
res = pd.merge(left,right,on=['key1','key2'],how='outer',indicator='indicator_column')# 显示merge信息
res
key1 | key2 | A | B | C | D | indicator_column | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
5 | K3 | K0 | NaN | NaN | C3 | D3 | right_only |
left = pd.DataFrame({'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']},
index = ['K0','K1','K2','K3'],)
right = pd.DataFrame({'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']},
index = ['K0','K1','K2','K3'],)
print(left)
print(right)
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
C D
K0 C0 D0
K1 C1 D1
K2 C2 D2
K3 C3 D3
调用 .merge 也可以像 .concat 一样依据索引 index 进行合并
res = pd.merge(left,right,left_index=True,right_index=True,how='outer')
res
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | C1 | D1 |
K2 | A2 | B2 | C2 | D2 |
K3 | A3 | B3 | C3 | D3 |
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]})
print(boys)
print(girls)
k age
0 K0 1
1 K1 2
2 K2 3
k age
0 K0 4
1 K0 5
2 K3 6
参数suffixes 可以为合并后的表格重新修改表头,区分不同表格的表头
res = pd.merge(boys,girls,on='k',suffixes=['_boy','_girl'],how='outer')
res
k | age_boy | age_girl | |
---|---|---|---|
0 | K0 | 1.0 | 4.0 |
1 | K0 | 1.0 | 5.0 |
2 | K1 | 2.0 | NaN |
3 | K2 | 3.0 | NaN |
4 | K3 | NaN | 6.0 |