Pandas.merge()的基本用法
pandas.merge(left,right,how=‘inner’,on=None,left_on=None,right_on=None,
left_index=False,right_index=False,sort=False,suffixes=(’_x’, ‘_y’), copy=True,indicator=False,validate=None)
l_df.merge(r_df,how=‘inner’,on=None,left_on=None,right_on=None,
left_index=False,right_index=False,sort=False,suffixes=(’_x’, ‘_y’),
copy=True,indicator=False,validate=None)
参数:
merge需要依据共同的某一列或者某一行来进行合并
left: 左表(DataFrame)
right:右表(DataFrame)
how:连接方式
left: 仅保留左表的键
right: 仅保留右表的键
outer:两表的键取并集
inner:两表的键取交集
on:用来对齐的那一列的名字,用到这个参数的时候一定要保证左表和右表用来对齐的 那一列都有相同的列名.
left_on :左侧DataFarme中用作连接键的列.
right_on:右侧DataFarme中用作连接键的列.
left_index : 将左侧的行索引用作其连接键 .
right_index :将右侧的行索引用作其连接键 .
sort :根据连接键对合并后的数据进行排序,默认为True.有时在处理大数据集时,禁用该选项可获得更好的性能.
suffixes :字符串值元组,用于追加到重叠列名的末尾,默认为(‘_x’,‘_y’).例如,左右两个DataFrame对象都有‘data’,则结果中就会出现‘data_x’,‘data_y’.
copy :设置为False,可以在某些特殊情况下避免将数据复制到结果数据结构中.
import pandas as pd
## 一、参数on
df1 = pd.DataFrame({'name':['Tom','Tom','Bob','Bob','Tom','Bob','Jack'],
'data1':range(7)})
df2 = pd.DataFrame({'name':['Bob','Tom','Smith'],'data2':range(3)})
df1
name data1
0 Tom 0
1 Tom 1
2 Bob 2
3 Bob 3
4 Tom 4
5 Bob 5
6 Jack 6
df2
name data2
0 Bob 0
1 Tom 1
2 Smith 2
pd.merge(df1,df2,on='name')
name data1 data2
0 Tom 0 1
1 Tom 1 1
2 Tom 4 1
3 Bob 2 0
4 Bob 3 0
5 Bob 5 0
# on选择多个连接键
df3 = pd.DataFrame({'name':['Tom','Tom','Bob','Bob','Tom','Bob','Jack'],
'subject':['Chinese','math','English','English',
'Chinese','math','English'],'data1':range(7)})
df4 = pd.DataFrame({'name':['Bob','Tom','Smith'], 'subject':['English',
'math','Chinese'],'data2':range(3)})
df3
name subject data1
0 Tom Chinese 0
1 Tom math 1
2 Bob English 2
3 Bob English 3
4 Tom Chinese 4
5 Bob math 5
6 Jack English 6
df4
name subject data2
0 Bob English 0
1 Tom math 1
2 Smith Chinese 2
pd.merge(df3,df4,on=['name','subject'])
name subject data1 data2
0 Tom math 1 1
1 Bob English 2 0
2 Bob English 3 0
# left_on,right_on 分别指明左右两侧的连接键
df5=pd.DataFrame({'l_name':['Tom','Tom','Bob','Bob','Tom','Bob','Jack'],
'data1':range(7)})
df6=pd.DataFrame({'r_name':['Bob','Tom','Smith'],'data2':range(3)})
pd.merge(df5,df6,left_on='l_name',right_on='r_name')
l_name data1 r_name data2
0 Tom 0 Tom 1
1 Tom 1 Tom 1
2 Tom 4 Tom 1
3 Bob 2 Bob 0
4 Bob 3 Bob 0
5 Bob 5 Bob 0
# 参数 left_index,right_index 进行索引上的合并
df7 = pd.DataFrame({'name':['Tom','Bob','Tom','Bob','Jack'],
'data1':range(5)})
df8 = pd.DataFrame({'data2':[3,6,9]},index=['Tom','Bob','Jim'])
pd.merge(df7,df8,left_on='name',right_index=True)
name data1 data2
0 Tom 0 3
2 Tom 2 3
1 Bob 1 6
3 Bob 3 6
# 参数 left_index,right_index 进行索引上的合并
df7 = pd.DataFrame({'name':['Tom','Bob','Tom','Bob','Jack'],
'data1':range(5)})
df8 = pd.DataFrame({'data2':[3,6,9]},index=['Tom','Bob','Jim'])
pd.merge(df7,df8,left_on='name',right_index=True)
name data1 data2
0 Tom 0 3
2 Tom 2 3
1 Bob 1 6
3 Bob 3 6
df9 = pd.DataFrame({'data1':[3,6,9]},index=['Tom','Bob','Jim'])
df10 = pd.DataFrame({'name':['Tom','Bob','Tom','Bob','Jack'],
'data2':range(5)})
pd.merge(df9,df10,right_on='name',left_index=True)
data1 name data2
0 3 Tom 0
2 3 Tom 2
1 6 Bob 1
3 6 Bob 3
## 二、参数how
#how='inner',默认,仅仅从两dataframe找有共同clubs,再合并
TeamA.merge(SponsershipB,on='clubs',how='inner')
cities clubs sponsers
0 London Chelsea Yokohama
1 Madrid Real Madrid Fly Emirates
2 Turin Juventus Jeep
#how='outer',从两dataframe找所有clubs,再合并,缺失值用Nan替代
TeamA.merge(SponsershipB,on='clubs',how='outer')
cities clubs sponsers
0 London Chelsea Yokohama
1 Madrid Real Madrid Fly Emirates
2 Turin Juventus Jeep
3 Munich Bayern Munich NaN
4 NaN Manchester United Chervolet
#how='right',仅从后边dataframe取clubs,再与前边dataframe从右边合并,缺失值用Nan替代
TeamA.merge(SponsershipB,on='clubs',how='right')
cities clubs sponsers
0 London Chelsea Yokohama
1 Madrid Real Madrid Fly Emirates
2 Turin Juventus Jeep
3 NaN Manchester United Chervolet
#how='left',仅从前边dataframe找出所有clubs,再与后边dataframe合并,缺失值用Nan替代
TeamA.merge(SponsershipB,on='clubs',how='left')
cities clubs sponsers
0 London Chelsea Yokohama
1 Madrid Real Madrid Fly Emirates
2 Turin Juventus Jeep
3 Munich Bayern Munich NaN
## 三、参数suffixes,设置二元后缀元组区分相同列
df11 = pd.DataFrame({'name':['Tom','Tom','Bob','Bob','Tom','Bob','Jack'],
'data':range(7)})
df12 = pd.DataFrame({'name':['Bob','Tom','Smith'],'data':range(3)})
pd.merge(df11,df12,on='name',suffixes=('_1','_2'))
name data_1 data_2
0 Tom 0 1
1 Tom 1 1
2 Tom 4 1
3 Bob 2 0
4 Bob 3 0
5 Bob 5 0
df11.merge(df12,on='name') #suffixes默认为(’_x’,‘_y’)
name data_x data_y
0 Tom 0 1
1 Tom 1 1
2 Tom 4 1
3 Bob 2 0
4 Bob 3 0
5 Bob 5 0