Dataframe的merge操作
dataframe的merge是按照两个dataframe共有的column进行连接,两个dataframe必须有同名的column ,merger操作与操作数据库表类似。
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
pd1=pd.DataFrame({'key1':['a','b','c'],'value1':[1,2,3]})
pd1
key1 | value1 | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
2 | c | 3 |
pd2=pd.DataFrame({'key1':['d','e','f'],'value1':[4,5,6]})
pd2
key1 | value1 | |
---|---|---|
0 | d | 4 |
1 | e | 5 |
2 | f | 6 |
pd.merge(pd1,pd2) #没有相同的列值,则无法合并
pd3=pd.DataFrame({'key1':['a','b','c'],'value1':[4,5,6]})
pd3
key1 | value1 | |
---|---|---|
0 | a | 4 |
1 | b | 5 |
2 | c | 6 |
pd.merge(pd1,pd3) #不指定列 关联条件就是所有同名的列
pd.merge(pd1,pd3,on='key1') #指定列 关联条件是所有同名的列
key1 | value1_x | value1_y | |
---|---|---|---|
0 | a | 1 | 4 |
1 | b | 2 | 5 |
2 | c | 3 | 6 |
pd4=pd.DataFrame({'name':['a','b','c','d'],'age':[1,2,3,4],'score':[99,100,98,87],'class':[2000,2000,2000,20001]})
pd4
name | age | score | class | |
---|---|---|---|---|
0 | a | 1 | 99 | 2000 |
1 | b | 2 | 100 | 2000 |
2 | c | 3 | 98 | 2000 |
3 | d | 4 | 87 | 20001 |
pd5=pd.DataFrame({'name':['a','c'],'class':[2000,20001],'sex':[0,1]})
pd5
name | class | sex | |
---|---|---|---|
0 | a | 2000 | 0 |
1 | c | 20001 | 1 |
pd.merge(pd4,pd5)
name | age | score | class | sex | |
---|---|---|---|---|---|
0 | a | 1 | 99 | 2000 | 0 |
pd.merge(pd4,pd5,on='name')
name | age | score | class_x | class_y | sex | |
---|---|---|---|---|---|---|
0 | a | 1 | 99 | 2000 | 2000 | 0 |
1 | c | 3 | 98 | 2000 | 20001 | 1 |
pd.merge(pd4,pd5,on=['name','class'])
pd.merge(pd4,pd5,on='name',how='left') #左关联
name | age | score | class_x | class_y | sex | |
---|---|---|---|---|---|---|
0 | a | 1 | 99 | 2000 | 2000.0 | 0.0 |
1 | b | 2 | 100 | 2000 | NaN | NaN |
2 | c | 3 | 98 | 2000 | 20001.0 | 1.0 |
3 | d | 4 | 87 | 20001 | NaN | NaN |
pd.merge(pd4,pd5,on='name',how='right') #右关联
name | age | score | class_x | class_y | sex | |
---|---|---|---|---|---|---|
0 | a | 1 | 99 | 2000 | 2000 | 0 |
1 | c | 3 | 98 | 2000 | 20001 | 1 |
pd.merge(pd5,pd4,on='name',how='left') #左关联
name | class_x | sex | age | score | class_y | |
---|---|---|---|---|---|---|
0 | a | 2000 | 0 | 1 | 99 | 2000 |
1 | c | 20001 | 1 | 3 | 98 | 2000 |
pd.merge(pd4,pd5,on='name',how='outer') #外关联
name | age | score | class_x | class_y | sex | |
---|---|---|---|---|---|---|
0 | a | 1 | 99 | 2000 | 2000.0 | 0.0 |
1 | b | 2 | 100 | 2000 | NaN | NaN |
2 | c | 3 | 98 | 2000 | 20001.0 | 1.0 |
3 | d | 4 | 87 | 20001 | NaN | NaN |
pd.merge(pd4,pd5,on='name',how='inner') #内关联
name | age | score | class_x | class_y | sex | |
---|---|---|---|---|---|---|
0 | a | 1 | 99 | 2000 | 2000 | 0 |
1 | c | 3 | 98 | 2000 | 20001 | 1 |