1 数据规整化-合并数据集
1.1 merge的内连,外连,左连,右连
- merge默认采用的是“inner连结”,取交集部分,没有交集的会舍弃掉
- 如果没指定用哪列连结,默认情况下merge会将重叠列的列名当做健,一般建议用 on 指定一下
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from numpy import nan as NA
df1 = DataFrame({
'key':['b','b','a','c','a','a','b'],
'data1':range(7)})
df2 = DataFrame({
'key':list('abd'),
'data2':range(3)})
pd.merge(df1,df2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
data1 |
key |
data2 |
0 |
0 |
b |
1 |
1 |
1 |
b |
1 |
2 |
6 |
b |
1 |
3 |
2 |
a |
0 |
4 |
4 |
a |
0 |
5 |
5 |
a |
0 |
pd.merge(df1,df2,on='key')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
data1 |
key |
data2 |
0 |
0 |
b |
1 |
1 |
1 |
b |
1 |
2 |
6 |
b |
1 |
3 |
2 |
a |
0 |
4 |
4 |
a |
0 |
5 |
5 |
a |
0 |
pd.merge(df1,df2,on='key',how='outer')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
data1 |
key |
data2 |
0 |
0.0 |
b |
1.0 |
1 |
1.0 |
b |
1.0 |
2 |
6.0 |
b |
1.0 |
3 |
2.0 |
a |
0.0 |
4 |
4.0 |
a |
0.0 |
5 |
5.0 |
a |
0.0 |
6 |
3.0 |
c |
NaN |
7 |
NaN |
d |
2.0 |
- how=left , 取merge连结的左边数据集,右边只在取有关联的,没关联的NAN值填充
- how=right , 取merge连结的右边数据集,左边只在取有关联的,没关联的NAN值填充
df5 = DataFrame({
'key':['b','b','a','c','a','a'],
'data1':range(6)})
df6 = DataFrame({
'key':list('ababd'),
'data2':range(5)})
pd.merge(df1,df2,on='key',how='left')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
data1 |
key |
data2 |
0 |
0 |
b |
1.0 |
1 |
1 |
b |
1.0 |
2 |
2 |
a |
0.0 |
3 |
3 |
c |
NaN |
4 |
4 |
a |
0.0 |
5 |
5 |
a |
0.0 |
6 |
6 |
b |
1.0 |
df5 = DataFrame({
'key':['b','b','a','c','a','a'],
'data1':range(6)})
df6 = DataFrame({
'key':list('ababd'),
'data2':range(5)})
pd.merge(df1,df2,on='key',how='right')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
|
data1 |
key |
data2 |
0 |
0.0 |
b |
1 |
1 |
1.0 |
b |
1 |
2 |
6.0 |
b |
1 |
3 |
2.0 |
a |
0 |
4 |
4.0 |
a |
0 |
5 |
5.0 |
a |
0 |
6 |
NaN |
d |
2 |
如果两个对象的列名不同,也可以分别进行指定
df3 = DataFrame({
'lkey':['b','b','a','c','a','a','b'],
'data1':range(7)})
df4 = DataFrame({
'rkey':list('abd'),
'data2':range(3)})
pd.