merge 在 pandas 里面的基本用法
- pd.merge([df1, df2], on=’ ‘, how=’ left/right/inner’)
- df1.merge(df2, on=’ ‘, how=’ ')
如果你要拼接的两个dataframe中,有一个相同的关键字段,且两个列名是相同的,就可以指定 on=‘column name’
接下来看例子
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4], ['b', 5], ['b', 6], ['a', 7]],
columns=['letter', 'num'])
df1, df2
df1
letter number
0 a 1
1 b 2
df2
letter num
0 c 3
1 d 4
2 b 5
3 b 6
4 a 7
可以看到df1, df2 均有letter列,接下来我们希望按照letter列为 key来join两个dataframe,且只保留存在于df1中的letter, 所以用‘left’ join
用法1
df1.merge(df2, on='letter', how='left')
letter number num
0 a 1 7
1 b 2 5
2 b 2 6
用法2
pd.merge(df1,df2, on='letter', how='left')
letter number num
0 a 1 7
1 b 2 5
2 b 2 6
本来按照 left join,以为出来的dataframe的行数会和df1一样,但是结果并不是。因为df2的letter列由重复的 ’b’,所以会多出来一行。所并不是一 一对应的关系。
解决方法
加上一个validate参数,参数格式见下方
validate : str, optional
If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
官方文档见: pandas.DataFrame.merge
如果我们指定 validate=“1:1" ( 或者 ”one_to_one")
pd.merge(df1,df2, on='letter', how='left', validate="1:1")
Traceback (most recent call last):
File "C:\Apps\anaconda3.6\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-37b8e1d89c1a>", line 1, in <module>
pd.merge(df1,df2, on='letter', how='left', validate="1:1")
File "C:\Apps\anaconda3.6\lib\site-packages\pandas\core\reshape\merge.py", line 60, in merge
validate=validate)
File "C:\Apps\anaconda3.6\lib\site-packages\pandas\core\reshape\merge.py", line 560, in __init__
self._validate(validate)
File "C:\Apps\anaconda3.6\lib\site-packages\pandas\core\reshape\merge.py", line 1080, in _validate
raise MergeError("Merge keys are not unique in right dataset;"
pandas.errors.MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
会出现报错,这样就会避免上面那种 奇怪多出一行的情况