问题
你想要对用一个 pandas.DataFrame 数据,对原来的一个 pandas.DataFrame 进行 UPSERT 更新,
I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.
例如你有这两个 DataFrame:
df1:
A B
date
'2015-10-01' 'A1' 'B1'
'2015-10-02' 'A2' 'B2'
'2015-10-03' 'A3' 'B3'
df2:
date A B
'2015-10-02' 'a1' 'b1'
'2015-10-03' 'a2' 'b2'
'2015-10-04' 'a3' 'b3'
你想要得到如下的 DataFrame 结果:
A B
date
'2015-10-01' 'A1' 'B1'
'2015-10-02' 'a1' 'b1'
'2015-10-03' 'a2' 'b2'
'2015-10-04' 'a3' 'b3'
This is analogous to what I think is called “upsert” in some SQL systems — a combination of update and insert, in the sense that each row from df2 is either (a) used to update an existing row in df1 if the row key already exists in df1, or (b) inserted into df1 at the end if the row key does not already exist.
Solution
以下是我在 StackOverflow 上的 “answer” 👇
This question is very help to me.
两种 Solution
answer of @Alexander that is good enough.
I just want to give a little explain the different of two solution:
- much lines
def upsert(target, new)
df = pd.concat([target, new[~new.index.isin(target.index)]], sort=False)
df.update(new)
return df
and
- one-liner
def upsert(target, new):
return pd.concat([target[~target.index.isin(new.index)], new], sort=False)
解释两种 Solution 的不同之处
There is the example can explain the difference:
>>> df = pd.DataFrame([('python', None, 4),
('java', '1992', 3),
('javascript', None, 2)],
columns=['lang', 'year', 'rank'])
>>> df
lang year rank
0 python None 4
1 java 1992 3
2 javascript None 2
>>> upsert_df = pd.DataFrame([('python', '1987'),
('GOLANG', '2009')],
columns=['lang', 'year'])
>>> upsert_df
lang year
0 python 1987
1 GOLANG 2009
>>> target, new = df.set_index(['lang']), upsert_df.set_index(['lang'])
>>> upsert(target, new, full_update=True)
year rank
lang
java 1992 3.0
javascript None 2.0
python 1987 NaN
GOLANG 2009 NaN
>>> upsert(target, new, full_update=False)
year rank
lang
python 1987 4.0
java 1992 3.0
c 1976 1.0
javascript None 2.0
GOLANG 2009 NaN
the upsert
function like this 👇
完整的 solution
def upsert(target, new, *, full_update=True, sort=False) -> pd.DataFrame:
"""do a upsert in pandas
:param target: to be upsert;
:param new: upsert data to target;
:param full_update: bool: 在 new 没有完整的 target.columns 的情况下,对于 update 的那部分 rows,
是对所有的 target.columns 都 update(没有的 columns 值将为空)?
还是对 new 中存在的 columns 对 target 执行该部分的 columns update?
默认对所有的 target.columns 都 update(new 中没有的 columns 值将为空);
:param sort: bool: if new.columns != target.columns,
the new version of pandas execute .concat will sort columns by default.
Reference: https://stackoverflow.com/a/33002097
"""
if full_update:
return pd.concat([target[~target.index.isin(new.index)], new], sort=sort)
else:
df = pd.concat([target, new[~new.index.isin(target.index)]], sort=sort)
df.update(new)
return df