【pandas】解决怎么对 pandas.DataFrame UPSERT(update & insert) 另一个 DataFrame 问题


问题

你想要对用一个 pandas.DataFrame 数据,对原来的一个 pandas.DataFrame 进行 UPSERT 更新,

I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.

例如你有这两个 DataFrame:

df1:
                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'A2'   'B2'
    '2015-10-03'  'A3'   'B3'

df2:
    date            A      B
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

你想要得到如下的 DataFrame 结果:

                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

This is analogous to what I think is called “upsert” in some SQL systems — a combination of update and insert, in the sense that each row from df2 is either (a) used to update an existing row in df1 if the row key already exists in df1, or (b) inserted into df1 at the end if the row key does not already exist.



Solution

以下是我在 StackOverflow 上的 “answer” 👇

This question is very help to me.

两种 Solution

answer of @Alexander that is good enough.

I just want to give a little explain the different of two solution:

  1. much lines
def upsert(target, new)
  df = pd.concat([target, new[~new.index.isin(target.index)]], sort=False)
  df.update(new)
  return df

and

  1. one-liner
def upsert(target, new):
  return pd.concat([target[~target.index.isin(new.index)], new], sort=False)


解释两种 Solution 的不同之处

There is the example can explain the difference:

>>> df = pd.DataFrame([('python', None, 4),
                       ('java',   '1992', 3),
                       ('javascript', None, 2)],
                      columns=['lang', 'year', 'rank'])
>>> df
	lang	year	rank
0	python	None	4
1	java	1992	3
2	javascript	None	2

>>> upsert_df = pd.DataFrame([('python', '1987'),
                              ('GOLANG', '2009')],
                             columns=['lang', 'year'])
>>> upsert_df
	lang	year
0	python	1987
1	GOLANG	2009

>>> target, new = df.set_index(['lang']), upsert_df.set_index(['lang'])
>>> upsert(target, new, full_update=True)
	year	rank
lang		
java	1992	3.0
javascript	None	2.0
python	1987	NaN
GOLANG	2009	NaN

>>> upsert(target, new, full_update=False)
	year	rank
lang		
python	1987	4.0
java	1992	3.0
c	1976	1.0
javascript	None	2.0
GOLANG	2009	NaN

the upsert function like this 👇

完整的 solution

def upsert(target, new, *, full_update=True, sort=False) -> pd.DataFrame:
  """do a upsert in pandas

  :param target: to be upsert;
  :param new:  upsert data to target;
  :param full_update: bool: 在 new 没有完整的 target.columns 的情况下,对于 update 的那部分 rows,
                            是对所有的 target.columns 都 update(没有的 columns 值将为空)?
                            还是对 new 中存在的 columns 对 target 执行该部分的 columns update?
                            默认对所有的 target.columns 都 update(new 中没有的 columns 值将为空);
  :param sort: bool: if new.columns != target.columns,
                     the new version of pandas execute .concat will sort columns by default.

  Reference: https://stackoverflow.com/a/33002097
  """

  if full_update:
    return pd.concat([target[~target.index.isin(new.index)], new], sort=sort)
  else:
    df = pd.concat([target, new[~new.index.isin(target.index)]], sort=sort)
    df.update(new)
    return df


Reference

🔗 pandas DataFrame concat / update (“upsert”)?

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值