pandas.DataFrame.merge

最新推荐文章于 2024-04-27 06:27:59 发布

梓沂

最新推荐文章于 2024-04-27 06:27:59 发布

阅读量283

点赞数

分类专栏：官方手册

官方手册专栏收录该内容

14 篇文章 0 订阅

订阅专栏

pandas.DataFrame.merge¶

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)[source]

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:	right : DataFrame how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’ left: use only keys from left frame, similar to a SQL left outer join; preserve key order right: use only keys from right frame, similar to a SQL right outer join; preserve key order outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys on : label or list Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames. left_on : label or list, or array-like Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns. right_on : label or list, or array-like Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns. left_index : boolean, default False Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels right_index : boolean, default False Use the index from the right DataFrame as the join key. Same caveats as left_index sort : boolean, default False Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword) suffixes : 2-length sequence (tuple, list, …) Suffix to apply to overlapping column names in the left and right side, respectively copy : boolean, default True If False, do not copy data unnecessarily indicator : boolean or string, default False If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both. validate : string, default None If specified, checks if merge is of specified type. “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: check if merge keys are unique in left dataset. “many_to_one” or “m:1”: check if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks. New in version 0.21.0.
Returns:	merged : DataFrame The output type will the be same as ‘left’, if it is a subclass of DataFrame.

Parameters:

right : DataFrame

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

left: use only keys from left frame, similar to a SQL left outer join; preserve key order
right: use only keys from right frame, similar to a SQL right outer join; preserve key order
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys

on : label or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_on : label or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_on : label or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

sort : boolean, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword)

suffixes : 2-length sequence (tuple, list, …)

Suffix to apply to overlapping column names in the left and right side, respectively

copy : boolean, default True

If False, do not copy data unnecessarily

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

validate : string, default None

If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.

New in version 0.21.0.

Returns:

merged : DataFrame

The output type will the be same as ‘left’, if it is a subclass of DataFrame.

使用Pandas进行数据匹配

本文转载自：蓝鲸的网站分析笔记

原文链接：使用Pandas进行数据匹配

merge()介绍

Pandas中的merge函数类似于Excel中的Vlookup，可以实现对两个数据表进行匹配和拼接的功能。与Excel不同之处在于merge函数有4种匹配拼接模式，分别为inner，left，right和outer模式。其中inner为默认的匹配模式。本篇文章我们将介绍merge函数的使用方法和4种拼接模式的区别。

下面是我们准备进行拼接的两个数据表，左边是贷款状态表loan_stats，右边为用户等级表member_grade。我们将分别用merge函数的4种匹配模式对这两个表进行拼接。

准备工作

开始使用merge函数进行数据拼接之前先导入所需的功能库，然后将分别读取两个数据表，并命名为loanstats表和member_grade表。

import numpy as np

import pandas as pd

loanstats=pd.DataFrame(pd.read_excel('loanStats.xlsx'))

member_grade=pd.DataFrame(pd.read_excel('member_grade.xlsx'))

函数功能介绍

merge函数的使用方法很简单，以下是官方的函数功能介绍和使用说明。merge函数中第一个出现的数据表是拼接后的left部分，第二个出现的数据表是拼接后的right部分。第三个是数据匹配模式，默认是inner模式。第四个参数on表示数据匹配所依据的字段名称，如果这个字段名称同时出现在两个数据表中，那么可以省略on参数的设置，merge默认会按照两个数据表中共有的字段名称进行匹配和拼接。如果两个数据表中的匹配字段名称不一致，则需要分别在left_on和right_on参数中指明两个表匹配字段的名称。如果两个数据表中没有匹配字段，需要使用索引列进行匹配和拼接，可以对left_index和right_index参数设置为True。merge还有一些排序和其他的参数，可在需要使用时进行设置。

Inner模式匹配

inner模式是merge的默认匹配模式，我们通过下面的文氏图来说明inner的匹配方法。Inner模式提供在loanstats和member_grade表中共有字段的匹配结果。也就是对两个的表交集部分进行匹配和拼接。单独只出现在一个表中的字段值不会参与匹配和拼接。

以下是使用merge函数进行拼接的代码，因为inner是默认的拼接模式，因此也可以省略how=’inner’部分。其中第一个出现的loanstats出现在拼接后的左侧，member_grade出现在拼接后的右侧。拼接后的数据表中只包含两个表的交集，因此不存在未匹配到的NaN情况。

1	`loan_inner=pd.merge(loanstats,member_grade,how='inner')`

left模式匹配

left模式是左匹配，以左边的数据表loanstats为基础匹配右边的数据表member_grade中的内容。匹配不到的内容以NaN值显示。在Excel中就好像将Vlookup公式写在了左边的表中。下面的文氏图说明了left模式的匹配方法。Left模式匹配的结果显示了所有左边数据表的内容，以及和右边数据表共有的内容。

以下为使用left模式匹配并拼接后的结果，loanstats在merge函数中第一个出现，因此为左表，member_grade第二个出现，为右表。匹配模式为left模式。从结果中可以看出left匹配模式保留了一张完整的loanstats表，以此为基础对member_grade表中的内容进行匹配。loanstats表中有两个member_id值在member_grade中无法找到，因此grades字段显示为NaN值。

1	`loan_left=pd.merge(loanstats,member_grade,how='left')`

right模式匹配

第三种模式是right匹配，right与left模式正好相反，right模式是右匹配，以右边的数据表member_grade为基础匹配左边的数据表loanstats。匹配不到的内容以NaN值显示。下面通过文氏图说明right模式的匹配方法。Right模式匹配的结果显示了所有右边数据表的内容，以及和左边数据表共有的内容。

以下为使用right模式匹配拼接的结果，从结果表中可以看出right匹配模式保留了完整的member_grade表，以此为基础对loanstats表进行匹配，在member_grade数据表中有两个条目在loanstats数据表中无法找到，因此显示为了NaN值。

1	`loan_right=pd.merge(loanstats,member_grade,how='right')`

outer模式匹配

最后一种模式是outer匹配，outer模式是两个表的汇总，将loanstats和member_grade两个要匹配的两个表汇总在一起，生成一张汇总的唯一值数据表以及匹配结果。

下面是使用outer模式匹配拼接的结果，其中member_id列包含了loanstats和member_grade中的唯一值，grade列显示了对member_grade表匹配的结果，其他列则显示了对loanstats表匹配的结果，无法匹配的内容以NaN值显示。

1	`loan_outer=pd.merge(loanstats,member_grade,how='outer')`

NaN值匹配问题

在进行数据匹配和拼接的过程中经常会遇到NaN值。这种情况下merge函数会如何处理呢？merge会将两个数据表中的NaN值进行交叉匹配拼接，换句话说就是将loanstats表member_id列中的NaN值
分别与member_grade表中member_id列中的每一个NaN值进行匹配，然后再拼接在一张表中。下面是包含NaN值的两张数据表进行拼接的结果，当我们使用left模式进行匹配时，loanstats作为基础
表，其中member_id列的NaN值分别与member_grade表中member_id列的每一个NaN值进行匹配。并将匹配结果显示在了结果表中。

1	`loan_left=pd.merge(loanstats,member_grade,how='left')`

df3['objectid'].isnull()产生的是一列布尔数组

用它可以过滤非空行：

df表的行列的选取方法：

.loc is primarily label based根据行标和列标选取，但是有个特例：

当行标是数字时，ddf.loc[1:2,['eci']]中的1:2好像是通过行号选择行，实际是通过行标选择行。

将行标改为3，4 则 .loc[1:2,]就查不到了。

.iloc is primarily integer position based 根据行号和列号选取。

Object Type	Indexers
Series	`s.loc[indexer]`
DataFrame	`df.loc[row_indexer,column_indexer] 先行后列`
Panel	`p.loc[item_indexer,major_indexer,minor_indexer]`

梓沂

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas.DataFrame.merge

pandas.DataFrame.merge¶DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, v...
复制链接

扫一扫

专栏目录