Pandas —— 合并、连接、比较

Jesse_chao122

于 2024-04-18 16:45:51 发布

阅读量701

点赞数 20

文章标签： pandas

本文链接：https://blog.csdn.net/Jesse_chao122/article/details/137914953

版权

本文主要讲解了Pandas中合并数据的方法，如concat函数用于连接对象，merge和join用于基于索引或列的连接，以及DataFrame的compare功能用于比较差异。还提供了具体的代码示例以展示这些操作的应用。

摘要由CSDN通过智能技术生成

1.合并（concat）

concat（）函数沿一个轴连接任意数量的Series或DataFrame对象，同时在其他轴上执行索引的可选集逻辑（并集或交集）。与numpy.contenate一样，concat（）获取一个同质类型对象的列表或dict，并将它们连接起来。

pandas.concat(objs, *, axis=0, join='outer', 
                ignore_index=False, keys=None, 
                levels=None, names=None, verify_integrity=False, 
                sort=False, copy=None)

参数说明：

参数名	说明
objs	Series或DataFrame对象的序列或映射
axis	{0/’index’, 1/’columns’}, default 0
join	{‘inner’, ‘outer’}, default ‘outer’
ignore_index	bool, default False
keys	sequence, default None
levels	list of sequences, default None
names	list, default None
verify_integrity	bool, default False
sort	bool, default False
copy	bool, default True

返回值：object, type of objs

具体用法见示例：

import pandas as pd
import numpy as np

# Series与DataFrame声明
df1 = pd.DataFrame([
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
])
df2= pd.DataFrame([
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
])
df3 = pd.DataFrame([
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
])

df4 = pd.DataFrame([
    np.random.randint(1, 10, size=4),
    np.random.randint(1, 10, size=4),
    np.random.randint(1, 10, size=4),
])

series = pd.Series(np.random.randint(1, 10, size=3))

# 基础用法
df = pd.concat([df1, df2, df3])
# 数据输出
按index合并后的DataFrame：
   0  1  2
0  6  5  7
1  3  5  9
2  8  6  8
0  4  5  4
1  4  9  5
2  1  3  2
0  2  4  8
1  4  8  1
2  1  9  2

# 参数ignore_index的用法：重置索引(index)
df = pd.concat([df1, df2, df3], ignore_index=True)
# 数据输出
合并后的DataFrame：
   0  1  2
0  5  3  5
1  3  6  1
2  4  8  2
3  9  4  3
4  8  3  4
5  7  6  3
6  2  6  8
7  6  3  5
8  4  8  6

# DataFrame与Series的合并
# (对参数axis的理解，以index为轴则意味着在行的方向递增数据，columns同理)
df = pd.concat([df1, series], axis=0, ignore_index=True)
# 数据输出
合并后的DataFrame：
   0    1    2
0  4  7.0  9.0
1  9  6.0  4.0
2  1  5.0  3.0
3  7  NaN  NaN
4  5  NaN  NaN
5  1  NaN  NaN
df = pd.concat([df1, series], axis=1, ignore_index=True)
# 数据输出
合并后的DataFrame：
   0  1  2  3
0  1  1  3  1
1  1  8  4  7
2  8  7  5  4

# 参数join的用法:指定合并方式inner or outer
df = pd.concat([df1, df4, df3], join='outer', ignore_index=True)
# 数据输出
合并后的DataFrame：
   0  1  2    3
0  7  4  1  NaN
1  7  5  1  NaN
2  9  5  5  NaN
3  6  5  1  3.0
4  2  1  2  1.0
5  7  5  8  2.0
6  3  6  5  NaN
7  6  4  3  NaN
8  6  1  8  NaN
# 参数join的用法:指定合并方式inner or outer
df = pd.concat([df1, df4, df3], join='inner', ignore_index=True)
# 数据输出
合并后的DataFrame：
   0  1  2
0  9  8  3
1  6  3  4
2  5  5  6
3  9  2  5
4  3  5  9
5  3  7  5
6  6  5  9
7  5  2  3
8  9  8  9

2.连接（merge、join）

join可以使用join（）或merge（）来执行。默认情况下，join（）将连接其索引上的DataFrames。每个方法都有参数，允许您指定要执行的联接类型（LEFT、RIGHT、INNER、FULL）或要联接的列（列名或索引）。

如果两个键列都包含键为null值的行，则这些行将相互匹配。这与通常的SQL联接行为不同，可能会导致意外的结果。

2.2.1 merge()

pandas.merge(left, right, how='inner', on=None, 
            left_on=None, right_on=None, left_index=False, 
            right_index=False, sort=False, suffixes=('_x', '_y'), 
            copy=None, indicator=False, validate=None)

参数说明：

参数名	说明
left	DataFrame or named Series
right	DataFrame or named Series
how	{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
on	label or list 要连接的列或索引级别名称。这些必须在两个 DataFrame 中找到。如果on为 None 并且不合并索引，则默认为两个 DataFrame 中列的交集。
left_on	label or list 要连接到左侧 DataFrame 的列或索引级别名称。也可以是左侧 DataFrame 长度的数组或数组列表。这些数组被视为列。
right_on	label or list 要连接到右侧 DataFrame 中的列或索引级别名称。也可以是正确 DataFrame 长度的数组或数组列表。这些数组被视为列。
left_index	bool, 默认 False 使用左侧 DataFrame 的索引作为连接键。如果它是 MultiIndex，则其他 DataFrame 中的键数（索引或列数）必须与级别数匹配。
right_index	bool, 默认 False 使用右侧 DataFrame 的索引作为连接键。如果它是 MultiIndex，则其他 DataFrame 中的键数（索引或列数）必须与级别数匹配。
sort	bool，默认 False 在结果 DataFrame 中按字典顺序对连接键进行排序。如果为 False，则连接键的顺序取决于连接类型（how 关键字）。
suffixes	列表，默认为（“_x”，“_y”）长度为 2 的序列，其中每个元素都可以是一个字符串，指示分别添加到左侧和右侧重叠列名称的后缀。传递None值而不是字符串，以指示左侧或右侧的列名称应保持原样，不带后缀。至少其中一个值不得为 None。
copy	bool，默认 True 如果为 False，请尽可能避免复制
indicator	bool 或 str，默认 False 如果为 True，则向输出 DataFrame 添加一列名为“_merge”的列，其中包含有关每行源的信息。可以通过提供字符串参数来为该列指定不同的名称。该列将具有分类类型，对于合并键仅出现在左侧 DataFrame 中的观察，值为“left_only”；对于合并键仅出现在右侧 DataFrame 中的观察，值为“right_only”；如果观察的合并键为“both”在两个 DataFrame 中都可以找到。
validate	str，可选如果指定，则检查合并是否属于指定类型。 “one_to_one”或“1:1”：检查合并键在左右数据集中是否唯一。 “one_to_many”或“1:m”：检查合并键在左侧数据集中是否唯一。 “many_to_one”或“m:1”：检查合并键在正确的数据集中是否唯一。 “many_to_many”或“m:m”：允许，但不会导致检查。

返回值：两个合并对象的DataFrame

import pandas as pd
import numpy as np

data1 = {
    "id":[1,2,3,4,5,6],
    "age":np.random.randint(15, 20, size=6),
    "class":np.random.randint(1, 4, size=6)
}

df1 = pd.DataFrame(data1)

data2 = {
    "id": [1, 2, 4, 5, 6],
    "name": ["John", "Smith", "Jesse", "Neal", "Bob"],
}

df2 = pd.DataFrame(data2)
df_left = pd.merge(df1, df2, on="id", how="left")
df_right = pd.merge(df1, df2, on="id", how="right")
df_inner = pd.merge(df1, df2, on="id", how="inner")
df_outer = pd.merge(df1, df2, on="id", how="outer")
df_cross = pd.merge(df1, df2, how="cross")
print(f'左连接结果：\n{df_left}')
print(f'右连接结果：\n{df_right}')
print(f'内连接结果：\n{df_inner}')
print(f'外连接结果：\n{df_outer}')
print(f'笛卡尔积结果：\n{df_cross}')
----------------------------------------------------
左连接结果：
   id  age  class   name
0   1   18      2   John
1   2   15      2  Smith
2   3   15      2    NaN
3   4   15      1  Jesse
4   5   18      2   Neal
5   6   19      2    Bob
右连接结果：
   id  age  class   name
0   1   18      2   John
1   2   15      2  Smith
2   4   15      1  Jesse
3   5   18      2   Neal
4   6   19      2    Bob
内连接结果：
   id  age  class   name
0   1   18      2   John
1   2   15      2  Smith
2   4   15      1  Jesse
3   5   18      2   Neal
4   6   19      2    Bob
外连接结果：
   id  age  class   name
0   1   18      2   John
1   2   15      2  Smith
2   3   15      2    NaN
3   4   15      1  Jesse
4   5   18      2   Neal
5   6   19      2    Bob
笛卡尔积结果：
    id_x  age  class  id_y   name
0      1   18      2     1   John
1      1   18      2     2  Smith
2      1   18      2     4  Jesse
3      1   18      2     5   Neal
4      1   18      2     6    Bob
5      2   15      2     1   John
6      2   15      2     2  Smith
7      2   15      2     4  Jesse
8      2   15      2     5   Neal
9      2   15      2     6    Bob
10     3   15      2     1   John
11     3   15      2     2  Smith
12     3   15      2     4  Jesse
13     3   15      2     5   Neal
14     3   15      2     6    Bob
15     4   15      1     1   John
16     4   15      1     2  Smith
17     4   15      1     4  Jesse
18     4   15      1     5   Neal
19     4   15      1     6    Bob
20     5   18      2     1   John
21     5   18      2     2  Smith
22     5   18      2     4  Jesse
23     5   18      2     5   Neal
24     5   18      2     6    Bob
25     6   19      2     1   John
26     6   19      2     2  Smith
27     6   19      2     4  Jesse
28     6   19      2     5   Neal
29     6   19      2     6    Bob

2.2.2 join()

DataFrame.join(other, on=None, how='left', 
               lsuffix='', rsuffix='', sort=False, 
               validate=None)

参数说明：

参数名	说明
other	DataFrame、Series 或包含它们任意组合的列表。索引应该类似于这一列中的一列。如果传递了 Series，则必须设置其 name 属性，该属性将用作生成的连接 DataFrame 中的列名称。
on	str、str 列表或类似数组上，可选调用者中要连接到other中的索引的列或索引级别名称，否则连接索引上的索引。如果给定多个值，则另一个DataFrame 必须具有 MultiIndex。如果调用 DataFrame 中尚未包含数组，则可以传递数组作为连接键。类似于 Excel VLOOKUP 运算。
how	{'left', 'right', 'outer', 'inner', 'cross'}, 默认'left' 如何处理两个对象的操作。左：使用调用帧的索引（或列，如果指定了）右：使用别人的索引。外部：将调用框架的索引（如果指定了则为列）与其他框架的索引形成并集，并按字典顺序排序。内部：形成调用帧的索引（或列，如果指定了 on）与其他索引的交集，保留调用帧的顺序。 cross：从两个帧创建笛卡尔积，保留左键的顺序。
lsuffix	str，默认 '' 从左框架的重叠列使用的后缀。
rsuffix	str，默认 '' 从右框架的重叠列使用的后缀。
sort	bool，默认 False 通过连接键按字典顺序对结果 DataFrame 进行排序。如果为 False，则连接键的顺序取决于连接类型（how 关键字）。
validate	str，可选如果指定，则检查连接是否属于指定类型。 “one_to_one”或“1:1”：检查连接键在左右数据集中是否唯一。 “one_to_many”或“1:m”：检查连接键在左侧数据集中是否唯一。 “many_to_one”或“m:1”：检查连接键在正确的数据集中是否唯一。 “many_to_many”或“m:m”：允许，但不会导致检查。

返回值：数据框，包含来自调用方和其他方的列的数据框。

import pandas as pd
import numpy as np

data1 = {
    "id":[1,2,3,4,5,6],
    "age":np.random.randint(15, 20, size=6),
    "class":np.random.randint(1, 4, size=6)
}

df1 = pd.DataFrame(data1)

data2 = {

    "name": ["John", "Smith", "Jesse", "Neal", "Bob"],
}

df2 = pd.DataFrame(data2)
df_left = df1.join(df2, how="left", rsuffix="_right", lsuffix="_left")
df_right = df1.join(df2, how="right", rsuffix="_right", lsuffix="_left")
df_inner = df1.join(df2, how="inner", rsuffix="_right", lsuffix="_left")
df_outer = df1.join(df2, how="outer", rsuffix="_right", lsuffix="_left")
df_cross = df1.join(df2, how="cross", rsuffix="_right", lsuffix="_left")
print(f'左连接结果：\n{df_left}')
print(f'右连接结果：\n{df_right}')
print(f'内连接结果：\n{df_inner}')
print(f'外连接结果：\n{df_outer}')
print(f'笛卡尔积结果：\n{df_cross}')
---------------------------------------------------------
左连接结果：
   id  age  class   name
0   1   16      3   John
1   2   19      1  Smith
2   3   16      3  Jesse
3   4   19      1   Neal
4   5   16      3    Bob
5   6   19      1    NaN
右连接结果：
   id  age  class   name
0   1   16      3   John
1   2   19      1  Smith
2   3   16      3  Jesse
3   4   19      1   Neal
4   5   16      3    Bob
内连接结果：
   id  age  class   name
0   1   16      3   John
1   2   19      1  Smith
2   3   16      3  Jesse
3   4   19      1   Neal
4   5   16      3    Bob
外连接结果：
   id  age  class   name
0   1   16      3   John
1   2   19      1  Smith
2   3   16      3  Jesse
3   4   19      1   Neal
4   5   16      3    Bob
5   6   19      1    NaN
笛卡尔积结果：
    id  age  class   name
0    1   16      3   John
1    1   16      3  Smith
2    1   16      3  Jesse
3    1   16      3   Neal
4    1   16      3    Bob
5    2   19      1   John
6    2   19      1  Smith
7    2   19      1  Jesse
8    2   19      1   Neal
9    2   19      1    Bob
10   3   16      3   John
11   3   16      3  Smith
12   3   16      3  Jesse
13   3   16      3   Neal
14   3   16      3    Bob
15   4   19      1   John
16   4   19      1  Smith
17   4   19      1  Jesse
18   4   19      1   Neal
19   4   19      1    Bob
20   5   16      3   John
21   5   16      3  Smith
22   5   16      3  Jesse
23   5   16      3   Neal
24   5   16      3    Bob
25   6   19      1   John
26   6   19      1  Smith
27   6   19      1  Jesse
28   6   19      1   Neal
29   6   19      1    Bob

3. 比较（compare）

3.1 Series.compare

3.2 DataFrame.compare

与另一个DataFrame进行比较并显示差异

DataFrame.compare(other, align_axis=1, 
                  keep_shape=False, keep_equal=False, 
                  result_names=('self', 'other'))

参数说明：

参数名	说明
other	DataFrame
align_axis	{0 or ‘index’, 1 or ‘columns’}, default 1 确定在哪个轴上对齐比较。 0或'index'结果差异垂直堆叠，其中从自身和他人交替地绘制行。 1，或“列”结果差异水平对齐，其中列从自身和他人交替绘制
keep_shape	bool, default False 如果为true，则保留所有行和列。否则，只保留具有不同值的值。
keep_equal	bool, default False 如果为true，则结果将保持相等的值。否则，相等的值显示为NaN。
result_names	tuple, default (‘self’, ‘other’) 设置比较的数据帧名称。

返回值：DataFrame

import pandas as pd
import numpy as np


df1 = pd.DataFrame([
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
])
df2= pd.DataFrame([
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
    np.random.randint(1, 10, size=3),
])

df = df1.compare(df2, align_axis=0)
# 数据输出
print(f'比较结果：\n{df}')
比较结果：
         0  1    2
0 self   6  8  9.0
  other  7  9  8.0
1 self   1  4  4.0
  other  7  7  5.0
2 self   3  8  NaN
  other  4  3  NaN

Jesse_chao122

关注

20
点赞
踩
18

收藏

觉得还不错? 一键收藏
1
评论
Pandas —— 合并、连接、比较

如果为 True，则向输出 DataFrame 添加一列名为“_merge”的列，其中包含有关每行源的信息。调用者中要连接到other中的索引的列或索引级别名称，否则连接索引上的索引。concat（）函数沿一个轴连接任意数量的Series或DataFrame对象，同时在其他轴上执行索引的可选集逻辑（并集或交集）。如果传递了 Series，则必须设置其 name 属性，该属性将用作生成的连接 DataFrame 中的列名称。内部：形成调用帧的索引（或列，如果指定了 on）与其他索引的交集，保留调用帧的顺序。
复制链接

扫一扫