在SQL和Pandas中,连接查询(join)是处理数据集之间关系的重要工具。下面是SQL中的各种连接查询类型及其与Pandas中相应操作的对应关系:
1. INNER JOIN
-
SQL:
INNER JOIN
返回两个表中具有匹配值的行。
-
Pandas:
merge()
方法的how
参数设置为'inner'
。- 示例代码:
merged_df = pd.merge(df1, df2, on='key', how='inner')
2. LEFT JOIN / LEFT OUTER JOIN
-
SQL:
LEFT JOIN
返回左表中的所有行,并在右表中找到匹配项时返回相应的行。如果没有匹配项,则返回NULL
。
-
Pandas:
merge()
方法的how
参数设置为'left'
。- 示例代码:
merged_df = pd.merge(df1, df2, on='key', how='left')
3. RIGHT JOIN / RIGHT OUTER JOIN
-
SQL:
RIGHT JOIN
返回右表中的所有行,并在左表中找到匹配项时返回相应的行。如果没有匹配项,则返回NULL
。
-
Pandas:
merge()
方法的how
参数设置为'right'
。- 示例代码:
merged_df = pd.merge(df1, df2, on='key', how='right')
4. FULL OUTER JOIN / FULL JOIN
-
SQL:
FULL OUTER JOIN
返回两个表中的所有行。对于没有匹配项的行,缺失的列会被填充为NULL
。
-
Pandas:
merge()
方法的how
参数设置为'outer'
。- 示例代码:
merged_df = pd.merge(df1, df2, on='key', how='outer')
5. CROSS JOIN
-
SQL:
CROSS JOIN
返回两个表的笛卡尔积,即所有可能的行组合。
-
Pandas:
merge()
方法没有直接对应的方法,但可以通过设置on
参数为None
并将how
设置为'outer'
来实现。- 示例代码:
merged_df = pd.merge(df1, df2, how='outer')
6. SEMI JOIN
-
SQL:
SEMI JOIN
返回左表中在右表中有匹配项的行。
-
Pandas:
merge()
方法结合boolean indexing
可以模拟SEMI JOIN
。- 示例代码:
semi_joined_df = df1[df1['key'].isin(df2['key'])]
7. ANTI JOIN
-
SQL:
ANTI JOIN
返回左表中在右表中没有匹配项的行。
-
Pandas:
merge()
方法结合boolean indexing
可以模拟ANTI JOIN
。- 示例代码:
anti_joined_df = df1[~df1['key'].isin(df2['key'])]
示例代码
假设我们有两个DataFrame df1
和 df2
,我们将演示这些连接操作:
import pandas as pd
# 创建示例 DataFrame
data1 = {
'key': ['A', 'B', 'C', 'D'],
'value1': [1, 2, 3, 4]
}
df1 = pd.DataFrame(data1)
data2 = {
'key': ['B', 'D', 'E'],
'value2': [5, 6, 7]
}
df2 = pd.DataFrame(data2)
# INNER JOIN
inner_joined_df = pd.merge(df1, df2, on='key', how='inner')
print("INNER JOIN:")
print(inner_joined_df)
# LEFT JOIN
left_joined_df = pd.merge(df1, df2, on='key', how='left')
print("\nLEFT JOIN:")
print(left_joined_df)
# RIGHT JOIN
right_joined_df = pd.merge(df1, df2, on='key', how='right')
print("\nRIGHT JOIN:")
print(right_joined_df)
# FULL OUTER JOIN
full_outer_joined_df = pd.merge(df1, df2, on='key', how='outer')
print("\nFULL OUTER JOIN:")
print(full_outer_joined_df)
# CROSS JOIN
cross_joined_df = pd.merge(df1, df2, how='outer')
print("\nCROSS JOIN:")
print(cross_joined_df)
# SEMI JOIN
semi_joined_df = df1[df1['key'].isin(df2['key'])]
print("\nSEMI JOIN:")
print(semi_joined_df)
# ANTI JOIN
anti_joined_df = df1[~df1['key'].isin(df2['key'])]
print("\nANTI JOIN:")
print(anti_joined_df)
输出示例
假设 df1
和 df2
如下所示:
df1:
key value1
0 A 1
1 B 2
2 C 3
3 D 4
df2:
key value2
0 B 5
1 D 6
2 E 7
输出结果将会是:
INNER JOIN:
key value1 value2
1 B 2 5
3 D 4 6
LEFT JOIN:
key value1 value2
0 A 1 NaN
1 B 2 5.0
2 C 3 NaN
3 D 4 6.0
RIGHT JOIN:
key value1 value2
1 B 2.0 5.0
3 D 4.0 6.0
2 E NaN 7.0
FULL OUTER JOIN:
key value1 value2
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
CROSS JOIN:
key_x value1 key_y value2
0 A 1 B 5
1 A 1 D 6
2 A 1 E 7
3 B 2 B 5
4 B 2 D 6
5 B 2 E 7
6 C 3 B 5
7 C 3 D 6
8 C 3 E 7
9 D 4 B 5
10 D 4 D 6
11 D 4 E 7
SEMI JOIN:
key value1
1 B 2
3 D 4
ANTI JOIN:
key value1
0 A 1
2 C 3