PysparkNote104---join表关联

Intro

    pyspark join用法,看看api是不是就够了。。。。

数据构造

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def get_or_create(app_name):
    spark = (
        SparkSession.builder.appName(app_name)
        .config("spark.driver.maxResultSize", "10g")
        .config("spark.sql.execution.arrow.enabled", "true")
        .config("spark.dynamicAllocation.enabled", "false")
        .config("spark.sql.crossJoin.enabled", "true")
        .config("spark.kryoserializer.buffer.max", "512m")
        .getOrCreate()
    )
    spark.sparkContext.setLogLevel("ERROR")
    return spark
import pandas as pd
df1 = pd.DataFrame({"name": ["A", "B"],"name1": ["A", "B"], "age": [10, 20]})
df2 = pd.DataFrame({"name": ["A"],"name1": ["A"], "sex": ["male"]})
spark = get_or_create("spark")
df_spark1 = spark.createDataFrame(df1)
df_spark2 = spark.createDataFrame(df2)

df_spark1.show(truncate=False)
+----+-----+---+
|name|name1|age|
+----+-----+---+
|A   |A    |10 |
|B   |B    |20 |
+----+-----+---+
df_spark2.show(truncate=False)
+----+-----+----+
|name|name1|sex |
+----+-----+----+
|A   |A    |male|
+----+-----+----+

join

主要是关联列名相同or不同时的使用差异。help上都有,可自查

        :param other: Right side of the join
        :param on: a string for the join column name, a list of column names,
            a join expression (Column), or a list of Columns.
            If `on` is a string or a list of strings indicating the name of the join column(s),
            the column(s) must exist on both sides, and this performs an equi-join.
        :param how: str, default ``inner``. Must be one of: ``inner``, ``cross``, ``outer``,
            ``full``, ``full_outer``, ``left``, ``left_outer``, ``right``, ``right_outer``,
            ``left_semi``, and ``left_anti``.

列名相同时

df_spark1.join(other=df_spark2,on=['name'],how='left').show()
+----+-----+---+-----+----+
|name|name1|age|name2| sex|
+----+-----+---+-----+----+
|   B|    B| 20| null|null|
|   A|    A| 10|    A|male|
+----+-----+---+-----+----+

多个列名相同步

df_spark1.join(other=df_spark2,on=['name','name1'],how='left').show()
+----+-----+---+----+
|name|name1|age| sex|
+----+-----+---+----+
|   A|    A| 10|male|
|   B|    B| 20|null|
+----+-----+---+----+

关联的列名不同

df_spark1.join(other=df_spark2,on=[df_spark1.name==df_spark2.name1],how='left').show()
+----+-----+---+----+-----+----+
|name|name1|age|name|name1| sex|
+----+-----+---+----+-----+----+
|   B|    B| 20|null| null|null|
|   A|    A| 10|   A|    A|male|
+----+-----+---+----+-----+----+

多关联条件

df_spark1.join(other=df_spark2,on=[df_spark1.name==df_spark2.name1,df_spark1.name1==df_spark2.name],how='left').show()
+----+-----+---+----+-----+----+
|name|name1|age|name|name1| sex|
+----+-----+---+----+-----+----+
|   A|    A| 10|   A|    A|male|
|   B|    B| 20|null| null|null|
+----+-----+---+----+-----+----+
df_spark1.join(other=df_spark2,on=[df_spark1.name==df_spark2.name1,df_spark1.name1==df_spark2.name],how='outer').show()
+----+-----+---+----+-----+----+
|name|name1|age|name|name1| sex|
+----+-----+---+----+-----+----+
|   A|    A| 10|   A|    A|male|
|   B|    B| 20|null| null|null|
+----+-----+---+----+-----+----+

简单的用法介绍完毕

                                2022-08-04 于南京市江宁区九龙湖

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值