前沿
在pyspark操作过程中,有时会忘记多表join重名字段去用问题,需要网上查,所以这次记录一下基本操作,便于以后自己查询使用。
创建两个dataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("xx").getOrCreate()
data1_set = [
["123", "A", "cat"],
["123", "B", "cat"],
["124", "B", "cat"],
["124", "B", "cat"]
]
df1 = spark.createDataFrame(data1_set, schema=["num1", "letter", "pet"])
data2_set = [
["123", "A", "dog"],
["123", "B", "dog"],
["124", "B", "dog"],
["124", "A", "dog"],
]
df2 = spark.createDataFrame(data2_set, schema=["num2", "letter", "pet"])
join
df = df1.join(df2,
on=[df1.num1 == df2.num2, df1.letter == df2.letter],
how='inner')
# 或
df = df1.join(df2,
on=[df1["num1"] == df2["num2"], df1["letter"] == df2["letter"]],
how='inner')
df.show()
结果展示:
从表中可以看到,有两个字段letter和pet分别有重复的字段。
select字段
select字段的方式有很多中,如下都可以:
dfp = df.select("num1", "num2")
dfp.show()
dfp = df.select(["num1", "num2"])
dfp.show()
dfp = df.select([df["num1"], df["num2"]])
dfp.show()
dfp = df.select(df["num1"], df["num2"])
dfp.show()
但当访问相同字段letter或pet时,就会报错,如:
dfp = df.select("num1", "pet")
报错信息如下:
因为系统无法判断字段“pet”是来自df1的还是df2的,为此需要指定哪个字段,正确用法为:
dfp = df.select("num1", df1["pet"]) # 可以混合使用
dfp.show()```
#### 总结
pyspark的join和select的形式有很多,有时候容易混淆。