pyspark主要通过DataFrame来完成数据的一些列处理,之前pyspark训练中已经给出了一些数据处理的方法,这里重点说下之前没说过的数据合并的问题,我们给出一个例子:
from pyspark.sql import SparkSession
import sys
if __name__ == "__main__":
print("begin")
spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate()
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['name','id'])
valuesB = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
TableB = spark.createDataFrame(valuesB,['name','id'])
TableA.show()
TableB.show()
ta = TableA.alias('ta')
tb = TableB.alias('tb')
inner_join = ta.join(tb, ta.name == tb.name)
inner_join.show()
mta = ta.selectExpr("name as name1", "id as id1")
mta.show()
mtb = tb.selectExpr("name as name2", "id as id2")
mtb.show()
left_join = mta.join(mtb, mta.name1 == mtb.name2,how='left') # Could also use 'left_outer'
left_join.show()
outer_join = mta.join(mtb, mta.name1 == mtb.name2,how='outer') # Could also use 'left_outer'
outer_join.show()
运行结果如下:
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
+-----------+---+
| name| id|
+-----------+---+
| Rutabaga| 1|
| Pirate| 2|
| Ninja| 3|
|Darth Vader| 4|
+-----------+---+
+------+---+------+---+
| name| id| name| id|
+------+---+------+---+
| Ninja| 3| Ninja| 3|
|Pirate| 1|Pirate| 2|
+------+---+------+---+
+---------+---+
| name1|id1|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
+-----------+---+
| name2|id2|
+-----------+---+
| Rutabaga| 1|
| Pirate| 2|
| Ninja| 3|
|Darth Vader| 4|
+-----------+---+
+---------+---+------+----+
| name1|id1| name2| id2|
+---------+---+------+----+
|Spaghetti| 4| null|null|
| Ninja| 3| Ninja| 3|
| Pirate| 1|Pirate| 2|
| Monkey| 2| null|null|
+---------+---+------+----+
+---------+----+-----------+----+
| name1| id1| name2| id2|
+---------+----+-----------+----+
| null|null| Rutabaga| 1|
|Spaghetti| 4| null|null|
| Ninja| 3| Ninja| 3|
| Pirate| 1| Pirate| 2|
| Monkey| 2| null|null|
| null|null|Darth Vader| 4|
+---------+----+-----------+----+
这里首先有两个表合并的方法,用到了3种方法,inner、left和outer,分别实现取交集、取左并集和取并集,join操作的时候需要用alias为表命名别名。
最后还有一个常用的问题就是合并时有两列列名相同的情况,可以使用
mta = ta.selectExpr("name as name1", "id as id1")
这种方式来修改列名。