想了解更多,欢迎移步"文渊小站"
里面有更多知识分享,以及一些有意思的小项目~
环境
spark 2.4.0
df列拼接(join操作)
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext
import re
import pandas as pd
from pyspark.sql.types import *
import pyspark.sql.functions as F
conf = SparkConf().setAppName('fea_extracting').set('spark.io.compression.codec','snappy')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
peopleDF = sqlContext.read.json("data/test/people.json")
peopleDF.show()
+----+-------+-----+-----+
| age| name|pcode| pcoe|
+----+-------+-----+-----+
|null| Alice|94304| null|
| 30|Brayden|94304| null|
| 19| Carla|94304|10036|
| 46| Diana|94304| null|
|null|Etienne|94104| null|
+----+-------+-----+-----+
pcodesDF = sqlContext.read.json("data/test/pcodes.json")
pcodesDF.show()
+-------------+-----+-----+
| city|pcode|state|
+-------------+-----+-----+
| Santa Fe|87501| NM|
| New York|10036| NY|
| HUGO|94304| CA|
| Palo Alto|94304| CA|
|San Francisco|94104| CA|
+-------------+-----+-----+
# 左关联,只显示能匹配上的 join后的df
mydf000 = peopleDF.join(pcodesDF,['pcode']) # mydf000 = peopleDF.join(pcodesDF,['pcode','state'])
mydf000.show()
+-----+----+-------+-----+-------------+-----+
|pcode| age| name| pcoe| city|state|
+-----+----+-------+-----+-------------+-----+
|94304| 46| Diana| null| HUGO| CA|
|94304| 19| Carla|10036| HUGO| CA|
|94304| 30|Brayden| null| HUGO| CA|
|94304|null| Alice| null| HUGO| CA|
|94304| 46| Diana| null| Palo Alto| CA|
|94304| 19| Carla|10036| Palo Alto| CA|
|94304| 30|Brayden| null| Palo Alto| CA|
|94304|null| Alice| null| Palo Alto| CA|
|94104|null|Etienne| null|San Francisco| CA|
+-----+----+-------+-----+-------------+-----+
# 左关联,只显示能匹配上的 join前的df
mydf001=peopleDF.join(pcodesDF,"pcode","leftsemi")
mydf001.show()
+-----+----+-------+-----+
|pcode| age| name| pcoe|
+-----+----+-------+-----+
|94304|null| Alice| null|
|94304| 30|Brayden| null|
|94304| 19| Carla|10036|
|94304| 46| Diana| null|
|94104|null|Etienne| null|
+-----+----+-------+-----+
# 左关联,显示join后的df
mydf002=peopleDF.join(pcodesDF,"pcode","left_outer")
mydf002.show()
+-----+----+-------+-----+-------------+-----+
|pcode| age| name| pcoe| city|state|
+-----+----+-------+-----+-------------+-----+
|94304|null| Alice| null| Palo Alto| CA|
|94304|null| Alice| null| HUGO| CA|
|94304| 30|Brayden| null| Palo Alto| CA|
|94304| 30|Brayden| null| HUGO| CA|
|94304| 19| Carla|10036| Palo Alto| CA|
|94304| 19| Carla|10036| HUGO| CA|
|94304| 46| Diana| null| Palo Alto| CA|
|94304| 46| Diana| null| HUGO| CA|
|94104|null|Etienne| null|San Francisco| CA|
+-----+----+-------+-----+-------------+-----+
# 右关联,显示join后的df
mydf003=peopleDF.join(pcodesDF,"pcode","right_outer")
mydf003.show()
+-----+----+-------+-----+-------------+-----+
|pcode| age| name| pcoe| city|state|
+-----+----+-------+-----+-------------+-----+
|87501|null| null| null| Santa Fe| NM|
|10036|null| null| null| New York| NY|
|94304| 46| Diana| null| HUGO| CA|
|94304| 19| Carla|10036| HUGO| CA|
|94304| 30|Brayden| null| HUGO| CA|
|94304|null| Alice| null| HUGO| CA|
|94304| 46| Diana| null| Palo Alto| CA|
|94304| 19| Carla|10036| Palo Alto| CA|
|94304| 30|Brayden| null| Palo Alto| CA|
|94304|null| Alice| null| Palo Alto| CA|
|94104|null|Etienne| null|San Francisco| CA|
+-----+----+-------+-----+-------------+-----+
df行拼接
# df1 最后1行后面拼接 df2(注意:df1与df2的列以及顺序需要保持一致)
df3 = df1.union(df2)
想了解更多,欢迎移步"文渊小站"
里面有更多知识分享,以及一些有意思的小项目~