pyspark常用语法_df行列拼接

想了解更多,欢迎移步"文渊小站"

里面有更多知识分享,以及一些有意思的小项目~

环境

spark 2.4.0

df列拼接(join操作)

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext
import re
import pandas as pd
from pyspark.sql.types import *
import pyspark.sql.functions as F
conf = SparkConf().setAppName('fea_extracting').set('spark.io.compression.codec','snappy')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
peopleDF = sqlContext.read.json("data/test/people.json")
peopleDF.show()
+----+-------+-----+-----+
| age|   name|pcode| pcoe|
+----+-------+-----+-----+
|null|  Alice|94304| null|
|  30|Brayden|94304| null|
|  19|  Carla|94304|10036|
|  46|  Diana|94304| null|
|null|Etienne|94104| null|
+----+-------+-----+-----+
pcodesDF = sqlContext.read.json("data/test/pcodes.json")
pcodesDF.show()
+-------------+-----+-----+
|         city|pcode|state|
+-------------+-----+-----+
|     Santa Fe|87501|   NM|
|     New York|10036|   NY|
|         HUGO|94304|   CA|
|    Palo Alto|94304|   CA|
|San Francisco|94104|   CA|
+-------------+-----+-----+
# 左关联,只显示能匹配上的 join后的df
mydf000 = peopleDF.join(pcodesDF,['pcode'])   # mydf000 = peopleDF.join(pcodesDF,['pcode','state'])
mydf000.show()
+-----+----+-------+-----+-------------+-----+
|pcode| age|   name| pcoe|         city|state|
+-----+----+-------+-----+-------------+-----+
|94304|  46|  Diana| null|         HUGO|   CA|
|94304|  19|  Carla|10036|         HUGO|   CA|
|94304|  30|Brayden| null|         HUGO|   CA|
|94304|null|  Alice| null|         HUGO|   CA|
|94304|  46|  Diana| null|    Palo Alto|   CA|
|94304|  19|  Carla|10036|    Palo Alto|   CA|
|94304|  30|Brayden| null|    Palo Alto|   CA|
|94304|null|  Alice| null|    Palo Alto|   CA|
|94104|null|Etienne| null|San Francisco|   CA|
+-----+----+-------+-----+-------------+-----+
# 左关联,只显示能匹配上的 join前的df
mydf001=peopleDF.join(pcodesDF,"pcode","leftsemi")
mydf001.show()
+-----+----+-------+-----+
|pcode| age|   name| pcoe|
+-----+----+-------+-----+
|94304|null|  Alice| null|
|94304|  30|Brayden| null|
|94304|  19|  Carla|10036|
|94304|  46|  Diana| null|
|94104|null|Etienne| null|
+-----+----+-------+-----+
# 左关联,显示join后的df
mydf002=peopleDF.join(pcodesDF,"pcode","left_outer")
mydf002.show()
+-----+----+-------+-----+-------------+-----+
|pcode| age|   name| pcoe|         city|state|
+-----+----+-------+-----+-------------+-----+
|94304|null|  Alice| null|    Palo Alto|   CA|
|94304|null|  Alice| null|         HUGO|   CA|
|94304|  30|Brayden| null|    Palo Alto|   CA|
|94304|  30|Brayden| null|         HUGO|   CA|
|94304|  19|  Carla|10036|    Palo Alto|   CA|
|94304|  19|  Carla|10036|         HUGO|   CA|
|94304|  46|  Diana| null|    Palo Alto|   CA|
|94304|  46|  Diana| null|         HUGO|   CA|
|94104|null|Etienne| null|San Francisco|   CA|
+-----+----+-------+-----+-------------+-----+
# 右关联,显示join后的df
mydf003=peopleDF.join(pcodesDF,"pcode","right_outer")
mydf003.show()
+-----+----+-------+-----+-------------+-----+
|pcode| age|   name| pcoe|         city|state|
+-----+----+-------+-----+-------------+-----+
|87501|null|   null| null|     Santa Fe|   NM|
|10036|null|   null| null|     New York|   NY|
|94304|  46|  Diana| null|         HUGO|   CA|
|94304|  19|  Carla|10036|         HUGO|   CA|
|94304|  30|Brayden| null|         HUGO|   CA|
|94304|null|  Alice| null|         HUGO|   CA|
|94304|  46|  Diana| null|    Palo Alto|   CA|
|94304|  19|  Carla|10036|    Palo Alto|   CA|
|94304|  30|Brayden| null|    Palo Alto|   CA|
|94304|null|  Alice| null|    Palo Alto|   CA|
|94104|null|Etienne| null|San Francisco|   CA|
+-----+----+-------+-----+-------------+-----+

df行拼接

# df1 最后1行后面拼接 df2(注意:df1与df2的列以及顺序需要保持一致)
df3 = df1.union(df2)
想了解更多,欢迎移步"文渊小站"

里面有更多知识分享,以及一些有意思的小项目~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值