python一个列表拆分多个数组_Pyspark:将多个数组列拆分为行

火花> = 2.4

您可以替换zip_ udf为arrays_zip功能

from pyspark.sql.functions import arrays_zip, col

(df

.withColumn("tmp", arrays_zip("b", "c"))

.withColumn("tmp", explode("tmp"))

.select("a", col("tmp.b"), col("tmp.c"), "d"))

火花<2.4

与DataFrames和UDF:

from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType

from pyspark.sql.functions import col, udf, explode

zip_ = udf(

lambda x, y: list(zip(x, y)),

ArrayType(StructType([

# Adjust types to reflect data types

StructField("first", IntegerType()),

StructField("second", IntegerType())

]))

)

(df

.withColumn("tmp", zip_("b", "c"))

# UDF output cannot be directly passed to explode

.withColumn("tmp", explode("tmp"))

.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))

与RDDs:

(df

.rdd

.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])

.toDF(["a", "b", "c", "d"]))

由于Python的通讯开销,这两种解决方案的效率都不高。如果数据大小固定,则可以执行以下操作:

from functools import reduce

from pyspark.sql import DataFrame

# Length of array

n = 3

# For legacy Python you'll need a separate function

# in place of method accessor

reduce(

DataFrame.unionAll,

(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")

for i in range(n))

).toDF("a", "b", "c", "d")

甚至:

from pyspark.sql.functions import array, struct

# SQL level zip of arrays of known size

# followed by explode

tmp = explode(array(*[

struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))

for i in range(n)

]))

(df

.withColumn("tmp", tmp)

.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))

与UDF或RDD相比,这应该明显更快。通用化以支持任意数量的列:

# This uses keyword only arguments

# If you use legacy Python you'll have to change signature

# Body of the function can stay the same

def zip_and_explode(*colnames, n):

return explode(array(*[

struct(*[col(c).getItem(i).alias(c) for c in colnames])

for i in range(n)

]))

df.withColumn("tmp", zip_and_explode("b", "c", n=3))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值