python 结构体嵌套_python – 重命名spark数据帧中的嵌套字段

最新推荐文章于 2022-01-22 17:07:08 发布

与何人说

最新推荐文章于 2022-01-22 17:07:08 发布

阅读量145

点赞数

文章标签： python 结构体嵌套

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_35683697/article/details/114921486

版权

Python

无法修改单个嵌套字段.您必须重新创建整个结构.在这种特殊情况下,最简单的解决方案是使用演员表.

首先是一堆进口：

from collections import namedtuple

from pyspark.sql.functions import col

from pyspark.sql.types import (

ArrayType, LongType, StringType, StructField, StructType)

和示例数据：

Record = namedtuple("Record", ["a", "b", "c"])

df = sc.parallelize([([Record("foo", 1, 3)], )]).toDF(["array_field"])

让我们确认架构与您的情况相同：

df.printSchema()

root

|-- array_field: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- a: string (nullable = true)

| | |-- b: long (nullable = true)

| | |-- c: long (nullable = true)

您可以将新模式定义为字符串：

str_schema = "array>"

df.select(col("array_field").cast(str_schema)).printSchema()

root

|-- array_field: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- a_renamed: string (nullable = true)

| | |-- b: long (nullable = true)

| | |-- c: long (nullable = true)

或数据类型：

struct_schema = ArrayType(StructType([

StructField("a_renamed", StringType()),

StructField("b", LongType()),

StructField("c", LongType())

]))

df.select(col("array_field").cast(struct_schema)).printSchema()

root

|-- array_field: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- a_renamed: string (nullable = true)

| | |-- b: long (nullable = true)

| | |-- c: long (nullable = true)

斯卡拉

Scala中可以使用相同的技术：

case class Record(a: String, b: Long, c: Long)

val df = Seq(Tuple1(Seq(Record("foo", 1, 3)))).toDF("array_field")

val strSchema = "array>"

df.select($"array_field".cast(strSchema))

要么

import org.apache.spark.sql.types._

val structSchema = ArrayType(StructType(Seq(

StructField("a_renamed", StringType),

StructField("b", LongType),

StructField("c", LongType)

)))

df.select($"array_field".cast(structSchema))

可能的改进：

如果您使用富有表现力的数据操作或JSON处理库,则可以更容易地将数据类型转储到dict或JSON字符串,并从那里获取它(例如,Python / toolz)：

from toolz.curried import pipe, assoc_in, update_in, map

from operator import attrgetter

# Update name to "a_updated" if name is "a"

rename_field = update_in(

keys=["name"], func=lambda x: "a_updated" if x == "a" else x)

updated_schema = pipe(

# Get schema of the field as a dict

df.schema["array_field"].jsonValue(),

# Update fields with rename

update_in(

keys=["type", "elementType", "fields"],

func=lambda x: pipe(x, map(rename_field), list)),

# Load schema from dict

StructField.fromJson,

# Get data type

attrgetter("dataType"))

df.select(col("array_field").cast(updated_schema)).printSchema()

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 结构体嵌套_python – 重命名spark数据帧中的嵌套字段

Python无法修改单个嵌套字段.您必须重新创建整个结构.在这种特殊情况下,最简单的解决方案是使用演员表.首先是一堆进口：from collections import namedtuplefrom pyspark.sql.functions import colfrom pyspark.sql.types import (ArrayType, LongType, StringType, Stru...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。