从mongoDB里加载到pyspark总是有bson.int64.int64的数据

Treasureashes

于 2023-04-24 02:55:29 发布

阅读量340

点赞数

文章标签：大数据 spark 分布式

本文链接：https://blog.csdn.net/Treasureashes/article/details/130334098

版权

首先导入的时候，就会报IntegerType not accept blabla

当你把schema指定为StringType，再用cast转成IntegerType，里面的值会是Null。

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType

spark = SparkSession.builder.appName("example").getOrCreate()
schema = StructType([
    StructField("created_at", StringType(), True)
])
df = spark.createDataFrame(documents, schema=schema)
df = df.withColumn("created_at", col("created_at").cast("integer"))

而且如果有这个类型的数据，随便df.show()一下它就会报：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 55) (driver-7b9bff5d64-v94tb executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for bson.int64.Int64). This happens when an unsupported/unregistered class is being unpickled that requires construction arguments. Fix it by registering a custom IObjectConstructor for this class.

当你去找什么序列反序列方法时，作为一个pyspark小白，真是不懂怎么操作，各种CloudPickleSerializer尝试了半天也不知道具体怎么弄。问gpt它也是车轱辘话来回说。

因为show不了，用select，collect也会报错，很难去找到到底是哪里有问题。

这个时候灵机一动，想到一个超级简单的办法，

import bson
documents = [{k: int(v) if isinstance(v, bson.int64.Int64) else v for k, v in doc.items()} for doc in documents]

就是遍历从mongoDB拿到的doc，只要value是bson.int64.int64这个类型的，就先把它转成int。

之后再转成pyspark dataframe，这时候就可以直接指定IntegerType或者LongType了，而且里面是有数值的。

schema = StructType([
    StructField("created_at", LongType(), True)
])
df = spark.createDataFrame(documents, schema=schema)

Treasureashes

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
从mongoDB里加载到pyspark总是有bson.int64.int64的数据

当你去找什么序列反序列方法时，作为一个pyspark小白，真是不懂怎么操作，各种CloudPickleSerializer尝试了半天也不知道具体怎么弄。之后再转成pyspark dataframe，这时候就可以直接指定IntegerType或者LongType了，而且里面是有数值的。就是遍历从mongoDB拿到的doc，只要value是bson.int64.int64这个类型的，就先把它转成int。因为show不了，用select，collect也会报错，很难去找到到底是哪里有问题。
复制链接

扫一扫