SPARK 第三方库不可序列化问题Task not serializable Caused by: java.io.NotSerializableException

最新推荐文章于 2023-08-08 09:59:56 发布

innersense

最新推荐文章于 2023-08-08 09:59:56 发布

阅读量2.4k

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/innersense/article/details/104692450

版权

Spark 中的RDD操作如map和filter要求函数可序列化。当函数涉及不可序列化的第三方库对象时，可通过包装对象并使用`transient`注解来解决。例如，Google的HyperLogLogPlusPlus不支持序列化，可以创建一个包装类，将hll变量声明为`transient`。若需保存数据，需重写`writeObject`和`readObject`方法。参考链接提供详细解决方案。

摘要由CSDN通过智能技术生成

在RDD 中的map,filter等中的函数，都会被序列化后，再发送到各个节点去，所以需要map或者filter中的函数可以进行序列化，

如果函数中依赖于第三方jar包生成的对象，但是第三方jar中类并不可以序列化，则可以用将第三方的对象进行包装，加上transient注解

package bigdata.zetaskech

import java.io.{ObjectInputStream, ObjectOutputStream}

import com.google.zetasketch.HyperLogLogPlusPlus

class HyperLogLog extends Serializable{

  @transient
  var hll:HyperLogLogPlusPlus[String] = new HyperLogLogPlusPlus.Builder().buildForStrings()

  val pathSet:mutable.Set[String] = mutable.Set[String]()

  var count:Long = 0

  def add(v:String):Unit = {
    hll.add(v)
    count = hll.longResult()
  }

  def merge(other: HyperLogLog):HyperLogLog = {
    hll.merge(other.hll)
    println(s"this ${hll.longResult()} other ${other.longResult}")
    this
  }

  def numValues :Long= {
    hll.numValues()
  }

  def longResult:Long = {
    hll.longResult()
  }


  private de