java rdd转化为strign_在Spark JavaRDD转换中使用Serializable lambda

我想了解以下代码.

//文件:LambdaTest.java

package test;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import java.io.Serializable;

import java.util.Arrays;

import java.util.List;

import java.util.function.Function;

public class LambdaTest implements Ops {

public static void main(String[] args) {

new LambdaTest().job();

}

public void job() {

SparkConf conf = new SparkConf()

.setAppName(LambdaTest.class.getName())

.setMaster("local[*]");

JavaSparkContext jsc = new JavaSparkContext(conf);

List lst = Arrays.asList(1, 2, 3, 4, 5, 6);

JavaRDD rdd = jsc.parallelize(lst);

Function func1 = (Function & Serializable) x -> x * x;

Function func2 = x -> x * x;

System.out.println(func1.getClass()); //test.LambdaTest$$Lambda$8/390374517

System.out.println(func2.getClass()); //test.LambdaTest$$Lambda$9/208350681

this.doSomething(rdd, func1); // works

this.doSomething(rdd, func2); // org.apache.spark.SparkException: Task not serializable

}

}

//文件:Ops.java

package test;

import org.apache.spark.api.java.JavaRDD;

import java.util.function.Function;

public interface Ops {

default void doSomething(JavaRDD rdd, Function func) {

rdd.map(x -> x + func.apply(x))

.collect()

.forEach(System.out::println);

}

}

区别在于func1使用Serializable绑定,而func2则不是.

在查看两个函数的运行时类时,它们都是LambdaTest类下的匿名类

它们都用于接口中的RDD转换,然后两个函数和LambdaTest应该是可序列化的.

如您所见,LambdaTest不实现Serializable接口.所以我认为这两个功能应该不起作用.但令人惊讶的是,func1有效.

func2的堆栈跟踪如下:

Serialization stack:

- object not serializable (class: test.LambdaTest$$Lambda$9/208350681, value: test.LambdaTest$$Lambda$9/208350681@61d84e08)

- element of array (index: 0)

- array (class [Ljava.lang.Object;, size 1)

- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)

- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=interface fr.leboncoin.etl.jobs.test.Ops, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic fr/leboncoin/etl/jobs/test/Ops.lambda$doSomething$1024e30a$1:(Ljava/util/function/Function;Ljava/lang/Integer;)Ljava/lang/Integer;, instantiatedMethodType=(Ljava/lang/Integer;)Ljava/lang/Integer;, numCaptured=1])

- writeReplace data (class: java.lang.invoke.SerializedLambda)

- object (class fr.leboncoin.etl.jobs.test.Ops$$Lambda$10/1470295349, fr.leboncoin.etl.jobs.test.Ops$$Lambda$10/1470295349@4e1459ea)

- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)

- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, )

at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)

at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)

at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)

... 19 more

似乎如果一个函数与Serializable绑定,包含它的对象不需要序列化,这让我感到困惑.

对此的任何解释都非常感谢.

– – – – – – – – – – – – – – – 更新 – – – – – – – – – – ———–

我试图使用抽象类而不是接口:

//文件:AbstractTest.java

public class AbstractTest {

public static void main(String[] args) {

new AbstractTest().job();

}

public void job() {

SparkConf conf = new SparkConf()

.setAppName(AbstractTest.class.getName())

.setMaster("local[*]");

JavaSparkContext jsc = new JavaSparkContext(conf);

List lst = Arrays.asList(1, 2, 3, 4, 5, 6);

JavaRDD rdd = jsc.parallelize(lst);

Ops ops = new Ops() {

@Override

public Integer apply(Integer x) {

return x + 1;

}

};

System.out.println(ops.getClass()); // class fr.leboncoin.etl.jobs.test.AbstractTest$1

ops.doSomething(rdd);

}

}

//文件:Ops.java

public abstract class Ops implements Serializable{

public abstract Integer apply(Integer x);

public void doSomething(JavaRDD rdd) {

rdd.map(x -> x + apply(x))

.collect()

.forEach(System.out::println);

}

}

即使Ops类在使用AbstractTest类的单独文件中编译,它也不起作用. ops对象的类名是class fr.leboncoin.etl.jobs.test.AbstractTest $1.根据以下堆栈跟踪,似乎需要序列化AbstractTestin顺序以序列化AbstractTest $1.

Serialization stack:

- object not serializable (class: test.AbstractTest, value: test.AbstractTest@21ac5eb4)

- field (class: test.AbstractTest$1, name: this$0, type: class test.AbstractTest)

- object (class test.AbstractTest$1, test.AbstractTest$1@36fc05ff)

- element of array (index: 0)

- array (class [Ljava.lang.Object;, size 1)

- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)

- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class fr.leboncoin.etl.jobs.test.Ops, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeSpecial fr/leboncoin/etl/jobs/test/Ops.lambda$doSomething$6d6228b6$1:(Ljava/lang/Integer;)Ljava/lang/Integer;, instantiatedMethodType=(Ljava/lang/Integer;)Ljava/lang/Integer;, numCaptured=1])

- writeReplace data (class: java.lang.invoke.SerializedLambda)

- object (class fr.leboncoin.etl.jobs.test.Ops$$Lambda$8/208350681, fr.leboncoin.etl.jobs.test.Ops$$Lambda$8/208350681@4acb2510)

- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)

- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, )

at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)

at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)

at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)

... 19 more

解决方法:

LambdaTest不需要是Serializable,因为它不是通过网络发送的 – 没有理由这样做.

另一方面,func1和func1都必须是Serializable,因为Spark将使用它们来执行计算(在RDD上,因此这段代码必须通过线路发送到工作节点.注意即使你写它所有在同一个类中,在编译之后,你的lambdas将放在单独的文件中,这要归功于整个类不必通过线路发送 – >外部类不需要是Serializable.

至于fun1的工作原理,当你不使用类型转换时,Java编译器会为你推断lambda表达式的类型.因此,在这种情况下,为fun2生成的代码将简单地实现一个Function(因为那是目标变量的类型).另一方面,如果无法从上下文推断类型(例如,在您的情况下,编译器无法知道fun1必须是Serializable,因为它是Spark所需的功能),您可以使用类型转换,如示例中所示明确提供一种类型.在这种情况下,编译器生成的代码将实现Function和Serializable接口,编译器不会尝试自己推断类型.

您可以在the state of lambda中的5.目标类型的上下文中找到它.

标签:java,lambda,serializable,apache-spark

来源: https://codeday.me/bug/20190702/1358653.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值