java spark map函数,Spark java Map函数正在执行两次

最新推荐文章于 2023-05-30 23:30:47 发布

木易movie

最新推荐文章于 2023-05-30 23:30:47 发布

阅读量148

点赞数

文章标签： java spark map函数

I have above code as spark driver, when I execute my program it works properly saving required data as parquet file.

String indexFile = "index.txt";

JavaRDD indexData = sc.textFile(indexFile).cache();

JavaRDD jsonStringRDD = indexData.map(new Function() {

@Override

public String call(String patientId) throws Exception {

return "json array as string"

}

});

//1. Read json string array into a Dataframe (execution 1)

DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );

//2. Save dataframe as parquet file (execution 2)

dataSchemaDF.write().parquet("md.parquet");

But i observed my mapper function on RDD indexData is getting executed twice.

first, when I read jsonStringRdd as DataFrame using SQLContext

Second, when I write the dataSchemaDF to the parquet file

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting json string into a Dataframe?

解决方案

I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;

...

and use it when you create DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

关注关注