java spark map函数,Spark java Map函数正在执行两次

I have above code as spark driver, when I execute my program it works properly saving required data as parquet file.

String indexFile = "index.txt";

JavaRDD indexData = sc.textFile(indexFile).cache();

JavaRDD jsonStringRDD = indexData.map(new Function() {

@Override

public String call(String patientId) throws Exception {

return "json array as string"

}

});

//1. Read json string array into a Dataframe (execution 1)

DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );

//2. Save dataframe as parquet file (execution 2)

dataSchemaDF.write().parquet("md.parquet");

But i observed my mapper function on RDD indexData is getting executed twice.

first, when I read jsonStringRdd as DataFrame using SQLContext

Second, when I write the dataSchemaDF to the parquet file

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting json string into a Dataframe?

解决方案

I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;

...

and use it when you create DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值