spark读取数据并打印,在Apache Spark中读取漂亮的打印json文件

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

My JSON schema looks like this -

{

"dataset": [

{

"key1": [

{

"range": "range1",

"value": 0.0

},

{

"range": "range2",

"value": 0.23

}

]

}, {..}, {..}

],

"last_refreshed_time": "2016/09/08 15:05:31"

}

Here are my questions -

Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?

If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.

Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

解决方案

You could use sc.wholeTextFiles() Here is a related post.

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):

with open(input_path, 'r') as handle:

jarr = json.load(handle)

f = open(output_path, 'w')

for entry in jarr:

f.write(json.dumps(entry)+"\n")

f.close()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值