spark读取数据并打印,在Apache Spark中读取漂亮的打印json文件

最新推荐文章于 2023-06-09 10:52:32 发布

摆渡仁

最新推荐文章于 2023-06-09 10:52:32 发布

阅读量175

点赞数

文章标签： spark读取数据并打印

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

My JSON schema looks like this -

{

"dataset": [

{

"key1": [

{

"range": "range1",

"value": 0.0

{

"range": "range2",

"value": 0.23

}

]

}, {..}, {..}

"last_refreshed_time": "2016/09/08 15:05:31"

}

Here are my questions -

Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?

If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.

Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

解决方案

You could use sc.wholeTextFiles() Here is a related post.

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):

with open(input_path, 'r') as handle:

jarr = json.load(handle)

f = open(output_path, 'w')

for entry in jarr:

f.write(json.dumps(entry)+"\n")

f.close()