背景:有一个2.2G大的文件,文件中每行为一句话,共一亿多行。需要将文件转成json格式,所有句子组成一个list,每个句子是list中的一个元素。
小数据做法:
import json
input_file = "XXXX"
sen_list = []
with open(input_file, "r", encoding = "utf8") as fin:
for line in fin:
line = line.strip()
sen_list.append(line)
output_file = "YYYY"
with open(output_file, "w", encoding = "utf8") as fout:
json.dump(sen_list, fout)
隐患:如果将文件内容全部读进来,内存可能撑不住。
大数据做法:
import json
input_file = "XXXX"
output_file = "YYYY"
with open(input_file, "r", encoding = "utf8") as fin, \
open(output_file, "w", encoding = "utf8") as fout:
fout.write('[')
for line in fin:
line = line.strip()
line_json = json.dumps(line, ensure_ascii = False)
fout.write(line_json)
fout.write(",")
fout.write('"今天天气不错"]')
注:当json中的文字涉及中文等非ASCII符号时,转换时一定要记得声明ensure_ascii = False,否则会显示unicode编码值。