数据转存为json

最新推荐文章于 2024-09-09 00:00:00 发布

随心所欲～～

最新推荐文章于 2024-09-09 00:00:00 发布

阅读量629

点赞数 23

分类专栏：裁判文书合规性审查项目文章标签： python

本文链接：https://blog.csdn.net/qq_60983016/article/details/139707755

版权

裁判文书合规性审查项目专栏收录该内容

9 篇文章 0 订阅

订阅专栏

为了将全部的数据即2019、2020、2021、2022、2023、2024年的裁判文书存到一起，我们现在进行如下处理。

对每年的txt文件进行处理

 import os
 import json
 import re
 
 
 def process_file_content(content):
     # 移除内容中的所有空白字符
     return re.sub(r'\s+', '', content)
 
 
 def process_txt_files(input_dir, output_file):
     # 获取所有TXT文件
     txt_files = [f for f in os.listdir(input_dir) if f.endswith('.txt')]
 
     with open(output_file, 'w', encoding='utf-8') as jsonl_file:
         for txt_file in txt_files:
             txt_file_path = os.path.join(input_dir, txt_file)
             with open(txt_file_path, 'r', encoding='utf-8') as file:
                 content = file.read()
                 processed_content = process_file_content(content)
                 jsonl_file.write(json.dumps({"content": processed_content}, ensure_ascii=False) + '\n')
 
 
 # 设置输入目录和输出文件路径
 input_directory = 'F:/python_code/2019'
 output_jsonl_file = 'F:/python_code/2019/processed_output.jsonl'
 
 # 处理文件
 process_txt_files(input_directory, output_jsonl_file)
 
 print(f"Processed TXT files and saved to {output_jsonl_file}")

遍历2019文件夹下的所有txt文件，把其中的空格都去掉，然后每一个txt文件作为jsonl文件的一行。

同样，对2020、2021、2022、2023、2024年的文件都进行一次该操作。得到文件：

合并文件

 import os
 import json
 
 
 def merge_jsonl_files(input_dir, output_file):
     all_entries = []
 
     # 获取所有JSONL文件
     jsonl_files = [f for f in os.listdir(input_dir) if f.endswith('.jsonl')]
 
     for jsonl_file in jsonl_files:
         jsonl_file_path = os.path.join(input_dir, jsonl_file)
         with open(jsonl_file_path, 'r', encoding='utf-8') as file:
             for line in file:
                 entry = json.loads(line.strip())
                 all_entries.append(entry)
 
     # 将所有条目写入一个JSON文件
     with open(output_file, 'w', encoding='utf-8') as output_json_file:
         json.dump(all_entries, output_json_file, ensure_ascii=False, indent=4)
 
 
 # 设置输入目录和输出文件路径
 input_directory = 'F:/python_code/reptilework'
 output_json_file = 'F:/python_code/reptilework/merged_output.json'
 
 # 合并JSONL文件
 merge_jsonl_files(input_directory, output_json_file)
 
 print(f"Merged JSONL files and saved to {output_json_file}")

得到结果如下：