背景
利用chatlama的方法生成大量的文本数据,以JSON文件的格式存储,需要清洗这些数据,得到html文件和对应的markdown文件。
25w行json数据如何批量处理,得到想要的数据?
方法
读取和解析 JSON 文件:
- 逐行读取 JSON 文件,并尝试将每一行解析为 JSON 对象。
- 如果解析成功,将对象添加到
json_objects
列表中,并增加计数器json_count
。
json_objects = []
json_count = 0
with open(json_file_path, 'r', encoding='utf-8') as file:
for line in file:
line = line.strip()
if line:
try:
json_obj = json.loads(line)
json_objects.append(json_obj)
json_count += 1 # 增加计数器
except json.JSONDecodeError as e:
print(f"Error decoding JSON on line: {line}")
print(e)
利用正则法匹配markdown
Markdown 表格的每一行通常以竖线 |
开头,并以换行符 \n
结尾。利用正则法匹配,
markdown_pattern = re.compile(r'\|.*?\n')
\|
: 匹配竖线字符 |
。
.*?
: 匹配任意字符(除了换行符)零次或多次,非贪婪模式。
\n
: 匹配换行符。
保存html和markd
将html_path中的内容作为html和markdown文件的文件名,将answer_mode4.split("```html")[1].split("```")[0].strip()作为html的内容。
for obj in json_objects:
answer_mode4 = obj.get("answer_mode4", "")
html_path = obj.get("html_path", "")
if not html_path or not answer_mode4:
continue
filename = os.path.splitext(os.path.basename(html_path))[0]
try:
html_code = answer_mode4.split("```html")[1].split("```")[0].strip()
except IndexError:
html_code = ""
markdown_code_matches = markdown_pattern.findall(answer_mode4)
markdown_code = ''.join(markdown_code_matches).strip()
if html_code:
with open(os.path.join(html_folder, f"{filename}.html"), 'w', encoding='utf-8') as html_file:
html_file.write(html_code)
if markdown_code:
with open(os.path.join(markdown_folder, f"{filename}.md"), 'w', encoding='utf-8') as markdown_file:
markdown_file.write(markdown_code)
完整代码
import os
import json
import re
json_file_path = r'g1138_rerender_echart_total.json'
html_folder = r'html_files1'
markdown_folder = r'markdown_files1'
os.makedirs(html_folder, exist_ok=True)
os.makedirs(markdown_folder, exist_ok=True)
json_objects = []
json_count = 0
with open(json_file_path, 'r', encoding='utf-8') as file:
for line in file:
line = line.strip()
if line:
try:
json_obj = json.loads(line)
json_objects.append(json_obj)
json_count += 1 # 增加计数器
except json.JSONDecodeError as e:
print(f"Error decoding JSON on line: {line}")
print(e)
markdown_pattern = re.compile(r'\|.*?\n')
for obj in json_objects:
answer_mode4 = obj.get("answer_mode4", "")
html_path = obj.get("html_path", "")
if not html_path or not answer_mode4:
continue
filename = os.path.splitext(os.path.basename(html_path))[0]
try:
html_code = answer_mode4.split("```html")[1].split("```")[0].strip()
except IndexError:
html_code = ""
markdown_code_matches = markdown_pattern.findall(answer_mode4)
markdown_code = ''.join(markdown_code_matches).strip()
if html_code:
with open(os.path.join(html_folder, f"{filename}.html"), 'w', encoding='utf-8') as html_file:
html_file.write(html_code)
if markdown_code:
with open(os.path.join(markdown_folder, f"{filename}.md"), 'w', encoding='utf-8') as markdown_file:
markdown_file.write(markdown_code)
print("文件已保存到 'html_files' 和 'markdown_files' 文件夹中。")