提取JSON文档中的html（split）与markdown（正则法）

最新推荐文章于 2024-07-05 11:41:32 发布

gatinaa

最新推荐文章于 2024-07-05 11:41:32 发布

阅读量198

点赞数 8

文章标签： json

本文链接：https://blog.csdn.net/weixin_46636042/article/details/139841016

版权

背景

利用chatlama的方法生成大量的文本数据，以JSON文件的格式存储，需要清洗这些数据，得到html文件和对应的markdown文件。

25w行json数据如何批量处理，得到想要的数据？

方法

读取和解析 JSON 文件:

逐行读取 JSON 文件，并尝试将每一行解析为 JSON 对象。
如果解析成功，将对象添加到 json_objects 列表中，并增加计数器 json_count。

json_objects = []
json_count = 0
with open(json_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line:
            try:
                json_obj = json.loads(line)
                json_objects.append(json_obj)
                json_count += 1  # 增加计数器
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON on line: {line}")
                print(e)

利用正则法匹配markdown

Markdown 表格的每一行通常以竖线 | 开头，并以换行符 \n 结尾。利用正则法匹配，

markdown_pattern = re.compile(r'\|.*?\n')

\|: 匹配竖线字符 |。

.*?: 匹配任意字符（除了换行符）零次或多次，非贪婪模式。

\n: 匹配换行符。

保存html和markd

将html_path中的内容作为html和markdown文件的文件名，将answer_mode4.split("```html")[1].split("```")[0].strip()作为html的内容。

for obj in json_objects:
    answer_mode4 = obj.get("answer_mode4", "")
    html_path = obj.get("html_path", "")

    if not html_path or not answer_mode4:
        continue

    filename = os.path.splitext(os.path.basename(html_path))[0]

    try:
        html_code = answer_mode4.split("```html")[1].split("```")[0].strip()
    except IndexError:
        html_code = ""

    markdown_code_matches = markdown_pattern.findall(answer_mode4)
    markdown_code = ''.join(markdown_code_matches).strip()

    if html_code:
        with open(os.path.join(html_folder, f"{filename}.html"), 'w', encoding='utf-8') as html_file:
            html_file.write(html_code)

    if markdown_code:
        with open(os.path.join(markdown_folder, f"{filename}.md"), 'w', encoding='utf-8') as markdown_file:
            markdown_file.write(markdown_code)

完整代码

import os
import json
import re

json_file_path = r'g1138_rerender_echart_total.json'
html_folder = r'html_files1'
markdown_folder = r'markdown_files1'

os.makedirs(html_folder, exist_ok=True)
os.makedirs(markdown_folder, exist_ok=True)

json_objects = []
json_count = 0
with open(json_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line:
            try:
                json_obj = json.loads(line)
                json_objects.append(json_obj)
                json_count += 1  # 增加计数器
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON on line: {line}")
                print(e)

markdown_pattern = re.compile(r'\|.*?\n')

for obj in json_objects:
    answer_mode4 = obj.get("answer_mode4", "")
    html_path = obj.get("html_path", "")

    if not html_path or not answer_mode4:
        continue

    filename = os.path.splitext(os.path.basename(html_path))[0]

    try:
        html_code = answer_mode4.split("```html")[1].split("```")[0].strip()
    except IndexError:
        html_code = ""

    markdown_code_matches = markdown_pattern.findall(answer_mode4)
    markdown_code = ''.join(markdown_code_matches).strip()

    if html_code:
        with open(os.path.join(html_folder, f"{filename}.html"), 'w', encoding='utf-8') as html_file:
            html_file.write(html_code)

    if markdown_code:
        with open(os.path.join(markdown_folder, f"{filename}.md"), 'w', encoding='utf-8') as markdown_file:
            markdown_file.write(markdown_code)

print("文件已保存到 'html_files' 和 'markdown_files' 文件夹中。")

gatinaa

关注

8
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
提取JSON文档中的html（split）与markdown（正则法）

利用chatlama的方法生成大量的文本数据，以JSON文件的格式存储，需要清洗这些数据，得到html文件和对应的markdown文件。25w行json数据如何批量处理，得到想要的数据？
复制链接

扫一扫