2021.4.2项目阶段报告_mr项目开发实验报告-CSDN博客

本文链接：https://blog.csdn.net/Mr__666/article/details/115392832

本周进度

项目成果

目前已经可以基本满足提取的要求；
对于实验报告，可以提取内容生成(实验报告名)text.txt和(实验报告名)code.txt；
对于图片，可以提取其中的文字生成txt文件；
可以批量处理，选择指定路径，准确率较高，速度还算可观。

具体实现

提取实验报告使用docx库：

def for_docx(in_path, out_path_text, out_path_code):

    ft = open(out_path_text, 'w', encoding='utf-8')
    fc = open(out_path_code, 'w', encoding='utf-8')
    document = Document(in_path)

    all_text = ""
    # 段落
    for paragraph in document.paragraphs:
        all_text = all_text + paragraph.text

    # 表格
    tables = document.tables  # 获取文件中的表格集
    table = tables[0]  # 获取文件中的第一个表格
    for i in range(0, len(table.rows)):  # 从表格第一行开始循环读取表格数据
        result = f'{(table.cell(i, 0).text):<5}'
        # cell(i,0)表示第(i+1)行第1列数据,以此类推
        all_text = all_text + result

    # 正则表达式处理
    pattern = re.compile(r'--begin--code--(.+?)(?=--end--code--)', re.DOTALL)
    procedure = pattern.findall(all_text)

    for code in procedure:
        fc.write(code)

    ft.write(all_text)

    ft.close()
    fc.close()

后期有提取其他类型文件的需求会考虑用tika；

提取图片用百度ocr接口：

def for_picture(in_path, out_path):
    url = 'https://aip.baidubce.com/oauth/2.0/token'
    data = {
        'grant_type': 'client_credentials',  # 固定值
        'client_id': '',  # 在开放平台注册后所建应用的API Key
        'client_secret': ''  # 所建应用的Secret Key
    }
    res = requests.post(url, data=data)
    res = res.json()
    access_token = res['access_token']

    # 通用文字识别接口url
    general_word_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic"

    # 二进制方式打开图片文件
    f = open(in_path, 'rb')
    img = base64.b64encode(f.read())
    params = {"image": img,
              "language_type": "CHN_ENG"}
    request_url = general_word_url + "?access_token=" + access_token
    headers = {'content-type': 'application/x-www-form-urlencoded'}
    response = requests.post(request_url, data=params, headers=headers)

    if response:
        res = response.json()["words_result"]
        with open(out_path, 'w', encoding='utf-8') as f:
            for j in res:
                f.write(j["words"] + "\n")