LLM应用构建前的非结构化数据处理（三）文档表格的提取

最新推荐文章于 2025-04-19 23:38:35 发布

l8947943

最新推荐文章于 2025-04-19 23:38:35 发布

阅读量695

点赞数 4

分类专栏： AIGC 文章标签： langchain 非结构化

本文链接：https://blog.csdn.net/l8947943/article/details/140309139

版权

AIGC 专栏收录该内容

25 篇文章

订阅专栏

1.学习内容

本节次学习内容来自于吴恩达老师的Preprocessing Unstructured Data for LLM Applications课程，因涉及到非结构化数据的相关处理，遂做学习整理。
本节主要学习pdf中的表格数据处理

2.环境准备

和之前一样，可以参考LLM应用构建前的非结构化数据处理（一）标准化处理认识数据
，其中配置信息保持一致

同样的，需要unstructured.io上获取APIkey。

3.开始尝试

3.1导入环境

# Warning control
import warnings
warnings.filterwarnings('ignore')

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.staging.base import dict_to_elements

# 初始化API
s = UnstructuredClient(
    api_key_auth="XXX",
    server_url="https://api.unstrXXX",
)

3.2样例浏览

from IPython.display import Image
Image(filename="images/embedded-images-tables.jpg", height=600, width=600)

输出如下：
在这里插入图片描述

3.3处理pdf文档

filename = "example_files/embedded-images-tables.pdf"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="hi_res",
    hi_res_model_name="yolox",
    skip_infer_table_types=[],
    pdf_infer_table_structure=True,
)

try:
    resp = s.general.partition(req)
    elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)

# 找到处理数据中的Table元素的unstructured对象数据
tables = [el for el in elements if el.category == "Table"]
tables[0].text

输出如下：

'Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 —0.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540 —0.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919'

将其转为html形式

table_html = tables[0].metadata.text_as_html
table_html

输出如下：

'<table><thead><tr><th>Inhibitor concentration (g)</th><th>be (V/dec)</th><th>ba (V/dec)</th><th>Ecorr (V)</th><th>icorr (AJcm?)</th><th>Polarization resistance (Q)</th><th>Corrosion rate (mmj/year)</th></tr></thead><tbody><tr><td></td><td>0.0335</td><td>0.0409</td><td>—0.9393</td><td>0.0003</td><td>24.0910</td><td>2.8163</td></tr><tr><td>NO</td><td>1.9460</td><td>0.0596</td><td>—0.8276</td><td>0.0002</td><td>121.440</td><td>1.5054</td></tr><tr><td></td><td>0.0163</td><td>0.2369</td><td>—0.8825</td><td>0.0001</td><td>42121</td><td>0.9476</td></tr><tr><td>s</td><td>03233</td><td>0.0540</td><td>—0.8027</td><td>5.39E-05</td><td>373.180</td><td>0.4318</td></tr><tr><td></td><td>0.1240</td><td>0.0556</td><td>—0.5896</td><td>5.46E-05</td><td>305.650</td><td>0.3772</td></tr><tr><td>= 5</td><td>0.0382</td><td>0.0086</td><td>—0.5356</td><td>1.24E-05</td><td>246.080</td><td>0.0919</td></tr></tbody></table>'

3.4 格式化呈现

from io import StringIO 
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

输出如下：

<table>
  <thead>
    <tr>
      <th>Inhibitor concentration (g)</th>
      <th>be (V/dec)</th>
      <th>ba (V/dec)</th>
      <th>Ecorr (V)</th>
      <th>icorr (AJcm?)</th>
      <th>Polarization resistance (Q)</th>
      <th>Corrosion rate (mmj/year)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td/>
      <td>0.0335</td>
      <td>0.0409</td>
      <td>&#8212;0.9393</td>
      <td>0.0003</td>
      <td>24.0910</td>
      <td>2.8163</td>
    </tr>
    <tr>
      <td>NO</td>
      <td>1.9460</td>
      <td>0.0596</td>
      <td>&#8212;0.8276</td>
      <td>0.0002</td>
      <td>121.440</td>
      <td>1.5054</td>
    </tr>
    <tr>
      <td/>
      <td>0.0163</td>
      <td>0.2369</td>
      <td>&#8212;0.8825</td>
      <td>0.0001</td>
      <td>42121</td>
      <td>0.9476</td>
    </tr>
    <tr>
      <td>s</td>
      <td>03233</td>
      <td>0.0540</td>
      <td>&#8212;0.8027</td>
      <td>5.39E-05</td>
      <td>373.180</td>
      <td>0.4318</td>
    </tr>
    <tr>
      <td/>
      <td>0.1240</td>
      <td>0.0556</td>
      <td>&#8212;0.5896</td>
      <td>5.46E-05</td>
      <td>305.650</td>
      <td>0.3772</td>
    </tr>
    <tr>
      <td>= 5</td>
      <td>0.0382</td>
      <td>0.0086</td>
      <td>&#8212;0.5356</td>
      <td>1.24E-05</td>
      <td>246.080</td>
      <td>0.0919</td>
    </tr>
  </tbody>
</table>

3.5 还原表格到html中显示

from IPython.core.display import HTML
HTML(table_html)

输出如下：在这里插入图片描述

3.6 借助langchain进行摘要

from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")
chain.invoke([Document(page_content=table_html)])

输出如下：

{'input_documents': [Document(page_content='<table><thead><tr><th>Inhibitor concentration (g)</th><th>be (V/dec)</th><th>ba (V/dec)</th><th>Ecorr (V)</th><th>icorr (AJcm?)</th><th>Polarization resistance (Q)</th><th>Corrosion rate (mmj/year)</th></tr></thead><tbody><tr><td></td><td>0.0335</td><td>0.0409</td><td>—0.9393</td><td>0.0003</td><td>24.0910</td><td>2.8163</td></tr><tr><td>NO</td><td>1.9460</td><td>0.0596</td><td>—0.8276</td><td>0.0002</td><td>121.440</td><td>1.5054</td></tr><tr><td></td><td>0.0163</td><td>0.2369</td><td>—0.8825</td><td>0.0001</td><td>42121</td><td>0.9476</td></tr><tr><td>s</td><td>03233</td><td>0.0540</td><td>—0.8027</td><td>5.39E-05</td><td>373.180</td><td>0.4318</td></tr><tr><td></td><td>0.1240</td><td>0.0556</td><td>—0.5896</td><td>5.46E-05</td><td>305.650</td><td>0.3772</td></tr><tr><td>= 5</td><td>0.0382</td><td>0.0086</td><td>—0.5356</td><td>1.24E-05</td><td>246.080</td><td>0.0919</td></tr></tbody></table>')],
 'output_text': 'The table provides data on the corrosion rate and polarization resistance of different inhibitor concentrations in a solution. The data includes the inhibitor concentration, be and ba values, Ecorr, icorr, polarization resistance, and corrosion rate. The table shows the impact of different inhibitor concentrations on the corrosion rate and polarization resistance.'}