1.学习内容

本节次学习内容来自于吴恩达老师的Preprocessing Unstructured Data for LLM Applications课程,因涉及到非结构化数据的相关处理,遂做学习整理。

2.相关环境准备

2.1 建议python版本在3.9版本以上

chromadb==0.4.22
langchain==0.1.5
langchain-community==0.0.17
langchain-core==0.1.19
langchain-openai==0.0.5
openai==1.11.1
tiktoken==0.5.2
#"unstructured[md,pdf,pptx]"
unstructured-client==0.16.0
unstructured==0.12.3
unstructured-inference==0.7.23
unstructured.pytesseract==0.3.12
urllib3==1.26.18
python-dotenv==1.0.1
panel==1.3.8
ipython==8.18.1
python-pptx==0.6.23
pdf2image==1.17.0
pdfminer==20191125
opencv-python==4.9.0.80
pikepdf==8.13.0
pypdf==4.0.1
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.

2.2 申请unstructured的key

地址: unstructured.io,新注册的用户有14天的免费使用时间,每天1000页的转换。如图:

LLM应用构建前的非结构化数据处理(一)标准化处理认识数据_数据处理

3.准备相关素材

LLM应用构建前的非结构化数据处理(一)标准化处理认识数据_5e_02

# Warning control
import warnings
warnings.filterwarnings('ignore')

from IPython.display import JSON

import json

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
# 初始化API
s = UnstructuredClient(
    api_key_auth="XXX",
    server_url="https://api.unstrXXX",
)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.

3.1 解析html

from IPython.display import Image
Image(filename="example_screnshoot/HTML_demo.png", height=100, width=100)
  • 1.
  • 2.

现实如图:

LLM应用构建前的非结构化数据处理(一)标准化处理认识数据_5e_03

filename = "examples/medium_blog.html"
elements = partition_html(filename=filename)
element_dict = [el.to_dict() for el in elements]
example_output = json.dumps(element_dict[11:15], indent=2)
JSON(example_output)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
[
  {
    "type": "Title",
    "element_id": "ca4a2c78bca728f3477958ece3222e10",
    "text": "Share",
    "metadata": {
      "category_depth": 0,
      "last_modified": "2024-07-09T15:00:56",
      "languages": [
        "eng"
      ],
      "file_directory": "examples",
      "filename": "medium_blog.html",
      "filetype": "text/html"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "23a7f3e28178ea0fa2b3e98b0275d2e3",
    "text": "In the vast digital universe, data is the lifeblood that drives decision-making and innovation. But not all data is created equal. Unstructured data in images and documents often hold a wealth of information that can be challenging to extract and analyze.",
    "metadata": {
      "last_modified": "2024-07-09T15:00:56",
      "languages": [
        "eng"
      ],
      "parent_id": "ca4a2c78bca728f3477958ece3222e10",
      "file_directory": "examples",
      "filename": "medium_blog.html",
      "filetype": "text/html"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "e1b7532458a93cfc789751895884e7bb",
    "text": "Enter Unstructured.io, a powerful tool to extract and efficiently transform structured data. With sixteen and counting pre-built connectors, the API can easily integrate with various data sources, including AWS S3, GitHub, Google Cloud Storage, and more.",
    "metadata": {
      "link_texts": [
        "Unstructured.io"
      ],
      "link_urls": [
        "https://www.unstructured.io/"
      ],
      "link_start_indexes": [
        6
      ],
      "last_modified": "2024-07-09T15:00:56",
      "languages": [
        "eng"
      ],
      "parent_id": "ca4a2c78bca728f3477958ece3222e10",
      "file_directory": "examples",
      "filename": "medium_blog.html",
      "filetype": "text/html"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "a6179d69ca1a55e0a3f98c08af0034e0",
    "text": "In this guide, we’ll cover the advantages of using the Unstructured API and Connector module, walk you through a step-by-step process of using it with the S3 Connector as an example, and show you how to be a part of the Unstructured community.",
    "metadata": {
      "last_modified": "2024-07-09T15:00:56",
      "languages": [
        "eng"
      ],
      "parent_id": "ca4a2c78bca728f3477958ece3222e10",
      "file_directory": "examples",
      "filename": "medium_blog.html",
      "filetype": "text/html"
    }
  }
]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.

3.2 解析pptx

Image(filename="example_screnshoot/pptx_slide.png", height=600, width=600)
  • 1.

LLM应用构建前的非结构化数据处理(一)标准化处理认识数据_3c_04

filename = "examples/msft_openai.pptx"
elements = partition_pptx(filename=filename)
element_dict = [el.to_dict() for el in elements]
JSON(json.dumps(element_dict[:], indent=2))
  • 1.
  • 2.
  • 3.
  • 4.

输出如下:

[
  {
    "type": "Title",
    "element_id": "e53cb06805f45fa23fb6d77966c5ec63",
    "text": "ChatGPT",
    "metadata": {
      "category_depth": 1,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "34a50527166e6765aa3e40778b5764e1",
    "text": "Chat-GPT: AI Chatbot, developed by OpenAI, trained to perform conversational tasks and creative tasks",
    "metadata": {
      "category_depth": 0,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "631df69dff044f977d66d71c5cbdab83",
    "text": "Backed by GPT-3.5 model (gpt-35-turbo), GPT-4 models",
    "metadata": {
      "category_depth": 0,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "6ac7cc52b0b2842ce7803bb176add0fb",
    "text": "Trained over 175 billion machine learning parameters",
    "metadata": {
      "category_depth": 0,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "01133c5465c85564ab1e39568d8b51f5",
    "text": "Conversation-in and message-out ",
    "metadata": {
      "category_depth": 0,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "1d495819227b92f341fb4b58d723a497",
    "text": "Note: Chat Completion API for GPT-4 models",
    "metadata": {
      "category_depth": 0,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "e450241caa0f39c30939a474bcff06ac",
    "text": "GPT-4 is multimodal (e.g., images + text)",
    "metadata": {
      "category_depth": 0,
      "file_directory": "examples",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-07-09T15:01:08",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  }
]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.

3.3 解析pdf

Image(filename="example_screnshoot/cot_paper.png", height=600, width=600)
  • 1.

LLM应用构建前的非结构化数据处理(一)标准化处理认识数据_5e_05

filename = "examples/CoT.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
JSON(json.dumps(resp.elements, indent=2))
  • 1.

输出如下:

[
  {
    "type": "Title",
    "element_id": "826446fa7830f0352c88808f40b0cc9b",
    "text": "B All Experimental Results",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "055f2fa97fbdee35766495a3452ebd9d",
    "text": "This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "parent_id": "826446fa7830f0352c88808f40b0cc9b",
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "9bf5af5255b80aace01b2da84ea86531",
    "text": "For the arithmetic reasoning benchmarks, some chains of thought (along with the equations produced) were correct, except the model performed an arithmetic operation incorrectly. A similar observation was made in Cobbe et al. (2021). Hence, we can further add a Python program as an external calculator (using the Python eval function) to all the equations in the generated chain of thought. When there are multiple equations in a chain of thought, we propagate the external calculator results from one equation to the following equations via string matching. As shown in Table 1, we see that adding a calculator significantly boosts performance of chain-of-thought prompting on most tasks.",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "parent_id": "826446fa7830f0352c88808f40b0cc9b",
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "46381dc72867b437cb990fc7734840ee",
    "text": "Table 1: Chain of thought prompting outperforms standard prompting for various large language models on five arithmetic reasoning benchmarks. All metrics are accuracy (%). Ext. calc.: post-hoc external calculator for arithmetic computations only. Prior best numbers are from the following. a: Cobbe et al. (2021). b & e: Pi et al. (2022), c: Lan et al. (2021), d: Pi˛ekos et al. (2021).",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "parent_id": "826446fa7830f0352c88808f40b0cc9b",
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "Table",
    "element_id": "3d22e4ba38f71ed038e9a72e4e8e225d",
    "text": "Prior best Prompting N/A (finetuning) 55a GSM8K SVAMP ASDiv 57.4b 75.3c AQuA 37.9d MAWPS 88.4e UL2 20B Standard Chain of thought 4.4 (+0.3) + ext. calc 4.1 6.9 10.1 12.5 (+2.4) 16.9 (+0.9) 23.6 (+3.1) 28.3 16.0 20.5 34.3 23.6 16.6 19.1 (+2.5) 42.7 LaMDA 137B Standard Chain of thought 14.3 (+7.8) + ext. calc 6.5 17.8 29.5 37.5 (+8.0) 46.6 (+6.5) 20.6 (-4.9) 42.1 40.1 25.5 53.4 20.6 43.2 57.9 (+14.7) 69.3 GPT-3 175B (text-davinci-002) Chain of thought 46.9 (+31.3) 68.9 (+3.2) 71.3 (+1.0) 35.8 (+11.0) 87.1 (+14.4) Standard 15.6 65.7 70.3 24.8 72.7 + ext. calc 49.6 70.3 71.1 35.8 87.5 Codex (code-davinci-002) Chain of thought 63.1 (+43.4) 76.4 (+6.5) 80.4 (+6.4) 45.3 (+15.8) 92.6 (+13.9) Standard 19.7 69.9 74.0 29.5 78.7 + ext. calc 65.4 77.0 80.0 45.3 93.3 PaLM 540B Standard Chain of thought 56.9 (+39.0) 79.0 (+9.6) 73.9 (+1.8) 35.8 (+10.6) 93.3 (+14.2) + ext. calc 17.9 69.4 72.1 25.2 79.2 79.8 58.6 72.6 35.8 93.5",
    "metadata": {
      "text_as_html": "<table><thead><tr><th></th><th>Prompting</th><th>GSMBK</th><th>SVAMP</th><th>ASDiv</th><th>AQUA</th><th>MAWPS</th></tr></thead><tbody><tr><td>Prior best</td><td>(finetuning)</td><td>55¢</td><td>57.4°</td><td>75.3¢</td><td>37.9¢</td><td>88.4¢</td></tr><tr><td rowspan=\"3\">UL2 20B</td><td>Standard</td><td>4.1</td><td>10.1</td><td>16.0</td><td>20.5</td><td>16.6</td></tr><tr><td>Chain of thought</td><td>4.4 (+0.3)</td><td>12.5 2.4</td><td>16.9 (+0.9)</td><td>23.6 (+3.1)</td><td>19.1 (2.5</td></tr><tr><td>+ ext. cale</td><td>.9</td><td>283</td><td>343</td><td>23.6</td><td>4.7</td></tr><tr><td rowspan=\"3\">LaMDA 137B</td><td>Standard</td><td>6.5</td><td>29.5</td><td>40.1</td><td>25.5</td><td>432</td></tr><tr><td>Chain of thought</td><td>14.3 (+7.8)</td><td>37.5 +8.0)</td><td>46.6 (+6.5)</td><td>20.6 (-4.9)</td><td>57.9 (+14.7)</td></tr><tr><td>+ ext. cale</td><td>78</td><td>42.</td><td>534</td><td>20.6</td><td>69.3</td></tr><tr><td>GPT-3 175B</td><td>Standard</td><td>15.6</td><td>65.7</td><td>70.3</td><td>24.8</td><td>72.7</td></tr><tr><td rowspan=\"2\">(text-davinci-002)</td><td>Chain of thought</td><td>46.9 (+31.3)</td><td>68.9 +3.2)</td><td>71.3 (+1.0)</td><td>35.8 (+11.0)</td><td>87.1 (+14.4)</td></tr><tr><td>+ ext. cale</td><td>49.6</td><td>0.3</td><td>71.1</td><td>358</td><td>875</td></tr><tr><td>Codex</td><td>Standard</td><td>19.7</td><td>69.9</td><td>74.0</td><td>29.5</td><td>8.7</td></tr><tr><td rowspan=\"2\">(code-davinci-002)</td><td>Chain of thought</td><td>63.1 (+434)</td><td>76.4 (+6.5)</td><td>80.4 (+6.4)</td><td>45.3 (+15.8)</td><td>92.6 (+13.9)</td></tr><tr><td>+ ext. cale</td><td>65.4</td><td>77.0</td><td>80.0</td><td>453</td><td>933</td></tr><tr><td rowspan=\"3\">PalLM 540B</td><td>Standard</td><td>17.9</td><td>69.4</td><td>72.1</td><td>252</td><td>79.2</td></tr><tr><td>Chain of thought</td><td>56.9 +39.0)</td><td>79.0 (+9.6)</td><td>73.9 (+1.8)</td><td>35.8 (+10.6)</td><td>93.3 (+142)</td></tr><tr><td>+ ext. cale</td><td>58.6</td><td>79.8</td><td>726</td><td>358</td><td>935</td></tr></tbody></table>",
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "parent_id": "826446fa7830f0352c88808f40b0cc9b",
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "PageNumber",
    "element_id": "0301f13983c12f215df253d2e16300d0",
    "text": "20",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "CoT.pdf"
    }
  }
]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.

4.总结

上述案例可以实现对非机构化文档的标准化,随后就可以对数据进行愉快的处理了。课程具体学习地址见参考链接1。

参考

  1.  Preprocessing Unstructured Data for LLM Applications
  2.  https://unstructured.io/