1.学习内容
本节次学习内容来自于吴恩达老师的Preprocessing Unstructured Data for LLM Applications课程,因涉及到非结构化数据的相关处理,遂做学习整理。
2.相关环境准备
2.1 建议python版本在3.9版本以上
chromadb==0.4.22
langchain==0.1.5
langchain-community==0.0.17
langchain-core==0.1.19
langchain-openai==0.0.5
openai==1.11.1
tiktoken==0.5.2
#"unstructured[md,pdf,pptx]"
unstructured-client==0.16.0
unstructured==0.12.3
unstructured-inference==0.7.23
unstructured.pytesseract==0.3.12
urllib3==1.26.18
python-dotenv==1.0.1
panel==1.3.8
ipython==8.18.1
python-pptx==0.6.23
pdf2image==1.17.0
pdfminer==20191125
opencv-python==4.9.0.80
pikepdf==8.13.0
pypdf==4.0.1
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
2.2 申请unstructured的key
地址: unstructured.io,新注册的用户有14天的免费使用时间,每天1000页的转换。如图:
3.准备相关素材
# Warning control
import warnings
warnings.filterwarnings('ignore')
from IPython.display import JSON
import json
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
3.1 解析html
现实如图:
[
{
"type": "Title",
"element_id": "ca4a2c78bca728f3477958ece3222e10",
"text": "Share",
"metadata": {
"category_depth": 0,
"last_modified": "2024-07-09T15:00:56",
"languages": [
"eng"
],
"file_directory": "examples",
"filename": "medium_blog.html",
"filetype": "text/html"
}
},
{
"type": "NarrativeText",
"element_id": "23a7f3e28178ea0fa2b3e98b0275d2e3",
"text": "In the vast digital universe, data is the lifeblood that drives decision-making and innovation. But not all data is created equal. Unstructured data in images and documents often hold a wealth of information that can be challenging to extract and analyze.",
"metadata": {
"last_modified": "2024-07-09T15:00:56",
"languages": [
"eng"
],
"parent_id": "ca4a2c78bca728f3477958ece3222e10",
"file_directory": "examples",
"filename": "medium_blog.html",
"filetype": "text/html"
}
},
{
"type": "NarrativeText",
"element_id": "e1b7532458a93cfc789751895884e7bb",
"text": "Enter Unstructured.io, a powerful tool to extract and efficiently transform structured data. With sixteen and counting pre-built connectors, the API can easily integrate with various data sources, including AWS S3, GitHub, Google Cloud Storage, and more.",
"metadata": {
"link_texts": [
"Unstructured.io"
],
"link_urls": [
"https://www.unstructured.io/"
],
"link_start_indexes": [
6
],
"last_modified": "2024-07-09T15:00:56",
"languages": [
"eng"
],
"parent_id": "ca4a2c78bca728f3477958ece3222e10",
"file_directory": "examples",
"filename": "medium_blog.html",
"filetype": "text/html"
}
},
{
"type": "NarrativeText",
"element_id": "a6179d69ca1a55e0a3f98c08af0034e0",
"text": "In this guide, we’ll cover the advantages of using the Unstructured API and Connector module, walk you through a step-by-step process of using it with the S3 Connector as an example, and show you how to be a part of the Unstructured community.",
"metadata": {
"last_modified": "2024-07-09T15:00:56",
"languages": [
"eng"
],
"parent_id": "ca4a2c78bca728f3477958ece3222e10",
"file_directory": "examples",
"filename": "medium_blog.html",
"filetype": "text/html"
}
}
]
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
3.2 解析pptx
输出如下:
[
{
"type": "Title",
"element_id": "e53cb06805f45fa23fb6d77966c5ec63",
"text": "ChatGPT",
"metadata": {
"category_depth": 1,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
},
{
"type": "ListItem",
"element_id": "34a50527166e6765aa3e40778b5764e1",
"text": "Chat-GPT: AI Chatbot, developed by OpenAI, trained to perform conversational tasks and creative tasks",
"metadata": {
"category_depth": 0,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
},
{
"type": "ListItem",
"element_id": "631df69dff044f977d66d71c5cbdab83",
"text": "Backed by GPT-3.5 model (gpt-35-turbo), GPT-4 models",
"metadata": {
"category_depth": 0,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
},
{
"type": "ListItem",
"element_id": "6ac7cc52b0b2842ce7803bb176add0fb",
"text": "Trained over 175 billion machine learning parameters",
"metadata": {
"category_depth": 0,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
},
{
"type": "ListItem",
"element_id": "01133c5465c85564ab1e39568d8b51f5",
"text": "Conversation-in and message-out ",
"metadata": {
"category_depth": 0,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
},
{
"type": "ListItem",
"element_id": "1d495819227b92f341fb4b58d723a497",
"text": "Note: Chat Completion API for GPT-4 models",
"metadata": {
"category_depth": 0,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
},
{
"type": "ListItem",
"element_id": "e450241caa0f39c30939a474bcff06ac",
"text": "GPT-4 is multimodal (e.g., images + text)",
"metadata": {
"category_depth": 0,
"file_directory": "examples",
"filename": "msft_openai.pptx",
"last_modified": "2024-07-09T15:01:08",
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e53cb06805f45fa23fb6d77966c5ec63",
"filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
}
}
]
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.
- 103.
- 104.
- 105.
- 106.
- 107.
- 108.
- 109.
- 110.
- 111.
- 112.
- 113.
- 114.
- 115.
- 116.
- 117.
- 118.
- 119.
- 120.
3.3 解析pdf
filename = "examples/CoT.pdf"
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy='hi_res',
pdf_infer_table_structure=True,
languages=["eng"],
)
try:
resp = s.general.partition(req)
print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
print(e)
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
输出如下:
[
{
"type": "Title",
"element_id": "826446fa7830f0352c88808f40b0cc9b",
"text": "B All Experimental Results",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"filename": "CoT.pdf"
}
},
{
"type": "NarrativeText",
"element_id": "055f2fa97fbdee35766495a3452ebd9d",
"text": "This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "826446fa7830f0352c88808f40b0cc9b",
"filename": "CoT.pdf"
}
},
{
"type": "NarrativeText",
"element_id": "9bf5af5255b80aace01b2da84ea86531",
"text": "For the arithmetic reasoning benchmarks, some chains of thought (along with the equations produced) were correct, except the model performed an arithmetic operation incorrectly. A similar observation was made in Cobbe et al. (2021). Hence, we can further add a Python program as an external calculator (using the Python eval function) to all the equations in the generated chain of thought. When there are multiple equations in a chain of thought, we propagate the external calculator results from one equation to the following equations via string matching. As shown in Table 1, we see that adding a calculator significantly boosts performance of chain-of-thought prompting on most tasks.",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "826446fa7830f0352c88808f40b0cc9b",
"filename": "CoT.pdf"
}
},
{
"type": "NarrativeText",
"element_id": "46381dc72867b437cb990fc7734840ee",
"text": "Table 1: Chain of thought prompting outperforms standard prompting for various large language models on five arithmetic reasoning benchmarks. All metrics are accuracy (%). Ext. calc.: post-hoc external calculator for arithmetic computations only. Prior best numbers are from the following. a: Cobbe et al. (2021). b & e: Pi et al. (2022), c: Lan et al. (2021), d: Pi˛ekos et al. (2021).",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "826446fa7830f0352c88808f40b0cc9b",
"filename": "CoT.pdf"
}
},
{
"type": "Table",
"element_id": "3d22e4ba38f71ed038e9a72e4e8e225d",
"text": "Prior best Prompting N/A (finetuning) 55a GSM8K SVAMP ASDiv 57.4b 75.3c AQuA 37.9d MAWPS 88.4e UL2 20B Standard Chain of thought 4.4 (+0.3) + ext. calc 4.1 6.9 10.1 12.5 (+2.4) 16.9 (+0.9) 23.6 (+3.1) 28.3 16.0 20.5 34.3 23.6 16.6 19.1 (+2.5) 42.7 LaMDA 137B Standard Chain of thought 14.3 (+7.8) + ext. calc 6.5 17.8 29.5 37.5 (+8.0) 46.6 (+6.5) 20.6 (-4.9) 42.1 40.1 25.5 53.4 20.6 43.2 57.9 (+14.7) 69.3 GPT-3 175B (text-davinci-002) Chain of thought 46.9 (+31.3) 68.9 (+3.2) 71.3 (+1.0) 35.8 (+11.0) 87.1 (+14.4) Standard 15.6 65.7 70.3 24.8 72.7 + ext. calc 49.6 70.3 71.1 35.8 87.5 Codex (code-davinci-002) Chain of thought 63.1 (+43.4) 76.4 (+6.5) 80.4 (+6.4) 45.3 (+15.8) 92.6 (+13.9) Standard 19.7 69.9 74.0 29.5 78.7 + ext. calc 65.4 77.0 80.0 45.3 93.3 PaLM 540B Standard Chain of thought 56.9 (+39.0) 79.0 (+9.6) 73.9 (+1.8) 35.8 (+10.6) 93.3 (+14.2) + ext. calc 17.9 69.4 72.1 25.2 79.2 79.8 58.6 72.6 35.8 93.5",
"metadata": {
"text_as_html": "<table><thead><tr><th></th><th>Prompting</th><th>GSMBK</th><th>SVAMP</th><th>ASDiv</th><th>AQUA</th><th>MAWPS</th></tr></thead><tbody><tr><td>Prior best</td><td>(finetuning)</td><td>55¢</td><td>57.4°</td><td>75.3¢</td><td>37.9¢</td><td>88.4¢</td></tr><tr><td rowspan=\"3\">UL2 20B</td><td>Standard</td><td>4.1</td><td>10.1</td><td>16.0</td><td>20.5</td><td>16.6</td></tr><tr><td>Chain of thought</td><td>4.4 (+0.3)</td><td>12.5 2.4</td><td>16.9 (+0.9)</td><td>23.6 (+3.1)</td><td>19.1 (2.5</td></tr><tr><td>+ ext. cale</td><td>.9</td><td>283</td><td>343</td><td>23.6</td><td>4.7</td></tr><tr><td rowspan=\"3\">LaMDA 137B</td><td>Standard</td><td>6.5</td><td>29.5</td><td>40.1</td><td>25.5</td><td>432</td></tr><tr><td>Chain of thought</td><td>14.3 (+7.8)</td><td>37.5 +8.0)</td><td>46.6 (+6.5)</td><td>20.6 (-4.9)</td><td>57.9 (+14.7)</td></tr><tr><td>+ ext. cale</td><td>78</td><td>42.</td><td>534</td><td>20.6</td><td>69.3</td></tr><tr><td>GPT-3 175B</td><td>Standard</td><td>15.6</td><td>65.7</td><td>70.3</td><td>24.8</td><td>72.7</td></tr><tr><td rowspan=\"2\">(text-davinci-002)</td><td>Chain of thought</td><td>46.9 (+31.3)</td><td>68.9 +3.2)</td><td>71.3 (+1.0)</td><td>35.8 (+11.0)</td><td>87.1 (+14.4)</td></tr><tr><td>+ ext. cale</td><td>49.6</td><td>0.3</td><td>71.1</td><td>358</td><td>875</td></tr><tr><td>Codex</td><td>Standard</td><td>19.7</td><td>69.9</td><td>74.0</td><td>29.5</td><td>8.7</td></tr><tr><td rowspan=\"2\">(code-davinci-002)</td><td>Chain of thought</td><td>63.1 (+434)</td><td>76.4 (+6.5)</td><td>80.4 (+6.4)</td><td>45.3 (+15.8)</td><td>92.6 (+13.9)</td></tr><tr><td>+ ext. cale</td><td>65.4</td><td>77.0</td><td>80.0</td><td>453</td><td>933</td></tr><tr><td rowspan=\"3\">PalLM 540B</td><td>Standard</td><td>17.9</td><td>69.4</td><td>72.1</td><td>252</td><td>79.2</td></tr><tr><td>Chain of thought</td><td>56.9 +39.0)</td><td>79.0 (+9.6)</td><td>73.9 (+1.8)</td><td>35.8 (+10.6)</td><td>93.3 (+142)</td></tr><tr><td>+ ext. cale</td><td>58.6</td><td>79.8</td><td>726</td><td>358</td><td>935</td></tr></tbody></table>",
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "826446fa7830f0352c88808f40b0cc9b",
"filename": "CoT.pdf"
}
},
{
"type": "PageNumber",
"element_id": "0301f13983c12f215df253d2e16300d0",
"text": "20",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"filename": "CoT.pdf"
}
}
]
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
4.总结
上述案例可以实现对非机构化文档的标准化,随后就可以对数据进行愉快的处理了。课程具体学习地址见参考链接1。
参考