1.学习内容

本节次学习内容来自于吴恩达老师的Preprocessing Unstructured Data for LLM Applications课程,因涉及到非结构化数据的相关处理,遂做学习整理。

什么是元数据?元数据可以是文档级别的,也可以是元素级别的,它可以是我们从文档信息本身提取的内容,比如最后修改日期或文件名,也可以是我们在预处理文档时推断出的内容。元数据在构建RAG混合搜索作用非常大。如图:

LLM应用构建前的非结构化数据处理(二)元数据的提取和文档切分_ci

其中的pagenumber、language就属于元数据。

元数据的作用是什么?如果您想将搜索限制在特定部分,您可以根据该元数据字段进行过滤,或者如果您想将结果限制在更近期的信息上,然后构造查询,以便仅返回在特定日期之后的文档。

2.相关环境准备

可以参考:LLM应用构建前的非结构化数据处理(一)

目录结构如图所示:

LLM应用构建前的非结构化数据处理(二)元数据的提取和文档切分_ci_02


本次我们尝试解析epub数据,是电子书格式内容,同样的,需要unstructured.io上获取APIkey。

3.开始尝试

3.1导包初始化

# Warning control
import warnings
warnings.filterwarnings('ignore')

import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

import json
from IPython.display import JSON

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements

import chromadb
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
# 初始化API
s = UnstructuredClient(
    api_key_auth="XXX",
    server_url="https://api.unstrXXX",
)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.

3.2 查看电子书内容和格式

from IPython.display import Image
Image(filename='images/winter-sports-cover.png', height=400, width=400)
  • 1.
  • 2.

LLM应用构建前的非结构化数据处理(二)元数据的提取和文档切分_ci_03

Image(filename="images/winter-sports-toc.png", height=400, width=400)
  • 1.

LLM应用构建前的非结构化数据处理(二)元数据的提取和文档切分_服务器_04

3.3 解析书本

filename = "example_files/winter-sports.epub"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(files=files)
try:
    resp = s.general.partition(req)
except SDKError as e:
    print(e)

JSON(json.dumps(resp.elements[0:3], indent=2))
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.

输出如下:

[
  {
    "type": "Title",
    "element_id": "6c6310b703135bfe4f64a9174a7af8eb",
    "text": "The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson",
    "metadata": {
      "category_depth": 1,
      "emphasized_text_contents": [
        "Winter Sports in\nSwitzerland"
      ],
      "emphasized_text_tags": [
        "span"
      ],
      "languages": [
        "eng"
      ],
      "filename": "winter-sports.epub",
      "filetype": "application/epub"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "9ecb42d4f263247a920448ed98830388",
    "text": "\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online at\n",
    "metadata": {
      "link_texts": [
        "www.gutenberg.org"
      ],
      "link_urls": [
        "https://www.gutenberg.org"
      ],
      "link_start_indexes": [
        285
      ],
      "languages": [
        "eng"
      ],
      "parent_id": "6c6310b703135bfe4f64a9174a7af8eb",
      "filename": "winter-sports.epub",
      "filetype": "application/epub"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "87ad8d091d5904b17bc345b10a1c964a",
    "text": "www.gutenberg.org. If you are not located\nin the United States, you’ll have to check the laws of the country where\nyou are located before using this eBook.",
    "metadata": {
      "link_texts": [
        "www.gutenberg.org"
      ],
      "link_urls": [
        "https://www.gutenberg.org"
      ],
      "link_start_indexes": [
        -1
      ],
      "languages": [
        "eng"
      ],
      "parent_id": "6c6310b703135bfe4f64a9174a7af8eb",
      "filename": "winter-sports.epub",
      "filetype": "application/epub"
    }
  }
]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.

3.4 过滤元数据类型是标题且包含hockey的章节

[x for x in resp.elements if x['type'] == 'Title' and 'hockey' in x['text'].lower()]
  • 1.

输出如下:

[{'type': 'Title',
  'element_id': '6cf4a015e8c188360ea9f02a9802269b',
  'text': 'ICE-HOCKEY',
  'metadata': {'category_depth': 0,
   'emphasized_text_contents': ['ICE-HOCKEY'],
   'emphasized_text_tags': ['span'],
   'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}},
 {'type': 'Title',
  'element_id': '4ef38ec61b1326072f24495180c565a8',
  'text': 'ICE HOCKEY',
  'metadata': {'category_depth': 0,
   'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}}]
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.

3.5 尝试按照指定章节拆分

chapters = [
    "THE SUN-SEEKER",
    "RINKS AND SKATERS",
    "TEES AND CRAMPITS",
    "ICE-HOCKEY",
    "SKI-ING",
    "NOTES ON WINTER RESORTS",
    "FOR PARENTS AND GUARDIANS",
]

# 找到上述章节对应的element_id
chapter_ids = {}
for element in resp.elements:
    for chapter in chapters:
        if element["text"] == chapter and element["type"] == "Title":
            chapter_ids[element["element_id"]] = chapter
            break
            
# 章节的key,value对调,方便后续查找
chapter_to_id = {v: k for k, v in chapter_ids.items()}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
# 尝试找到元数据父节点为ICE-HOCKEY对应id的所有内容,并输出第一个结果
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["ICE-HOCKEY"]][0]
  • 1.
  • 2.

输出如下:

{'type': 'NarrativeText',
 'element_id': 'c7c8e2f178cb0dc273ba7e811372640b',
 'text': 'Many of the Swiss winter-resorts can put\ninto the field a very strong ice-hockey team, and fine teams from other\ncountries often make winter tours there; but the ice-hockey which the\nordinary winter visitor will be apt to join in will probably be of the\nmost elementary and unscientific kind indulged in, when the skating day\nis drawing to a close, by picked-up sides. As will be readily\nunderstood, the ice over which a hockey match has been played is\nperfectly useless for skaters any more that day until it has been swept,\nscraped, and sprinkled or flooded; and in consequence, at all Swiss\nresorts, with the exception of St. Moritz, where there is a rink that\nhas been made for the hockey-player, or when an important match is being\nplayed, this sport is supplementary to such others as I have spoken of.\nNobody, that is, plays hockey and nothing else, since he cannot play\nhockey at all till the greedy skaters have finished with the ice.',
 'metadata': {'emphasized_text_contents': ['Many'],
  'emphasized_text_tags': ['span'],
  'languages': ['eng'],
  'parent_id': '6cf4a015e8c188360ea9f02a9802269b',
  'filename': 'winter-sports.epub',
  'filetype': 'application/epub'}}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.

3.6 尝试持久化

client = chromadb.PersistentClient(path="chroma_tmp", settings=chromadb.Settings(allow_reset=True))
client.reset() #输出True
  • 1.
  • 2.
collection = client.create_collection(
    name="winter_sports",
    metadata={"hnsw:space": "cosine"}
)
  • 1.
  • 2.
  • 3.
  • 4.
# 将元素数据内容存入chromadb,该过程构建可能需要五分钟左右
for element in resp.elements:
    parent_id = element["metadata"].get("parent_id")
    chapter = chapter_ids.get(parent_id, "")
    collection.add(
        documents=[element["text"]],
        ids=[element["element_id"]],
        metadatas=[{"chapter": chapter}]
    )
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
# 拿到数据并查看
results = collection.peek()
print(results["documents"])
  • 1.
  • 2.
  • 3.

输出如下:

['[Image\nunavailable.]', '[Image\nunavailable.]', 'Here is a remarkably varied programme, and one that will obviously\ngive a good spell of regular work to a candidate who intends to grapple\nwith it. It contains more of the material for skating than does the\ncorresponding English second test, in which only the four edges, the\nfour simple turns, and the four changes of edge are introduced, since\nthis International second test comprises as well as those, the four\nloops, and two out of the four brackets. These\nloops, which are most charming and effective figures, have nowadays no\nplace in English skating, since it is quite impossible to execute any of\nthem, as far as is at present known, without breaking the rules for\nEnglish skating, since the unemployed leg (i.e. the one not\ntracing the figure) must be used to get the necessary balance and swing.\nThey belong to a great class of figures like cross-cuts in all their\nvarieties, beaks, pigs-ears, &c., in which the skater nearly, or\nactually, stops still for a moment, and then, by a swing of the body or\nleg, resumes or reverses his movement. By this momentary loss and\nrecovery of balance there is opened out to the skater whole new fields\nof intricate and delightful movements, and the patterns that can be\ntraced on the ice are of endless variety. And here in this second\nInternational test the confines of this territory are entered on by the\nfour loops, which are the simplest of the “check and recovery” figures.\nIn the loops (the shape of which is accurately expressed by their names)\nthe skater does not come absolutely to a standstill, though very nearly,\nand the swing of the body and leg is then thrown forward in front of the\nskate, and this restores to it its velocity, and pulls it, so to speak,\nout of its loop. A further extension of this check and resumption of\nspeed occurs in cross-cuts, which do not enter into the International\ntests, but which figure largely in the performance of good skaters. Here\nthe forward movement of the skate (or backward movement, if back\ncross-cuts are being skated) is entirely checked, the skater comes to a\nmomentary standstill and moves backwards for a second. Then the forward\nswing of the body and unemployed leg gives him back his checked and\nreversed movement.', '[Image\nunavailable.]', '(a) A set of combined figures skated with another skater,\nwho will be selected by the judges, introducing the following calls in\nsuch order and with such repetitions as the judges may direct:—', 'CHAPTER\nVII', 'The figures need not be commenced from rest.', 'But when we consider that the first-class skater must be able to\nskate at high speed on any edge, make any turn at a fixed point, and\nleave that fixed point (having made his turn and edge in compliance with\nthe proper form for English skating, without scrape or wavering) still\non a firm and large-circumferenced curve, that he must be able to\ncombine any mohawk and choctaw with any of the sixteen turns, and any of\nthe sixteen turns with any change of edge, and that in combined skating\nhe is frequently called upon to do all these permutations of edge and\nturn, at a fixed point, and in\ntime with his partner, while two other partners are performing the same\nevolution in time with each other, it begins to become obvious that\nthere is considerable variety to be obtained out of these manœuvres. But\nthe consideration of combined skating, which is the cream and\nquintessence of English skating, must be considered last; at present we\nwill see what the single skater may be called upon to do, if he wishes\nto attain to acknowledged excellence in his sport.', 'Plate XXXII', 'He delivers the stone: the skip, eagle-eyed, watches the pace of it.\nIt may seem to him to be travelling with sufficient speed to reach the\nspot at which he desires it should rest. In this case he says nothing\nwhatever, except probably “Well laid down.” Smoothly it glides, and in\nall probability he will exclaim “Not a touch”: or (if he is very Scotch,\neither by birth or by infection of curling) “not a cow” (which means not\na touch of the besom). On the other hand he may think that it has been\nlaid down too weakly and will not get over the hog-line. Then he will\nshriek out, “Sweep it; sweep it” (or “soop it; soop it”) “man” (or\n“mon”). On which No. 2 and No. 3 of his side burst into frenzied\nactivity, running by the side of the stone and polishing the surface of\nthe ice immediately in front of it with their besoms. For, however well\nthe ice has been prepared, this zealous polishing assists a stone to\ntravel, and vigorous sweeping of the ice in front of it will give, even\non very smooth and hard ice, several feet of additional travel, and a\nstone that would have been hopelessly hogged will easily be converted\ninto the most useful of stones by diligent sweeping, and will lie a\nlittle way in front of the house where the skip has probably directed it\nto be. If he is an astute and cunning old dog, as all skips should be,\nhe will not want this first stone in the house at all; in fact, if he\nsees it is coming into the house, he will probably say “too strong.”\nYet, since according to\nthe rules only stones inside the house can count for the score, it seems\nincredible at first sight why he should not want every stone to be\nthere. This “inwardness” will be explained later.']
  • 1.

3.7 尝试查询

result = collection.query(
    query_texts=["How many players are on a team?"],
    n_results=2,
    where={"chapter": "ICE-HOCKEY"},
)
print(json.dumps(result, indent=2))
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.

输出如下:

{
  "ids": [
    [
      "241221156e35865aa1715aa298bcc78d",
      "7a2340e355dc6059a061245db57f925b"
    ]
  ],
  "distances": [
    [
      0.5229756832122803,
      0.7836341261863708
    ]
  ],
  "metadatas": [
    [
      {
        "chapter": "ICE-HOCKEY"
      },
      {
        "chapter": "ICE-HOCKEY"
      }
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "It is a wonderful and delightful sight to watch the speed and\naccuracy of a first-rate team, each member of which knows the play of\nthe other five players. The finer the team, as is always the case, the\ngreater is their interdependence on each other, and the less there is of\nindividual play. Brilliant running and dribbling, indeed, you will see;\nbut as distinguished from a side composed of individuals, however good,\nwho are yet not a team, these brilliant episodes are always part of a\nplan, and end not in some wild shot but in a pass or a succession of\npasses, designed to lead to a good opening for scoring. There is,\nindeed, no game at which team play outwits individual brilliance so\ncompletely.",
      "And in most places hockey is not taken very seriously: it is a\ncharming and heat-producing scramble to take part in when the out-door\nday is drawing to a close and the chill of the evening beginning to set\nin; there is a vast quantity of falling down in its componence and not\nvery many goals, and a general ignorance about rules. But since a game,\nespecially such a wholly admirable\nand delightful game as ice-hockey, may just as well be played on the\nlines laid down for its conduct as not, I append at the end of this\nshort section a copy of the latest edition of the rules as issued by\nPrince\u2019s Club, London."
    ]
  ],
  "uris": null,
  "data": null,
  "included": [
    "metadatas",
    "documents",
    "distances"
  ]
}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.

3.8尝试对文件分块(chunk)

elements = dict_to_elements(resp.elements)

chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=100, # 对分块过小的不足100字符的进行合并
    max_characters=3000,
)

JSON(json.dumps(chunks[0].to_dict(), indent=2))
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.

输出如下:

{
  "type": "CompositeElement",
  "element_id": "676ccd27-a9e4-46ea-80f1-00be45b60182",
  "text": "The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson\n\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online at\n\n\nwww.gutenberg.org. If you are not located\nin the United States, you’ll have to check the laws of the country where\nyou are located before using this eBook.",
  "metadata": {
    "emphasized_text_contents": [
      "Winter Sports in\nSwitzerland"
    ],
    "emphasized_text_tags": [
      "span"
    ],
    "filename": "winter-sports.epub",
    "filetype": "application/epub",
    "languages": [
      "eng"
    ],
    "link_texts": [
      "www.gutenberg.org",
      "www.gutenberg.org"
    ],
    "link_urls": [
      "https://www.gutenberg.org",
      "https://www.gutenberg.org"
    ],
    "orig_elements": "eJzNVNFum0AQ/JUVzw4BG9uQx0htVamqKsVVVYXIOrgFrsF36G4JIVH/vXvYTtPGitSHqn0y3p2dHXZGXD8G2OIONW2VDC4gWJWrRRwV62gRL5ZFhUm1SkQWrxOxFlWKRTCDYIckpCDB+MegFIS1seNWYkcNl2JG4K5rhFMPKLeE97QtjSbe4bh9HXxR/MfCVWcsOVA611eDoge0rdAyuDkxTqLej7pO6AlRqRa12KGXPEx8Z27iC7HrJ5EeQWM3IUTXtYqFKqPPj31eVveixj0x6jq4+c5lv8+PbBqET9Z8w5LgXc/iC7Q14KUxt2AqePUdZlCM8CaEtyFconZG+31HLRtFLQa86vfLZ1gWyVwm1Xy1mCdrkc2jJElRZmm6iBZp+uLyf+UGvqr07daRsKxMS7zft+fp8qnpj7SfGYYhrI/nCY19xtDbdo9piDp3cX5+GtsJ+wfpe25RrjeNcoCF94QfKmOB2LbeobdI6NFo9D9DgxbZpKn7WStCCVfEuXXclbneGUdguGmB1bCfPO2hg7GtBEGgDZQew2hgmxsQ7TTDdYuOrCr9WV2uh0aQM3iHNoSvpoedGHmyG0HRDGp1xyqYZeAqa7V45qVypecz77WziTvmOQh4GcAPquRE+Zcp217iQQ79vAPzGs335xen/JfgfRTWsv13uPH3OxHAdC1kKqMslsssSop4XZSLZFnEkYjLjH34xwE8i/+z/L3gC+F9BSPbzpScDYLW+K8jJ+xU9mYemvfzKM7aFhrB4SDOWYPl7QRuxfCUxNL0muwIU5Jzfdxx4IcCOfs++ErXhzD4D1X4iv03PwDRsgQb"
  }
}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.

可以看到,文件分块成功

print(len(elements)) # 输出752
print(len(chunks))   # 输出255
  • 1.
  • 2.

可以看到,文件共有752个元素模块,经过chunk后,最终形成了255个模块。

4. 总结

本节内容对元数据进行了学习,元数据对于文档数据的提取、文档的切分工作意义重大,但是也要注意,识别过程中可能会出现Title分类错误的问题,需要观察。