老铁们,今天咱们聊聊如何通过Doctran
库来提取文档中的有用特性。这项技术真相当丝滑,得益于OpenAI的函数调用功能,咱们可以从文档中提取特定的元数据。
提取文档元数据在以下任务中尤其有帮助:
- 分类:将文档分类到不同类别中。
- 数据挖掘:提取结构化数据以便进行数据分析。
- 风格迁移:改变文本的书写方式以更符合用户的预期输入,提高向量搜索结果。
接下来咱们进入实战环节,看看如何做到这一点。
首先,确保安装并导入必要的库:
%pip install --upgrade --quiet doctran
import json
from langchain_community.document_transformers import DoctranPropertyExtractor
from langchain_core.documents import Document
from dotenv import load_dotenv
load_dotenv()
输入
这是我们将要提取属性的文档:
sample_text = """[Generated with ChatGPT]
...
"""
print(sample_text)
设置要提取的属性
在这里,我们定义了需要从文档中提取的属性:
documents = [Document(page_content=sample_text)]
properties = [
{
"name": "category",
"description": "What type of email this is.",
"type": "string",
"enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
"required": True,
},
{
"name": "mentions",
"description": "A list of all people mentioned in this email.",
"type": "array",
"items": {
"name": "full_name",
"description": "The full name of the person mentioned.",
"type": "string",
},
"required": True,
},
{
"name": "eli5",
"description": "Explain this email to me like I'm 5 years old.",
"type": "string",
"required": True,
},
]
property_extractor = DoctranPropertyExtractor(properties=properties)
输出
通过Doctran
提取文档属性后的结果会作为新文档返回,包含在metadata中:
extracted_document = property_extractor.transform_documents(
documents, properties=properties
)
print(json.dumps(extracted_document[0].metadata, indent=2))
输出如下,它提取了文档类型、提到的人名列表,以及一个简单易懂的解释:
{
"extracted_properties": {
"category": "update",
"mentions": [
"John Doe",
"Jane Smith",
"Michael Johnson",
"Sarah Thompson",
"David Rodriguez"
],
"eli5": "This email provides important updates and discussions on various topics..."
}
}
老铁们,今天的技术分享就到这里,希望对大家有帮助。开发过程中遇到问题也可以在评论区交流~