QA抽取:

watersink

已于 2025-04-23 19:47:32 修改

阅读量730

点赞数 8

分类专栏：大模型 NLP 文章标签：深度学习人工智能自然语言处理语言模型

于 2025-04-23 19:42:36 首次发布

本文链接：https://blog.csdn.net/qq_14845119/article/details/147461695

版权

大模型同时被 2 个专栏收录

25 篇文章

订阅专栏

NLP

11 篇文章

订阅专栏

方法1：（基于LangChain、Doctran实现）

问题定义：

QA 抽取，即从给定的文本中抽取出问题（Question）和答案（Answer）对，是自然语言处理（NLP）领域中的一项重要任务。在构建基于向量存储的知识库时，文档通常以叙述或对话形式存储。然而，用户的查询大多是问答形式。通过在向量化之前将文档转换为Q&A格式，我们可以提高检索相关文档的可能性，并减少检索不相关文档的风险。

抽取方法：

基于规则的方法：

通过编写特定的规则来匹配文本中的问题和答案模式。例如，根据一些固定的句式结构、标点符号等特征来识别问题和答案。这种方法的优点是准确性较高，在一些特定领域和规则明确的情况下效果较好；缺点是需要人工编写大量规则，且对文本的格式和语言表达有较高要求，灵活性较差。

基于机器学习的方法：

利用机器学习算法，如决策树、支持向量机等，对标注好的文本数据进行训练，学习问题和答案的特征，从而实现抽取。这种方法相比基于规则的方法更具灵活性，能够处理一些不规则的文本，但需要大量的标注数据进行训练，且模型的训练和调优较为复杂。

基于深度学习的方法：

近年来，深度学习在 QA 抽取中得到了广泛应用。例如，使用循环神经网络（RNN）、长短时记忆网络（LSTM）、注意力机制（Attention）等模型，能够自动学习文本的语义特征，更好地处理长序列文本和语义复杂的问题，在大规模数据集上往往能取得较好的效果。但深度学习模型通常需要强大的计算资源和大量的数据来训练，模型的解释性相对较差。

应用场景：

信息检索：

帮助用户更准确地获取所需信息。例如在企业文档管理系统中，用户可以通过输入问题来快速获取相关文档中的答案，提高信息检索效率。

智能客服：

将常见问题及其答案从相关文档中抽取出来，构建智能客服知识库。当用户咨询问题时，系统可以快速匹配并给出答案，提高客服效率和质量，降低人工成本。

知识图谱构建：

从大量文本中抽取问答对，为知识图谱提供实体和关系等信息，丰富知识图谱的内容，使其能够更好地支持复杂的知识查询和推理任务。

为什么要转换为Q&A格式：

提高相关性：用户查询以问题形式进行，文档也以Q&A形式存储可以更好地匹配查询。

降低误检率：通过明确的问题和答案，减少了模糊匹配导致的误检。

实操介绍：

方法1：（基于LangChain、Doctran实现）

LangChain 是一个用于开发由语言模型驱动的应用程序的框架，它提供了一系列工具和组件，帮助开发者更轻松地将语言模型与其他数据源和服务集成，以构建各种自然语言处理应用。Doctran是一个利用OpenAI的功能调用特性来转换文档的库。它能够将普通的文本文档转换为Q&A格式，使得基于向量的检索更为精准。

Doctran 使用语言模型自动生成关于文档内容的问答对。这些问答对保存在文档的元数据中，并且可以作为向量化输入的补充。Doctran 通过 OpenAI 的函数调用机制对文档内容进行“询问”，提取问答对。问答格式能更好地匹配用户以问题形式输入的查询，因此提高了检索的相关性。

# 安装必要的库
#pip install --upgrade --quiet doctran

# 导入所需模块
import os
import json
from langchain_community.document_transformers import DoctranQATransformer
from langchain_core.documents import Document
from dotenv import load_dotenv

os.environ['OPENAI_API_KEY'] = 'sk-Vm77zMD5G6rG2bUcXgbKcCtrnlRHfnschKh29I1t0FuWtkYm'
os.environ['OPENAI_API_MODEL'] = 'gpt-4'
# 加载环境变量
load_dotenv()

# 示例文档内容
sample_text = """Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

...

Thank you for your attention, and let's continue to work together to achieve our goals.

Best regards,

Jason Fan
Cofounder & CEO
Psychic
jason@psychic.dev
"""

# 构建文档对象
documents = [Document(page_content=sample_text)]

# 构建 Doctran 转换器
qa_transformer = DoctranQATransformer()

# 转化文档为问答格式
transformed_document = qa_transformer.transform_documents(documents)

# 输出转化后的问答元数据
print(json.dumps(transformed_document[0].metadata, indent=2))

方法2：（基于自己搭建的大模型实现）

主要步骤，

从PDF、word中提取文本
按生成式模型能接受的最大字符数和token数分割文本；
定义问答对生成prompt
定义问答对生成函数
使用生成式模型生成问答对，并将结果写入txt文件；

实现代码，

import datetime
import os
import fitz  # PyMuPDF
import requests
import json
import numpy as np
from transformers import BertTokenizer

#pip install pymupdf
#pip install transformers
#pip install frontend
#pip install openai==1.61.1

# 从PDF中提取文本
# 从PDF文件中提取文本函数
def extract_text_from_pdf(pdf_path):
    text = ""
    pdf_document = fitz.open(pdf_path)
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text += page.get_text()
    return text


#定义文本分割函数
def split_text(tokenizer, text, max_length, max_tokens):
    """将文本按字符数和token数分割成不超过max_length字符和max_tokens的段落"""
    paragraphs = []
    current_paragraph = ""
    current_tokens = 0
 
    for line in text.split("\n"):
        line_tokens = tokenizer.encode(line, add_special_tokens=False)
        if (len(current_paragraph) + len(line) + 1 <= max_length) and (current_tokens + len(line_tokens) + 1 <= max_tokens):
            current_paragraph += line + "\n"
            current_tokens += len(line_tokens) + 1  # +1 for the newline token
        else:
            paragraphs.append(current_paragraph.strip())
            current_paragraph = line + "\n"
            current_tokens = len(line_tokens) + 1
 
    if current_paragraph:
        paragraphs.append(current_paragraph.strip())
    return paragraphs
 
 
#定义问答对生成函数
def generate_qa(text_content, question_text=None):

    content= "以下是我给出的内容：{}。需要执行的命令：根据给出的内容，生成合适的问答对，生成的格式必须得是标准json格式".format(text_content)
    #定义问答对生成prompt
    prompt = '''
    #01 你是一个问答对数据集处理专家。
    #02 你的任务是根据我给出的内容，生成对应的问答对。
    #03 所生成的问题，必须得在我给出的内容中有对应的答案。
    #04 你必须严格按照我的问答对示例格式来生成,下面是2个问答对示例：
    {"QUESTION": "星冕仪模块三个主要观测目标是什么？", "ANSEWER": "星冕仪模块是三个主要观测目标是：1.近邻恒星高对比度成像普查。2.视向速度探测已知系外行星后随观测。3.恒星星周盘高对比度成像监测，并对恒星外星黄道尘强度分布进行定量分析。"}
    {"QUESTION": "空间站光学仓是什么？", "ANSEWER": "中国空间站光学舱将是一台 2 米口径的空间天文望远镜，主要工作在近紫外-可见光-近红外波段，兼具大视场巡天和精细观测能力，立足于2020-30 年代国际天文学研究的战略前沿，在科学上具有极强的竞争力，将与欧美同期的大型天文项目并驾齐驱，优势互补，并在若干方向上有所超越，有望取得对宇宙认知的重大突破"}
    '''
    
    import openai
    client = openai.Client(
        base_url='http://10.1.12.10:11435/v1/',
        api_key='None',  # required but ignored
    ) 
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": prompt,
            },
            {
                "role": "user",
                "content": content,
            }
        ],

        model='qwen2.5:7b',
        max_tokens=4096,
        temperature=0.05,
        top_p=0.9,
    )

    return chat_completion.choices[0].message.content
 
 
#将生成的问答对写入.txt文件
def write_to_file(content):
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    file_name = f"new_file_{timestamp}.txt"
    with open(file_name, "w", encoding="utf-8") as file:
        file.write(content)
    print("File 'new_file.txt' has been created and written.")
 
 
#读取PDF生成的txt文件
def read_file(file_name):
    try:
        with open(file_name, "r", encoding='utf-8') as file:
            content = file.read()
        return content
    except FileNotFoundError:
        print(f"File '{file_name}' not found.")
 
 
#主程序
def doc_to_qa(pdf_file):


    # 初始化BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
 
    # 提取PDF文件中的文本并按token数和字符数分割
    pdf_files = [pdf_file]
    max_length = 18000
    max_tokens = 4620
    documents = []
    for pdf in pdf_files:
        text = extract_text_from_pdf(pdf)
        paragraphs = split_text(tokenizer, text, max_length, max_tokens)
        documents.extend(paragraphs)
 
    # 将分割后的文本块分别存储到多个txt文件中，并保存文件名到一个列表中
    file_names = []
    for idx, doc in enumerate(documents, start=1):
        file_name = f"text_file{idx}.txt"
        with open(file_name, "w", encoding="utf-8") as file:
            file.write(doc)
        file_names.append(file_name)
 
    #print("文本文件已成功生成。")
    #print("生成的文件名列表：", file_names)


    for file in file_names:
        text_content = read_file(file)
        #print ('text_content\n', text_content)
        qa_text = generate_qa(text_content=text_content)
        print('qa_text\n', qa_text)
        write_to_file(qa_text)
        
        #删除文件
        os.remove(file)


if __name__ == '__main__':
    base_dir = "../政务/政策-省级政府规章/"
    for name in os.listdir(base_dir):
        full_name = os.path.join(base_dir, name)
        doc_to_qa(full_name)
        print("processed {}".format(name) )