采用标题增强的方法优化RAG的向量化检索

SunnyDoy

已于 2024-06-24 12:52:49 修改

阅读量244

点赞数 7

文章标签： python 人工智能

于 2024-06-24 11:09:38 首次发布

本文链接：https://blog.csdn.net/qq_65019236/article/details/139910729

版权

一、方法思想

文本的标题是文章内容的汇总，对于理解对应文本的语义有至关重要的作用。对于划分的每段文本如果能知道其对应的文本标题，那对于当前段落的理解也会有巨大的提升，基于这个思想，我们增加标题判断，判断哪些文本为标题，并在metadata中进行标记；然后将文本与往上一级的标题进行拼合，实现文本信息的增强。

二、判断标题的原则

1、非字母字符的比例不能过长

一个标题不可能是空的。空字符串无法传达任何信息，因此直接排除。
非字母字符的比例不能超过给定阈值。

def under_non_alpha_ratio(text: str, threshold: float = 0.5):
    """
        检查文本片段中非字母字符的比例是否超过给定的阈值。这有助于防止像 "-----------BREAK---------" 这样的文本被标记为标题或叙述文本。该比例不包括空格。

        参数
    ----------
        text
        要测试的输入字符串
        threshold
        如果非字母字符的比例超过此阈值，则函数返回 False
    """
    if len(text) == 0:
        return False

    alpha_count = len([char for char in text if char.strip() and char.isalpha()])
    total_count = len([char for char in text if char.strip()])
    try:
        ratio = alpha_count / total_count
        return ratio < threshold
    except:
        return False

2、文本长度不能为0，由于是标题，文本长度也不能过长。

# 文本长度不能超过设定值，默认20
    # NOTE(robinson) - splitting on spaces here instead of word tokenizing because it
    # is less expensive and actual tokenization doesn't add much value for the length check
    if len(text) > title_max_word_length:
        return False

3、文本中有标点符号，就不是title

[^\w\s]：匹配任何不是字母数字字符（\w）或空白字符（\s）的字符。换句话说，这个部分匹配任何标点符号。

\Z：匹配字符串的结尾。

ENDS_IN_PUNCT_PATTERN = r"[^\w\s]\Z"
    ENDS_IN_PUNCT_RE = re.compile(ENDS_IN_PUNCT_PATTERN)
    if ENDS_IN_PUNCT_RE.search(text) is not None:
        return False

4、防止将带有特定标点符号结尾的问候语视为标题且标题不能全是数字

if text.endswith((",", ".", "，", "。")):
        return False

    if text.isnumeric():
        print(f"Not a title. Text is all numeric:\n\n{text}")  # type: ignore
        return False

5、开头的字符内应该有数字，这里默认5个字符内。

if len(text) < 5:
        text_5 = text
    else:
        text_5 = text[:5]
    alpha_in_text_5 = sum(list(map(lambda x: x.isnumeric(), list(text_5))))
    if not alpha_in_text_5:
        return False

三、完整代码

from langchain.docstore.document import Document
import re


def under_non_alpha_ratio(text: str, threshold: float = 0.5):
    """
        检查文本片段中非字母字符的比例是否超过给定的阈值。这有助于防止像 "-----------BREAK---------" 这样的文本被标记为标题或叙述文本。该比例不包括空格。

        参数
    ----------
        text
        要测试的输入字符串
        threshold
        如果非字母字符的比例超过此阈值，则函数返回 False
    """
    if len(text) == 0:
        return False

    alpha_count = len([char for char in text if char.strip() and char.isalpha()])
    total_count = len([char for char in text if char.strip()])
    try:
        ratio = alpha_count / total_count
        return ratio < threshold
    except:
        return False


def is_possible_title(
        text: str,
        title_max_word_length: int = 20,
        non_alpha_threshold: float = 0.5,
) -> bool:
    """
        检查文本是否通过了所有有效标题的检查。

        参数
        ----------
        text
            要检查的输入文本
        title_max_word_length
            标题可以包含的最大单词数
        non_alpha_threshold
            文本被视为标题所需的最小字母字符数
    """

    # 文本长度为0的话，肯定不是title
    if len(text) == 0:
        print("Not a title. Text is empty.")
        return False

    # 文本中有标点符号，就不是title
    ENDS_IN_PUNCT_PATTERN = r"[^\w\s]\Z"
    ENDS_IN_PUNCT_RE = re.compile(ENDS_IN_PUNCT_PATTERN)
    if ENDS_IN_PUNCT_RE.search(text) is not None:
        return False

    # 文本长度不能超过设定值，默认20
    # NOTE(robinson) - splitting on spaces here instead of word tokenizing because it
    # is less expensive and actual tokenization doesn't add much value for the length check
    if len(text) > title_max_word_length:
        return False

    # 文本中数字的占比不能太高，否则不是title
    if under_non_alpha_ratio(text, threshold=non_alpha_threshold):
        return False

    # NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles
    if text.endswith((",", ".", "，", "。")):
        return False

    if text.isnumeric():
        print(f"Not a title. Text is all numeric:\n\n{text}")  # type: ignore
        return False

    # 开头的字符内应该有数字，默认5个字符内
    if len(text) < 5:
        text_5 = text
    else:
        text_5 = text[:5]
    alpha_in_text_5 = sum(list(map(lambda x: x.isnumeric(), list(text_5))))
    if not alpha_in_text_5:
        return False

    return True


def zh_title_enhance(docs: Document) -> Document:
    title = None
    if len(docs) > 0:
        for doc in docs:
            if is_possible_title(doc.page_content):
                doc.metadata['category'] = 'cn_Title'
                title = doc.page_content
            elif title:
                doc.page_content = f"下文与({title})有关。{doc.page_content}"
        return docs
    else:
        print("文件不存在")