如何去除文本中的停用词及特殊符号

小闪777

于 2024-05-27 17:04:28 发布

阅读量278

点赞数 2

分类专栏： python-AI 文章标签： python

本文链接：https://blog.csdn.net/xiaoshan_777/article/details/139242397

版权

python-AI 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

- 业务背景
- 技术细节

业务背景

` 在RAG方案中，知识库中的文本会很大的影响到召回的准确率，因此我们在切分文本时去除停用词及特殊符号来提升召回的准确率。

技术细节

准备停用词文件及特殊符号文件
特殊符号文件

停用词文件

读取停用词及特殊符号

punctuations = load_txt_data("docs/punctuations.txt")
stopwords = load_txt_data("docs/query_stopwords.txt")
exclusions = stopwords + punctuations

def load_txt_data(file_path) -> list:
    current_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    with open(os.path.join(current_dir, file_path), encoding="utf-8") as file:
        text_data = [line.strip() for line in file.readlines()]
    return text_data

使用jieba去除停用词

def clean_pages_es(query) -> str:
    punctuations = load_txt_data("docs/punctuations.txt")
    stopwords = load_txt_data("docs/query_stopwords.txt")
    exclusions = stopwords + punctuations
    # 去除空格、制表符、换页符、换行符等
    text = re.sub(r'\s+', '', query)
    # 分词
    tokens = jieba.lcut(text)
    # 去除停用词
    tokens = [word for word in tokens if word not in exclusions]
    # 重新组合文本
    cleaned_text = ''.join(tokens)
    query = cleaned_text
    return query