自然语言处理----停用词

最新推荐文章于 2024-10-18 11:28:30 发布

楚轩QK

最新推荐文章于 2024-10-18 11:28:30 发布

阅读量2.7k

点赞数 1

文章标签：停用词自然语言处理

本文链接：https://blog.csdn.net/qq_33343767/article/details/82798617

版权

自定义词典和停用词的引入

自定义词典是我们在分词的时候避免把我们需要的词组分成小词而导入的，而停用词，则是我们在分词过程中，将对我们分词过程中的干扰词排除在外的词典。

import re
import jieba
import sqlite3
import pandas as pd
from zhon.hanzi import punctuation   #中文标点符号

#jieba 分词可以将我们的自定义词典导入，格式 “词” “词性” “词频”
jieba.load_userdict('data/userdict.txt')

# 从中文停用词表里面，把停用词作为列表格式保存并返回, 使用的哈工大停用词表文件
def get_custom_stopwords(stop_words_file):
'''
 #导入停用词库
'''
    with open(stop_words_file) as f:
        stopwords = f.read()
    stopwords_list = stopwords.split('\n')
    custom_stopwords_list = [i for i in stopwords_list]
    return custom_stopwords_list
    
def pd_search(sql="select * from 'asks_answer'" ,DB_name='***.sqlite3'):
    connection = sqlite3.connect(DB_name)
    df = pd.read_sql_query(sql, connection)
    
    return df

def chinese_word_cut(mytext):
    """
    对文本进行切词，并过滤掉中文字符和停用词。
    """
    stop_words_file = 'stopwords.txt'
    stopwords = get_custom_stopwords(stop_words_file)
    
    cutted_word = re.sub(r'[%s]+'%punctuation, '', 
                         " ".join([i for i in jieba.cut_for_search(mytext) if i not in stopwords]))
    return cutted_word

mytext = "语言模型是把查询和文档分别表示成语言模型，通过两个语言模型之间的KL距离来估计两者的相似度。"
test_word = chinese_word_cut(mytext)
test_word

>>>'语言 模型 查询 文档 成 语言 模型 两个 语言 模型 之间 KL 距离 估计 相似度'