文本预处理_词频计算_ngram

最新推荐文章于 2025-01-28 19:28:55 发布

rebirth_2020

最新推荐文章于 2025-01-28 19:28:55 发布

阅读量1.9k

点赞数 3

分类专栏： NLP

本文链接：https://blog.csdn.net/qq_25992377/article/details/90215625

版权

NLP 专栏收录该内容

21 篇文章

订阅专栏

- 中文文本的预处理

在处理文本时，由于存在各种不可预知的词，所以只进行我们所关心字符的处理。

参考： https://www.jianshu.com/p/093ec1eeccff

def filter_word(sentence):
    for uchar in sentence:
        if(uchar>=u'\u4e00' and uchar<=u'\u9fa5'):
            continue
        if(uchar >= u'\u0030' and uchar<=u'\u0039'):
            continue
        if (uchar >= u'\u0041' and uchar <= u'\u005a') or (uchar >= u'\u0061' and uchar <= u'\u007a'):
            continue
        else:
            sentence=sentence.replace(uchar,"")
    return sentence

- 分词后进行词频统计

很多库有词频统计的功能，当然也可以自己写，但是用CountVectorizer更为方便。

#该代码没有验证
dict_all=dict()
for sentence in text:
    for one in sentence:
        if one in dict_all:
            dict_all[one] += 1
        else:
            dict_all[one] = 1

vectorizer=CountVectorizer(min_df=2) 
corpus=["我们 是 好人 大大 他们 坏人 真的","他们 是 坏人 真的","aa aa cc"]
X=vectorizer.fit_transform(corpus)

对于字符长度为1的词，CountVectorizer过滤掉了..如果不想被过滤掉，需要使用chars的level。

输出格式为： (文档序号, 词序) 词频，其中词序为整个文档的情况下。

- ngram

由相邻词不同个数的组合，就分为了1gram，2gram，3gram等。

通过如下代码，可以看出CountVectorizer的ngram工作方式，结果中会包含小于n的词汇。

bigram_vectorizer=CountVectorizer(ngram_range=(1,2),min_df=1)
analyze=bigram_vectorizer.build_analyzer()
print(analyze("我们 是 好人 真的 坏人"))

对于分词好的句子list，做bigram，将结果保存下来。

X_2=bigram_vectorizer.fit_transform(contents[0:200]).toarray()
#print(X_2)
bi_word_all=bigram_vectorizer.get_feature_names()

file_out=open("bigram.txt","w")
for one in bi_word_all:
    file_out.write(one+"\n")
file_out.close()