对语料库的每一个句子的每一个单词加权重

最新推荐文章于 2021-05-08 18:03:40 发布

ASD991936157

最新推荐文章于 2021-05-08 18:03:40 发布

阅读量1.2k

点赞数 1

本文链接：https://blog.csdn.net/ASD991936157/article/details/77040447

版权

这篇博客介绍了如何对语料库中的每个句子的每个单词进行预处理，并利用TF-IDF方法来赋予它们权重。

摘要由CSDN通过智能技术生成

包括预处理，使用tfidf加权重

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# created by fhqplzj on 2017/05/15 上午10:48
import itertools
import re

import jieba
from six.moves import xrange
from sklearn.feature_extraction.text import TfidfVectorizer


def load_stopwords():
    path = '/Users/fhqplzj/PycharmProjects/data_service/service/dic/why/stopwords'
    content = open(path, 'rb').read().decode('utf-8')
    return frozenset(content.splitlines())


stopwords = load_stopwords()
chinese = re.compile(ur'^[0-9a-zA-Z_\u4e00-\u9fa5]+$')


def filter_func(word):
    result = True if re.match(chinese, word) else False
    return result and word not in stopwords


def my_tokenizer(sentence):
    words = jieba.lcut(sentence)
    return filter(filter_func, words)


def word_and_weight(corpus):
    vectorizer = TfidfVectorizer(tokenizer=my_tokenizer, norm='l1')
    tfidf_