from sklearn.feature_extraction.text import CountVectorizer
corpus=['Job was the charirman of Apple Inc., and he was very famous',
'I like to use apple computer',
'And I also like to eat apple']
如上图,希望对列表中的三句话进行向量化处理
vectorizer = CountVectorizer()
print("未经停用词过滤的文档向量化情况:")
print(vectorizer.fit_transform(corpus).todense()) # 显示完整矩阵形式
print(vectorizer.vocabulary_)
可看到文档被转换成了3行17列的形式,表示各个词在矩阵中的出现情况
接下来,加载nltk库包中的停用词表,对文档进一步处理
import nltk
nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words("english")
print (stopwords)
通过设定停用词,对文档重新进行向量化处理
vectorizer1 = CountVectorizer(stop_words="english")
print("after stopwords removal:")
print(vectorizer1.fit_transform(corpus).todense())
print(vectorizer1.vocabulary_)
由此,现矩阵变成3行8列的形式
此外,还可以通过ngram方式对文档划分标准进行设定,以下代码演示了以单个单词和两个单词为划分标准的情况,即ngram_range(1,2)
vectorizer2 = CountVectorizer(ngram_range=(1,2))
print("N-gram mode:")
print(vectorizer2.fit_transform(corpus).todense())
print(vectorizer2.vocabulary_)