你可以通过在CountVectorizer中指定ngram_range参数来构建所有可能的二进制和三元组的词汇表。from sklearn.feature_extraction.text import CountVectorizer
Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()
至于你试图做的事情,如果你在你的词汇表上训练CountVectorizer然后转换文档,它将会起作用。my_vocabulary= ['was the fallacy', 'more crucial', 'particular premium', 'to excel', 'data tracking', 'another reality']
vectorizer = CountVectorizer(ngram_range=(2, 3))
vectorizer.fit_transform(my_vocabulary)
tf = vectorizer.transform(doc_set)
vectorizer.vocabulary_
Out[26]:
{'another reality': 0,
'data tracking': 1,
'more crucial': 2,
'particular premium': 3,
'the fallacy': 4,
'to excel': 5,
'was the': 6,
'was the fallacy': 7}
tf.toarray()
Out[25]:
array([[0, 0, 0, 0, 1, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0]], dtype=int64)