def get_features_by_wordbag(): global max_features x_train, x_test, y_train, y_test=load_all_files() vectorizer = CountVectorizer( decode_error='ignore', strip_accents='ascii', max_features=max_features, stop_words='english', max_df=1.0, min_df=1 ) print vectorizer x_train=vectorizer.fit_transform(x_train) x_train=x_train.toarray() vocabulary=vectorizer.vocabulary_ vectorizer = CountVectorizer( decode_error='ignore', strip_accents='ascii', vocabulary=vocabulary, stop_words='english', max_df=1.0, min_df=1 ) print vectorizer x_test=vectorizer.fit_transform(x_test) x_test=x_test.toarray() return x_train, x_test, y_train, y_test
词袋模型示例:
>>> corpus = [
... 'This is the first document.', ... 'This is the second second document.', ... 'And the third one.', ... 'Is this the first document?', ... ] >>> X = vectorizer.fit_transform(corpus) >>> X <4x9 sparse matrix of type '<... 'numpy.int64'>' with 19 stored elements in Compressed Sparse ... format>
The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
>>> analyze = vectorizer.build_analyzer() >>> analyze("This is a text document to analyze.") == ( ... ['this', 'is', 'text', 'document', 'to', 'analyze']) True
Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
>>> vectorizer.get_feature_names() == ( ... ['and', 'document', 'first', 'is', 'one', ... 'second', 'the', 'third', 'this']) True >>> X.toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)