词袋模型测试数据使用和训练数据一样的词汇表

最新推荐文章于 2021-11-24 21:36:05 发布

djph26741

最新推荐文章于 2021-11-24 21:36:05 发布

阅读量149

点赞数

原文链接：http://www.cnblogs.com/bonelee/p/8426002.html

版权

def get_features_by_wordbag():
    global max_features
    x_train, x_test, y_train, y_test=load_all_files()

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print vectorizer
    x_train=vectorizer.fit_transform(x_train)
    x_train=x_train.toarray()
    vocabulary=vectorizer.vocabulary_

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 vocabulary=vocabulary,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print vectorizer
    x_test=vectorizer.fit_transform(x_test)
    x_test=x_test.toarray()

    return x_train, x_test, y_train, y_test

词袋模型示例：

>>> corpus = [
...     'This is the first document.', ... 'This is the second second document.', ... 'And the third one.', ... 'Is this the first document?', ... ] >>> X = vectorizer.fit_transform(corpus) >>> X <4x9 sparse matrix of type '<... 'numpy.int64'>'  with 19 stored elements in Compressed Sparse ... format>

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

 
   >>> 
   >>> analyze = vectorizer.build_analyzer() >>> analyze("This is a text document to analyze.") == ( ... ['this', 'is', 'text', 'document', 'to', 'analyze']) True

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

 
   >>> 
   >>> vectorizer.get_feature_names() == ( ... ['and', 'document', 'first', 'is', 'one', ... 'second', 'the', 'third', 'this']) True >>> X.toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1],  [0, 1, 0, 1, 0, 2, 1, 0, 1],  [1, 0, 0, 0, 1, 0, 1, 1, 0],  [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

转载于:https://www.cnblogs.com/bonelee/p/8426002.html

djph26741

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
词袋模型测试数据使用和训练数据一样的词汇表

def get_features_by_wordbag(): global max_features x_train, x_test, y_train, y_test=load_all_files() vectorizer = CountVectorizer( decode_erro...
复制链接

扫一扫