http://scikit-learn.org/stable/modules/feature_extraction.html
4.2节内容太多,因此将文本特征提取单独作为一块。
1、the bag of words representation
将raw data表示成长度固定的数字特征向量,scikit-learn提供了三个方式:
tokenizing:给每一个token(字、词,粒度自己把握)一个整数索引id
counting:每个token在每个文档中出现的次数
normalizing:根据每个token在样本/文档中出现的次数 规范化/权重化 token的重要性。
重新理解什么是feature、什么事sample:
- each individual token occurrence frequency (normalized or not) is treated as a feature.
- the vector of all the token frequencies for a given document is considered a multivariate sample.
general process (tokenization, counting and normalization) of turning a collection of text documents into numerical feature vectors,while completelyignoring the relative position information of the words in the document.
2、sparsity
每个文档中的词,只是整个语料库中所有词,的很小的一部分,这样造成feature vector的稀疏性(很多值为0)。为了解决存储和运算速度的问题,使用python的scipy.sparse包。
3、common vectorizer usage
CountVectorizer同时实现tokenizing和counting。
参数很多,但默认的就很合理了,适合大多数情况,具体参考:http://blog.csdn.net/mmc2015/article/details/46866537
这边的例子说明了它的使用:
http://blog.csdn.net/mmc2015/article/details/46857887
包括fit_transform、transform、get_feature_names()、ngram_range=(min,max)、vocabulary_.get()等。。。。
4、tf-idf term weighting
解决(e.g. “the”, “a”, “is” in English) 某些词出现次数太多,却又不是我们关注的词的问题。
the text.TfidfTransformer class实现了mormalization:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[ 0.85..., 0. ..., 0.52...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 0.55..., 0.83..., 0. ...],
[ 0.63..., 0. ..., 0.77...]])
>>> transformer.idf_ #idf_保存fit之后的结果
array([ 1. ..., 2.25..., 1.84...])
another class called