sklearn学习笔记2 Feature_extraction库

最新推荐文章于 2024-02-08 15:41:24 发布

wateryouyouyou

最新推荐文章于 2024-02-08 15:41:24 发布

阅读量5.6k

点赞数 2

本文链接：https://blog.csdn.net/wateryouyo/article/details/53906426

版权

1. 将字典格式的数据转换为特征。

前提：数据是用字典格式存储的，通过调用DictVectorizer类可将其转换成特征，对于特征值为字符型的变量，自动转换为多个特征变量，类似前面提到的onehot编码。

In [226]: measurements = [
     ...:      {'city': 'Dubai', 'temperature': 33.},
     ...:      {'city': 'London', 'temperature': 12.},
     ...:      {'city': 'San Fransisco', 'temperature': 18.},
     ...: ]

In [227]: from sklearn.feature_extraction import DictVectorizer

In [228]: vec=DictVectorizer()

In [229]: vec.fit_transform(measurements).toarray()
Out[229]:
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

In [230]: vec.get_feature_names()
Out[230]: ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

同时可以通过vec的restrict函数，直接对特征进行选择。

In [247]: from sklearn.feature_selection import SelectKBest,chi2
In [249]: z=vec.fit_transform(measurements)support = SelectKBest(chi2, k=2).fit(z, [0, 1,2])
In [250]: z.toarray()
Out[250]:array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]])
In [251]: vec.get_feature_names()
Out[251]: ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
In [252]: vec.restrict(support.get_support())
Out[252]:DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True)
In [253]: vec.get_feature_names()
Out[253]: ['city=San Fransisco', 'temperature']

也可调用inverse_transform函数得到原来的值。

2.特征哈希

当特征取值列表很大，且有多个需onehot编码时，会导致特征矩阵很大，且有很多0，这时可用哈希函数将特征根据特征名和值映射到指定维数的矩阵。由于hash是单向函数，因此FeatureHash没有inverse_transform函数。

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='dict')
D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(D)
f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

3.文本处理

（1）count

语料库中的单词作为特征，文档中该单词出现的数目作为特征值。

In [1]: from sklearn.feature_extraction.text import CountVectorizer

In [2]: vec=CountVectorizer()

In [3]: vec
Out[3]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content'
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]: corpus = [
   ...:     'This is the first document.',
   ...:     'This is the second second document.',
   ...:     'And the third one.',
   ...:     'Is this the first document?',
   ...: ]

In [5]: X=vec.fit_transform(corpus)

In [6]: X.toarray()
Out[6]:
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

也可用ngram作为词袋，通过参数ngram_range指定。

In [21]: bigram_vec=CountVectorizer(ngram_range=(1,3),token_pattern=r'\b\w+\b'
    ...: min_df=1)

In [22]: bigram_vec.fit_transform(corpus).toarray()
Out[22]:
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 1, 0, 0, 1, 1,
        0, 0, 0, 0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [23]: analyze=bigram_vec.build_analyzer()

In [24]: analyze('hello a b c')
Out[24]:
[u'hello',
 u'a',
 u'b',
 u'c',
 u'hello a',
 u'a b',
 u'b c',
 u'hello a b',
 u'a b c']

对字符编码的识别问题，可看下chardet模块

（2）稀疏矩阵转换

HashTransformerHashVectorizer=CountVectorizer+HashTransfomer

参考：

http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction