1. 将字典格式的数据转换为特征。
前提:数据是用字典格式存储的,通过调用DictVectorizer类可将其转换成特征,对于特征值为字符型的变量,自动转换为多个特征变量,类似前面提到的onehot编码。
In [226]: measurements = [
...: {'city': 'Dubai', 'temperature': 33.},
...: {'city': 'London', 'temperature': 12.},
...: {'city': 'San Fransisco', 'temperature': 18.},
...: ]
In [227]: from sklearn.feature_extraction import DictVectorizer
In [228]: vec=DictVectorizer()
In [229]: vec.fit_transform(measurements).toarray()
Out[229]:
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
In [230]: vec.get_feature_names()
Out[230]: ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
同时可以通过vec的restrict函数,直接对特征进行选择。
In [247]: from sklearn.feature_selection import SelectKBest,chi2
In [249]: z=vec.fit_transform(measurements)support = SelectKBest(chi2, k=2).fit(z, [0, 1,2])
In [250]: z.toarray()
Out[250]:array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]])
In [251]: vec.get_feature_names()
Out[251]: ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
In [252]: vec.restrict(support.get_support())
Out[252]:DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True)
In [253]: vec.get_feature_names()
Out[253]: ['city=San Fransisco', 'temperature']
也可调用inverse_transform函数得到原来的值。
2.特征哈希
当特征取值列表很大,且有多个需onehot编码时,会导致特征矩阵很大,且有很多0,这时可用哈希函数将特征根据特征名和值映射到指定维数的矩阵。由于hash是单向函数,因此FeatureHash没有inverse_transform函数。
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='dict')
D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(D)
f.toarray()
array([[ 0., 0., -4., -1., 0., 0., 0., 0., 0., 2.],
[ 0., 0., 0., -2., -5., 0., 0., 0., 0., 0.]])
3.文本处理
(1)count
语料库中的单词作为特征,文档中该单词出现的数目作为特征值。
In [1]: from sklearn.feature_extraction.text import CountVectorizer
In [2]: vec=CountVectorizer()
In [3]: vec
Out[3]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content'
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
In [4]: corpus = [
...: 'This is the first document.',
...: 'This is the second second document.',
...: 'And the third one.',
...: 'Is this the first document?',
...: ]
In [5]: X=vec.fit_transform(corpus)
In [6]: X.toarray()
Out[6]:
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
也可用ngram作为词袋,通过参数ngram_range指定。
In [21]: bigram_vec=CountVectorizer(ngram_range=(1,3),token_pattern=r'\b\w+\b'
...: min_df=1)
In [22]: bigram_vec.fit_transform(corpus).toarray()
Out[22]:
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 1, 0, 0, 1, 1,
0, 0, 0, 0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)
In [23]: analyze=bigram_vec.build_analyzer()
In [24]: analyze('hello a b c')
Out[24]:
[u'hello',
u'a',
u'b',
u'c',
u'hello a',
u'a b',
u'b c',
u'hello a b',
u'a b c']
对字符编码的识别问题,可看下chardet模块
HashTransformerHashVectorizer=CountVectorizer+HashTransfomer
参考:
http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction