对自己目前常用的几种特征提取方法做个简要总结。
1,将文本数据转化为特征向量(其中CountVectorizer只考虑词汇在文本中出现的频率)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
wordVectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = wordVectorizer.fit_transform(x_train)
wordTransformer = TfidfTransformer()
train_feature = wordTransformer.fit_transform(X_train)
2,文本特征数值转换(DictVectorizer的处理对象是符号化非数字化但是具有一定结构的特征数据,如字典,dataframe等,将符号转成数字0/1表示。)
笨办法是直接用字典的key-value转换(穷举)
from sklearn.feature_extraction import DictVectorizer
dict_vec = DictVectorizer(sparse=False) # False:不产生稀疏矩阵
X_train = dict_vec.fit_transform(X_train.to_dict(orient='record'))
X_test = dict_vec.transform(X_test.to_dict(orient='record'))
print(dict_vec.feature_names_) # 查看转换后的列名
print(X_train)
另,feature.extraction:
__all__ = ['DictVectorizer', 'image', 'img_to_graph', 'grid_to_graph', 'text',
'FeatureHasher']
feature.extraction.text:
__all__ = ['CountVectorizer',
'ENGLISH_STOP_WORDS',
'TfidfTransformer',
'TfidfVectorizer',
'strip_accents_ascii',
'strip_accents_unicode',
'strip_tags']
附,直接看源代码比较明了。