sklearn特征抽取
流程
- 实例化DictVectorizer
- fit_transform(x) 输入数据并转化
sklearn安装
pip install sklearn
sklearn特征抽取
CountVectorizer(max_df=1.0,min_df=1,…)
- 返回词频矩阵
- CountVectorizer.fit_transform(X,y)
- X:文本或者包含文本字符串的可迭代对象
- 返回值:返回sparse矩阵
- CountVectorizer.inverse_transform(X)
- X:array数组或者sparse矩阵
- 返回值:转换之前数据格式
- CountVectorizer.get_feature_names()
- 返回值:单词列表
from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer()
result = vector.fit_transform(["Players, put yo' pinky rings up to the moon,Girls, what y'all trying to do?"])
print(vector.get_feature_names())
print(result.toarray)
sklearn 字典特征抽取
把字典中类别数据分别进行转化成特征(one-hot编码)
- DictVectorizer.fit_transform(x)
- x:字典或者包含字典的迭代器
- 返回值:返回sparse矩阵
- DictVectorizer.inverse_transform(x)
- x:array数组或者sparse矩阵
- 返回值:转换之前数据格式
- DictVectorizer.get_feature_names()
- 返回类别名称
- DictVectorizer.transform(x)
- 按照原先的标准转换
from sklearn.feature_extraction import DictVectorizer
data = [ {'city':'Dubai','temperature':33.},
{'city':'London','temperature':12.},
{'city':'San Fransisco','temperature':18.},]
dictVector = DictVectorizer()
print(dictVector.fit_transform(data))
print(dictVector.feature_names_)
td-idf
tf 词频 = 某个词出现次数/总词数
idf 逆文档频率 = log(文档总数/(包含该词的文档数+1))