机器学习之特征工程提取
- 字典类型处理:
- 文本类型处理 :
- 1 单个字符不进行统计,每篇文章中出现一次便记作一次,特征值不再重复出现,但是出现次数可以进行计算累加
- 2 对应特征值在原来文章中不出现则计为0
- 3 对于中文字符,同样,单个字符无意义,长句子可以先分词然后用逗号隔开
- TF-IDF :
- TF(term frequency):词的频率
- IDF(inverse document frequency):逆文档频率(log(总文档数量/该词出现的文档数量),TF*IDF得出的就是每个词在该篇文档中的重要性程度)
# -*- coding: utf-8 -*-
"""
Created on Fri Nov 20 16:20:41 2020
@author: Yuka
"""
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba
def dict_vect():
'''
字典类型特征提取
Returns
-------
None.
'''
data = [{'japan':100},{'China':99},{'American':85},{'Korcan':70}]
dic = DictVectorizer(sparse=False)
res = dic.fit_transform(data)
print(res)
print(dic.get_feature_names())
return None
def cut_zh(datas):
res = [" ".join(list(jieba.cut(data))) for data in datas]
#print(res)
return res
def text_vec():
'''
文本类型特征提取
Returns
-------
None.
'''
data = ['I like you,but i just like you~','If i can contral my bed time,it will make me randical']
tex = CountVectorizer()
res = tex.fit_transform(data)
print(tex.get_feature_names())
print(res.toarray())
datas = cut_zh(["我还是好喜欢你,但是也仅仅只是喜欢你而已.....","我还是好喜欢你,像风走了十万里,不问归期.....","我还是好喜欢你,想你妈揍你,不讲道理...."])
res1 = tex.fit_transform(datas)
print(tex.get_feature_names())
print(res1.toarray())
return None
def tf_vect():
tf = TfidfVectorizer()
datas = cut_zh(["我还是好喜欢你,但是也仅仅只是喜欢你而已.....", "我还是好喜欢你,像风走了十万里,不问归期.....", "我还是好喜欢你,想你妈揍你,不讲道理...."])
res1 = tf.fit_transform(datas)
print(tf.get_feature_names())
print(res1.toarray())
if __name__ == '__main__':
dict_vect()
text_vec()
tf_vect()
字典特征数据处理:
['American', 'China', 'Korcan', 'japan']
[[ 0. 0. 0. 100.]
[ 0. 99. 0. 0.]
[ 85. 0. 0. 0.]
[ 0. 0. 70. 0.]]
英文特征数据处理:
['bed', 'but', 'can', 'contral', 'if', 'it', 'just', 'like', 'make', 'me', 'my', 'randical', 'time', 'will', 'you']
[[0 1 0 0 0 0 1 2 0 0 0 0 0 0 2]
[1 0 1 1 1 1 0 0 1 1 1 1 1 1 0]]
中文特征数据处理:
['不讲道理', '不问', '仅仅只是', '但是', '十万里', '喜欢', '归期', '而已', '还是']
[[0 0 1 1 0 2 0 1 1]
[0 1 0 0 1 1 1 0 1]
[1 0 0 0 0 1 0 0 1]]
TF-IDF 文本分类处理
['不讲道理', '不问', '仅仅只是', '但是', '十万里', '喜欢', '归期', '而已', '还是']
[[0. 0. 0.4591149 0.4591149 0. 0.54232132
0. 0.4591149 0.27116066]
[0. 0.52004008 0. 0. 0.52004008 0.30714405
0.52004008 0. 0.30714405]
[0.76749457 0. 0. 0. 0. 0.45329466
0. 0. 0.45329466]]