机器学习（1）—特征工程_机器学习特征工程英文文本-CSDN博客

本文链接：https://blog.csdn.net/qq_38936560/article/details/112738378

在这里插入图片描述

1. 特征抽取

在实际的应用中，我们的数据并不是只有数字的数据，而是有各种不同的情况。可能会是一段文字，又或者会是图片、视频。把这些数据抽象成只有数字的方法，就是特征抽取。
在这里插入图片描述

1.1字典型数据(DictVectorizer类)：

在这里插入图片描述
get_feature_names()：data经过转换后的特征名称
inverse_transform()：One-hot编码/稀疏矩阵转换为原数据形式

稀疏矩阵表示（DictVectorizer(),默认sparse=True）：节约存储空间，方便读取
One-Hot编码（DictVectorizer(sparse=False)）：处理离散型数据

# 导库
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
# 准备数据
data= [{'city':'北京','temperature':10,'weather':'阴'},
        {'city':'上海','temperature':20,'weather':'晴'},
        {'city':'南京','temperature':15,'weather':'雨'}]
print(data)
#运行结果：
#[{'city': '北京', 'temperature': 10, 'weather': '阴'}, 
#{'city': '上海', 'temperature': 20, 'weather': '晴'}, 
#{'city': '南京', 'temperature': 15, 'weather': '雨'}]

#稀疏矩阵表示：
dict1=DictVectorizer()
data_after1=dict1.fit_transform(data)
print(dict1.get_feature_names())
print(data_after1)
#运行结果：
#['city=上海', 'city=北京', 'city=南京', 'temperature', 'weather=晴', 'weather=阴', 'weather=雨']
 # (0, 1)	1.0 
 # (0, 3)	10.0
 # (0, 5)	1.0
 # (1, 0)	1.0
 # (1, 3)	20.0 
 # (1, 4)	1.0
 # (2, 2)	1.0
 # (2, 3)	15.0
 # (2, 6)	1.0

# One-Hot编码
dict2=DictVectorizer(sparse=False)
data_after2=dict2.fit_transform(data)
print(dict2.get_feature_names())
print(data_after2)
#运行结果：
#['city=上海', 'city=北京', 'city=南京', 'temperature', 'weather=晴', 'weather=阴', 'weather=雨']
#[[ 0.  1.  0. 10.  0.  1.  0.]
# [ 1.  0.  0. 20.  1.  0.  0.]
# [ 0.  0.  1. 15.  0.  0.  1.]]

1.2文本型数据

英文文本：text库中的CountVectorizer类
中文文本：先使用jieba进行分词

count方法：

from sklearn.feature_extraction.text import CountVectorizer
#稀疏矩阵形式
cv=CountVectorizer()
data=["Hiding from the rain and snow,Trying to forget but I won't let go.","In that misty morning when I saw your smiling face,You only looked at me and I was yours."]
print(cv.fit_transform(data))
print(cv.get_feature_names())
#  运行结果：
#  (0, 7)	1  0：data中文档索引；7：词索引；1：单词在文档中的出现次数.
#   (0, 5)	1  5号单词出现1次
#   (0, 20)	1  20号单词出现1次
#   (0, 15)	1
#   (0, 0)	1
#   (0, 18)	1
#   (0, 22)	1
#   (0, 21)	1
#   (0, 4)	1
#   (0, 2)	1
#   (0, 25)	1
#   (0, 9)	1
#   (0, 6)	1
#   (1, 0)	1   
#   (1, 8)	1
#   (1, 19)	1
#   (1, 12)	1
#   (1, 13)	1
#   (1, 24)	1
#   (1, 16)	1
#   (1, 27)	1
#   (1, 17)	1
#   (1, 3)	1
#   (1, 26)	1
#   (1, 14)	1
#   (1, 10)	1
#   (1, 1)	1
#   (1, 11)	1
#   (1, 23)	1
#   (1, 28)	1
# ['and', 'at', 'but', 'face', 'forget', 'from', 'go', 'hiding', 'in', 'let', 'looked', 'me', 'misty', 'morning', 'only', 'rain', 'saw', 'smiling', 'snow', 'that', 'the', 'to', 'trying', 'was', 'when', 'won', 'you', 'your', 'yours']

from sklearn.feature_extraction.text import CountVectorizer
import jieba 
cv=CountVectorizer()
data=["前尘往事成云烟，消散在彼此眼前。","只是因为在人群中多看了你一眼，再也没能忘掉你容颜。"]
print(cv.fit_transform(data))
#运行结果：
#   (0, 1)	1
#   (0, 3)	1
#   (1, 2)	1
#   (1, 0)	1
# ['再也没能忘掉你容颜', '前尘往事成云烟', '只是因为在人群中多看了你一眼', '消散在彼此眼前']

上述的结果，明显不符合我们的要求。原因是sklearn不会对中文进行分词。

#没处理之前文档形式 ["前尘往事成云烟，消散在彼此眼前。","只是因为在人群中多看了你一眼，再也没能忘掉你容颜。"]
#处理之后['前尘往事 成 云烟 ， 消散 在 彼此 眼前 。', '只是 因为 在 人群 中多 看 了 你 一眼 ， 再也 没 能 忘掉 你 容颜 。']
content=[]
for i in data:
    con=jieba.cut(i)
    con=list(con)
    c=' '.join(con)
    content.append(c)#区分extend
#类似于英文文档，每个单词之间有空格，分词的目的就在于将每个词组用空格分开
print(cv.fit_transform(content).toarray())
print(cv.get_feature_names())
# 运行结果：
# [[0 0 1 0 0 1 0 0 0 1 0 1 1]
#  [1 1 0 1 1 0 1 1 1 0 1 0 0]]
# ['一眼', '中多', '云烟', '人群', '再也', '前尘往事', '只是', '因为', '容颜', '彼此', '忘掉', '消散', '眼前']

tf-idf方法：
tf—term frequency 词频
idf—inverse document frequency 逆文档频率：表示某一单词的重要性

$\frac{\text { 文档中某一词语出现次数}}{\text { 文档的总词数 }}$

$idf=\log _{10} \frac{\text { 总文本数量 }}{\text { 该词出现的文本数量 }}$
假如有100个文档，“python”这个词在其中的10个文档中出现过， $idf=\log _{10} \frac{\text { 100 }}{\text { 10 }}=1$ ， $i d f$ 越小，该词语就越重要
$t f - i d f = t f * i d f$