导入
现在,我们有一篇文章,类似于下面的这样的:
问:我们应该如何从中提取数据的特征
单词 作为 特征
可以作为特征的量:句子 短语 单词 字母
综合比较而言,还是选择单词作为特征是比较合适的
在sklearn中如何对文本数据进行特征值化
# -*- coding: utf-8 -*-
"""
@Time : 2021/3/7 16:13
@Author : yuhui
@Email : 3476237164@qq.com
@FileName: 09_文本特征提取CountVectorizer.py
@Software: PyCharm
"""
from sklearn.feature_extraction.text import CountVectorizer
data=["life is short,i like like python",
"life life is too long,i dislike python"]
# 实例化一个转换器类
transfer=CountVectorizer()
# 统计每个样本特征词出现的个数
# 调用方法
data_new=transfer.fit_transform(data)
print(data_new) # 返回一个稀疏矩阵
print(data_new.toarray()) # 将稀疏矩阵转变为真正的矩阵
# 查看属性
# 查看特征名
print(transfer.get_feature_names())
D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer.py
(0, 2) 1
(0, 1) 1
(0, 6) 1
(0, 3) 2
(0, 5) 1
(1, 2) 2
(1, 1) 1
(1, 5) 1
(1, 7) 1
(1, 4) 1
(1, 0) 1
[[0 1 1 2 0 1 1 0]
[1 1 2 0 1 1 0 1]]
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']
Process finished with exit code 0
思考:如果我们将数据换为中文呢?
D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer.py
(0, 1) 1
(1, 0) 1
[[0 1]
[1 0]]
['天安门上太阳升', '我爱北京天安门']
Process finished with exit code 0
这个时候就需要我们对原始数据进行手动分词了
D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer.py
(0, 0) 1
(0, 1) 1
(1, 1) 1
(1, 2) 1
[[1 1 0]
[0 1 1]]
['北京', '天安门', '太阳']
Process finished with exit code 0
小结
文本特征提取的基本流程:
- 导入库
from sklearn.feature_extraction.text import CountVectorizer
- 实例化一个转换器类
transfer=CountVectorizer()
- 调用方法
data_new=transfer.fit_transform(data)
- 查看属性
# 查看特征名
print(transfer.get_feature_names())
第一次复习
# -*- coding: utf-8 -*-
"""
@Time : 2021/4/8 13:58
@Author : yuhui
@Email : 3476237164@qq.com
@FileName: 09_文本特征提取CountVectorizer_2.py
@Software: PyCharm
"""
from sklearn.feature_extraction.text import CountVectorizer
def text_feature_extraction():
"""文本特征提取"""
data=["life is short,i like like python",
"life life is too long,i dislike python"]
transfer=CountVectorizer()
data_new=transfer.fit_transform(data)
print(data_new)
print(data)
# data_new 稀疏矩阵
# 如何将稀疏矩阵转变为真正的矩阵
data_new_matrix=data_new.toarray()
print(data_new_matrix)
# 查看属性:特征名
print(transfer.get_feature_names())
if __name__ == '__main__':
text_feature_extraction()
D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer_2.py
(0, 2) 1
(0, 1) 1
(0, 6) 1
(0, 3) 2
(0, 5) 1
(1, 2) 2
(1, 1) 1
(1, 5) 1
(1, 7) 1
(1, 4) 1
(1, 0) 1
['life is short,i like like python', 'life life is too long,i dislike python']
[[0 1 1 2 0 1 1 0]
[1 1 2 0 1 1 0 1]]
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']
Process finished with exit code 0