09-文本特征提取CountVectorizer

最新推荐文章于 2024-05-12 06:07:29 发布

yuhui_2000

最新推荐文章于 2024-05-12 06:07:29 发布

阅读量200

点赞数

分类专栏：黑马程序员3天快速入门Python机器学习文章标签：文本特征提取 CountVectorizer sklearn 中文分词预处理

本文链接：https://blog.csdn.net/yuhui_2000/article/details/114487347

版权

黑马程序员3天快速入门Python机器学习专栏收录该内容

27 篇文章 4 订阅

订阅专栏

导入

现在，我们有一篇文章，类似于下面的这样的：
在这里插入图片描述
问：我们应该如何从中提取数据的特征

单词作为特征

可以作为特征的量：句子短语单词字母

综合比较而言，还是选择单词作为特征是比较合适的

在sklearn中如何对文本数据进行特征值化

在这里插入图片描述

# -*- coding: utf-8 -*-

"""
@Time    : 2021/3/7 16:13
@Author  : yuhui
@Email   : 3476237164@qq.com
@FileName: 09_文本特征提取CountVectorizer.py
@Software: PyCharm
"""

from sklearn.feature_extraction.text import CountVectorizer

data=["life is short,i like like python",
"life life is too long,i dislike python"]

# 实例化一个转换器类
transfer=CountVectorizer()
# 统计每个样本特征词出现的个数

# 调用方法
data_new=transfer.fit_transform(data)

print(data_new)  # 返回一个稀疏矩阵
print(data_new.toarray())  # 将稀疏矩阵转变为真正的矩阵

# 查看属性
# 查看特征名
print(transfer.get_feature_names())

D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer.py
  (0, 2)	1
  (0, 1)	1
  (0, 6)	1
  (0, 3)	2
  (0, 5)	1
  (1, 2)	2
  (1, 1)	1
  (1, 5)	1
  (1, 7)	1
  (1, 4)	1
  (1, 0)	1
[[0 1 1 2 0 1 1 0]
 [1 1 2 0 1 1 0 1]]
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']

Process finished with exit code 0

思考：如果我们将数据换为中文呢？

在这里插入图片描述

D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer.py
  (0, 1)	1
  (1, 0)	1
[[0 1]
 [1 0]]
['天安门上太阳升', '我爱北京天安门']

Process finished with exit code 0

这个时候就需要我们对原始数据进行手动分词了

在这里插入图片描述

D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer.py
  (0, 0)	1
  (0, 1)	1
  (1, 1)	1
  (1, 2)	1
[[1 1 0]
 [0 1 1]]
['北京', '天安门', '太阳']

Process finished with exit code 0

小结

文本特征提取的基本流程：

导入库

from sklearn.feature_extraction.text import CountVectorizer

实例化一个转换器类

transfer=CountVectorizer()

调用方法

data_new=transfer.fit_transform(data)

查看属性

# 查看特征名
print(transfer.get_feature_names())

第一次复习

# -*- coding: utf-8 -*-

"""
@Time    : 2021/4/8 13:58
@Author  : yuhui
@Email   : 3476237164@qq.com
@FileName: 09_文本特征提取CountVectorizer_2.py
@Software: PyCharm
"""

from sklearn.feature_extraction.text import CountVectorizer

def text_feature_extraction():
	"""文本特征提取"""
	data=["life is short,i like like python",
"life life is too long,i dislike python"]
	transfer=CountVectorizer()
	data_new=transfer.fit_transform(data)
	print(data_new)
	print(data)

	# data_new  稀疏矩阵
	# 如何将稀疏矩阵转变为真正的矩阵
	data_new_matrix=data_new.toarray()
	print(data_new_matrix)

	# 查看属性：特征名
	print(transfer.get_feature_names())

if __name__ == '__main__':
	text_feature_extraction()

D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/09_文本特征提取CountVectorizer_2.py
  (0, 2)	1
  (0, 1)	1
  (0, 6)	1
  (0, 3)	2
  (0, 5)	1
  (1, 2)	2
  (1, 1)	1
  (1, 5)	1
  (1, 7)	1
  (1, 4)	1
  (1, 0)	1
['life is short,i like like python', 'life life is too long,i dislike python']
[[0 1 1 2 0 1 1 0]
 [1 1 2 0 1 1 0 1]]
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']

Process finished with exit code 0

yuhui_2000

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
09-文本特征提取CountVectorizer

导入现在，我们有一篇文章，类似于下面的这样的：问：我们应该如何从中提取数据的特征单词作为特征可以作为特征的量：句子短语单词字母综合比较而言，还是选择单词作为特征是比较合适的在sklearn中如何对文本数据进行特征值化# -*- coding: utf-8 -*-"""@Time : 2021/3/7 16:13@Author : yuhui@Email : 3476237164@qq.com@FileName: 09_文本特征提取CountVe
复制链接

扫一扫