python文本特征提取_机器学习入门之机器学习之路：python 文本特征提取 CountVectorizer, TfidfVectorizer...

最新推荐文章于 2022-08-29 12:05:17 发布

cat12315

最新推荐文章于 2022-08-29 12:05:17 发布

阅读量576

点赞数

文章标签： python文本特征提取

本文链接：https://blog.csdn.net/weixin_34943809/article/details/113671195

版权

本文主要向大家介绍了机器学习入门之机器学习之路：python 文本特征提取 CountVectorizer, TfidfVectorizer，通过具体的内容向大家展现，希望对大家学习机器学习入门有所帮助。

本特征提取：将文本数据转化成特征向量的过程比较常用的文本特征表示法为词袋法词袋法：不考虑词语出现的顺序，每个出现过的词汇单独作为一列特征这些不重复的特征词汇集合为词表每一个文本都可以在很长的词表上统计出一个很多列的特征向量如果每个文本都出现的词汇，一般被标记为停用词不计入特征向量主要有两个api来实现 CountVectorizer 和 TfidfVectorizerCountVectorizer：只考虑词汇在文本中出现的频率TfidfVectorizer：除了考量某词汇在文本出现的频率，还关注包含这个词汇的所有文本的数量能够削减高频没有意义的词汇出现带来的影响, 挖掘更有意义的特征相比之下，文本条目越多，Tfid的效果会越显著下面对两种提取特征的方法，分别设置停用词和不停用，使用朴素贝叶斯进行分类预测，比较评估效果

python3 学习api的使用

源代码git: https://github.com/linyi0604/MachineLearning

代码：

1 from sklearn.datasets import fetch_20newsgroups

2 from sklearn.cross_validation import train_test_split

3 from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

4 from sklearn.naive_bayes import MultinomialNB

5 from sklearn.metrics import classification_report

7 ‘‘‘

8 文本特征提取：

9 将文本数据转化成特征向量的过程

10 比较常用的文本特征表示法为词袋法

11 词袋法：

12 不考虑词语出现的顺序，每个出现过的词汇单独作为一列特征

13 这些不重复的特征词汇集合为词表

14 每一个文本都可以在很长的词表上统计出一个很多列的特征向量

15 如果每个文本都出现的词汇，一般被标记为停用词不计入特征向量

17 主要有两个api来实现 CountVectorizer 和 TfidfVectorizer

18 CountVectorizer：

19 只考虑词汇在文本中出现的频率

20 TfidfVectorizer：

21 除了考量某词汇在文本出现的频率，还关注包含这个词汇的所有文本的数量

22 能够削减高频没有意义的词汇出现带来的影响, 挖掘更有意义的特征

24 相比之下，文本条目越多，Tfid的效果会越显著

27 下面对两种提取特征的方法，分别设置停用词和不停用，

28 使用朴素贝叶斯进行分类预测，比较评估效果

30 ‘‘‘

33 # 1 下载新闻数据

34 news = fetch_20newsgroups(subset="all")

37 # 2 分割训练数据和测试数据

38 x_train, x_test, y_train, y_test = train_test_split(news.data,

39 news.target,

40 test_size=0.25,

41 random_state=33)

44 # 3.1 采用普通统计CountVectorizer提取特征向量

45 # 默认配置不去除停用词

46 count_vec = CountVectorizer()

47 x_count_train = count_vec.fit_transform(x_train)

48 x_count_test = count_vec.transform(x_test)

49 # 去除停用词

50 count_stop_vec = CountVectorizer(analyzer=‘word‘, stop_words=‘english‘)

51 x_count_stop_train = count_stop_vec.fit_transform(x_train)

52 x_count_stop_test = count_stop_vec.transform(x_test)

54 # 3.2 采用TfidfVectorizer提取文本特征向量

55 # 默认配置不去除停用词

56 tfid_vec = TfidfVectorizer()

57 x_tfid_train = tfid_vec.fit_transform(x_train)

58 x_tfid_test = tfid_vec.transform(x_test)

59 # 去除停用词

60 tfid_stop_vec = TfidfVectorizer(analyzer=‘word‘, stop_words=‘english‘)

61 x_tfid_stop_train = tfid_stop_vec.fit_transform(x_train)

62 x_tfid_stop_test = tfid_stop_vec.transform(x_test)

65 # 4 使用朴素贝叶斯分类器分别对两种提取出来的特征值进行学习和预测

66 # 对普通通统计CountVectorizer提取特征向量学习和预测

67 mnb_count = MultinomialNB()

68 mnb_count.fit(x_count_train, y_train) # 学习

69 mnb_count_y_predict = mnb_count.predict(x_count_test) # 预测

70 # 去除停用词

71 mnb_count_stop = MultinomialNB()

72 mnb_count_stop.fit(x_count_stop_train, y_train) # 学习

73 mnb_count_stop_y_predict = mnb_count_stop.predict(x_count_stop_test) # 预测

75 # 对TfidfVectorizer提取文本特征向量学习和预测

76 mnb_tfid = MultinomialNB()

77 mnb_tfid.fit(x_tfid_train, y_train)

78 mnb_tfid_y_predict = mnb_tfid.predict(x_tfid_test)

79 # 去除停用词

80 mnb_tfid_stop = MultinomialNB()

81 mnb_tfid_stop.fit(x_tfid_stop_train, y_train) # 学习

82 mnb_tfid_stop_y_predict = mnb_tfid_stop.predict(x_tfid_stop_test) # 预测

84 # 5 模型评估

85 # 对普通统计CountVectorizer提取的特征学习模型进行评估

86 print("未去除停用词的CountVectorizer提取的特征学习模型准确率：", mnb_count.score(x_count_test, y_test))

87 print("更加详细的评估指标:\n", classification_report(mnb_count_y_predict, y_test))

88 print("去除停用词的CountVectorizer提取的特征学习模型准确率：", mnb_count_stop.score(x_count_stop_test, y_test))

89 print("更加详细的评估指标:\n", classification_report(mnb_count_stop_y_predict, y_test))

91 # 对TfidVectorizer提取的特征学习模型进行评估

92 print("TfidVectorizer提取的特征学习模型准确率：", mnb_tfid.score(x_tfid_test, y_test))

93 print("更加详细的评估指标:\n", classification_report(mnb_tfid_y_predict, y_test))

94 print("去除停用词的TfidVectorizer提取的特征学习模型准确率：", mnb_tfid_stop.score(x_tfid_stop_test, y_test))

95 print("更加详细的评估指标:\n", classification_report(mnb_tfid_stop_y_predict, y_test))

97 ‘‘‘

98 未去除停用词的CountVectorizer提取的特征学习模型准确率： 0.8397707979626485

99 更加详细的评估指标:

100 precision recall f1-score support

101

102 0 0.86 0.86 0.86 201

103 1 0.86 0.59 0.70 365

104 2 0.10 0.89 0.17 27

105 3 0.88 0.60 0.72 350

106 4 0.78 0.93 0.85 204

107 5 0.84 0.82 0.83 271

108 6 0.70 0.91 0.79 197

109 7 0.89 0.89 0.89 239

110 8 0.92 0.98 0.95 257

111 9 0.91 0.98 0.95 233

112 10 0.99 0.93 0.96 248

113 11 0.98 0.86 0.91 272

114 12 0.88 0.85 0.86 259

115 13 0.94 0.92 0.93 252

116 14 0.96 0.89 0.92 239

117 15 0.96 0.78 0.86 285

118 16 0.96 0.88 0.92 272

119 17 0.98 0.90 0.94 252

120 18 0.89 0.79 0.84 214

121 19 0.44 0.93 0.60 75

122

123 avg / total 0.89 0.84 0.86 4712

124

125 去除停用词的CountVectorizer提取的特征学习模型准确率： 0.8637521222410866

126 更加详细的评估指标:

127 precision recall f1-score support

128

129 0 0.89 0.85 0.87 210

130 1 0.88 0.62 0.73 352

131 2 0.22 0.93 0.36 59

132 3 0.88 0.62 0.73 341

133 4 0.85 0.93 0.89 222

134 5 0.85 0.82 0.84 273

135 6 0.79 0.90 0.84 226

136 7 0.91 0.91 0.91 239

137 8 0.94 0.98 0.96 264

138 9 0.92 0.98 0.95 236

139 10 0.99 0.92 0.95 251

140 11 0.97 0.91 0.93 254

141 12 0.89 0.87 0.88 254

142 13 0.95 0.94 0.95 248

143 14 0.96 0.91 0.93 233

144 15 0.94 0.87 0.90 250

145 16 0.96 0.89 0.93 271

146 17 0.98 0.95 0.97 238

147 18 0.90 0.84 0.87 200

148 19 0.53 0.91 0.67 91

149

150 avg / total 0.90 0.86 0.87 4712

151

152 TfidVectorizer提取的特征学习模型准确率： 0.8463497453310697

153 更加详细的评估指标:

154 precision recall f1-score support

155

156 0 0.67 0.84 0.75 160

157 1 0.74 0.85 0.79 218

158 2 0.85 0.82 0.83 256

159 3 0.88 0.76 0.82 275

160 4 0.84 0.94 0.89 217

161 5 0.84 0.96 0.89 229

162 6 0.69 0.93 0.79 192

163 7 0.92 0.84 0.88 259

164 8 0.92 0.98 0.95 259

165 9 0.91 0.96 0.94 238

166 10 0.99 0.88 0.93 264

167 11 0.98 0.73 0.83 321

168 12 0.83 0.91 0.87 226

169 13 0.92 0.97 0.95 231

170 14 0.96 0.89 0.93 239

171 15 0.97 0.51 0.67 443

172 16 0.96 0.83 0.89 293

173 17 0.97 0.92 0.95 245

174 18 0.62 0.98 0.76 119

175 19 0.16 0.93 0.28 28

176

177 avg / total 0.88 0.85 0.85 4712

178

179 去除停用词的TfidVectorizer提取的特征学习模型准确率： 0.8826400679117148

180 更加详细的评估指标:

181 precision recall f1-score support

182

183 0 0.81 0.86 0.83 190

184 1 0.81 0.85 0.83 238

185 2 0.87 0.84 0.86 257

186 3 0.88 0.78 0.83 269

187 4 0.90 0.92 0.91 235

188 5 0.88 0.95 0.91 243

189 6 0.80 0.90 0.85 230

190 7 0.92 0.89 0.90 244

191 8 0.94 0.98 0.96 265

192 9 0.93 0.97 0.95 242

193 10 0.99 0.88 0.93 264

194 11 0.98 0.85 0.91 273

195 12 0.86 0.93 0.89 231

196 13 0.93 0.96 0.95 237

197 14 0.97 0.90 0.93 239

198 15 0.96 0.70 0.81 320

199 16 0.98 0.84 0.90 294

200 17 0.99 0.92 0.95 248

201 18 0.74 0.97 0.84 145

202 19 0.29 0.96 0.45 48

203

204 avg / total 0.90 0.88 0.89 4712

205 ‘‘‘

本文由职坐标整理并发布，希望对同学们有所帮助。了解更多详情请关注职坐标人工智能机器学习频道！

cat12315

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python文本特征提取_机器学习入门之机器学习之路：python 文本特征提取 CountVectorizer, TfidfVectorizer...

本文主要向大家介绍了机器学习入门之机器学习之路：python 文本特征提取 CountVectorizer, TfidfVectorizer，通过具体的内容向大家展现，希望对大家学习机器学习入门有所帮助。本特征提取：将文本数据转化成特征向量的过程比较常用的文本特征表示法为词袋法词袋法：不考虑词语出现的顺序，每个出现过的词汇单独作为一列特征这些不重复的特征词汇集合为词...
复制链接

扫一扫