python文本特征提取_机器学习入门之机器学习之路:python 文本特征提取 CountVectorizer, TfidfVectorizer...

本文主要向大家介绍了机器学习入门之机器学习之路:python 文本特征提取 CountVectorizer, TfidfVectorizer,通过具体的内容向大家展现,希望对大家学习机器学习入门有所帮助。

本特征提取:    将文本数据转化成特征向量的过程    比较常用的文本特征表示法为词袋法词袋法:    不考虑词语出现的顺序,每个出现过的词汇单独作为一列特征    这些不重复的特征词汇集合为词表    每一个文本都可以在很长的词表上统计出一个很多列的特征向量    如果每个文本都出现的词汇,一般被标记为 停用词 不计入特征向量    主要有两个api来实现 CountVectorizer 和 TfidfVectorizerCountVectorizer:    只考虑词汇在文本中出现的频率TfidfVectorizer:    除了考量某词汇在文本出现的频率,还关注包含这个词汇的所有文本的数量    能够削减高频没有意义的词汇出现带来的影响, 挖掘更有意义的特征相比之下,文本条目越多,Tfid的效果会越显著下面对两种提取特征的方法,分别设置停用词和不停用,使用朴素贝叶斯进行分类预测,比较评估效果

python3 学习api的使用

源代码git: https://github.com/linyi0604/MachineLearning

代码:

1 from sklearn.datasets import  fetch_20newsgroups

2 from sklearn.cross_validation import train_test_split

3 from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

4 from sklearn.naive_bayes import MultinomialNB

5 from sklearn.metrics import classification_report

6

7 ‘‘‘

8 文本特征提取:

9     将文本数据转化成特征向量的过程

10     比较常用的文本特征表示法为词袋法

11 词袋法:

12     不考虑词语出现的顺序,每个出现过的词汇单独作为一列特征

13     这些不重复的特征词汇集合为词表

14     每一个文本都可以在很长的词表上统计出一个很多列的特征向量

15     如果每个文本都出现的词汇,一般被标记为 停用词 不计入特征向量

16

17 主要有两个api来实现 CountVectorizer 和 TfidfVectorizer

18 CountVectorizer:

19     只考虑词汇在文本中出现的频率

20 TfidfVectorizer:

21     除了考量某词汇在文本出现的频率,还关注包含这个词汇的所有文本的数量

22     能够削减高频没有意义的词汇出现带来的影响, 挖掘更有意义的特征

23

24 相比之下,文本条目越多,Tfid的效果会越显著

25

26

27 下面对两种提取特征的方法,分别设置停用词和不停用,

28 使用朴素贝叶斯进行分类预测,比较评估效果

29

30 ‘‘‘

31

32

33 # 1 下载新闻数据

34 news = fetch_20newsgroups(subset="all")

35

36

37 # 2 分割训练数据和测试数据

38 x_train, x_test, y_train, y_test = train_test_split(news.data,

39                                                     news.target,

40                                                     test_size=0.25,

41                                                     random_state=33)

42

43

44 # 3.1 采用普通统计CountVectorizer提取特征向量

45 # 默认配置不去除停用词

46 count_vec = CountVectorizer()

47 x_count_train = count_vec.fit_transform(x_train)

48 x_count_test = count_vec.transform(x_test)

49 # 去除停用词

50 count_stop_vec = CountVectorizer(analyzer=‘word‘, stop_words=‘english‘)

51 x_count_stop_train = count_stop_vec.fit_transform(x_train)

52 x_count_stop_test = count_stop_vec.transform(x_test)

53

54 # 3.2 采用TfidfVectorizer提取文本特征向量

55 # 默认配置不去除停用词

56 tfid_vec = TfidfVectorizer()

57 x_tfid_train = tfid_vec.fit_transform(x_train)

58 x_tfid_test = tfid_vec.transform(x_test)

59 # 去除停用词

60 tfid_stop_vec = TfidfVectorizer(analyzer=‘word‘, stop_words=‘english‘)

61 x_tfid_stop_train = tfid_stop_vec.fit_transform(x_train)

62 x_tfid_stop_test = tfid_stop_vec.transform(x_test)

63

64

65 # 4 使用朴素贝叶斯分类器  分别对两种提取出来的特征值进行学习和预测

66 # 对普通通统计CountVectorizer提取特征向量 学习和预测

67 mnb_count = MultinomialNB()

68 mnb_count.fit(x_count_train, y_train)   # 学习

69 mnb_count_y_predict = mnb_count.predict(x_count_test)   # 预测

70 # 去除停用词

71 mnb_count_stop = MultinomialNB()

72 mnb_count_stop.fit(x_count_stop_train, y_train)   # 学习

73 mnb_count_stop_y_predict = mnb_count_stop.predict(x_count_stop_test)    # 预测

74

75 # 对TfidfVectorizer提取文本特征向量 学习和预测

76 mnb_tfid = MultinomialNB()

77 mnb_tfid.fit(x_tfid_train, y_train)

78 mnb_tfid_y_predict = mnb_tfid.predict(x_tfid_test)

79 # 去除停用词

80 mnb_tfid_stop = MultinomialNB()

81 mnb_tfid_stop.fit(x_tfid_stop_train, y_train)   # 学习

82 mnb_tfid_stop_y_predict = mnb_tfid_stop.predict(x_tfid_stop_test)    # 预测

83

84 # 5 模型评估

85 # 对普通统计CountVectorizer提取的特征学习模型进行评估

86 print("未去除停用词的CountVectorizer提取的特征学习模型准确率:", mnb_count.score(x_count_test, y_test))

87 print("更加详细的评估指标:\n", classification_report(mnb_count_y_predict, y_test))

88 print("去除停用词的CountVectorizer提取的特征学习模型准确率:", mnb_count_stop.score(x_count_stop_test, y_test))

89 print("更加详细的评估指标:\n", classification_report(mnb_count_stop_y_predict, y_test))

90

91 # 对TfidVectorizer提取的特征学习模型进行评估

92 print("TfidVectorizer提取的特征学习模型准确率:", mnb_tfid.score(x_tfid_test, y_test))

93 print("更加详细的评估指标:\n", classification_report(mnb_tfid_y_predict, y_test))

94 print("去除停用词的TfidVectorizer提取的特征学习模型准确率:", mnb_tfid_stop.score(x_tfid_stop_test, y_test))

95 print("更加详细的评估指标:\n", classification_report(mnb_tfid_stop_y_predict, y_test))

96

97 ‘‘‘

98 未去除停用词的CountVectorizer提取的特征学习模型准确率: 0.8397707979626485

99 更加详细的评估指标:

100               precision    recall  f1-score   support

101

102           0       0.86      0.86      0.86       201

103           1       0.86      0.59      0.70       365

104           2       0.10      0.89      0.17        27

105           3       0.88      0.60      0.72       350

106           4       0.78      0.93      0.85       204

107           5       0.84      0.82      0.83       271

108           6       0.70      0.91      0.79       197

109           7       0.89      0.89      0.89       239

110           8       0.92      0.98      0.95       257

111           9       0.91      0.98      0.95       233

112          10       0.99      0.93      0.96       248

113          11       0.98      0.86      0.91       272

114          12       0.88      0.85      0.86       259

115          13       0.94      0.92      0.93       252

116          14       0.96      0.89      0.92       239

117          15       0.96      0.78      0.86       285

118          16       0.96      0.88      0.92       272

119          17       0.98      0.90      0.94       252

120          18       0.89      0.79      0.84       214

121          19       0.44      0.93      0.60        75

122

123 avg / total       0.89      0.84      0.86      4712

124

125 去除停用词的CountVectorizer提取的特征学习模型准确率: 0.8637521222410866

126 更加详细的评估指标:

127               precision    recall  f1-score   support

128

129           0       0.89      0.85      0.87       210

130           1       0.88      0.62      0.73       352

131           2       0.22      0.93      0.36        59

132           3       0.88      0.62      0.73       341

133           4       0.85      0.93      0.89       222

134           5       0.85      0.82      0.84       273

135           6       0.79      0.90      0.84       226

136           7       0.91      0.91      0.91       239

137           8       0.94      0.98      0.96       264

138           9       0.92      0.98      0.95       236

139          10       0.99      0.92      0.95       251

140          11       0.97      0.91      0.93       254

141          12       0.89      0.87      0.88       254

142          13       0.95      0.94      0.95       248

143          14       0.96      0.91      0.93       233

144          15       0.94      0.87      0.90       250

145          16       0.96      0.89      0.93       271

146          17       0.98      0.95      0.97       238

147          18       0.90      0.84      0.87       200

148          19       0.53      0.91      0.67        91

149

150 avg / total       0.90      0.86      0.87      4712

151

152 TfidVectorizer提取的特征学习模型准确率: 0.8463497453310697

153 更加详细的评估指标:

154               precision    recall  f1-score   support

155

156           0       0.67      0.84      0.75       160

157           1       0.74      0.85      0.79       218

158           2       0.85      0.82      0.83       256

159           3       0.88      0.76      0.82       275

160           4       0.84      0.94      0.89       217

161           5       0.84      0.96      0.89       229

162           6       0.69      0.93      0.79       192

163           7       0.92      0.84      0.88       259

164           8       0.92      0.98      0.95       259

165           9       0.91      0.96      0.94       238

166          10       0.99      0.88      0.93       264

167          11       0.98      0.73      0.83       321

168          12       0.83      0.91      0.87       226

169          13       0.92      0.97      0.95       231

170          14       0.96      0.89      0.93       239

171          15       0.97      0.51      0.67       443

172          16       0.96      0.83      0.89       293

173          17       0.97      0.92      0.95       245

174          18       0.62      0.98      0.76       119

175          19       0.16      0.93      0.28        28

176

177 avg / total       0.88      0.85      0.85      4712

178

179 去除停用词的TfidVectorizer提取的特征学习模型准确率: 0.8826400679117148

180 更加详细的评估指标:

181               precision    recall  f1-score   support

182

183           0       0.81      0.86      0.83       190

184           1       0.81      0.85      0.83       238

185           2       0.87      0.84      0.86       257

186           3       0.88      0.78      0.83       269

187           4       0.90      0.92      0.91       235

188           5       0.88      0.95      0.91       243

189           6       0.80      0.90      0.85       230

190           7       0.92      0.89      0.90       244

191           8       0.94      0.98      0.96       265

192           9       0.93      0.97      0.95       242

193          10       0.99      0.88      0.93       264

194          11       0.98      0.85      0.91       273

195          12       0.86      0.93      0.89       231

196          13       0.93      0.96      0.95       237

197          14       0.97      0.90      0.93       239

198          15       0.96      0.70      0.81       320

199          16       0.98      0.84      0.90       294

200          17       0.99      0.92      0.95       248

201          18       0.74      0.97      0.84       145

202          19       0.29      0.96      0.45        48

203

204 avg / total       0.90      0.88      0.89      4712

205 ‘‘‘

本文由职坐标整理并发布,希望对同学们有所帮助。了解更多详情请关注职坐标人工智能机器学习频道!

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值