文本分类的python实现-基于SVM算法

最新推荐文章于 2024-01-10 15:26:57 发布

wangyajie_11

最新推荐文章于 2024-01-10 15:26:57 发布

阅读量2.8k

点赞数

分类专栏：自然语言处理机器学习 Python 文章标签：自然语言处理

Python 同时被 3 个专栏收录

16 篇文章 0 订阅

订阅专栏

机器学习

9 篇文章 4 订阅

订阅专栏

自然语言处理

3 篇文章 0 订阅

订阅专栏

描述

训练集为评论文本，标签为 pos,neu,neg三种分类，train.csv的第一列为文本content，第二列为label。可以单独使用SVC训练然后预测，也可以使用管道pipeline把训练和预测放在一块。
SVC的惩罚参数C：默认值是1.0。C越大，对误分类的惩罚增大，趋向于对训练集全分对的情况，这样对训练集测试时准确率很高，但泛化能力弱。C值小，对误分类的惩罚减小，允许容错，泛化能力较强。
尽管TF-IDF权重有着非常广泛的应用，但并不是所有的文本权重采用TF-IDF都会有较好的性能。在有些问题上，采用BOOL型的权重（单词在某个文档中出现记为1，不出现记为0）可以得到更好的性能。通过增加CountVectorizer的参数(binary = True)实现。

实验

代码

# -*- coding: utf-8 -*-
import csv
import jieba
jieba.load_userdict('wordDict.txt')
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.grid_search import GridSearchCV


# 读取训练集
def readtrain():
    with open('Train.csv', 'rb') as csvfile:
        reader = csv.reader(csvfile)
        column1 = [row for row in reader]
    content_train = [i[1] for i in column1[1:]] #第一列为文本内容，并去除列名
    opinion_train = [i[2] for i in column1[1:]] #第二列为类别，并去除列名
    print '训练集有 %s 条句子' % len(content_train)
    train = [content_train, opinion_train]
    return train


# 将utf8的列表转换成unicode
def changeListCode(b):
    a = []
    for i in b:
        a.append(i.decode('utf8'))
    return a


# 对列表进行分词并用空格连接
def segmentWord(cont):
    c = []
    for i in cont:
        a = list(jieba.cut(i))
        b = " ".join(a)
        c.append(b)
    return c


# corpus = ["我 来到 北京 清华大学", "他 来到 了 网易 杭研 大厦", "小明 硕士 毕业 与 中国 科学院"]
train = readtrain()
content = segmentWord(train[0])
opinion = train[1]


# 划分
train_content = content[:7000]
test_content = content[7000:]
train_opinion = opinion[:7000]
test_opinion = opinion[7000:]


# 计算权重
vectorizer = CountVectorizer()
tfidftransformer = TfidfTransformer()
tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(train_content))  # 先转换成词频矩阵，再计算TFIDF值
print tfidf.shape


# 单独预测
'''
word = vectorizer.get_feature_names()
weight = tfidf.toarray()
# 分类器
clf = MultinomialNB().fit(tfidf, opinion)
docs = ["在 标准 状态 下 途观 的 行李厢 容积 仅 为 400 L", "新 买 的 锋驭 怎么 没有 随 车 灭火器"]
new_tfidf = tfidftransformer.transform(vectorizer.transform(docs))
predicted = clf.predict(new_tfidf)
print predicted
'''


# 训练和预测一体
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC(C=0.99, kernel = 'linear'))])
text_clf = text_clf.fit(train_content, train_opinion)
predicted = text_clf.predict(test_content)
print 'SVC',np.mean(predicted == test_opinion)
print set(predicted)
#print metrics.confusion_matrix(test_opinion,predicted) # 混淆矩阵



# 循环调参
'''
parameters = {'vect__max_df': (0.4, 0.5, 0.6, 0.7),'vect__max_features': (None, 5000, 10000, 15000),
              'tfidf__use_idf': (True, False)}
grid_search = GridSearchCV(text_clf, parameters, n_jobs=1, verbose=1)
grid_search.fit(content, opinion)
best_parameters = dict()
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

'''
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

输出

Building prefix dict from the default dictionary ...
Loading model from cache c:\users\www\appdata\local\temp\jieba.cache
Loading model cost 0.383 seconds.
Prefix dict has been built succesfully.

训练集有 10981 条句子
(7000, 14688)
SVC 0.701582516956
set(['neg', 'neu', 'pos'])