数据挖掘实战（十）--基于sklearn的SVM使用

最新推荐文章于 2024-08-07 09:41:39 发布

bb8886

最新推荐文章于 2024-08-07 09:41:39 发布

阅读量2.4k

点赞数 2

分类专栏：数据挖掘文章标签：支持向量机数据挖掘 sklearn Powered by 金山文档

本文链接：https://blog.csdn.net/bb8886/article/details/129696501

版权

数据挖掘专栏收录该内容

16 篇文章 17 订阅

订阅专栏

文章介绍了支持向量机(SVM)的基本原理和参数，如C参数和核函数，并展示了如何使用SVM对MNIST数据集进行分类，以及通过SVM对垃圾邮件和正常邮件进行判断。在邮件处理中，使用jieba进行分词，CountVectorizer构建数据集，最终通过SVM模型进行训练和评估，得出高精度的分类结果。

摘要由CSDN通过智能技术生成

一、支持向量机（SVM）

简介

定义：支持向量机是一种二类分类器。假如我们有两个类别的数据，而这两个类别恰好能被一条线分开，线上所有点为一类，线下所有的点属于另一类。SVM要做的就是找到这条线，用它来做预测，跟线性回归原理很像。只是SVM要找出最佳的分割线。

最佳分隔线：最优化问题，让各点到分割线之间的距离最大化。

多类别分类问题：创建多个SVM分类器。为每个类别创建一对多分类器，把训练数据分为两个类别——属于特定类别的数据和其他所有类别数据。对新数据进行分类时，从这些类别中找出最匹配的。

参数

C参数：与分类器正确分类比例相关。C值越高，间隔越小，表示要尽可能把所有数据正确分类；C值越小，间隔越大——有些数据将无法正确分类。

kernel参数：指定内核函数。

SVM（基础形式）局限性之一就是只能用来对线性可分的数据进行分类。如果线性不可分呢？SVM中使用核技巧（kernel trick）来解决非线性问题。核技巧及构造核函数。

非线性问题:如果数据线性不可分，就需要将其置入更高维的空间中，加入更多伪特征直到数据线性可分。

常见的核函数：线性核（两个个体的特征向量的点积、带权重的特征和偏置项）、多项式核（提高点积的阶数（比如2））、高斯核（rbf）、拉普拉斯核、Sigmoid核。这些内核能够有效地确定两类数据之间的距离，SVM可以据此对新数据进行分类。

二、基于SVM对MNIST数据集进行分类

加载minist数据集

import numpy as np
X = np.load('./data/dataset.npy')
y = np.load('./data/class.npy')

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=14)

使用SVM训练

SVM有很多类型的Estimators: svm.SVR、svm.SVC、svm.OneClassSVM、svm.NuSVR等。

from sklearn import svm
predictor = svm.SVC(gamma='scale', C=1.0, decision_function_shape='ovr', kernel='rbf')
predictor.fit(x_train, y_train)

预测并使用f1评估

# 预测结果
result = predictor.predict(x_test)
# 进行评估
from sklearn.metrics import f1_score
print("F-score:{0:.2f}".format(f1_score(result, y_test, average='micro')))

三、基于SVM对垃圾邮件进行判断

加载数据集

import numpy as np
import pandas as pd
# 垃圾邮件地址
spam_data_path = './data/spam_5000.utf8'
# 正常邮件地址
ham_data_path = './data/ham_5000.utf8'

with open(spam_data_path, encoding='utf-8') as f:
    spam_txt_list = f.readlines()
with open(ham_data_path, encoding='utf-8') as f:
    ham_txt_list = f.readlines()

加载停用词

功能词：人类语言包含很多功能词。与其他词相比，功能词没有什么实际含义。最普遍的功能词是限定词（“the”、“a”、“an”、“that”、和“those”），这些词帮助在文本中描述名词和表达概念，如地点或数量。介词如：“over”，“under”，“above” 等表示两个词的相对位置。

停用词：在信息检索中，功能词的另一个名称是：停用词（stopword）。称它们为停用词是因为在文本处理过程中如果遇到它们，则立即停止处理，将其扔掉。将这些词扔掉减少了索引量，增加了检索效率，并且通常都会提高检索的效果。停用词主要包括英文字符、数字、数学字符、标点符号及使用频率特高的单汉字等。

# 加载停止词
stop_word_path = './data/stopword.txt'
with open(stop_word_path, encoding='utf-8') as f:
    # 去除空格以及换行符
    stop_words = f.read().strip()

分词：分词就是将连续的一串子序列（句子）分成一个一个的词。

import jieba
a = "请不要把陌生人的些许善意，视为珍稀的瑰宝，却把身边亲近人的全部付出，当做天经地义的事情，对其视而不见"
cut_a = jieba.cut(a)
print(list(cut_a))

结果如下：

使用jieba对垃圾邮件和正常邮件进行分词。

# 使用jieba对垃圾邮件和正常邮件进行分词
import jieba
spam_words = []
# 垃圾邮件
for spam_txt in spam_txt_list:
    words = []
    cut_txts = jieba.cut(spam_txt)
    for cut_txt in cut_txts:
        # 判断分词是否是字母表组成的，是否是换行符，并且是否在停词表中
        if(cut_txt.isalpha() and cut_txt != '\n' and cut_txt not in stop_words):
            words.append(cut_txt)
    # 将词组成句子
    sentence = ' '.join(words)
    spam_words.append(sentence)

# 处理正常邮件
ham_words = []
for ham_txt in ham_txt_list:
    words = []
    cut_txts  = jieba.cut(ham_txt)
    for cut_txt in cut_txts:
        if(cut_txt.isalpha() and cut_txt!="\n" and cut_txt not in stop_words):
            words.append(cut_txt)
    sentence = ' '.join(words)
    ham_words.append(sentence)

spam_words部分数据结果如下（数据类型为list）：

构建词云：使用WordCloud构建词云。text是一个字符串，WordCloud会自动使用空格或者逗号对text进行分割。

# 构建词云
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def showWordCloud(text):
    wc = WordCloud(
        background_color='white',
        max_words=200,
        # 为了显示中文，使用字体
        font_path='./data/simhei.ttf',
        min_font_size=15,
        max_font_size=50,
        width=600
    )
    wordcloud = wc.generate(text)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

# 展示垃圾邮件和正常邮件的词云图
showWordCloud(" ".join(spam_words))

# 展示正常邮件的词云图
showWordCloud(" ".join(ham_words))

构建数据集

通过前面的步骤我们已经得到了邮件进行分词之后的结果。在SVM中我们知道，每一条数据的特征的个数是一样多的（也就是他们拥有相同的特征，但是可能特征值不同），但是很明显对于文本数据，每一封邮件的特征词明显是不一样的。这里我们可以想一想在数据挖掘实战（七）--使用朴素贝叶斯进行社会媒体挖掘中，我们使用了DictVectorizer转换器将特征字典转换成了一个矩阵，这里的数据是list数据，因此我们选择CountVectorizer将list数据转换成矩阵。

# 构建数据集
from sklearn.feature_extraction.text import CountVectorizer
data = []
data.extend(ham_words)
data.extend(spam_words)
# binary默认为False，一个关键词在一篇文档中可能出现n次，如果binary=True，非零的n将全部置为1
# max_features 对所有关键词的出现的频率进行降序排序，只取前max_features个作为关键词集
vectorizer = CountVectorizer(binary=False, max_features=1500)
result = vectorizer.fit_transform(data)

然后我们在加上列对应的名字（非必须，不影响训练）：

# 词汇表,为字典类型,key为词汇,value为索引
vocabulary = vectorizer.vocabulary_
result = pd.DataFrame(result.toarray())
# 对索引进行从小到大的排序
colnames = sorted(vocabulary.items(), key=lambda item: item[1])
colname = []
for i in colnames:
    colname.append(i[0])
result.columns = colname
print(result)  # 其中index中[0, 5000)是正常的邮件数据，[5000, 10001]是垃圾邮件数据

训练

# 训练
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# 使用0,1划分垃圾邮件和正常邮件
labels = []
labels.extend(np.ones(5000))
labels.extend(np.zeros(5001))

x_train, x_test, y_train, y_test = train_test_split(result, labels, random_state=14)
predictor = SVC(gamma='scale', C=1, decision_function_shape='ovr', kernel='rbf')
predictor.fit(x_train, y_train)
predictor_label = predictor.predict(x_test)

评估

# 评估
print("the accuracy is:", np.mean(predictor_label == y_test))
# 使用交叉验证进行评估
from sklearn.model_selection import cross_val_score
predictor = SVC(gamma='scale', C=1.0, decision_function_shape='ovr', kernel='rbf')
scores = cross_val_score(predictor, result, labels, scoring='f1')
print("Score: {}".format(np.mean(scores)))

结果如下：

使用不同数量的特征进行评估

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels = []
labels.extend(np.ones(5000))
labels.extend(np.zeros(5001))

# 画图的两个轴
scores = []
indexs = []

data = []
data.extend(ham_words)
data.extend(spam_words)

for i in range(1,3000,50):
    # 转换器
    vectorizer = CountVectorizer(binary=False,max_features=i)
    result = vectorizer.fit_transform(data)
    
    train,test,trainlabel,testlabel = train_test_split(result,labels,random_state=14)
    # 划分训练集和测试集    
    predictor = SVC(gamma='scale', C=1.0, decision_function_shape='ovr', kernel='rbf')
    predictor.fit(train,trainlabel)
    score = predictor.score(test,testlabel)
    scores.append(score)
    indexs.append(i)