基于LIBSVM实现文本分类（python）

最新推荐文章于 2024-06-19 23:40:42 发布

城尘丶

最新推荐文章于 2024-06-19 23:40:42 发布

阅读量7.1k

点赞数 7

分类专栏：文本数据处理

本文链接：https://blog.csdn.net/a524633094/article/details/78310738

版权

文本数据处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

支持向量机(Support Vector Machine)是Cortes和Vapnik于1995年首先提出的，它在解决小样本、非线性及高维模式识别中表现出许多特有的优势，并能够推广应用到函数拟合等其他机器学习问题中。具体SVM理论学习参见jasper的博客

LIBSVM是台湾大学林智仁(Lin Chih-Jen)教授等开发设计的一个简单、易于使用和快速有效的SVM模式识别与回归的软件包，他不但提供了编译好的可在Windows系列系统的执行文件，还提供了源代码，方便改进、修改以及在其它操作系统上应用；该软件对SVM所涉及的参数调节相对比较少，提供了很多的默认参数，利用这些默认参数可以解决很多问题；并提供了交互检验(Cross Validation)的功能。该软件可以解决C-SVM、ν-SVM、ε-SVR和ν-SVR等问题，包括基于一对一算法的多类模式识别问题。具体LIBSVM软件包学习参见LIBSVM官方文档

在进行文本分类之前，需要准备数据集。可以百度搜集也可以在搜狗实验室下载,本文的语料库下载见文章末尾。

在万事俱备，只欠代码之前，我们需要理清思路：

将数据集分为训练集与测试集
对数据进行处理，生成LIBSVM工具需要的输入形式的文件
利用LISVM工具进行训练及测试

接下来我们具体实施。

1、确定训练集与测试集
所谓的测试集与训练集都属于语料库，就对于我们提供的语料库而言，每个类别分别提取五十篇总计五百篇作为测试集，注意，测试集与训练集是互斥的，不相互包含，他俩的总集合是语料库。
2、对于训练集的每篇文章进行分词，分词同时过滤停用词

本报讯在辽宁省本溪市太子河中消失了近２０年的鱼儿，最近又游回了这座曾被世界环境组织称为“卫星上看不见”的城市。
经过近１０年环境综合治理和大规模建设生态农业，本溪市民生活质量大为改善。市区已建成３２．８平方公里的烟尘控制区，大气能见度明显好转，天开始变蓝，水开始变清。

本报讯辽宁省本溪市太子河中消失年鱼儿最近游回这座曾世界环境组织称为卫星看不见城市年环境综合治理大规模建设生态农业本溪市民生活质量大为改善市区已建成．平方公里烟尘控制区大气能见度明显好转天变蓝水开始变清

3、统计总的词频DF，以及每个类别的总词频TF
在对于每篇文章进行分词、过滤停用词的同时，我们应该统计我们的大字典和小字典。这里所说的大字典即为所有训练集的文章出现的词所构成的总的字典，小字典即为每个类别中该类别出现的所有文章的词所构成的字典。而后大字典中取出的值记为df，小字典取出的值记为tf。
tf表示term frequency，通常是指词频，在这里即为一个类别的某个词出现的所有次数；idf表示inversed document frequency，是文档频率的倒数，因而此时统计的df为idf的倒数，即为一个词在多少篇文章中出现过。
注意： 读懂df与tf的意义，再进行统计大字典和小字典，两个字典的统计方式不同。在测试集时，tf为每篇文章中的词的出现次数，df依旧是训练集的大字典。
运用TF-IDF算法，计算构建特征向量
$weight=1.0*tf*\log _{2}\frac{N}{df}$
其中，weight表示每个词的权重，tf和df不再赘述，N是训练集的所有文章数，当然在进行测试集计算时，N即为测试集的所有文章数
通过对每个权值weight计算，最后生成如下图的特征向量文件
通过LIBSVM工具进行训练并得到结果
在实行这一步之前，我们应该已经成功的将训练集转化成训练的特征文件featurefile.txt，和测试集转化成测试的特征训练文件txt_featurefile.txt

1  将已有的特征文件进行缩放，将所有没有用的数字（如：0）剔除掉；
svm-scale -l 0 -u 1 -s temp.txt featurefile.txt > feature_scale.txt
2  将测试集所训练的特征文件进行缩放
svm-scale -l 0 -u 1 -s txt_temp.txt txt_featurefile.txt > txt_feature_scale.txt
3  同2一样，步骤2是用参数来缩放，而该局是用步骤1生成的缩放标准来缩放（2、3执行其一即可）
svm-scale -r temp.txt txt_featurefile.txt > txt_feature_scale.txt
4  将步骤1缩放后的特征文件进行训练，即得到model.txt训练标准文件
svm-train -c 32.0 -g 0.0078125 feature_scale.txt model.txt
5  用步骤4得到的model.txt文件来测试步骤2、3得到的特征文件，将结构输出至result.txt
svm-predict txt_feature_scale.txt model.txt result.txt

代码如下

#对每个类别进行分词，统计每个类别的字典以及总的字典
#coding=GBK
import jieba
import os
import xlwt
#读取每个txt文件的全路径
def readfullnames():
    fullname_list = []
    for dir in os.listdir('E:\python\untitled2\建模\训练\文本分类语料库'):
        for filename in os.listdir('E:\python\untitled2\建模\训练\文本分类语料库\\' + dir):
            fullname = 'E:\python\untitled2\建模\训练\文本分类语料库\\' + dir.decode('gbk') + '\\' + filename
            fullname_list.append(fullname)
           # print len(fullname_list)
           # print fullname.decode('gbk').encode('utf-8')
    return fullname_list

#读取每个类别的字典的路径
def readfillnames():
    fillname_list = []
    for dir in os.listdir('E:\python\untitled2\建模\训练\解词'):
        fillname = 'E:\python\untitled2\建模\训练\解词\\' + dir + '\\' + dir+'.txt'
        #if  fullname != 'E:\python\untitled2\文本分类语料库\解词\字典.txt\字典.txt.txt':
        #if fullname.find('txt.txt') == -1:
        if 'txt.txt' not in fillname:
            fillname_list.append(fillname)
    #print len(fullname_list)
           # print fullname.decode('gbk').encode('utf-8')
    return fillname_list

#读取每个类别的词频的路径
def readcipin_fullnames():
    fillname_list = []
    for dir in os.listdir('E:\python\untitled2\建模\训练\解词'):
        fillname = 'E:\python\untitled2\建模\训练\解词\\' + dir + '\\词频' + dir+'.txt'
        #if  fullname != 'E:\python\untitled2\文本分类语料库\解词\字典.txt\字典.txt.txt':
        #if fullname.find('txt.txt') == -1:
        if 'txt.txt' not in fillname:
            fillname_list.append(fillname)
    #print len(fullname_list)
           # print fullname.decode('gbk').encode('utf-8')
    return fillname_list

#读取停用词表
def read_stopwords():
    stopwords_list = []
    ifs = open('E:\python\untitled2\stopword.txt', 'r')
    for line in ifs.readlines():
        line = line.strip()
        stopwords_list.append(line)
    return stopwords_list

#将每个类别的字典写入该文件下
def write_words(words_dic,dfs):
    for k in words_dic.items():
        st = ''.join(['%s : %s' % k])
        dfs.write(st)
        dfs.write('\n')

#将txt进行分词 统计每个类别的词在多少篇文章中出现  以及 类词频
def segfile(fullname_list):
    all_stopwords_list = read_stopwords()
    words_dic = {}
    all_words = {}                        #词频
    name_temp = "管理"
    for fullname in fullname_list:
        dfs = open('E:\python\untitled2\建模\训练\解词\\' + name_temp + '\\' + name_temp + '.txt', 'w')
        ddfs = open('E:\python\untitled2\建模\训练\解词\\' + name_temp + '\\词频' + name_temp + '.txt', 'w')
        dirname = fullname.split('E:\python\untitled2\建模\训练\文本分类语料库\\')[1].split('\\')[0]

        if  name_temp != dirname:
            write_words(words_dic, dfs)
            write_words(all_words, ddfs)
            words_dic.clear()
            all_words.clear()
            name_temp = dirname

        filename = fullname.split('\\')[-1]
        print fullname.decode('gbk') + '==================================================='
        ifs = open(fullname, 'r')
        ofs = open('E:\python\untitled2\建模\训练\解词\\' + dirname + '\\' + filename, 'w')
        words_temp = []
        for line in ifs.readlines():
            line = line.strip()
            try:
                words = jieba.cut(line.decode('gbk').encode('utf-8'))
            except:
                continue

            for w in words:
                if w.strip() == '':
                    continue
                if w in all_stopwords_list:
                    continue
                if w not in words_temp:
                    words_temp.append(w)
                if w not in all_words.keys():
                    all_words[w] = 1
                else:
                    all_words[w] += 1
                print w
                ofs.write(w.encode('gbk') + ' ')
            ofs.write('\n')
            
        for t in words_temp:
            if t not in words_dic.keys():
                words_dic[t] = 1
            else:
                words_dic[t] += 1

    ifs.close()
    ofs.close()
    dfs.close()
    ddfs.close()

    dfs = open('E:\python\untitled2\建模\训练\解词\\' + name_temp + '\\' + name_temp + '.txt', 'w')
    write_words(words_dic, dfs)
    dfs.close()
    ddfs = open('E:\python\untitled2\建模\训练\解词\\' + name_temp + '\\词频' + name_temp + '.txt', 'w')
    write_words(all_words, ddfs)
    ddfs.close()

#统计总字典
def sumdic(fillname_list):
    dic = {}
    fillname_list = readfillnames()
    for file in fillname_list:
        dfs = open(file ,'r')
        for line in dfs.readlines():
            key = line.split(':')[0].strip()
            value = int(line.split(':')[-1].strip())
            if key not in dic.keys():
                dic[key] = value
            else:
                dic[key] += value
                #print key.decode('gbk').encode('utf-8')+': %d'%dic[key]
            print "程序运行中，请稍后。。。"

    #将次数少于九次的词删除
    for t in dic.keys():
        if dic[t] < 2:
            del dic[t]

    afs = open('E:\python\untitled2\建模\训练\字典.txt','w')
    write_words(dic, afs)
    afs.close()

#统计总词频
def sumcipindic():
    cipin_dic = {}
    cipin_fullnamelist = readcipin_fullnames()
    for file in cipin_fullnamelist:
        dfs = open(file ,'r')
        for line in dfs.readlines():
            key = line.split(':')[0].strip()
            value = int(line.split(':')[-1].strip())
            if key not in cipin_dic.keys():
                cipin_dic[key] = value
            else:
                cipin_dic[key] += value
                #print key.decode('gbk').encode('utf-8')+': %d'%dic[key]
            print "请稍后。。。"

    afs = open('E:\python\untitled2\建模\训练\词频字典.txt','w')
    write_words(cipin_dic, afs)
    afs.close()

if __name__ == '__main__':

    for dir in os.listdir('E:\python\untitled2\建模\训练\文本分类语料库'):
        if not os.path.exists('E:\python\untitled2\建模\训练\解词\\' + dir):
            os.mkdir('E:\python\untitled2\建模\训练\解词\\' + dir)

    fullname_list = readfullnames()
    fillname_list = readfillnames()
    segfile(fullname_list)
    sumdic(fillname_list)
    sumcipindic()

#通过大小字典构建特征向量文件
# coding=GBK
import jieba
import os
import xlwt
import math

def get_fullname_list():
    path = r"E:\\python\\untitled2\\建模\\训练\\解词"
    dirname_list = os.listdir(path)
    fullname_list = []
    for dirname in dirname_list:
        new_path = path + r"\\" + dirname
        filename_list = os.listdir(new_path)
        for filename in filename_list:
            fullname = new_path + r'\\' + filename
            fullname_list.append(fullname)
    return  fullname_list

def write_words(words_dic,dfs):
    for k in words_dic.items():
        st = ''.join(['%s : %s' % k])
        dfs.write(st)
        dfs.write('\n')

def getworddict():
    fullname_list = get_fullname_list()
    worddict = {}
    dfs = open("E:\python\untitled2\建模\\ffile.txt", 'w')
    for fullname in fullname_list:
        print fullname
        ifs = open(fullname, 'r')

        wordset = set()
        for line in ifs.readlines():
            words = line.strip().split()
            wordset = set(words)
        for w in wordset:
            if w in worddict.keys():
                worddict[w] += 1
            else:
                worddict[w] = 1
    write_words(worddict, dfs)
    return worddict

#读取总的字典
def get_worddict():
    dic = {}
    afs = open('E:\python\untitled2\建模/字典.txt','r')
    for line in afs.readlines():
        key = line.split(':')[0].strip()
        value = int(line.split(':')[-1].strip())
        dic[key] = value
    afs.close()
    return dic

#读取词频的字典
def get_cipinworddict():
    dic = {}
    afs = open('E:\python\untitled2\建模/词频字典.txt','r')
    for line in afs.readlines():
        key = line.split(':')[0].strip()
        value = int(line.split(':')[-1].strip())
        dic[key] = value
    afs.close()
    return dic

#读取每个类的词频字典
def get_class_cipinworddict(name_temp):
    dic = {}
    afs = open('E:\python\untitled2\建模\解词\\' + name_temp + '\\词频' + name_temp + '.txt', 'r')
    for line in afs.readlines():
        key = line.split(':')[0].strip()
        value = int(line.split(':')[-1].strip())
        dic[key] = value
    afs.close()
    return dic

#每个类别进行编号
def create_classname_dict():
    classname_dict = {}
    classname_dict['教师'] =1
    classname_dict['科技'] = 2
    classname_dict['学生'] = 3
    classname_dict['管理'] = 4
    return classname_dict


#创建特征文件
def create_feature_file():
    classname_dict = create_classname_dict()
    worddict = get_worddict()
    cipin_dict = get_cipinworddict()
    array_worddict = worddict.keys()
    fullname_list = get_fullname_list()
    ofs = open("E:\python\untitled2\建模\训练\\featurefile.txt", 'w')
    temp_class = ""
    for fullname in fullname_list:
        str = ''
        dirname = fullname.split(r'E:\\python\\untitled2\\建模\\训练\\解词\\')[1].split('\\')[0]
        #classno = -1
        for classname in classname_dict.keys():
            if classname in fullname:
                classno = classname_dict[classname]
                break
        if (temp_class != dirname):
            class_cipindict = get_class_cipinworddict(dirname)
            temp_class = dirname
            print  dirname
        classno = -1
        str = repr(classno) + ' '
        ifs_curfile = open(fullname, 'r')
        #统计每个词在每篇文章中出现的次数tf
        file_worddict = {}
        for line in ifs_curfile.readlines():
            words = line.rstrip().split()
            for w in words:
                if w not in file_worddict.keys():
                    file_worddict[w] = 1
                else:
                    file_worddict[w] += 1

        #wordno = 1
        for wordno in range(0,len(array_worddict)):
        #for w in array_worddict:
            tf = 0
            ctf = 0
            nctf = 1
            w = array_worddict[wordno]
            if w in file_worddict.keys():
                tf = file_worddict[w]
                print w
                try:
                    ctf = class_cipindict[w]
                    print ctf
                    nctf = cipin_dict[w] - ctf
                except:
                    continue
                if nctf==0:
                    nctf = 1
                print nctf
            df = worddict[w]
            weight = 1.0*tf * math.log((40.0/df), 2)
        #weight = (ctf/nctf)*math.log((40.0/df),2)

            str += repr(wordno+1) + ':' + repr(weight) + ' '
        ofs.write(str.rstrip() + '\n')
    ofs.close()

if __name__ == '__main__':
    create_feature_file()
    #getworddict()

LIBSVM学习素材包下载
百度云盘：链接：http://pan.baidu.com/s/1i57suQp 密码：t109

城尘丶

关注

7
点赞
踩
44

收藏

觉得还不错? 一键收藏
20
评论
基于LIBSVM实现文本分类（python）

支持向量机(Support Vector Machine)是Cortes和Vapnik于1995年首先提出的，它在解决小样本、非线性及高维模式识别中表现出许多特有的优势，并能够推广应用到函数拟合等其他机器学习问题中。具体SVM理论学习参见jasper的博客LIBSVM是台湾大学林智仁(Lin Chih-Jen)教授等开发设计的一个简单、易于使用和快速有效的SVM模式识别与回归的软件包，他不但提
复制链接

扫一扫

专栏目录