中文文本分类

最新推荐文章于 2024-09-10 06:58:53 发布

爱科研的徐博士

最新推荐文章于 2024-09-10 06:58:53 发布

阅读量2.1k

点赞数

分类专栏：【算法】自然语言

本文链接：https://blog.csdn.net/u010626937/article/details/53561479

版权

【算法】自然语言专栏收录该内容

20 篇文章 3 订阅

订阅专栏

0 前言

最近公司需要，需要实现一个简单的文本分类算法，在此做个笔记，文章内容可能包含他人的东西，在此表示感谢！！！

1 流程

文本预处理
特征选择
分类器的选择
训练模型
检验模型

2、文本预处理

首先需要导入文件，导入文件的类型可以根据自己的需要自行选择，在此提示一点：如果是在Windows上，需要对输入的文件路径进行编码的转化。

#方式一
path=Unicode(path,'utf-8')
#方式二
path=u'E:\datamining\...'

我们要尽可能的使得我们的文本干净，易处理，如果文件中出现大量的空格、换行符，我们可以将之替换为空。

string=‘....’
string=string.replace('\n','').replace(' ','')

处理完文本之后，我们要对文本进行分词（便于以后对文本进行向量化表示），在此我选用的是结巴分词，也可以使用nlpir,(如果您有什么好用的中文分词工具，适合python语言的，您可以留言给我，灰常感谢呀~~~)

import jieba
string='.....'
words=jieba.cut(string)#words为一个对象

分完词之后，我们还需要进行对一些停用词进行过滤，停用词表的选择，可以根据自己项目的要求进行选择，也可以选择一般的停用词表，然后再根据自己的项目要求对停用词表进行停用词的添加。
在此贴一下他人的代码：

import jieba  
import os
import time  
import string  
rootpath="../转换后的文件"  
os.chdir(rootpath)  
# stopword  
words_list = []                                      
filename_list = []  
category_list = []  
all_words = {}                                # 全词库 {'key':value }  
stopwords = {}.fromkeys([line.rstrip() for line in open('../stopwords.txt')])  
category = os.listdir(rootpath)               # 类别列表  
delEStr = string.punctuation + ' ' + string.digits  
identify = string.maketrans('', '')     
#########################  
#       分词，创建词库    #  
#########################  
def fileWordProcess(contents):  
    wordsList = []  
    contents = re.sub(r'\s+',' ',contents) # trans 多空格 to 空格  
    contents = re.sub(r'\n',' ',contents)  # trans 换行 to 空格  
    contents = re.sub(r'\t',' ',contents)  # trans Tab to 空格  
    contents = contents.translate(identify, delEStr)   
    for seg in jieba.cut(contents):  
        seg = seg.encode('utf8')  
        if seg not in stopwords:           # remove 停用词  
            if seg!=' ':                   # remove 空格  
                wordsList.append(seg)      # create 文件词列表  
    file_string = ' '.join(wordsList)              
    return file_string  

for categoryName in category:             # 循环类别文件，OSX系统默认第一个是系统文件  
    if(categoryName=='.DS_Store'):continue  
    categoryPath = os.path.join(rootpath,categoryName) # 这个类别的路径  
    filesList = os.listdir(categoryPath)      # 这个类别内所有文件列表  
    # 循环对每个文件分词  
    for filename in filesList:  
        if(filename=='.DS_Store'):continue  
        starttime = time.clock()  
        contents = open(os.path.join(categoryPath,filename)).read()  
        wordProcessed = fileWordProcess(contents)       # 内容分词成列表  
#暂时不做#filenameWordProcessed = fileWordProcess(filename) # 文件名分词，单独做特征  
#         words_list.append((wordProcessed,categoryName,filename)) # 训练集格式：[(当前文件内词列表，类别，文件名)]  
        words_list.append(wordProcessed)  
        filename_list.append(filename)  
        category_list.append(categoryName)  
        endtime = time.clock();   
        print '类别:%s >>>>文件:%s >>>>导入用时: %.3f' % (categoryName,filename,endtime-starttime)