python 实现中文文本分类

最新推荐文章于 2024-08-02 17:55:05 发布

XnCSD

最新推荐文章于 2024-08-02 17:55:05 发布

阅读量3w

点赞数 27

分类专栏： python 机器学习文章标签： python sklearn 文本分类机器学习

本文链接：https://blog.csdn.net/xncsd/article/details/86742377

版权

本文介绍如何使用Python的scikit-learn库处理中文文本分类，包括语料库获取、数据预处理（分词、TF-IDF特征提取）、模型训练与评估，涉及多项式贝叶斯、随机森林和逻辑回归等算法。

摘要由CSDN通过智能技术生成

python 实现中文文本分类

本文基于 Python 采用 scikit-learn 模块实现中文文本分类。

文本分类

一、预处理

1. 获取语料库

语料库数据选用搜狗语料库的搜狐新闻数据精简版：http://www.sogou.com/labs/resource/cs.php。

数据集介绍：
来自搜狐新闻2012年6月—7月期间国内，国际，体育，社会，娱乐等18个频道的新闻数据，提供URL和正文信息

格式说明：
数据格式为

<doc>

<url>页面URL</url>

<docno>页面ID</docno>

<contenttitle>页面标题</contenttitle>

<content>页面内容</content>

</doc>

注意：content字段去除了HTML标签，保存的是新闻正文文本

下载后解压到 SogouCS.reduced 文件夹。下载的文本是 xml 格式，需要解析为纯文本。参考这篇博文进行解析：http://www.sohu.com/a/147504203_609569 。需要注意的是，下载的原文本数据中缺少跟节点，并且有些特殊符号需要去掉，因此进行了一些格式处理步骤。代码如下所示，保存为 sougou_text.py：

#!/usr/bin/python
# -*- encoding:utf-8 -*-
 
 
import os
from xml.dom import minidom
from urllib.parse import urlparse
import glob
from queue import Queue
from threading import Thread, Lock
import time

THREADLOCK = Lock()
# 解析的文本保存路径
corpus_dir = './SogouCS.corpus/'


def file_format(from_file, to_file):
    """对下载的文本进行格式处理"""
    try:
        # 原文本需要用 gb18030 打开
        with open(from_file, 'r', encoding='gb18030') as rf:
            lines = rf.readlines()
        # xml 格式有问题，需添加根节点
        lines.insert(0, '<data>\n')
        lines.append('</data>')
        with open(to_file, 'w', encoding='utf-8') as wf:
            for l in lines:
                l = l.replace('&', '')
                wf.write(l)
    except UnicodeDecodeError:
        print("转码出错",from_file)


def praser_handler(q: Queue):
    # 建立url和类别的映射词典
    dicurl = {
   'auto.sohu.com': 'qiche', 'it.sohu.com': 'hulianwang', 'health.sohu.com': 'jiankang',
              'sports.sohu.com': 'tiyu', 'travel.sohu.com': 'lvyou', 'learning.sohu.com': 'jiaoyu',
              'cul.sohu.com': 'wenhua', 'mil.news.sohu.com': 'junshi', 'business.sohu.com': 'shangye',
              'house.sohu.com': 'fangchan', 'yule.sohu.com': 'yule', 'women.sohu.com': 'shishang',
              'media.sohu.com': 'chuanmei', 'gongyi.sohu.com': 'gongyi', '2008.sohu.com': 'aoyun'}
    while not q.empty():
        file = q.get()
        with THREADLOCK:
            print("文件" + file)
        file_code = file.split('.')[-2]
        file_format(file, file) # 进行格式处理
        doc = minidom.parse(file)
        root = doc.documentElement
        claimtext = root.getElementsByTagName("content")
        claimurl = root.getElementsByTagName("url")
        textnum = len(claimurl)
        for index in range(textnum):
            if claimtext[index].firstChild is None:
                continue
            url = urlparse(claimurl[index].firstChild.data)
            if url.hostname in dicurl:
                if not os.path.exists(corpus_dir + dicurl[url.hostname]):
                    os.makedirs(corpus_dir + dicurl[url.hostname])
                fp_in = open(
                    corpus_dir + dicurl[url.hostname] + "/%s_%d.txt" % (file_code, index),"wb")
                fp_in.write((claimtext[index].firstChild.data).encode('utf8'))
                fp_in.close()


def sougou_text_praser(org_dir):
    # 用8个线程处理文本
    q = Queue()
    for file in glob.glob(org_dir + '*.txt'):
        q.put(file)
    for i in range(8):
        Thread(target=praser_handler, args=(q,)).start()
    while not q.empty():
        time.sleep(10)


def files_count(corpus_dir):
    # 统计各类别下的文本数
    folders = os.listdir(corpus_dir)
    total = 0
    for folder in folders:
        if folder.startswith('.DS'):
            continue
        fpath = os.path.join(corpus_dir, folder)
        files = os.listdir(fpath)
        num = len(files)
        total += num
        print(folder