处理复旦大学中文文分类数据集

王学强_Bryan

已于 2022-05-01 15:43:51 修改

阅读量4.4k

点赞数 8

文章标签：分类 python 机器学习自然语言处理中文分词

于 2022-05-01 12:52:22 首次发布

本文链接：https://blog.csdn.net/qq_45488132/article/details/124524478

版权

1. 处理原始数据

原始数据的格式为（如图1.1），每个文件夹中有不等量的.txt文件，每个文件为一篇语料。

图1.1：原始数据格式

读取类别文件列表，用os模块读取原始数据文件夹下的每一个文件夹（即每一个类）。

# 遍历文件夹下的所有文件夹，一个文件夹是一个类
data_dir = '复旦大学中文文本分类数据集/原始数据'
file_list = []
for root, file, files in os.walk(data_dir):
    file_list.append(root)

file_list = file_list[1:]

遍历每个类别中每个文件，一个文件为一篇语料。将每个类别的语料放在一个列表中，列表中每个元素为每篇语料。再存储到corpus_list中，为所有类别的所有语料。

# 将每个文件夹中的语料提取到一个txt文件中，作为一个类
corpus_list = []
for file in file_list:
    temp_corpus_list = []  # 用来存放这一类的语料
    # 遍历每一类文件下的每一条语料
    for corpus in os.listdir(file):
        text = open(file + '/' + corpus, 'rb').read().decode('GB2312', 'ignore')

        # 去除一些奇怪的字符
        text = text.replace('\r', ' ').replace('\n', ' ').replace('\u3000', '').replace('             ', '')
        temp_corpus_list.append(text)

    corpus_list.append(temp_corpus_list)

按类别将语料写入文件中，每一个类别为一个.txt文件，一个.txt文件中每一行就为一篇语料，用\r\n来分开。

# 将列表写入文件中
for i in range(len(file_list)):
    print(file_list[i][36:])
    f = open('复旦大学中文文本分类数据集/标准格式数据' + file_list[i][36:] + '.txt', 'w', encoding='utf-8')
    for each in corpus_list[i]:
        f.write(each + '\r\n')
    f.close()

最终得到标准数据格式，如图1.2所示。其中每一个.txt文件为一个类的所有语料，每个语料用\r\n分开。

图1.1：标准格式数据

2. 分词、去停用词

得到标准数据之后，需要对每一类的语料进行处理。读取所有类别的所有语料，进行分词，去停用词处理。本文使用jieba库进行分词，标注词性。使用Gensim==3.8.0来进行词典的构建

读取各个类别的.txt文件方法，并将标签对应类别进行存储，方便后期将语料标签和类别进行对应。

""" 获取文件列表，并将标签-类别对应表写入文件中
    参数
    files_path: 包含所需语料的路径
    store_path: 保存的标签-类别对应表地址
    输出
    file_list: 包含的各个类的语料文件
"""


def get_file_list(files_path, store_path):
    files = os.listdir(path=files_path)

    f = open(store_path, 'w')
    for each in files:
        f.write(str(files.index(each)) + '-->' + each + '\n')

    f.close()

    return [files_path + '/' + each for each in files]

读取停用词表以及去停用词方法，其中要注意停用词表的换行符，本文使用\r\n对每个停用词进行换行。

""" 加载停用词
    参数
    path: 停用词表的路径
    输出：停用词列表 
"""


def get_stopwords(path):
    return list(set(open(path, 'rb').read().decode("utf-8").split("\r\n")))


""" 去除停用词,每个词是带有词性标注的词
    参数
    word_list: 需要去停用词的列表
    stopwords: 停用词列表
    输出
    word_list: 去除停用词后的词列表
"""


def rm_stopwords(word_list, stopwords):
    # 这个很重要，注意每次pop之后总长度是变化的
    for i in range(word_list.__len__())[::-1]:
        # 去除多余的分隔符，因为列表中带有词性，所以要用[0]进行取词
        if word_list[i][0] == '\n' or word_list[i][0] == '\r':
            word_list.pop(i)
        # 去停用词
        elif word_list[i][0] in stopwords:
            word_list.pop(i)
        #  去数字
        elif word_list[i][0].isdigit():
            word_list.pop(i)
    return word_list

读取文件列表对每类中每条语料进行分词、词性标注、去停用词处理。

""" 获取分词去停用词后的词袋以及字典
        参数
        file_list: 文件列表，其中包含每个类型的语料，一种语料为一个文件，文件中每一条语料为一行
        stopwords: 停用词表
        输出
        content_list: 每篇语料的原始内容
        jieba_list: 每篇语料分词后的词列表
        rmd_list: 每篇语料去停用词后的词列表
        flag_list: 每篇语料的标签列表
        bow: 词袋
        dictionary: 整个语料库中出现的词对应的ID词典
    """

    def get_corpus(self, file_list, stopwords):
        # 初始化方法所需参数

        # 遍历文件列表中不同类型的文件，一个file即为一类
        for file in file_list:
            start_time = time()

            # 遍历文件中的每一行，一行即为一篇语料,其中”\r\n“是预处理语料时每篇语料的分割符
            contents = open(file, 'rb').read().decode('utf-8').split('\r\n')[:-1]
            print("正在处理 " + file + " 内的语料，共有" + str(len(contents)) + "条...")

            for content in contents:
                self.content_list.append(content)
                # 分词,使用jieba的posseg库进行分词并词性标注
                word_list = [(w.word, w.flag) for w in pseg.cut(sentence=content)]
                self.jieba_list.append(word_list)

                # 去停用词
                # !!! 很重要，因为python内一个变量的地址不变，而在一个类中，如果有变量会关联到这个变量，那么这个关联变量会实时随着这个变量改变，导致原来所被赋值的对象改变
                temp_rmd_list = copy.deepcopy(word_list)
                temp_rmd_list = rm_stopwords(word_list=word_list, stopwords=stopwords)
                self.rmd_list.append(temp_rmd_list)

                # 字典词袋
                words = [w[0] for w in temp_rmd_list]
                self.dictionary.add_documents(documents=[words])
                self.bow.append(self.dictionary.doc2bow(document=words))

                # 文章标签
                self.flag_list.append(self.flag)

            end_time = time()
            print("已处理完" + str(self.flag + 1) + "类语料，耗时" + get_time(end_time - start_time))
            # 下一个文章类别
            self.flag += 1

经过以上的操作，可以得到content_list、jieba_list、rmd_list、flag_list、bow、dictionary。每种数据格式如下：

数据	数据描述
`content_list`	原始内容列表
`jieba_list`	分词并标注词性后的列表，格式如：`[('我',r),('北京',ns),('天安门',ns)]`
`rmd_list`	分词并去停用词内容的列表，格式与`jieba_list`相同，但是去除了停用词
`flag_list`	语料的类别标签，格式如`[0,0,0,1,1,1,2,2]`
`bow`	语料的词袋模型，格式如`[(0, 1),(1, 52)]`，其中每个元组第一个元素指词的`id`，第二个元素指词的频次
`dictionary`	语料中所有词对应的`id`，格式如：`{我：1, 你：2`}

将得到的词袋bow以及对应的词典dictionary 进行存储。并赋予对应的读取方法。其中bow的.txt文件内，元组之内用/分开，元组之间用,隔开，列表之间用\n隔开。

""" 将词袋保存到txt文件中，将词典dictionary保存到文件中
        参数 
        bow_path: 存储bow的路径
        bow: 要存储的bow
        dict_path: 存储dictionary的路径
        dictionary: 要存储的dictionary
    """

    def save_bow_dict(self, bow_path, dict_path):
        file = open(bow_path, 'w')
        for each in self.bow:
            for each_ in each:
                file.write(str(each_[0]) + ',' + str(each_[1]))
                file.write('/')
            file.write('\n')
        file.close()

        self.dictionary.save(fname_or_handle=dict_path)
        print('保存成功!')

    """ 对应的读取bow和字典的的操作
        参数
        bow_path: 词袋bow的路径
        dict_path: 词典的路径
        输出
        bow: 词袋
        dictionary: 词典
    """

    def load_bow_dict(self, bow_path, dict_path):
        file = open(bow_path, 'rb').read().decode('utf-8').split('\n')[:-1]
        self.bow = []
        for each in file:
            lst1 = each.split('/')[:-1]
            lst = []
            for each_ in lst1:
                lst2 = each_.split(',')
                lst2 = (int(lst2[0]), int(lst2[1]))
                lst.append(lst2)
            self.bow.append(lst)

        self.dictionary = corpora.Dictionary.load(fname=dict_path)

3. 存储到csv

构建一个dataframe，存储到csv中，注意编码格式，本文使用utf-8-sig

""" 将得到的信息转化为dataframe并存储到csv中
        参数
        content_list: 每篇语料的原始内容
        jieba_list: 每篇语料分词后的词列表
        rmd_list: 每篇语料去停用词后的词列表
        flag_list: 每篇语料的标签列表
        path: 存储csv的路径
        输出
        info_df: 包含信息的dataframe
    """

    def to_csv(self, path):
        self.info_df = pd.DataFrame(columns=['语料标签', '语料内容', '分词内容', '去停用词内容'])
        self.info_df['语料标签'] = self.flag_list
        self.info_df['语料内容'] = self.content_list
        self.info_df['分词内容'] = self.jieba_list
        self.info_df['去停用词内容'] = self.rmd_list

        self.info_df.to_csv(path_or_buf=path, index=False, encoding='utf-8-sig')

4. 存储到数据库

将得到的内容存储到数据库中，其中元组之内用,隔开，元组之间用;隔开，

""" 将数据存储到数据库中
        参数
        connection: 数据库连接
        database: 要写入的数据库
        content_list: 每篇语料的原始内容
        jieba_list: 每篇语料分词后的词列表
        rmd_list: 每篇语料去停用词后的词列表
        flag_list: 每篇语料的标签列表
    """

    def to_mysql(self, connection, table):
        # 创建游标
        cur = connection.cursor()

        # 创建一个table
        sql_create_table = """
        CREATE TABLE """ + table + """ (
                            语料ID INT NOT NULL,
                            语料标签 INT NOT NULL,
                            语料内容 LONGTEXT NOT NULL,
                            分词内容 LONGTEXT NOT NULL,
                            去停用词内容 LONGTEXT NOT NULL
                        )
                        ENGINE=InnoDB
                        DEFAULT CHARSET=utf8
                        COLLATE=utf8_general_ci;
        """
        cur.execute(sql_create_table)

        # 编写插入数据的sql
        sql = "insert into " + table + " (语料ID, 语料标签, 语料内容, 分词内容, 去停用词内容) values ('%s', '%s', '%s', '%s', '%s')"

        try:
            for i in range(len(self.content_list)):
                content_id = str(i)  # 语料ID存储到数据库中
                content = self.content_list[i]  # 语料内容存储到数据库中
                jieba_content = ';'.join(str(each) for each in self.jieba_list[i])  # 存储到数据库中时，jieba分词列表中的词用空格分开
                rmd_content = ';'.join(str(each) for each in self.rmd_list[i])  # 存储到数据库中时，停用词表中的词用空格分开
                flag = str(self.flag_list[i])  # 语料标签存储到数据库中
                cur.execute(sql, (content_id, flag, content, jieba_content, rmd_content))
            connection.commit()
            print(str(len(self.content_list)) + '条数据已存储到数据库中')

        # 如果存储出问题了，回滚报错
        except Exception as e:
            print(e)
            connection.rollback()
            print("插入数据失败")

5. 主函数

以上所有操作都封装在一个ProcessingAndSave类里面，通过主函数进行调用，根据本地路径进行参数调整，其中数据库连接host、user、password、database需要根据本地的具体情况进行设置，port如果没有特别设定则为3306。

""" 编写主函数
    --------在使用前，测试停用词表的每行分隔符（第62行），以及语料的分隔符（第124行），可能需要修改---------
"""
if __name__ == "__main__":
    # 初始化类
    pas = ProcessingAndSave()

    file_list = get_file_list(files_path='复旦大学中文文本分类数据集/标准格式数据',
                              store_path='复旦大学中文文本分类数据集/预处理后数据/标签对应类别.txt')

    stopwords = get_stopwords('复旦大学中文文本分类数据集/停用词表/stopwords_all.txt')

    # 获取语料
    pas.get_corpus(file_list=file_list, stopwords=stopwords)

    # 存储到csv文件
    pas.to_csv('复旦大学中文文本分类数据集/预处理后数据/info.csv')

    # 保存词袋和词典
    pas.save_bow_dict(bow_path='复旦大学中文文本分类数据集/models/bow.txt',
                      dict_path='复旦大学中文文本分类数据集/models/dictionary')

    # 存储到数据库中
    connection = pymysql.connect(host="localhost", user="root", password="xxxxxx", database="xxx", port=3306)
    table = '复旦中文语料'
    pas.to_mysql(connection=connection, table=table)

7. 最终结果展示

csv文件

csv文件
sql文件

sql文件

bow

6. 注意内容

本文中使用的windows10环境，换行符默认为\n，在读取文件进行换行时，应该先测试效果。
存储文件时，默认都为utf-8编码，但是本文存储到csv文件时，使用的utf-8-sig，出现乱码时，需要进行调整。
本文默认使用mysql数据库。
将数据存入数据库时，本文默认原数据库中没有该表，会创建一个表来存储数据。所以有需要的话要改动to_mysql函数。

7. 资源列表

8. 完整代码

处理复旦大学语料库.py

import os
data_dir = r'NLP/文本聚类/corpus/复旦大学中文文本分类数据集/原始数据'

# 遍历文件夹下的所有文件夹，一个文件夹是一个类
file_list = []
for root, file, files in os.walk(data_dir):
    file_list.append(root)

file_list = file_list[1:]

# 将每个文件夹中的语料提取到一个txt文件中，作为一个类
corpus_list = []
for file in file_list:
    temp_corpus_list = []  # 用来存放这一类的语料
    for corpus in os.listdir(file):
        text = open(file + '/' + corpus, 'rb').read().decode('GB2312', 'ignore')

        # 去除一些奇怪的字符
        text = text.replace('\r', ' ').replace('\n', ' ').replace('\u3000', '').replace('             ', '')
        temp_corpus_list.append(text)

    corpus_list.append(temp_corpus_list)


# 将列表写入文件中
for i in range(len(file_list)):
    print(file_list[i][36:])
    f = open('NLP/文本聚类/corpus/复旦大学中文文本分类数据集/标准格式数据' + file_list[i][36:] + '.txt', 'w', encoding='utf-8')
    for each in corpus_list[i]:
        f.write(each + '\r\n')
    f.close()

ProcessingAndSave.py

import pandas as pd

from gensim import corpora
import jieba.posseg as pseg
from time import time
import copy

import pymysql
import os

""" 计算时间
    参数
    milliseconds: 输入的毫秒
    输出
    将毫秒化成小时，分钟以及秒
"""


def get_time(seconds):
    if seconds <= 60:
        return ' ' + str(seconds) + ' 秒'
    elif 60 < seconds <= 3600:
        minutes = int(seconds / 60)
        seconds = seconds % 60
        return ' ' + str(minutes) + ' 分 ' + str(seconds) + ' 秒'
    else:
        hours = int(seconds / 3600)
        minutes = int((seconds % 3600) / 60)
        seconds = (seconds % 3600) % 60
        return ' ' + str(hours) + '小时' + str(minutes) + '分' + str(seconds) + '秒'


""" 获取文件列表，并将标签-类别对应表写入文件中
    参数
    files_path: 包含所需语料的路径
    store_path: 保存的标签-类别对应表地址
    输出
    file_list: 包含的各个类的语料文件
"""


def get_file_list(files_path, store_path):
    files = os.listdir(path=files_path)

    f = open(store_path, 'w')
    for each in files:
        f.write(str(files.index(each)) + '-->' + each + '\n')

    f.close()

    return [files_path + '/' + each for each in files]


""" 加载停用词
    参数
    path: 停用词表的路径
    输出：停用词列表 
"""


def get_stopwords(path):
    return list(set(open(path, 'rb').read().decode("utf-8").split("\r\n")))


""" 去除停用词,每个词是带有词性标注的词
    参数
    word_list: 需要去停用词的列表
    stopwords: 停用词列表
    输出
    word_list: 去除停用词后的词列表
"""


def rm_stopwords(word_list, stopwords):
    # 这个很重要，注意每次pop之后总长度是变化的
    for i in range(word_list.__len__())[::-1]:
        # 去除多余的分隔符
        if word_list[i][0] == '\n' or word_list[i][0] == '\r':
            word_list.pop(i)
        # 去停用词
        elif word_list[i][0] in stopwords:
            word_list.pop(i)
        #  去数字
        elif word_list[i][0].isdigit():
            word_list.pop(i)
    return word_list


class ProcessingAndSave(object):
    """ 初始化函数 """

    def __init__(self):
        self.bow = []  # 词袋
        self.dictionary = corpora.Dictionary()  # 字典
        self.content_list = []  # 每篇文章原文
        self.jieba_list = []  # 每篇文章去分词后的词列表,每个词的存储形式为元组，例如：('我', r),('天安门', ns)
        self.rmd_list = []  # 每篇文章去停用词后的列表,每个词的存储形式为元组，例如：('我', r),('天安门', ns)
        self.flag_list = []  # 每篇文章所属的类别
        self.flag = 0  # 标签的初始类别

        self.info_df = pd.DataFrame()  # 存放所有信息的df

    """ 获取分词去停用词后的词袋以及字典
        参数
        file_list: 文件列表，其中包含每个类型的语料，一种语料为一个文件，文件中每一条语料为一行
        stopwords: 停用词表
        输出
        content_list: 每篇语料的原始内容
        jieba_list: 每篇语料分词后的词列表
        rmd_list: 每篇语料去停用词后的词列表
        flag_list: 每篇语料的标签列表
        bow: 词袋
        dictionary: 整个语料库中出现的词对应的ID词典
    """

    def get_corpus(self, file_list, stopwords):
        # 初始化方法所需参数

        # 遍历文件列表中不同类型的文件，一个file即为一类
        for file in file_list:
            start_time = time()

            # 遍历文件中的每一行，一行即为一篇语料,其中”\r\n“是预处理语料时每篇语料的分割符
            contents = open(file, 'rb').read().decode('utf-8').split('\r\n')[:-1]
            print("正在处理 " + file + " 内的语料，共有" + str(len(contents)) + "条...")

            for content in contents:
                self.content_list.append(content)
                # 分词
                word_list = [(w.word, w.flag) for w in pseg.cut(sentence=content)]
                self.jieba_list.append(word_list)

                # 去停用词
                # !!! 很重要，因为python内一个变量的地址不变，而在一个类中，如果有变量会关联到这个变量，那么这个关联变量会实时随着这个变量改变，导致原来所被赋值的对象改变
                temp_rmd_list = copy.deepcopy(word_list)
                temp_rmd_list = rm_stopwords(word_list=word_list, stopwords=stopwords)
                self.rmd_list.append(temp_rmd_list)

                # 字典词袋
                words = [w[0] for w in temp_rmd_list]
                self.dictionary.add_documents(documents=[words])
                self.bow.append(self.dictionary.doc2bow(document=words))

                # 文章标签
                self.flag_list.append(self.flag)

            end_time = time()
            print("已处理完" + str(self.flag + 1) + "类语料，耗时" + get_time(end_time - start_time))
            # 下一个文章类别
            self.flag += 1

    """ 将词袋保存到txt文件中，将词典dictionary保存到文件中
        参数 
        bow_path: 存储bow的路径
        bow: 要存储的bow
        dict_path: 存储dictionary的路径
        dictionary: 要存储的dictionary
    """

    def save_bow_dict(self, bow_path, dict_path):
        file = open(bow_path, 'w')
        for each in self.bow:
            for each_ in each:
                file.write(str(each_[0]) + ',' + str(each_[1]))
                file.write('/')
            file.write('\n')
        file.close()

        self.dictionary.save(fname_or_handle=dict_path)
        print('保存成功!')

    """ 对应的读取bow和字典的的操作
        参数
        bow_path: 词袋bow的路径
        dict_path: 词典的路径
        输出
        bow: 词袋
        dictionary: 词典
    """

    def load_bow_dict(self, bow_path, dict_path):
        file = open(bow_path, 'rb').read().decode('utf-8').split('\n')[:-1]
        self.bow = []
        for each in file:
            lst1 = each.split('/')[:-1]
            lst = []
            for each_ in lst1:
                lst2 = each_.split(',')
                lst2 = (int(lst2[0]), int(lst2[1]))
                lst.append(lst2)
            self.bow.append(lst)

        self.dictionary = corpora.Dictionary.load(fname=dict_path)

    """ 将得到的信息转化为dataframe并存储到csv中
        参数
        content_list: 每篇语料的原始内容
        jieba_list: 每篇语料分词后的词列表
        rmd_list: 每篇语料去停用词后的词列表
        flag_list: 每篇语料的标签列表
        path: 存储csv的路径
        输出
        info_df: 包含信息的dataframe
    """

    def to_csv(self, path):
        self.info_df = pd.DataFrame(columns=['语料标签', '语料内容', '分词内容', '去停用词内容'])
        self.info_df['语料标签'] = self.flag_list
        self.info_df['语料内容'] = self.content_list
        self.info_df['分词内容'] = self.jieba_list
        self.info_df['去停用词内容'] = self.rmd_list

        self.info_df.to_csv(path_or_buf=path, index=False, encoding='utf-8-sig')

    """ 将数据存储到数据库中
        参数
        connection: 数据库连接
        database: 要写入的数据库
        content_list: 每篇语料的原始内容
        jieba_list: 每篇语料分词后的词列表
        rmd_list: 每篇语料去停用词后的词列表
        flag_list: 每篇语料的标签列表
    """

    def to_mysql(self, connection, table):
        # 创建游标
        cur = connection.cursor()

        # 创建一个table
        sql_create_table = """
        CREATE TABLE """ + table + """ (
                            语料ID INT NOT NULL,
                            语料标签 INT NOT NULL,
                            语料内容 LONGTEXT NOT NULL,
                            分词内容 LONGTEXT NOT NULL,
                            去停用词内容 LONGTEXT NOT NULL
                        )
                        ENGINE=InnoDB
                        DEFAULT CHARSET=utf8
                        COLLATE=utf8_general_ci;
        """
        cur.execute(sql_create_table)

        # 编写插入数据的sql
        sql = "insert into " + table + " (语料ID, 语料标签, 语料内容, 分词内容, 去停用词内容) values ('%s', '%s', '%s', '%s', '%s')"

        try:
            for i in range(len(self.content_list)):
                content_id = str(i)  # 语料ID存储到数据库中
                content = self.content_list[i]  # 语料内容存储到数据库中
                jieba_content = ';'.join(str(each) for each in self.jieba_list[i])  # 存储到数据库中时，jieba分词列表中的词用空格分开
                rmd_content = ';'.join(str(each) for each in self.rmd_list[i])  # 存储到数据库中时，停用词表中的词用空格分开
                flag = str(self.flag_list[i])  # 语料标签存储到数据库中
                cur.execute(sql, (content_id, flag, content, jieba_content, rmd_content))
            connection.commit()
            print(str(len(self.content_list)) + '条数据已存储到数据库中')

        # 如果存储出问题了，回滚报错
        except Exception as e:
            print(e)
            connection.rollback()
            print("插入数据失败")


""" 编写主函数
    --------在使用前，测试停用词表的每行分隔符（第62行），以及语料的分隔符（第124行），可能需要修改---------
"""
if __name__ == "__main__":
    # 初始化类
    pas = ProcessingAndSave()

    file_list = get_file_list(files_path='NLP/文本聚类/corpus/复旦大学中文文本分类数据集/标准格式数据',
                              store_path='NLP/文本聚类/corpus/复旦大学中文文本分类数据集/预处理后数据/标签对应类别.txt')

    stopwords = get_stopwords('NLP/文本聚类/corpus/复旦大学中文文本分类数据集/停用词表/stopwords_all.txt')

    # 获取语料
    pas.get_corpus(file_list=file_list, stopwords=stopwords)

    # 存储到csv文件
    pas.to_csv('NLP/文本聚类/corpus/复旦大学中文文本分类数据集/预处理后数据/info.csv')

    # 保存词袋和词典
    pas.save_bow_dict(bow_path='NLP/文本聚类/corpus/复旦大学中文文本分类数据集/models/bow.txt',
                      dict_path='NLP/文本聚类/corpus/复旦大学中文文本分类数据集/models/dictionary')

    # 存储到数据库中
    connection = pymysql.connect(host="localhost", user="root", password="wxq2001", database="nlp", port=3306)
    table = '复旦中文语料'
    pas.to_mysql(connection=connection, table=table)

王学强_Bryan

关注

8
点赞
踩
41

收藏

觉得还不错? 一键收藏
9
评论
处理复旦大学中文文分类数据集

目录1. 处理原始数据2. 分词、去停用词3. 存储到csv4. 存储到数据库5. 主函数7. 最终结果展示6. 注意内容7. 资源列表8. 完整代码复旦大学中文文本分类数据集是一个小型轻量的数据集，常用于自然语言处理文本分类，文本聚类实验中，本文通过使用Python将该数据集进行基本处理，并分别存储到csv和sql文件中。1. 处理原始数据原始数据的格式为（如图1.1），每个文件夹中有不等量的.txt文件，每个文件为一篇语料。图1.1：原始数据格式读取类别文件列表，用os模块读取原始数据
复制链接

扫一扫