【NLP项目-文本分类】划分测试集,训练集,验证集


本篇文章的主要任务是将自己的数据集使用在Chinese-Text-Classification-PyTorch项目中
github地址: Chinese-Text-Classification

数据集:二分类的文本数据,做情感分析,review为评论内容,label分为1,0正负项。

一、不分词划分数据集

用pandas读取csv数据文件,用sklearn中的train_test_split函数划分数据集

1.划分数据集

按8:1:1比例将数据集划分成训练集,验证集,测试集:

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    shuffle=True,
                                                    stratify=y,
                                                    random_state=42)

X_valid, X_test, y_valid, y_test = train_test_split(X_test,
                                                    y_test,
                                                    test_size=0.5,
                                                    shuffle=True,
                                                    stratify=y_test,
                                                    random_state=42)

2.将各数据集写入txt文件

生成三个txt文件,test.txt,train.txt,dev.txt 用于匹配替换项目数据集格式

# 划分成txt文件
testdir = "./WeiboData/data/test.txt"
traindir = "./WeiboData/data/train.txt"
validdir = "./WeiboData/data/dev.txt"

print(X_test)
print(y_test)
with open(testdir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_test,y_test):
        f.write(str(i)+'\t'+str(j)+'\n')

with open(traindir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_train,y_train):
        f.write(str(i)+'\t'+str(j)+'\n')

with open(validdir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_valid,y_valid):
        f.write(str(i)+'\t'+str(j)+'\n')

f.close()

完整代码:

import pandas as pd
import jieba
from sklearn.model_selection import train_test_split

data = pd.read_csv(r'D:\Study\PycahrmProjects\sentimentAnalysis\wb_data1_denote1.csv', encoding='utf-8-sig')

X = data['review'].values
y = data.label.values

# 5:3:2
# 8:1:1
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    shuffle=True,
                                                    stratify=y,
                                                    random_state=42)

X_valid, X_test, y_valid, y_test = train_test_split(X_test,
                                                    y_test,
                                                    test_size=0.5,
                                                    shuffle=True,
                                                    stratify=y_test,
                                                    random_state=42)

print("训练集样本数 = ", len(y_train))
print("训练集中正样本数 = ", len([w for w in y_train if w == 1]))
print("训练集中负样本数 = ", len([w for w in y_train if w == 0]))
print("验证集样本数 = ", len(y_valid))
print("验证集中正样本数 = ", len([w for w in y_valid if w == 1]))
print("验证集中负样本数 = ", len([w for w in y_valid if w == 0]))
print("测试集样本数 = ", len(y_test))
print("测试集中正样本数 = ", len([w for w in y_test if w == 1]))
print("测试集中负样本数 = ", len([w for w in y_test if w == 0]))

# 划分成txt文件
testdir = "./WeiboData/data/test.txt"
traindir = "./WeiboData/data/train.txt"
validdir = "./WeiboData/data/dev.txt"

print(X_test)
print(y_test)
with open(testdir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_test,y_test):
        f.write(str(i)+'\t'+str(j)+'\n')

with open(traindir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_train,y_train):
        f.write(str(i)+'\t'+str(j)+'\n')

with open(validdir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_valid,y_valid):
        f.write(str(i)+'\t'+str(j)+'\n')

f.close()

二、分词划分数据集

1.分词

注意分词后需要按行划分成列表
代码如下(示例):

# 用jieba对各数据集分词
def tokenizer(data):
    # 得到文本数据
    text = []
    for i in range(data.shape[0]):
        text.append(str(data[i]))

    comment = '\n'.join(text)

    # 清洗文本数据-用正则表达式删去数字、字母、标点符号、特殊符号等
    import re
    symbols = "[0-9\!\%\,\。\.\,\、\~\?\(\)\(\)\?\!\“\”\:\:\;\"\"\;\……&\-\_\|\.\A.B.C\*\^]"
    comments = re.sub(symbols, '', comment)

    comments_list = jieba.cut(comments)  # 精确模式
    # comments_list = jieba.cut_for_search(comments)#搜索引擎模式
    x_train = ' '.join([x for x in comments_list])  # 用空格连接分好的词

    return x_train

# 对各数据集分词

X_test = tokenizer(X_test)
X_train = tokenizer(X_train)
X_valid = tokenizer(X_valid)

# 按行将string划分成列表
X_valid = X_valid.split('\n')
X_test = X_test.split('\n')
X_train = X_train.split('\n')


2.完整代码

import pandas as pd
import jieba
from sklearn.model_selection import train_test_split

# 用jieba对各数据集分词
def tokenizer(data):
    # 得到文本数据
    text = []
    for i in range(data.shape[0]):
        text.append(str(data[i]))

    comment = '\n'.join(text)

    # 清洗文本数据-用正则表达式删去数字、字母、标点符号、特殊符号等
    import re
    symbols = "[0-9\!\%\,\。\.\,\、\~\?\(\)\(\)\?\!\“\”\:\:\;\"\"\;\……&\-\_\|\.\A.B.C\*\^]"
    comments = re.sub(symbols, '', comment)

    comments_list = jieba.cut(comments)  # 精确模式
    # comments_list = jieba.cut_for_search(comments)#搜索引擎模式
    x_train = ' '.join([x for x in comments_list])  # 用空格连接分好的词

    return x_train


data = pd.read_csv(r'D:\Study\PycahrmProjects\sentimentAnalysis\wb_data1_denote1.csv', encoding='utf-8-sig')


X = data['review'].values
y = data.label.values

# 5:3:2
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.5,
                                                    shuffle=True,
                                                    stratify=y,
                                                    random_state=42)

X_valid, X_test, y_valid, y_test = train_test_split(X_test,
                                                    y_test,
                                                    test_size=0.3,
                                                    shuffle=True,
                                                    stratify=y_test,
                                                    random_state=42)

print("训练集样本数 = ", len(y_train))
print("训练集中正样本数 = ", len([w for w in y_train if w == 1]))
print("训练集中负样本数 = ", len([w for w in y_train if w == 0]))
print("验证集样本数 = ", len(y_valid))
print("验证集中正样本数 = ", len([w for w in y_valid if w == 1]))
print("验证集中负样本数 = ", len([w for w in y_valid if w == 0]))
print("测试集样本数 = ", len(y_test))
print("测试集中正样本数 = ", len([w for w in y_test if w == 1]))
print("测试集中负样本数 = ", len([w for w in y_test if w == 0]))

# 划分成txt文件

testdir = "./WeiboData/data/test.txt"
traindir = "./WeiboData/data/train.txt"
validdir = "./WeiboData/data/dev.txt"

print(X_test)
print(y_test)

# 对各数据集分词

X_test = tokenizer(X_test)
X_train = tokenizer(X_train)
X_valid = tokenizer(X_valid)

X_valid = X_valid.split('\n')
X_test = X_test.split('\n')
X_train = X_train.split('\n')


print(X_test)
print(type(X_test))
print(len(X_test))

with open(testdir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_test,y_test):
        f.write(str(i)+'\t'+str(j)+'\n')


with open(traindir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_train,y_train):
        f.write(str(i)+'\t'+str(j)+'\n')


with open(validdir, 'a+', encoding='utf-8-sig') as f:
    for i,j in zip(X_valid,y_valid):
        f.write(str(i)+'\t'+str(j)+'\n')

f.close()



  • 0
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
新闻数据集CSV文本分类是指对一组新闻数据进行分类,将它们归类到相应的类别中。这个任务可以通过使用机器学习和自然语言处理技术来实现。 首先,我们需要一个新闻数据集,它通常是以CSV格式存储的。CSV文件中的每一行代表一个新闻样本,每个样本包含了多个特征,例如新闻标题、内容、发布日期等。并且,每个样本还会有一个标签,用于表示该新闻所属的类别,例如体育、娱乐、科技等。 接下来,我们可以使用机器学习算法来构建一个分类模型。对于文本分类任务,常用的算法有朴素贝叶斯、支持向量机和深度学习模型,例如卷积神经网络和循环神经网络。这些算法可以自动从数据中学习特征和模式,并根据这些特征和模式将新闻分配到正确的类别中。 在训练模型之前,我们需要对原始文本进行一些预处理步骤,例如分词、去除停用词、词干化等。这些预处理步骤有助于降低特征维度,并且提取出对分类有用的信息。 然后,我们将数据集划分训练集测试集训练集用于训练模型,而测试集用于评估模型的性能。我们可以使用交叉验证等方法来选择最合适的模型,并进行调参以提高分类的准确性。 最后,我们可以使用训练好的模型来对新的未知新闻样本进行分类。只需将新闻的文本特征输入到模型中,模型将预测并输出该新闻所属的类别。 总之,新闻数据集CSV文本分类是一个利用机器学习和自然语言处理技术的任务。通过构建分类模型并对新闻样本进行预处理和特征提取,我们可以将新闻自动分类到相应的类别中。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值