【NLP】天池新闻文本分类(二)——数据读取与数据分析

本文通过Pandas库对天池新闻文本分类挑战赛数据进行读取与分析,发现数据集较大,每篇新闻平均907个字符,最长57921个。类别分布不均,科技类最多,星座类最少。字符'3750'等可能是标点符号,平均每篇新闻包含81个句子。需注意新闻字符截断和类别不平衡问题。
摘要由CSDN通过智能技术生成

【NLP】天池新闻文本分类(二)——数据读取与数据分析

前言

NLP之新闻文本分类挑战赛(赛题链接)。
其实上一篇赛题理解时已经做了数据读取和分析,因为一般在分析之后才对题目有初步理解。但为了流程完整性,还是做一篇独立的数据读取与分析,采用Pandas库实现。

数据读取

使用Pandas库完成数据读取操作

#导入包
import pandas as pd
train_df = pd.read_csv('./data/训练集数据/train_set.csv',sep='\t',nrows=100)

这里使用到的read_csv由三部分构成:
1.读取文件路径;
2.分隔符sep,为每列分割的字符,本赛题存储的数据分隔符为\t;
3.读取行数nrows,由于数据集比较大,可以先设置为100预览。
读取完后可以浏览下数据

#预览数据
train_df.head()

在这里插入图片描述
上图是读取好的数据,是表格形式。第一列(label)为新闻的类别,第二列(text)为新闻的字符。

数据分析

使用Pandas库分析赛题数据的分布规律。
1.数据整体情况

train_df = pd.read_csv('./data/训练集数据/train_set.csv',sep='\t')
train_df.info()

在这里插入图片描述

train_df.describe()
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值