新闻分类大赛赛题分析

@新闻分类大赛赛题分析

本文对天池新闻文本分类比赛进行赛题分析。

1 数据格式

训练集为csv格式文件,使用pandas读取前十行,结果如图。

import pandas as pd
train_df = pd.read_csv(r'train_set.csv', sep='\t')
print(len(train_df))
print(train_df.head(10))

数据示例
训练集一共20万条数据。
数据集中标签label的对应的关系如下:{‘科技’: 0, ‘股票’: 1, ‘体育’: 2, ‘娱乐’: 3, ‘时政’: 4, ‘社会’: 5, ‘教育’: 6, ‘财经’: 7, ‘家居’: 8, ‘游戏’: 9, ‘房产’: 10, ‘时尚’: 11, ‘彩票’: 12, ‘星座’: 13}
训练文本text做了匿名处理,文字均用数字替代。

2 数据类别分布

利用pandas数据透视,分析每个Label数据量。

table = pd.pivot_table(train_df,index=['label'],aggfunc='count')
print(table)

结果如图:
数据分布
最多的科技类文本有38918条数据,占据总数据19.5%。而最少的星座类,只有908条数据。
数据不是太均衡,在后续操作中可以适当注意下。

3 文本长度

然后分析每条新闻的长度。

train_df['text_len'] = train_df['text'].str.len()
#增加一列"text_len",计算text长度
col_mean = train_df[["text_len"]].mean()
col_mean["label"]="mean"
#增加一行,计算长度均值
col_max = train_df[["text_len"]].max()
col_max["label"]="max"
#增加一行,计算长度最大值
col_min = train_df[["text_len"]].min()
col_min["label"]="min"
#增加一行,计算长度最小值
train_df=train_df.append(col_mean,ignore_index=True)
train_df=train_df.append(col_max,ignore_index=True)
train_df=train_df.append(col_min,ignore_index=True)
print(train_df.tail(6))
#输出最后6行

结果如图
数据长度分布
可以看出长度平均在4400左右,最长283530,最短为9。初步判断为长文本分类。

4 思路

鉴于数据匿名处理,一方面可以视为one-hot编码,另外一方面不太好直接使用现成的预训练模型编码。
本次是第一次参加比赛,想先用机器学习的方法,比如朴素贝叶斯、XGBoost等方法实现。
群里面提到这类数据用text GCN比较好,后期可以尝试。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值