NLP新闻文本分类学习赛 - Task2 数据读取与数据分析

最新推荐文章于 2021-11-01 23:34:15 发布

cxm 17

最新推荐文章于 2021-11-01 23:34:15 发布

阅读量282

点赞数 1

分类专栏： Datawhale零基础入门

本文链接：https://blog.csdn.net/weixin_45415853/article/details/107480713

版权

Datawhale零基础入门专栏收录该内容

8 篇文章 1 订阅

订阅专栏

本章主要内容为数据读取和数据分析，具体使用Pandas库完成数据读取操作，并对赛题数据进行分析构成。
Datawhale零基础入门NLP赛事 - Task2 数据读取与数据分析

本章作业

假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？

import pandas as pd
from collections import Counter
import re

//分隔符sep，将每列分割开的字符，设置为\t即可
train_df = pd.read_csv('TRAIN_DATA/train_set.csv', sep='\t')

//用标点符号对每篇新闻进行分割，分割的片段数即为该条新闻的句子数
train_df['sentence_num'] = train_df['text'].apply(lambda x: len(re.split(' 3750 | 900 | 648 ', x)))//字符前后加空格，排除标点符号在句首和句末的情况。
print('The average number of sentence in each news is :', sum(train_df['sentence_num'])/len(train_df['sentence_num']))

结果为：

The average number of sentence in each news is : 78.09435

统计每类新闻中出现次数对多的字符。

// 将同一类的文本拼接在一起，并去掉其中的标点符号
all_lines_in_a_class = []
for i in range(14):
	line = ' '.join(train_df[train_df['label'] == i]['text'])
	all_lines_in_a_class.append(re.sub('3750|900|648','',line))

//统计每类新闻里各字符出现的次数
print('The character that appear most often in each type of news are:')
for i,line in enumerate(all_lines_in_a_class):
	line = filter(lambda x: x, line.split(' '))
	word_count = Counter(line)
	word_count = sorted(word_count.items(), key=lambda d: int(d[1]), reverse=True)
	print(i,':',word_count[0][0])

结果为：

The character that appear most often in each type of news are:
0 : 3370
1 : 3370
2 : 7399
3 : 6122
4 : 4411
5 : 6122
6 : 6248
7 : 3370
8 : 6122
9 : 7328
10 : 3370
11 : 4939
12 : 4464
13 : 4939

统计每类新闻中覆盖新闻条数最多的字符。

//每条新闻中一共出现了哪些字符，每个字符只统计一次
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))

//将同一类的文本拼接在一起，并去掉其中的标点符号
all_lines_in_a_class = []
for i in range(14):
	line = ' '.join(train_df[train_df['label'] == i]['text_unique'])
	all_lines_in_a_class.append(re.sub('3750|900|648','',line))

//统计每类新闻里各字符出现的新闻条数
print('The character that appear in most news in each type are:')
for i,line in enumerate(all_lines_in_a_class):
	line = filter(lambda x: x, line.split(' '))
	word_count = Counter(line)
	word_count = sorted(word_count.items(), key=lambda d: int(d[1]), reverse=True)
	print(i,':',word_count[0])

结果¹为：

The character that appear in most news in each type are:
0 : ('2465', 36438)
1 : ('3370', 33421)
2 : ('7399', 30671)
3 : ('2465', 21340)
4 : ('4853', 14624)
5 : ('6122', 12005)
6 : ('5620', 9466)
7 : ('3370', 8163)
8 : ('6122', 6315)
9 : ('7399', 5511)
10 : ('885', 4794)
11 : ('6122', 2813)
12 : ('3370', 1737)
13 : ('5491', 877)