本章主要内容为数据读取和数据分析,具体使用Pandas库完成数据读取操作,并对赛题数据进行分析构成。
Datawhale零基础入门NLP赛事 - Task2 数据读取与数据分析
本章作业
- 假设字符3750,字符900和字符648是句子的标点符号,请分析赛题每篇新闻平均由多少个句子构成?
import pandas as pd
from collections import Counter
import re
//分隔符sep,将每列分割开的字符,设置为\t即可
train_df = pd.read_csv('TRAIN_DATA/train_set.csv', sep='\t')
//用标点符号对每篇新闻进行分割,分割的片段数即为该条新闻的句子数
train_df['sentence_num'] = train_df['text'].apply(lambda x: len(re.split(' 3750 | 900 | 648 ', x)))//字符前后加空格,排除标点符号在句首和句末的情况。
print('The average number of sentence in each news is :', sum(train_df['sentence_num'])/len(train_df['sentence_num']))
结果为:
The average number of sentence in each news is : 78.09435
- 统计每类新闻中出现次数对多的字符。
// 将同一类的文本拼接在一起,并去掉其中的标点符号
all_lines_in_a_class = []
for i in range(14):
line = ' '.join(train_df[train_df['label'] == i]['text'])
all_lines_in_a_class.append(re.sub('3750|900|648','',line))
//统计每类新闻里各字符出现的次数
print('The character that appear most often in each type of news are:')
for i,line in enumerate(all_lines_in_a_class):
line = filter(lambda x: x, line.split(' '))
word_count = Counter(line)
word_count = sorted(word_count.items(), key=lambda d: int(d[1]), reverse=True)
print(i,':',word_count[0][0])
结果为:
The character that appear most often in each type of news are:
0 : 3370
1 : 3370
2 : 7399
3 : 6122
4 : 4411
5 : 6122
6 : 6248
7 : 3370
8 : 6122
9 : 7328
10 : 3370
11 : 4939
12 : 4464
13 : 4939
- 统计每类新闻中覆盖新闻条数最多的字符。
//每条新闻中一共出现了哪些字符,每个字符只统计一次
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
//将同一类的文本拼接在一起,并去掉其中的标点符号
all_lines_in_a_class = []
for i in range(14):
line = ' '.join(train_df[train_df['label'] == i]['text_unique'])
all_lines_in_a_class.append(re.sub('3750|900|648','',line))
//统计每类新闻里各字符出现的新闻条数
print('The character that appear in most news in each type are:')
for i,line in enumerate(all_lines_in_a_class):
line = filter(lambda x: x, line.split(' '))
word_count = Counter(line)
word_count = sorted(word_count.items(), key=lambda d: int(d[1]), reverse=True)
print(i,':',word_count[0])
结果1为:
The character that appear in most news in each type are:
0 : ('2465', 36438)
1 : ('3370', 33421)
2 : ('7399', 30671)
3 : ('2465', 21340)
4 : ('4853', 14624)
5 : ('6122', 12005)
6 : ('5620', 9466)
7 : ('3370', 8163)
8 : ('6122', 6315)
9 : ('7399', 5511)
10 : ('885', 4794)
11 : ('6122', 2813)
12 : ('3370', 1737)
13 : ('5491', 877)
以科技类新闻(0)为例,一共三万八千多条科技类新闻,其中字符‘2465’出现在36438条科技类新闻中。 ↩︎