nlp文本分类Task2–数据分析
1.对赛题数据进行了读取
2.对新闻长度,label分布进行了可视化
3.分析了赛题每篇新闻平均句子长度,每类新闻中出现次次数最多的字符
4.分析了每类新闻中的关键词
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data_df = pd.read_csv('data/train_set.csv',sep='\t')
data_df.head()
label | text | |
---|---|---|
0 | 2 | 2967 6758 339 2021 1854 3731 4109 3792 4149 15... |
1 | 11 | 4464 486 6352 5619 2465 4802 1452 3137 5778 54... |
2 | 3 | 7346 4068 5074 3747 5681 6093 1777 2226 7354 6... |
3 | 2 | 7159 948 4866 2109 5520 2490 211 3956 5520 549... |
4 | 3 | 3646 3055 3055 2490 4659 6065 3370 5814 2465 5... |
数据分析
1.赛题数据中,新闻文本的长度是多少
2.label的分布
3.赛题数据中,字符分布是怎么样的
data_df['text_len'] = data_df['text'].apply(lambda x:len(x.split(' ')))
data_df['text_len'].describe()
count 200000.000000
mean 907.207110
std 996.029036
min 2.000000
25% 374.000000
50% 676.000000
75% 1131.000000
max 57921.000000
Name: text_len, dtype: float64
由上可每条文本长度最大值5W+,最短只有2,为后期进行padding分析大多数文本长度集中在多少
_ = plt.hist(data_df['text_len'],bins = 200,range=(0,10000))
plt.xlabel('Text char count')
plt.title('histogram of char count')
Text(0.5, 1.0, 'histogram of char count')
以上可知,大多数文本长度集中在2000以内
# 新闻类别分布
data_df['label'].value_counts().plot(kind='bar')
plt.title('news class count')
plt.xlabel('category')
Text(0.5, 0, 'category')
由上可知赛题的数据集类别分布不均匀,在训练集中科技类新闻最多,最少的新闻是星座新闻
字符分布统计
1.词频统计
2.句子分布统计
a.通过训练集中所有的句子进行拼接而划分为字符,并统计每个字符出现的次数
from collections import Counter
all_lines = ' '.join(list(data_df['text']))
word_count = Counter(all_lines.split(' '))
word_count = sorted(word_count.items(),key=lambda x:x[1],reverse=True)
print('字符个数',len(word_count))
print('出现次数最多的字符',word_count[0])
print('出现次数最少的字符',word_count[-1])
字符个数 6869
出现次数最多的字符 ('3750', 7482224)
出现次数最少的字符 ('3133', 1)
word_count[:10]
[('3750', 7482224),
('648', 4924890),
('900', 3262544),
('3370', 2020958),
('6122', 1602363),
('4464', 1544962),
('7399', 1455864),
('4939', 1387951),
('3659', 1251253),
('4811', 1159401)]
b.句子分布统计:根据字在每个句子的出现情况,反推出标点符号。统计不同字符在句子中出现的次数
data_df['text_unique'] = data_df['text'].apply(lambda x:' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(data_df['text_unique']))
word_in_line = Counter(all_lines.split(' '))
word_count_in_line = sorted(word_in_line.items(),key = lambda x:x[1],reverse=True)
len(word_in_line)
6869
word_count_in_line[:4]
[('3750', 197997), ('900', 197653), ('648', 191975), ('2465', 177310)]
共20w条数据,3760,900,648的覆盖率到达了99%
由上数据分析得到以下几条结论:
1. 每条新闻文本长度最短为2,最长为5w+,大部分集中在2000以下
2.label的分布不均匀,科技类新闻量接近4W,星座类样本量不到1k
3.训练集中共6869个字符,其中3750,900,648三个字符在所有的新闻中出现的覆盖率达到了99%
假设字符’3750’、‘900’、'648’是新闻text的标点符号,分析每篇新闻平均由多少个句子组成
import re
data_df['sentence_count_new'] = data_df['text'].apply(lambda x: re.split('3760|900|648',x))
data_df['sentence_count'] = data_df['sentence_count_new'].apply(lambda x:int(len(x)))
data_df['sentence_count'].describe()
count 200000.000000
mean 43.391315
std 48.039653
min 1.000000
25% 15.000000
50% 30.000000
75% 55.000000
max 1867.000000
Name: sentence_count, dtype: float64
_ = plt.hist(data_df['sentence_count'],bins = 200)
plt.xlabel('sentence counts per news')
plt.title('sentence counts')
Text(0.5, 1.0, 'sentence counts')
# 查看每类新闻中出现次数最多的字符
groups = data_df.groupby('label')
from collections import Counter
for name,group in groups:
print('label',name)
#查看每类新闻中出现次数最多的字符
all_lines = ' '.join(list(group['text']))
word_count = Counter(all_lines.split(' '))
word_count = sorted(word_count.items(),key = lambda x:x[1],reverse=True)
print(word_count[:5])
label 0
[('3750', 1267331), ('648', 967653), ('900', 577742), ('3370', 503768), ('4464', 307431)]
label 1
[('3750', 1200686), ('648', 714152), ('3370', 626708), ('900', 542884), ('4464', 445525)]
label 2
[('3750', 1458331), ('648', 974639), ('900', 618294), ('7399', 351894), ('6122', 343850)]
label 3
[('3750', 774668), ('648', 494477), ('900', 298663), ('6122', 187933), ('4939', 173606)]
label 4
[('3750', 360839), ('648', 231863), ('900', 190842), ('4411', 120442), ('7399', 86190)]
label 5
[('3750', 715740), ('648', 329051), ('900', 305241), ('6122', 159125), ('5598', 136713)]
label 6
[('3750', 469540), ('648', 345372), ('900', 222488), ('6248', 193757), ('2555', 175234)]
label 7
[('3750', 428638), ('648', 262220), ('900', 184131), ('3370', 159156), ('5296', 132136)]
label 8
[('3750', 242367), ('648', 202399), ('900', 92207), ('6122', 57345), ('4939', 56147)]
label 9
[('3750', 178783), ('648', 157291), ('900', 70680), ('7328', 46477), ('6122', 43411)]
label 10
[('3750', 180259), ('648', 114512), ('900', 75185), ('3370', 67780), ('2465', 45163)]
label 11
[('3750', 83834), ('648', 67353), ('900', 37240), ('4939', 18591), ('6122', 18438)]
label 12
[('3750', 87412), ('4464', 51426), ('3370', 45815), ('648', 37041), ('2465', 36610)]
label 13
[('3750', 33796), ('648', 26867), ('900', 11263), ('4939', 9651), ('669', 8925)]
# 由上分析可知前top3基本还是标点符号,进而分析字符在每类新闻的句子中出现频率
for name,group in groups:
print('label',name)
#查看每类新闻中出现次数最多的字符
group['text_unique'] = group['text'].apply(lambda x:' '.join(list(set(x.split(' ')))))
all_word = ' '.join(list(group['text_unique']))
word_count = Counter(all_word.split(' '))
word_count = sorted(word_count.items(),key = lambda x:x[1],reverse=True)
group['sentence_counts_line'] = group['text'].apply(lambda x: re.split('3760|900|648',x))
group['sentence_counts'] = group['sentence_counts_line'].apply(lambda x:int(len(x)))
sum = 0
for x in list(group['sentence_counts']):
sum+=x
print('总句子个数:',sum)
freq = [(x[0], round(x[1]/sum,2)) for x in word_count[:5]]
print(freq)
label 0
总句子个数: 1650511
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('2465', 0.02), ('6122', 0.02)]
label 1
总句子个数: 1333264
[('900', 0.03), ('3750', 0.03), ('3370', 0.03), ('4464', 0.02), ('648', 0.02)]
label 2
总句子个数: 1657969
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('7399', 0.02), ('2465', 0.02)]
label 3
总句子个数: 858692
[('3750', 0.03), ('900', 0.03), ('648', 0.03), ('2465', 0.02), ('7399', 0.02)]
label 4
总句子个数: 450264
[('900', 0.03), ('3750', 0.03), ('4853', 0.03), ('648', 0.03), ('7399', 0.03)]
label 5
总句子个数: 669496
[('900', 0.02), ('3750', 0.02), ('648', 0.02), ('6122', 0.02), ('7399', 0.02)]
label 6
总句子个数: 611244
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('5620', 0.02), ('4811', 0.01)]
label 7
总句子个数: 465720
[('900', 0.02), ('3750', 0.02), ('648', 0.02), ('3370', 0.02), ('1699', 0.02)]
label 8
总句子个数: 309819
[('648', 0.02), ('3750', 0.02), ('900', 0.02), ('6122', 0.02), ('7399', 0.02)]
label 9
总句子个数: 243676
[('3750', 0.02), ('648', 0.02), ('900', 0.02), ('7399', 0.02), ('6122', 0.02)]
label 10
总句子个数: 199761
[('900', 0.02), ('3750', 0.02), ('885', 0.02), ('648', 0.02), ('3686', 0.02)]
label 11
总句子个数: 110063
[('3750', 0.03), ('648', 0.03), ('900', 0.03), ('6122', 0.03), ('7539', 0.03)]
label 12
总句子个数: 76816
[('3370', 0.02), ('4464', 0.02), ('3750', 0.02), ('900', 0.02), ('7539', 0.02)]
label 13
总句子个数: 40968
[('648', 0.02), ('3750', 0.02), ('5491', 0.02), ('2662', 0.02), ('7539', 0.02)]
关键词分析,通过上面的分析可知,除了标点符号3750,900,648是标点符号以外
1.对于0类别来说,3370,1699出现频率较高
2.对于1类别来说,6122,7399出现频率较高
依次类推