天池零基础入门NLP之新闻文本分类task2--数据分析

nlp文本分类Task2–数据分析

1.对赛题数据进行了读取

2.对新闻长度,label分布进行了可视化

3.分析了赛题每篇新闻平均句子长度,每类新闻中出现次次数最多的字符

4.分析了每类新闻中的关键词

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data_df = pd.read_csv('data/train_set.csv',sep='\t')
data_df.head()
labeltext
022967 6758 339 2021 1854 3731 4109 3792 4149 15...
1114464 486 6352 5619 2465 4802 1452 3137 5778 54...
237346 4068 5074 3747 5681 6093 1777 2226 7354 6...
327159 948 4866 2109 5520 2490 211 3956 5520 549...
433646 3055 3055 2490 4659 6065 3370 5814 2465 5...

数据分析

1.赛题数据中,新闻文本的长度是多少

2.label的分布

3.赛题数据中,字符分布是怎么样的

data_df['text_len'] = data_df['text'].apply(lambda x:len(x.split(' ')))
data_df['text_len'].describe()
count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

由上可每条文本长度最大值5W+,最短只有2,为后期进行padding分析大多数文本长度集中在多少

_ = plt.hist(data_df['text_len'],bins = 200,range=(0,10000))
plt.xlabel('Text char count')
plt.title('histogram of char count')
Text(0.5, 1.0, 'histogram of char count')

在这里插入图片描述

以上可知,大多数文本长度集中在2000以内

# 新闻类别分布
data_df['label'].value_counts().plot(kind='bar')
plt.title('news class count')
plt.xlabel('category')
Text(0.5, 0, 'category')

在这里插入图片描述

由上可知赛题的数据集类别分布不均匀,在训练集中科技类新闻最多,最少的新闻是星座新闻

字符分布统计

1.词频统计

2.句子分布统计

a.通过训练集中所有的句子进行拼接而划分为字符,并统计每个字符出现的次数

from collections import Counter
all_lines = ' '.join(list(data_df['text']))
word_count = Counter(all_lines.split(' '))
word_count = sorted(word_count.items(),key=lambda x:x[1],reverse=True)
print('字符个数',len(word_count))
print('出现次数最多的字符',word_count[0])
print('出现次数最少的字符',word_count[-1])
字符个数 6869
出现次数最多的字符 ('3750', 7482224)
出现次数最少的字符 ('3133', 1)
word_count[:10]
[('3750', 7482224),
 ('648', 4924890),
 ('900', 3262544),
 ('3370', 2020958),
 ('6122', 1602363),
 ('4464', 1544962),
 ('7399', 1455864),
 ('4939', 1387951),
 ('3659', 1251253),
 ('4811', 1159401)]

b.句子分布统计:根据字在每个句子的出现情况,反推出标点符号。统计不同字符在句子中出现的次数

data_df['text_unique'] = data_df['text'].apply(lambda x:' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(data_df['text_unique']))
word_in_line = Counter(all_lines.split(' '))
word_count_in_line = sorted(word_in_line.items(),key = lambda x:x[1],reverse=True)
len(word_in_line)
6869
word_count_in_line[:4]
[('3750', 197997), ('900', 197653), ('648', 191975), ('2465', 177310)]

共20w条数据,3760,900,648的覆盖率到达了99%

由上数据分析得到以下几条结论:

1. 每条新闻文本长度最短为2,最长为5w+,大部分集中在2000以下
2.label的分布不均匀,科技类新闻量接近4W,星座类样本量不到1k
3.训练集中共6869个字符,其中3750,900,648三个字符在所有的新闻中出现的覆盖率达到了99%

假设字符’3750’、‘900’、'648’是新闻text的标点符号,分析每篇新闻平均由多少个句子组成

import re
data_df['sentence_count_new'] = data_df['text'].apply(lambda x: re.split('3760|900|648',x))
data_df['sentence_count'] = data_df['sentence_count_new'].apply(lambda x:int(len(x)))
data_df['sentence_count'].describe()
count    200000.000000
mean         43.391315
std          48.039653
min           1.000000
25%          15.000000
50%          30.000000
75%          55.000000
max        1867.000000
Name: sentence_count, dtype: float64
_ = plt.hist(data_df['sentence_count'],bins = 200)
plt.xlabel('sentence counts per news')
plt.title('sentence counts')
Text(0.5, 1.0, 'sentence counts')

在这里插入图片描述

# 查看每类新闻中出现次数最多的字符
groups = data_df.groupby('label')
from collections import Counter
for name,group in groups:
    print('label',name)
    #查看每类新闻中出现次数最多的字符
    all_lines = ' '.join(list(group['text']))
    word_count = Counter(all_lines.split(' '))
    word_count = sorted(word_count.items(),key = lambda x:x[1],reverse=True)
    print(word_count[:5])
label 0
[('3750', 1267331), ('648', 967653), ('900', 577742), ('3370', 503768), ('4464', 307431)]
label 1
[('3750', 1200686), ('648', 714152), ('3370', 626708), ('900', 542884), ('4464', 445525)]
label 2
[('3750', 1458331), ('648', 974639), ('900', 618294), ('7399', 351894), ('6122', 343850)]
label 3
[('3750', 774668), ('648', 494477), ('900', 298663), ('6122', 187933), ('4939', 173606)]
label 4
[('3750', 360839), ('648', 231863), ('900', 190842), ('4411', 120442), ('7399', 86190)]
label 5
[('3750', 715740), ('648', 329051), ('900', 305241), ('6122', 159125), ('5598', 136713)]
label 6
[('3750', 469540), ('648', 345372), ('900', 222488), ('6248', 193757), ('2555', 175234)]
label 7
[('3750', 428638), ('648', 262220), ('900', 184131), ('3370', 159156), ('5296', 132136)]
label 8
[('3750', 242367), ('648', 202399), ('900', 92207), ('6122', 57345), ('4939', 56147)]
label 9
[('3750', 178783), ('648', 157291), ('900', 70680), ('7328', 46477), ('6122', 43411)]
label 10
[('3750', 180259), ('648', 114512), ('900', 75185), ('3370', 67780), ('2465', 45163)]
label 11
[('3750', 83834), ('648', 67353), ('900', 37240), ('4939', 18591), ('6122', 18438)]
label 12
[('3750', 87412), ('4464', 51426), ('3370', 45815), ('648', 37041), ('2465', 36610)]
label 13
[('3750', 33796), ('648', 26867), ('900', 11263), ('4939', 9651), ('669', 8925)]
# 由上分析可知前top3基本还是标点符号,进而分析字符在每类新闻的句子中出现频率
for name,group in groups:
    print('label',name)
    #查看每类新闻中出现次数最多的字符
    group['text_unique'] =  group['text'].apply(lambda x:' '.join(list(set(x.split(' ')))))
    all_word = ' '.join(list(group['text_unique']))
    word_count = Counter(all_word.split(' '))
    word_count = sorted(word_count.items(),key = lambda x:x[1],reverse=True)
    group['sentence_counts_line'] = group['text'].apply(lambda x: re.split('3760|900|648',x))
    group['sentence_counts'] = group['sentence_counts_line'].apply(lambda x:int(len(x)))
    sum = 0
    for x in list(group['sentence_counts']):
        sum+=x
    print('总句子个数:',sum)
    freq = [(x[0], round(x[1]/sum,2)) for x in word_count[:5]]
    print(freq)
label 0
总句子个数: 1650511
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('2465', 0.02), ('6122', 0.02)]
label 1
总句子个数: 1333264
[('900', 0.03), ('3750', 0.03), ('3370', 0.03), ('4464', 0.02), ('648', 0.02)]
label 2
总句子个数: 1657969
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('7399', 0.02), ('2465', 0.02)]
label 3
总句子个数: 858692
[('3750', 0.03), ('900', 0.03), ('648', 0.03), ('2465', 0.02), ('7399', 0.02)]
label 4
总句子个数: 450264
[('900', 0.03), ('3750', 0.03), ('4853', 0.03), ('648', 0.03), ('7399', 0.03)]
label 5
总句子个数: 669496
[('900', 0.02), ('3750', 0.02), ('648', 0.02), ('6122', 0.02), ('7399', 0.02)]
label 6
总句子个数: 611244
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('5620', 0.02), ('4811', 0.01)]
label 7
总句子个数: 465720
[('900', 0.02), ('3750', 0.02), ('648', 0.02), ('3370', 0.02), ('1699', 0.02)]
label 8
总句子个数: 309819
[('648', 0.02), ('3750', 0.02), ('900', 0.02), ('6122', 0.02), ('7399', 0.02)]
label 9
总句子个数: 243676
[('3750', 0.02), ('648', 0.02), ('900', 0.02), ('7399', 0.02), ('6122', 0.02)]
label 10
总句子个数: 199761
[('900', 0.02), ('3750', 0.02), ('885', 0.02), ('648', 0.02), ('3686', 0.02)]
label 11
总句子个数: 110063
[('3750', 0.03), ('648', 0.03), ('900', 0.03), ('6122', 0.03), ('7539', 0.03)]
label 12
总句子个数: 76816
[('3370', 0.02), ('4464', 0.02), ('3750', 0.02), ('900', 0.02), ('7539', 0.02)]
label 13
总句子个数: 40968
[('648', 0.02), ('3750', 0.02), ('5491', 0.02), ('2662', 0.02), ('7539', 0.02)]

关键词分析,通过上面的分析可知,除了标点符号3750,900,648是标点符号以外

1.对于0类别来说,3370,1699出现频率较高
2.对于1类别来说,6122,7399出现频率较高
依次类推
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值