天池零基础入门NLP之新闻文本分类task2--数据分析

最新推荐文章于 2024-04-09 15:49:47 发布

佛系

最新推荐文章于 2024-04-09 15:49:47 发布

阅读量861

点赞数

分类专栏： NLP 文章标签： python 数据分析机器学习数据挖掘 nlp

本文链接：https://blog.csdn.net/weixin_41667774/article/details/107523746

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

nlp文本分类Task2–数据分析

1.对赛题数据进行了读取

2.对新闻长度，label分布进行了可视化

3.分析了赛题每篇新闻平均句子长度，每类新闻中出现次次数最多的字符

4.分析了每类新闻中的关键词

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

data_df = pd.read_csv('data/train_set.csv',sep='\t')

data_df.head()

	label	text
0	2	2967 6758 339 2021 1854 3731 4109 3792 4149 15...
1	11	4464 486 6352 5619 2465 4802 1452 3137 5778 54...
2	3	7346 4068 5074 3747 5681 6093 1777 2226 7354 6...
3	2	7159 948 4866 2109 5520 2490 211 3956 5520 549...
4	3	3646 3055 3055 2490 4659 6065 3370 5814 2465 5...

数据分析

1.赛题数据中，新闻文本的长度是多少

2.label的分布

3.赛题数据中，字符分布是怎么样的

data_df['text_len'] = data_df['text'].apply(lambda x:len(x.split(' ')))

data_df['text_len'].describe()

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

由上可每条文本长度最大值5W+，最短只有2，为后期进行padding分析大多数文本长度集中在多少

_ = plt.hist(data_df['text_len'],bins = 200,range=(0,10000))
plt.xlabel('Text char count')
plt.title('histogram of char count')

Text(0.5, 1.0, 'histogram of char count')

在这里插入图片描述

以上可知，大多数文本长度集中在2000以内

# 新闻类别分布
data_df['label'].value_counts().plot(kind='bar')
plt.title('news class count')
plt.xlabel('category')

Text(0.5, 0, 'category')

在这里插入图片描述

由上可知赛题的数据集类别分布不均匀，在训练集中科技类新闻最多，最少的新闻是星座新闻

字符分布统计

1.词频统计

2.句子分布统计

a.通过训练集中所有的句子进行拼接而划分为字符，并统计每个字符出现的次数

from collections import Counter
all_lines = ' '.join(list(data_df['text']))
word_count = Counter(all_lines.split(' '))
word_count = sorted(word_count.items(),key=lambda x:x[1],reverse=True)
print('字符个数',len(word_count))
print('出现次数最多的字符',word_count[0])
print('出现次数最少的字符',word_count[-1])

字符个数 6869
出现次数最多的字符 ('3750', 7482224)
出现次数最少的字符 ('3133', 1)

word_count[:10]

[('3750', 7482224),
 ('648', 4924890),
 ('900', 3262544),
 ('3370', 2020958),
 ('6122', 1602363),
 ('4464', 1544962),
 ('7399', 1455864),
 ('4939', 1387951),
 ('3659', 1251253),
 ('4811', 1159401)]

b.句子分布统计：根据字在每个句子的出现情况，反推出标点符号。统计不同字符在句子中出现的次数

data_df['text_unique'] = data_df['text'].apply(lambda x:' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(data_df['text_unique']))

word_in_line = Counter(all_lines.split(' '))
word_count_in_line = sorted(word_in_line.items(),key = lambda x:x[1],reverse=True)

len(word_in_line)

word_count_in_line[:4]

[('3750', 197997), ('900', 197653), ('648', 191975), ('2465', 177310)]

共20w条数据，3760,900,648的覆盖率到达了99%

由上数据分析得到以下几条结论：

1. 每条新闻文本长度最短为2，最长为5w+，大部分集中在2000以下
2.label的分布不均匀，科技类新闻量接近4W，星座类样本量不到1k
3.训练集中共6869个字符，其中3750,900,648三个字符在所有的新闻中出现的覆盖率达到了99%

假设字符’3750’、‘900’、'648’是新闻text的标点符号，分析每篇新闻平均由多少个句子组成

import re
data_df['sentence_count_new'] = data_df['text'].apply(lambda x: re.split('3760|900|648',x))
data_df['sentence_count'] = data_df['sentence_count_new'].apply(lambda x:int(len(x)))

data_df['sentence_count'].describe()

count    200000.000000
mean         43.391315
std          48.039653
min           1.000000
25%          15.000000
50%          30.000000
75%          55.000000
max        1867.000000
Name: sentence_count, dtype: float64

_ = plt.hist(data_df['sentence_count'],bins = 200)
plt.xlabel('sentence counts per news')
plt.title('sentence counts')

Text(0.5, 1.0, 'sentence counts')

在这里插入图片描述

# 查看每类新闻中出现次数最多的字符
groups = data_df.groupby('label')

from collections import Counter
for name,group in groups:
    print('label',name)
    #查看每类新闻中出现次数最多的字符
    all_lines = ' '.join(list(group['text']))
    word_count = Counter(all_lines.split(' '))
    word_count = sorted(word_count.items(),key = lambda x:x[1],reverse=True)
    print(word_count[:5])

label 0
[('3750', 1267331), ('648', 967653), ('900', 577742), ('3370', 503768), ('4464', 307431)]
label 1
[('3750', 1200686), ('648', 714152), ('3370', 626708), ('900', 542884), ('4464', 445525)]
label 2
[('3750', 1458331), ('648', 974639), ('900', 618294), ('7399', 351894), ('6122', 343850)]
label 3
[('3750', 774668), ('648', 494477), ('900', 298663), ('6122', 187933), ('4939', 173606)]
label 4
[('3750', 360839), ('648', 231863), ('900', 190842), ('4411', 120442), ('7399', 86190)]
label 5
[('3750', 715740), ('648', 329051), ('900', 305241), ('6122', 159125), ('5598', 136713)]
label 6
[('3750', 469540), ('648', 345372), ('900', 222488), ('6248', 193757), ('2555', 175234)]
label 7
[('3750', 428638), ('648', 262220), ('900', 184131), ('3370', 159156), ('5296', 132136)]
label 8
[('3750', 242367), ('648', 202399), ('900', 92207), ('6122', 57345), ('4939', 56147)]
label 9
[('3750', 178783), ('648', 157291), ('900', 70680), ('7328', 46477), ('6122', 43411)]
label 10
[('3750', 180259), ('648', 114512), ('900', 75185), ('3370', 67780), ('2465', 45163)]
label 11
[('3750', 83834), ('648', 67353), ('900', 37240), ('4939', 18591), ('6122', 18438)]
label 12
[('3750', 87412), ('4464', 51426), ('3370', 45815), ('648', 37041), ('2465', 36610)]
label 13
[('3750', 33796), ('648', 26867), ('900', 11263), ('4939', 9651), ('669', 8925)]

# 由上分析可知前top3基本还是标点符号，进而分析字符在每类新闻的句子中出现频率
for name,group in groups:
    print('label',name)
    #查看每类新闻中出现次数最多的字符
    group['text_unique'] =  group['text'].apply(lambda x:' '.join(list(set(x.split(' ')))))
    all_word = ' '.join(list(group['text_unique']))
    word_count = Counter(all_word.split(' '))
    word_count = sorted(word_count.items(),key = lambda x:x[1],reverse=True)
    group['sentence_counts_line'] = group['text'].apply(lambda x: re.split('3760|900|648',x))
    group['sentence_counts'] = group['sentence_counts_line'].apply(lambda x:int(len(x)))
    sum = 0
    for x in list(group['sentence_counts']):
        sum+=x
    print('总句子个数：',sum)
    freq = [(x[0], round(x[1]/sum,2)) for x in word_count[:5]]
    print(freq)

label 0
总句子个数： 1650511
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('2465', 0.02), ('6122', 0.02)]
label 1
总句子个数： 1333264
[('900', 0.03), ('3750', 0.03), ('3370', 0.03), ('4464', 0.02), ('648', 0.02)]
label 2
总句子个数： 1657969
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('7399', 0.02), ('2465', 0.02)]
label 3
总句子个数： 858692
[('3750', 0.03), ('900', 0.03), ('648', 0.03), ('2465', 0.02), ('7399', 0.02)]
label 4
总句子个数： 450264
[('900', 0.03), ('3750', 0.03), ('4853', 0.03), ('648', 0.03), ('7399', 0.03)]
label 5
总句子个数： 669496
[('900', 0.02), ('3750', 0.02), ('648', 0.02), ('6122', 0.02), ('7399', 0.02)]
label 6
总句子个数： 611244
[('3750', 0.02), ('900', 0.02), ('648', 0.02), ('5620', 0.02), ('4811', 0.01)]
label 7
总句子个数： 465720
[('900', 0.02), ('3750', 0.02), ('648', 0.02), ('3370', 0.02), ('1699', 0.02)]
label 8
总句子个数： 309819
[('648', 0.02), ('3750', 0.02), ('900', 0.02), ('6122', 0.02), ('7399', 0.02)]
label 9
总句子个数： 243676
[('3750', 0.02), ('648', 0.02), ('900', 0.02), ('7399', 0.02), ('6122', 0.02)]
label 10
总句子个数： 199761
[('900', 0.02), ('3750', 0.02), ('885', 0.02), ('648', 0.02), ('3686', 0.02)]
label 11
总句子个数： 110063
[('3750', 0.03), ('648', 0.03), ('900', 0.03), ('6122', 0.03), ('7539', 0.03)]
label 12
总句子个数： 76816
[('3370', 0.02), ('4464', 0.02), ('3750', 0.02), ('900', 0.02), ('7539', 0.02)]
label 13
总句子个数： 40968
[('648', 0.02), ('3750', 0.02), ('5491', 0.02), ('2662', 0.02), ('7539', 0.02)]

关键词分析，通过上面的分析可知，除了标点符号3750,900，648是标点符号以外

1.对于0类别来说，3370,1699出现频率较高
2.对于1类别来说，6122,7399出现频率较高
依次类推

佛系

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
天池零基础入门NLP之新闻文本分类task2--数据分析

nlp文本分类Task2–数据分析1.对赛题数据进行了读取2.对新闻长度，label分布进行了可视化3.分析了赛题每篇新闻平均句子长度，每类新闻中出现次次数最多的字符4.分析了每类新闻中的关键词import numpy as npimport matplotlib.pyplot as pltimport pandas as pd%matplotlib inlinedata_df = pd.read_csv('data/train_set.csv',sep='\t')data_df.h
复制链接

扫一扫