NLP入门之新闻文本分类竞赛task2

最新推荐文章于 2024-10-16 16:58:10 发布

一阵星星雨

最新推荐文章于 2024-10-16 16:58:10 发布

阅读量204

点赞数 1

分类专栏： competition 文章标签： nlp 数据分析 python

本文链接：https://blog.csdn.net/qq_39526018/article/details/107521663

版权

competition 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

NLP入门之新闻文本分类竞赛——task2

一数据读取

train_df = pd.read_csv('../dataset/train_set.csv', sep='\t', nrows=10000)  # nrows设置显示获取数据数目，数目较大，这里选择10000，以下统计均为在10000条数据的情况下。

二句⼦⻓度分析

train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].head(5)  # 显示前五条数据
"""
0    1057
1     486
2     764
3    1570
4     307
Name: text_len, dtype: int64
"""
print(train_df['text_len'].describe())
"""
count    10000.000000
mean       908.766300
std       1033.708515
min         15.000000
25%        375.000000
50%        672.500000
75%       1123.000000
max      44665.000000
Name: text_len dtype: float64
"""

# 句子长度直方图
_ = plt.hist(train_df['text_len'], bins=100)
plt.xlabel('textChar ')
plt.ylabel('charCount')
plt.title("Histogram of char count")
plt.show()

在这里插入图片描述

分析结果：每个句子平均由 909个字符构成，最短句子长度为15，最长句子长度为44665。

三新闻类别分布

# 新闻类别分布直方图
print(type(train_df['label'].value_counts()))
train_df['label'].value_counts().plot(kind='bar')
train_df['label'].value_counts().plot(kind='bar')
plt.xlabel("category")
plt.ylabel("newsCount")
plt.title('Histogram of news class count')
plt.show()

在这里插入图片描述

分析结果：根据直方图显示14个类别所占的新闻数目分布不均匀，依次减少，类别0最多，其次是1，最少的是13类别。

四字符分布统计

all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)
print("word_count num is: " + str(word_count))
"""
word_count num is: 5340
"""

train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
disrepeat_lines = ' '.join(list(train_df['text_unique']))
disrepeatWord_count = Counter(disrepeat_lines.split(" "))
disrepeatWord_count = sorted(disrepeatWord_count.items(), key=lambda d:int(d[1]), reverse = True)

print(disrepeatWord_count[0])
print(disrepeatWord_count[1])
print(disrepeatWord_count[2])
print(disrepeatWord_count[3])
# 所占数目前四多的字符及数量
"""
('3750', 9913)
('900', 9898)
('648', 9628)
('6122', 8869)
"""

五 TASK

1.统计每篇新闻由多少句子构成

train_df['text_sentence'] = train_df['text'].apply(lambda x: len(list(filter(lambda a:(a == '3750') | (a == '900') | (a == '648'),x.split(' ')))))   # 以标点符号进行分割
print(train_df['text_sentence'].describe())
"""
count    10000.000000
mean        78.122000
std         85.411257
min          0.000000
25%         27.000000
50%         54.000000
75%        100.000000
max       1518.000000
Name: text_sentence, dtype: float64
"""

# 每条新闻所含句子数分布直方图
train_df['text_sentence'].plot(kind='line',color='b')
plt.xlabel("news")
plt.ylabel("sentenceCount")
plt.title("Histogram of news sentenceCount")
plt.show()

在这里插入图片描述

分析结果：根据标点符号进行分割统计每篇新闻共有多少句子数，可以看出平均每条新闻共有78个字符，最大是1518。

2.统计每类新闻里各字符出现的次数

all_class_lines = []
for i in range(14):
    # 将同一类别的新闻连接
    line = ' '.join(train_df[train_df['label'] == i]['text'])
    all_class_lines.append(re.sub('3750|900|648', '', line)) # 去除标点符号
for index,line in enumerate(all_class_lines):
    # 遍历每一类别统计每一类新闻出现最多的字符数
	line = filter(lambda x: x, line.split(' '))
	news_wordCount = Counter(line)
	news_wordCount = sorted(news_wordCount.items(), key=lambda d: int(d[1]), reverse=True)
	print(index,':',news_wordCount[0])

"""
0 : ('3370', 23597)
1 : ('3370', 33483)
2 : ('7399', 17031)
3 : ('6122', 8914)
4 : ('4411', 6173)
5 : ('6122', 7826)
6 : ('6248', 9544)
7 : ('3370', 7587)
8 : ('6122', 3215)
9 : ('2465', 2816)
10 : ('3370', 3275)
11 : ('4939', 942)
12 : ('4464', 2828)
13 : ('4939', 538)
"""