task-2-数据读取与分析

最新推荐文章于 2021-12-02 23:48:45 发布

波心冷血

最新推荐文章于 2021-12-02 23:48:45 发布

阅读量171

点赞数

分类专栏：天池-新闻文本分类

本文链接：https://blog.csdn.net/qq_22441151/article/details/107524790

版权

天池-新闻文本分类专栏收录该内容

6 篇文章 1 订阅

订阅专栏

import pandas as pd
import matplotlib.pyplot as plt

# 误区1，pd,read_csv('直接读取会报错')
train_df = pd.read_csv('./data/train_set.csv',sep='\t')
train_df.head(10)

	label	text
0	2	2967 6758 339 2021 1854 3731 4109 3792 4149 15...
1	11	4464 486 6352 5619 2465 4802 1452 3137 5778 54...
2	3	7346 4068 5074 3747 5681 6093 1777 2226 7354 6...
3	2	7159 948 4866 2109 5520 2490 211 3956 5520 549...
4	3	3646 3055 3055 2490 4659 6065 3370 5814 2465 5...
5	9	3819 4525 1129 6725 6485 2109 3800 5264 1006 4...
6	3	307 4780 6811 1580 7539 5886 5486 3433 6644 58...
7	10	26 4270 1866 5977 3523 3764 4464 3659 4853 517...
8	12	2708 2218 5915 4559 886 1241 4819 314 4261 166...
9	3	3654 531 1348 29 4553 6722 1474 5099 7541 307 ...

句子长度分析

train_df['text_len'] = train_df['text'].apply(lambda x:len(x.split(' ')))
print(train_df['text_len'].describe())

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

新闻类别分布

train_df['label'].value_counts().plot(kind='bar')
plt.title('New class count')
plt.xlabel('category')

Text(0.5, 0, 'category')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tfYaTl2V-1595427266329)(output_5_1.png)]

数据集中标签的对应关系:
{‘科技’: 0, ‘股票’: 1, ‘体育’: 2, ‘娱乐’: 3, ‘时政’: 4, ‘社会’: 5, ‘教育’: 6, ‘财经’: 7, ‘家居’: 8, ‘游戏’: 9, ‘房产’: 10, ‘时尚’: 11, ‘彩票’: 12, ‘星座’: 13}
从数据统计中看到赛题的数据集类别的分布是不均匀的，科技，股票，体育等最多，彩票，星座最少

字符统计分析

接下来统计每个字符出现的次数，首先可以经数据集中的所有句子进行拼接进而划分为字符，并统计每个字符出现的次数

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

print(len(word_count))

print(word_count[0])

print(word_count[-1])

6869
('3750', 7482224)
('3133', 1)

从统计结果看出，在训练集中一共有6869个字，其中编号3750出现的次数最多，3133出现的次数最少

from collections import Counter
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

print(word_count[0])

print(word_count[1])

print(word_count[2])

('3750', 197997)
('900', 197653)
('648', 191975)

出现次数最多的字符是 3750 900 648 ，文本中出现频次的是标点符号，常见的标点符号有逗号—句号—，新闻文本中可能还

波心冷血

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
task-2-数据读取与分析

import pandas as pdimport matplotlib.pyplot as plt# 误区1，pd,read_csv('直接读取会报错')train_df = pd.read_csv('./data/train_set.csv',sep='\t')train_df.head(10) label text 0 2 2967 6758 339 2021
复制链接

扫一扫

专栏目录