import pandas as pd
import matplotlib.pyplot as plt
# 误区1,pd,read_csv('直接读取会报错')
train_df = pd.read_csv('./data/train_set.csv',sep='\t')
train_df.head(10)
label | text | |
---|---|---|
0 | 2 | 2967 6758 339 2021 1854 3731 4109 3792 4149 15... |
1 | 11 | 4464 486 6352 5619 2465 4802 1452 3137 5778 54... |
2 | 3 | 7346 4068 5074 3747 5681 6093 1777 2226 7354 6... |
3 | 2 | 7159 948 4866 2109 5520 2490 211 3956 5520 549... |
4 | 3 | 3646 3055 3055 2490 4659 6065 3370 5814 2465 5... |
5 | 9 | 3819 4525 1129 6725 6485 2109 3800 5264 1006 4... |
6 | 3 | 307 4780 6811 1580 7539 5886 5486 3433 6644 58... |
7 | 10 | 26 4270 1866 5977 3523 3764 4464 3659 4853 517... |
8 | 12 | 2708 2218 5915 4559 886 1241 4819 314 4261 166... |
9 | 3 | 3654 531 1348 29 4553 6722 1474 5099 7541 307 ... |
句子长度分析
train_df['text_len'] = train_df['text'].apply(lambda x:len(x.split(' ')))
print(train_df['text_len'].describe())
count 200000.000000
mean 907.207110
std 996.029036
min 2.000000
25% 374.000000
50% 676.000000
75% 1131.000000
max 57921.000000
Name: text_len, dtype: float64
新闻类别分布
train_df['label'].value_counts().plot(kind='bar')
plt.title('New class count')
plt.xlabel('category')
Text(0.5, 0, 'category')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tfYaTl2V-1595427266329)(output_5_1.png)]
数据集中标签的对应关系:
{‘科技’: 0, ‘股票’: 1, ‘体育’: 2, ‘娱乐’: 3, ‘时政’: 4, ‘社会’: 5, ‘教育’: 6, ‘财经’: 7, ‘家居’: 8, ‘游戏’: 9, ‘房产’: 10, ‘时尚’: 11, ‘彩票’: 12, ‘星座’: 13}
从数据统计中看到赛题的数据集类别的分布是不均匀的,科技,股票,体育等最多,彩票,星座最少
字符统计分析
接下来统计每个字符出现的次数,首先可以经数据集中的所有句子进行拼接进而划分为字符,并统计每个字符出现的次数
from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)
print(len(word_count))
print(word_count[0])
print(word_count[-1])
6869
('3750', 7482224)
('3133', 1)
从统计结果看出,在训练集中一共有6869个字,其中编号3750出现的次数最多,3133出现的次数最少
from collections import Counter
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)
print(word_count[0])
print(word_count[1])
print(word_count[2])
('3750', 197997)
('900', 197653)
('648', 191975)
出现次数最多的字符是 3750 900 648 ,文本中出现频次的是标点符号,常见的标点符号有 逗号—句号—,新闻文本中可能还