学习目标
学习使用Pandas读取赛题数据
分析赛题数据的分布规律
数据读取
赛题数据虽然是文本数据,每个新闻是不定长的,但任然使用csv格式进行存储。因此可以直接用Pandas完成数据读取的操作。
读取数据并分析
df_train['text'].map(lambda x: len(x.split())).describe()
count 200000.000000
mean 907.207110
std 996.029036
min 2.000000
25% 374.000000
50% 676.000000
75% 1131.000000
max 57921.000000
Name: text, dtype: float64
共有2w个句子,句子平均长度为907,最少长度为2,最长为57921
%%time
vocab = dict()
for text in df_train['text']:
for word in text.split():
if vocab.get(word):
vocab[word] += 1
else:
vocab[word] = 1
chars = sorted(vocab.items(), key=lambda x: x[1], reverse=True)
chars[:10]
[('3750', 7482224),
('648', 4924890),
('900', 3262544),
('3370', 2020958),
('6122', 1602363),
('4464', 1544962),
('7399', 1455864),
('4939', 1387951),
('3659', 1251253),
('4811', 1159401)]
分析结论
通过上述分析我们可以得出以下结论:
赛题中每个新闻包含的字符个数平均为1000个,还有一些新闻字符较长;
赛题中新闻类别分布不均匀,科技类新闻样本量接近4w,星座类新闻样本量不到1k;
赛题总共包括7000-8000个字符;
通过数据分析,我们还可以得出以下结论:
每个新闻平均字符个数较多,可能需要截断;
由于类别不均衡,会严重影响模型的精度;
本章作业
假设字符3750,字符900和字符648是句子的标点符号,请分析赛题每篇新闻平均由多少个句子构成?
统计如下:
(‘3750’, 7482224),
(‘648’, 4924890),
(‘900’, 3262544),
则赛题每篇新闻平均由(7482224+4924890+3262544)/ 20000个句子组成
统计每类新闻中出现次数对多的字符
'''
统计每类新闻中出现次数对多的字符
'''
%%time
# vocab = dict()
# for text,lb in df_train:
# for word in text.split():
# if vocab.get(word):
# vocab[word] += 1
# else:
# vocab[word] = 1
text = df_train['text']
label = df_train['label']
vocab = dict()
for i in range(20000):
if vocab.get(label[i]):
pass
else:
vocab[label[i]] = dict()
for word in text[i].split():
if vocab[label[i]].get(word):
vocab[label[i]][word] += 1
else:
vocab[label[i]][word] = 1
for i in range(14):
if vocab.get(i):
chars = sorted(vocab[label[i]].items(), key=lambda x: x[1], reverse=True)
print(i)
print(chars[:10])
运行结果:
0
[('3750', 139292), ('648', 93693), ('900', 59694), ('7399', 34120), ('6122', 32888), ('4939', 32220), ('4704', 29992), ('1667', 28924), ('5598', 27521), ('4464', 24248)]
1
[('3750', 8195), ('648', 6628), ('900', 3716), ('4939', 1828), ('6122', 1754), ('5560', 1742), ('669', 1693), ('4811', 1360), ('4893', 1302), ('2465', 1247)]
2
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
3
[('3750', 139292), ('648', 93693), ('900', 59694), ('7399', 34120), ('6122', 32888), ('4939', 32220), ('4704', 29992), ('1667', 28924), ('5598', 27521), ('4464', 24248)]
4
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
5
[('3750', 18937), ('648', 16408), ('900', 7477), ('7328', 4736), ('6122', 4637), ('2465', 4415), ('7399', 3939), ('3370', 3922), ('4939', 3805), ('5547', 3782)]
6
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
7
[('3750', 18346), ('648', 11677), ('900', 7412), ('3370', 6942), ('2465', 4921), ('5560', 4637), ('4464', 4160), ('3523', 4069), ('3686', 3991), ('6122', 3603)]
8
[('3750', 8865), ('4464', 5036), ('3370', 4522), ('648', 3766), ('900', 3742), ('2465', 3340), ('3659', 3289), ('6065', 2910), ('1667', 2441), ('2614', 2150)]
9
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
10
[('3750', 126914), ('648', 97095), ('900', 58260), ('3370', 49790), ('2465', 30701), ('4464', 30422), ('6122', 27672), ('7399', 26174), ('3659', 25650), ('4939', 24097)]
11
[('3750', 43439), ('648', 26855), ('900', 18554), ('3370', 15777), ('5296', 13209), ('4464', 11495), ('6835', 10591), ('3659', 8632), ('6122', 7500), ('299', 7350)]
12
[('3750', 36330), ('648', 23142), ('900', 18741), ('4411', 11728), ('7399', 8422), ('4893', 7922), ('6122', 7510), ('2400', 6784), ('4464', 6781), ('4853', 6285)]
13
[('3750', 126914), ('648', 97095), ('900', 58260), ('3370', 49790), ('2465', 30701), ('4464', 30422), ('6122', 27672), ('7399', 26174), ('3659', 25650), ('4939', 24097)]