NLP赛事 - Task2 数据读取与数据分析

学习目标

学习使用Pandas读取赛题数据
分析赛题数据的分布规律
数据读取
赛题数据虽然是文本数据,每个新闻是不定长的,但任然使用csv格式进行存储。因此可以直接用Pandas完成数据读取的操作。

读取数据并分析

df_train['text'].map(lambda x: len(x.split())).describe()

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text, dtype: float64

共有2w个句子,句子平均长度为907,最少长度为2,最长为57921

%%time
vocab = dict()
for text in df_train['text']:
    for word in text.split():
        if vocab.get(word):
            vocab[word] += 1
        else:
            vocab[word] = 1
chars = sorted(vocab.items(), key=lambda x: x[1], reverse=True)
chars[:10]            
     
[('3750', 7482224),
 ('648', 4924890),
 ('900', 3262544),
 ('3370', 2020958),
 ('6122', 1602363),
 ('4464', 1544962),
 ('7399', 1455864),
 ('4939', 1387951),
 ('3659', 1251253),
 ('4811', 1159401)]     

分析结论

通过上述分析我们可以得出以下结论:

赛题中每个新闻包含的字符个数平均为1000个,还有一些新闻字符较长;
赛题中新闻类别分布不均匀,科技类新闻样本量接近4w,星座类新闻样本量不到1k;
赛题总共包括7000-8000个字符;
通过数据分析,我们还可以得出以下结论:

每个新闻平均字符个数较多,可能需要截断;

由于类别不均衡,会严重影响模型的精度;

本章作业

假设字符3750,字符900和字符648是句子的标点符号,请分析赛题每篇新闻平均由多少个句子构成?

统计如下:
(‘3750’, 7482224),
(‘648’, 4924890),
(‘900’, 3262544),
则赛题每篇新闻平均由(7482224+4924890+3262544)/ 20000个句子组成

统计每类新闻中出现次数对多的字符

'''
统计每类新闻中出现次数对多的字符
'''
%%time
# vocab = dict()
# for text,lb in df_train:
#     for word in text.split():
#         if vocab.get(word):
#             vocab[word] += 1
#         else:
#             vocab[word] = 1
text = df_train['text']
label = df_train['label']
vocab = dict()

for i in range(20000):
    if vocab.get(label[i]):
        pass
    else:
        vocab[label[i]] = dict()    
    for word in text[i].split():
        if vocab[label[i]].get(word):
            vocab[label[i]][word] += 1
        else:
            vocab[label[i]][word] = 1

for i in range(14):
    if vocab.get(i):
        chars = sorted(vocab[label[i]].items(), key=lambda x: x[1], reverse=True)
        print(i)
        print(chars[:10])    
            

运行结果:

0
[('3750', 139292), ('648', 93693), ('900', 59694), ('7399', 34120), ('6122', 32888), ('4939', 32220), ('4704', 29992), ('1667', 28924), ('5598', 27521), ('4464', 24248)]
1
[('3750', 8195), ('648', 6628), ('900', 3716), ('4939', 1828), ('6122', 1754), ('5560', 1742), ('669', 1693), ('4811', 1360), ('4893', 1302), ('2465', 1247)]
2
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
3
[('3750', 139292), ('648', 93693), ('900', 59694), ('7399', 34120), ('6122', 32888), ('4939', 32220), ('4704', 29992), ('1667', 28924), ('5598', 27521), ('4464', 24248)]
4
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
5
[('3750', 18937), ('648', 16408), ('900', 7477), ('7328', 4736), ('6122', 4637), ('2465', 4415), ('7399', 3939), ('3370', 3922), ('4939', 3805), ('5547', 3782)]
6
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
7
[('3750', 18346), ('648', 11677), ('900', 7412), ('3370', 6942), ('2465', 4921), ('5560', 4637), ('4464', 4160), ('3523', 4069), ('3686', 3991), ('6122', 3603)]
8
[('3750', 8865), ('4464', 5036), ('3370', 4522), ('648', 3766), ('900', 3742), ('2465', 3340), ('3659', 3289), ('6065', 2910), ('1667', 2441), ('2614', 2150)]
9
[('3750', 76485), ('648', 48070), ('900', 29621), ('6122', 18428), ('4939', 16759), ('669', 14505), ('7399', 14504), ('4893', 14452), ('803', 13470), ('1635', 13428)]
10
[('3750', 126914), ('648', 97095), ('900', 58260), ('3370', 49790), ('2465', 30701), ('4464', 30422), ('6122', 27672), ('7399', 26174), ('3659', 25650), ('4939', 24097)]
11
[('3750', 43439), ('648', 26855), ('900', 18554), ('3370', 15777), ('5296', 13209), ('4464', 11495), ('6835', 10591), ('3659', 8632), ('6122', 7500), ('299', 7350)]
12
[('3750', 36330), ('648', 23142), ('900', 18741), ('4411', 11728), ('7399', 8422), ('4893', 7922), ('6122', 7510), ('2400', 6784), ('4464', 6781), ('4853', 6285)]
13
[('3750', 126914), ('648', 97095), ('900', 58260), ('3370', 49790), ('2465', 30701), ('4464', 30422), ('6122', 27672), ('7399', 26174), ('3659', 25650), ('4939', 24097)]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值