Kaggle: Google Quest Q&A Labeling 首战银牌方法总结+心得

本文总结了作者在Kaggle谷歌Quest Q&A标签挑战赛中获得银牌的方法。他们使用BERT和DistilBERT模型,发现将问题标题、正文和答案分开处理更有效。通过调整预测输出以匹配训练集中的离散评分,显著提高了模型性能。最终,通过特征工程、预处理和模型平均策略,实现了成绩的提升。
摘要由CSDN通过智能技术生成

往期文章链接目录

Kaggle: Google Quest Q&A Labeling summary

General Part

Congratulations to all winners of this competition. Your hard work paid off!

First, I have to say thanks to the authors of the following three published notebooks:

https://www.kaggle.com/akensert/bert-base-tf2-0-now-huggingface-transformer,
https://www.kaggle.com/abhishek/distilbert-use-features-oof,
https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe.

These notebooks showed awesome ways to build models, visualize the dataset and extract features from non-text data.

Our initial plan was to take question title, question body and answer all into a Bert based model. But after we analyzed the distribution of the lengths of question bodies and answers, we found two major problems:

  1. If we fitted all three parts as input, we had to adjust the input space for both question body and answer due to the limitation of the input size of the Bert based models. In order to do this, we had to trim a bunch of text, which was a waste of training data. Also, the question of how to trim a long text immediately presented itself to us.
  2. Roughly half of the question bodies and answers had code in them. When implementing tokenization, it really brought some troubles for us. The tokenization of the code looked extremely wired, and we didn’t know if these would have a large effect on the model pred
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值