Kaggle: Google Quest Q&A Labeling 首战银牌方法总结+心得

最新推荐文章于 2024-05-13 02:04:16 发布

Jay_Tang

最新推荐文章于 2024-05-13 02:04:16 发布

阅读量1.3k

点赞数 2

分类专栏：比赛/项目经验分享文章标签：机器学习深度学习 tensorflow 自然语言处理

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/Jay_Tang/article/details/105305243

版权

本文总结了作者在Kaggle谷歌Quest Q&A标签挑战赛中获得银牌的方法。他们使用BERT和DistilBERT模型，发现将问题标题、正文和答案分开处理更有效。通过调整预测输出以匹配训练集中的离散评分，显著提高了模型性能。最终，通过特征工程、预处理和模型平均策略，实现了成绩的提升。

摘要由CSDN通过智能技术生成

往期文章链接目录

Kaggle: Google Quest Q&A Labeling summary

General Part

Congratulations to all winners of this competition. Your hard work paid off!

First, I have to say thanks to the authors of the following three published notebooks:

https://www.kaggle.com/akensert/bert-base-tf2-0-now-huggingface-transformer,
https://www.kaggle.com/abhishek/distilbert-use-features-oof,
https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe.

These notebooks showed awesome ways to build models, visualize the dataset and extract features from non-text data.

Our initial plan was to take question title, question body and answer all into a Bert based model. But after we analyzed the distribution of the lengths of question bodies and answers, we found two major problems:

If we fitted all three parts as input, we had to adjust the input space for both question body and answer due to the limitation of the input size of the Bert based models. In order to do this, we had to trim a bunch of text, which was a waste of training data. Also, the question of how to trim a long text immediately presented itself to us.
Roughly half of the question bodies and answers had code in them. When implementing tokenization, it really brought some troubles for us. The tokenization of the code looked extremely wired, and we didn’t know if these would have a large effect on the model pred

最低0.47元/天解锁文章

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。