往期文章链接目录
Kaggle: Google Quest Q&A Labeling summary
General Part
Congratulations to all winners of this competition. Your hard work paid off!
First, I have to say thanks to the authors of the following three published notebooks:
https://www.kaggle.com/akensert/bert-base-tf2-0-now-huggingface-transformer,
https://www.kaggle.com/abhishek/distilbert-use-features-oof,
https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe.
These notebooks showed awesome ways to build models, visualize the dataset and extract features from non-text data.
Our initial plan was to take question title, question body and answer all into a Bert based model. But after we analyzed the distribution of the lengths of question bodies and answers, we found two major problems:
- If we fitted all three parts as input, we had to adjust the input space for both question body and answer due to the limitation of the input size of the Bert based models. In order to do this, we had to trim a bunch of text, which was a waste of training data. Also, the question of how to trim a long text immediately presented itself to us.
- Roughly half of the question bodies and answers had code in them. When implementing tokenization, it really brought some troubles for us. The tokenization of the code looked extremely wired, and we didn’t know if these would have a large effect on the model pred