Kaggle竞赛：Quora Insincere Questions Classification 总结与心得感想

最新推荐文章于 2023-01-20 21:19:58 发布

yyhhlancelot

最新推荐文章于 2023-01-20 21:19:58 发布

阅读量1.7k

点赞数 3

分类专栏：自然语言处理竞赛总结文章标签： Quora 文本分类心得感悟 NLP Kaggle

本文链接：https://blog.csdn.net/yyhhlancelot/article/details/87262813

版权

本文作者参加Kaggle上的Quora Insincere Questions Classification竞赛，首次涉足NLP领域，总结了从分词、词向量到深度学习模型如LSTM、GRU、注意力机制和胶囊网络的实践经验。尽管最终成绩不理想，但作者认识到NLP模型的重要性，并意识到需要更多地阅读论文和深入学习。同时，分享了比赛代码和预处理、建模的步骤，反思了一心二用和懈怠情绪对比赛的影响，强调了保持学习的热情和危机感。

摘要由CSDN通过智能技术生成

这次Quora的文本分类题，4000支参赛队伍中个人solo最终只在LB上达到了20%，一方面是因为第一次参加NLP方面的比赛，完全是个小白，另一方面是自己在比赛途中也有不少懈怠，因此想做一些技术上以及客观上的总结警醒自己。

比赛是通过文本训练集来预测Quora上的问题是真诚的还是不真诚的问题，比赛链接https://www.kaggle.com/c/quora-insincere-questions-classification

关于技术上的问题：

在比赛中通过学习各路大神的kernel，同时在很多问题一知半解的情况下，查阅各种文献资料，我对NLP的文本分类有了一个大致的了解，分词，语言模型，词向量等有了初步认识。语言模型n-gram，预训练词向量word embedding，以及常用的lstm网络和GRU网络等等。文本预处理的各种方法，以及attention layer等模型方案。

客观上存在的问题：

NLP的比赛和普通的数据挖掘比赛有很大的不同，普通的数据挖掘比赛最重要的需要挖掘到好的特征，其次是使用合适的模型；而NLP更注重模型本身，所以现有的模型中，深度学习的模型在NLP里得到广泛应用。我自己在这个比赛途中同时也在进行另外一个数据挖掘的比赛，一心二用导致了自己不够专注。做到一定程度的时候卡在一个地方，就发生了懈怠情绪，这也是需要自己改正的一个地方。

感悟与收获：

最重要的收获是感觉自己NLP终于入了门，同时了解了各种前沿的论文对于NLP建模的影响，需要阅读更多的论文，因为基本上好的nlp模型都是从现有论文中衍生的（当然很多大神是通过比赛验证自己的模型然后再发Paper），这和从数据中衍生的数据挖掘出的特征真的是有很大的区别。同时这次比赛借鉴了很多别人的方案，在以后更需要做的是站在巨人的肩膀上做出自己的一些想法。在这次比赛后，发现nlp真是一个巨大无比的坑，还有太多需要学习的地方，继续加油，保持危机感。

以下附比赛代码以及附注：

源码链接：https://github.com/yyhhlancelot/Kaggle_Quora_Insincere_Question_Classification

首先载入需要用的包：

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import keras
import os
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
import gc
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, MaxPooling1D, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D, BatchNormalization, PReLU
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import concatenate, add
from keras.callbacks import *

预处理阶段：

清理符号：

def clean_text(x):
    puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x

使用正则表达式清理数字：

import re
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

清理错误拼写：

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color','centre':'center','didnt':'did not','doesnt':'does not',
                'isnt':'is not','shouldnt':'should not','favourite':'favorite','travelling':'traveling',
                'counselling':'counseling','theatre':'theater','cancelled':'canceled','labour':'labor',
                'organisation':'organization','wwii':'world war 2','citicise':'criticize','instagram': 'social medium',
                'whatsapp': 'social medium','snapchat': 'social medium',"ain't": "is not", 
                "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", 
                "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would",
                "he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", 
                "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", 
                "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                "i'd": "i would", "i'd've": "i would have", "i'll": "i will","i'll've": "i will have",
                "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", 
                "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                "it's": "it is","let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
                "must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
                "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                "oughtn't": "ought not