版权声明:本文为博主原创文章,可随意转载,但请注明作者和出处。 https://blog.csdn.net/LeYOUNGER/article/details/78667538
摘要
最近在做一个Kaggle比赛, 于此分享一下所使用的DL模型(经过简化),希望能对初学者有所帮助。
(最后结果,公榜23/4512,私榜87/4512,过拟合了啊啊啊啊T.T)
比赛地址:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
模型中所使用的词嵌入请参看如下博客:
各种模型的总结
https://www.kaggle.com/jagangupta/lessons-from-toxic-blending-is-the-new-sexy
Text相关比赛的套路
- 文本预处理
- 分词
- 特征工程(词袋,词嵌入,句向量等)
- CV与KFold
- 模型融合(Stacking,加权融合等)
使用的深度模型
TextRNN
inp = Input(shape=(MAX_SEQUENCE_LENGTH, ))
x_3 = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[get_embeddings()],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)(inp)
x_3 = SpatialDropout1D(0.2)(x_3)
x_3 = Bidirectional(GRU(128, return_sequences=True, dropout=0.2, kernel_initializer='glorot_uniform'),merge_mode='concat')(x_3)
avg_pool_3 = GlobalAveragePooling1D()(x_3)
max_pool_3 = GlobalMaxPooling1D()(x_3)
attention_3 = Attention()(x_3)
x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3])
x = Dense(6, activation="sigmoid")(x)
adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
sgd = keras.optimizers.SGD(lr=0.001)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer=adam,
metrics=['accuracy', 'binary_crossentropy'])
TextCNN
inp = Input(shape=(MAX_SEQUENCE_LENGTH, ))
x_3 = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[get_embeddings()],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)(inp)
x_3 = SpatialDropout1D(0.2)(x_3)
cnn1 = Conv1D(256, 2, padding='same', strides=1, activation='relu')(x_3)
cnn2 = Conv1D(256, 3, padding='same', strides=1, activation='relu')(x_3)
cnn3 = Conv1D(256, 4, padding='same', strides=1, activation='relu')(x_3)
cnn4 = Conv1D(256, 5, padding='same', strides=1, activation='relu')(x_3)
cnn5 = Conv1D(256, 6, padding='same', strides=1, activation='relu')(x_3)
cnn = keras.layers.concatenate([cnn1, cnn2, cnn3, cnn4, cnn5], axis=-1)
cnn1 = Conv1D(128, 3, padding='same', strides=1, activation='relu')(cnn)
cnn1 = MaxPooling1D(pool_size=200)(cnn1)
cnn2 = Conv1D(128, 4, padding='same', strides=1, activation='relu')(cnn)
cnn2 = MaxPooling1D(pool_size=200)(cnn2)
cnn3 = Conv1D(128, 5, padding='same', strides=1, activation='relu')(cnn)
cnn3 = MaxPooling1D(pool_size=200)(cnn3)
cnn = keras.layers.concatenate([cnn1, cnn2, cnn3], axis=-1)
x = Flatten()(cnn)
x = Dropout(0.2)(x)
x = Dense(128, kernel_initializer='he_normal')(x)
x = PReLU()(x)
x = BatchNormalization()(x)
x = Dropout(0.2)(x)
x = Dense(6, activation="sigmoid")(x)
adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
sgd = keras.optimizers.SGD(lr=0.001)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer=adam,
metrics=['accuracy', 'binary_crossentropy'])
RCNN
inp = Input(shape=(MAX_SEQUENCE_LENGTH, ))
x_4 = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[get_embeddings()],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)(inp)
x_3 = SpatialDropout1D(0.2)(x_4)
x_3 = Bidirectional(GRU(196, return_sequences=True, dropout=0.2, kernel_initializer='he_normal'),
merge_mode='concat')(x_3)
x_3 = Conv1D(96, kernel_size=3, padding="valid", kernel_initializer="glorot_uniform")(x_3)
avg_pool_3 = GlobalAveragePooling1D()(x_3)
max_pool_3 = GlobalMaxPooling1D()(x_3)
att_3 = Attention()(x_3)
x = keras.layers.concatenate([avg_pool_3, max_pool_3, att_3])
x = Dense(6, activation="sigmoid")(x)
adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
sgd = keras.optimizers.SGD(lr=0.001)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer=adam,
metrics=['accuracy', 'binary_crossentropy'])
<link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-778f64ae39.css" rel="stylesheet">
</div>
<div class="hide-article-box text-center">
<a class="btn" id="btn-readmore" data-track-view='{"mod":"popu_376","con":",https://blog.csdn.net/LeYOUNGER/article/details/79572279,"}' data-track-click='{"mod":"popu_376","con":",https://blog.csdn.net/LeYOUNGER/article/details/79572279,"}'>阅读更多</a>
</div>
<script>
(function(){
function setArticleH(btnReadmore,posi){
var winH = $(window).height();
var articleBox = $("div.article_content");
var artH = articleBox.height();
if(artH > winH*posi){
articleBox.css({
'height':winH*posi+'px',
'overflow':'hidden'
})
btnReadmore.click(function(){
articleBox.removeAttr("style");
$(this).parent().remove();
})
}else{
btnReadmore.parent().remove();
}
}
var btnReadmore = $("#btn-readmore");
if(btnReadmore.length>0){
if(currentUserName){
setArticleH(btnReadmore,3);
}else{
setArticleH(btnReadmore,1.2);
}
}
})()
</script>
</article>
<div class="recommend-item-box recommend-box-ident type_blog clearfix" data-track-view='{"mod":"popu_614","con":",https://blog.csdn.net/hahajinbu/article/details/79650336,BlogCommendFromBaidu_1"}' data-track-click='{"mod":"popu_614","con":",https://blog.csdn.net/hahajinbu/article/details/79650336,BlogCommendFromBaidu_1"}'>
<div class="content">
<a href="https://blog.csdn.net/hahajinbu/article/details/79650336?utm_source=blogxgwz2" target="_blank" title="kaggle的toxic_comment_classification比赛21th经验总结">
<h4 class="text-truncate oneline">
kaggle的toxic_comment_classification比赛21th经验总结 </h4>
<div class="info-box d-flex align-content-center">
<!-- <p class="avatar">
<img src="https://avatar.csdn.net/8/1/F/3_hahajinbu.jpg" alt="hahajinbu" class="avatar-pic">
<span class="namebox">
<span class="name">hahajinbu</span>
<span class="triangle"></span>
</span>
</p> -->
<p class="date-and-readNum">
<span class="date hover-show">03-22</span>
<span class="read-num hover-hide">
<svg class="icon csdnc-yuedushu" aria-hidden="true">
<use xlink:href="#csdnc-yuedushu"></use>
</svg>
1195</span>
</p>
</div>
</a>
<p class="content">
<a href="https://blog.csdn.net/hahajinbu/article/details/79650336?utm_source=blogxgwz2" target="_blank" title="kaggle的toxic_comment_classification比赛21th经验总结">
<span class="desc oneline">这个比赛可以说是一波三折,本来应该早早就结束了,结果因为数据泄露更换了数据,中途还更换过评价指标,不过好在最后还是顺利结果,我们队伍拿到了前1%的成绩(21/4551),属于不是很好但是也不是很坏的结...</span>
</a>
<span class="blog_title_box oneline"><a target="_blank" href="https://blog.csdn.net/hahajinbu?utm_source=blog_pc_recommand">来自: <span class="blog_title"> CODE and POEM</span></a></span>
</p>
</div>
</div>
<div class="recommend-item-box recommend-box-ident type_blog clearfix" data-track-view='{"mod":"popu_614","con":",https://blog.csdn.net/starzhou/article/details/72623158,BlogCommendFromBaidu_3"}' data-track-click='{"mod":"popu_614","con":",https://blog.csdn.net/starzhou/article/details/72623158,BlogCommendFromBaidu_3"}'>
<div class="content">
<a href="https://blog.csdn.net/starzhou/article/details/72623158?utm_source=blogxgwz3" target="_blank" title="Kaggle 数据挖掘比赛经验分享 (转载)">
<h4 class="text-truncate oneline">
Kaggle 数据挖掘比赛经验分享 (转载) </h4>
<div class="info-box d-flex align-content-center">
<!-- <p class="avatar">
<img src="https://avatar.csdn.net/4/E/7/3_starzhou.jpg" alt="starzhou" class="avatar-pic">
<span class="namebox">
<span class="name">starzhou</span>
<span class="triangle"></span>
</span>
</p> -->
<p class="date-and-readNum">
<span class="date hover-show">05-22</span>
<span class="read-num hover-hide">
<svg class="icon csdnc-yuedushu" aria-hidden="true">
<use xlink:href="#csdnc-yuedushu"></use>
</svg>
824</span>
</p>
</div>
</a>
<p class="content">
<a href="https://blog.csdn.net/starzhou/article/details/72623158?utm_source=blogxgwz3" target="_blank" title="Kaggle 数据挖掘比赛经验分享 (转载)">
<span class="desc oneline">