[比赛分享] Kaggle-Toxic Comment [Keras多二分类，优质Comment语料， Pre-trained词向量的使用]

最新推荐文章于 2024-08-16 08:29:34 发布

LeYOUNGER

最新推荐文章于 2024-08-16 08:29:34 发布

阅读量4.9k

点赞数 2

分类专栏： python 机器学习自然语言处理

本文链接：https://blog.csdn.net/LeYOUNGER/article/details/78949709

版权

机器学习同时被 3 个专栏收录

35 篇文章 8 订阅

订阅专栏

python

19 篇文章 1 订阅

订阅专栏

自然语言处理

14 篇文章 0 订阅

订阅专栏

摘要

最近在看一个Kaggle的比赛【Toxic Comment】

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

比赛目标是判断文字评论是否为毒评论

同时毒评论具体细化成了六个类别
【’toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’】

本博客主要分享学习到的新姿势

Keras 之居然可以同时做多个2分类

使用Bi-LSTM实现的Baseline[0.051]，居然是同时做6个2分类，以前居然不知道还可以这么操作！

代码如下：

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint

max_features = 20000
maxlen = 100

train = pd.read_csv('../data/train/train.csv')
test = pd.read_csv('../data/test/test.csv')
subm = pd.read_csv('../data/sample_submission.csv/sample_submission.csv')
train = train.sample(frac=1)

list_sentences_train = train["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("CVxTz").values


tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)

def get_model():
    embed_size = 128
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size)(inp)
    x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1))(x)
    x = GlobalMaxPool1D()(x)
    x = Dropout(0.1)(x)
    x = Dense(50, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(6, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model


model = get_model()
batch_size = 32
epochs = 3


file_path="weights_base.best.hdf5"
# checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

# early = EarlyStopping(monitor="val_loss", mode="min", patience=20)
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)


callbacks_list = [checkpoint, early] #early
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list)
# model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

model.load_weights(file_path)
y_test = model.predict(X_te)


sample_submission = pd.read_csv("../input/sample_submission.csv")
sample_submission[list_classes] = y_test

sample_submission.to_csv("baseline.csv", index=False)

优质的各种Comment语料

Comment
- YouTube Comments(excellent for supplementing the threat and identity_hate columns)
- Reddit Comments(roughly a terabyte of data, divided by year)
Toxic word dictionary
Pre-trained word embeddings
- Google’s word2vec embedding: [Word2Vec] [DownloadLink]
- Glove word vectors: [Glove]
- Facebook’s fastText embeddings: [FastText]
- [DeepMoji]: To understand how language is used to express emotions
WikiPedia
- Wikipedia database reports: https://en.wikipedia.org/wiki/Wikipedia:Database_reports
- Wikimedia logs: https://meta.wikimedia.org/w/index.php?title=Special%3ALog
Other

使用Pre-trained词向量

https://github.com/MoyanZitto/keras-cn/blob/master/docs/legacy/blog/word_embedding.md

使用方法如下：

GLOVE

GLOVE_DIR = 'D:\glove.6B'
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Google Word2Vec

from gensim.models.keyedvectors import KeyedVectors
w2v_bin = 'D:\GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(w2v_bin, binary=True)

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = model[word] if word in model else None
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Fast Text

def get_embeddings_FastText():
    from gensim.models.keyedvectors import KeyedVectors
    w2v_bin = '../pre-trained/FastText_wiki.en/wiki.en.vec'
    model = KeyedVectors.load_word2vec_format(w2v_bin, binary=False)

    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = model[word] if word in model else None
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

最后用在Keras中

Embedding(len(word_index) + 1,
                  EMBEDDING_DIM,
                  weights=[embedding_matrix],
                  input_length=MAX_SEQUENCE_LENGTH,
                  trainable=False)

Categorical_crossentropy VS Binary_crossentropy

引用第一名的解释如下：

In this case, it should be binary_crossentropy and not categorical_crossentropy. categorical_crossentropy assumes that all the probabilities of classes sum to 1 (a multi-class scenario where every sample has exactly 1 class). In this competition, we have a multi-label scenario, because a sample can have any number of classes (or none at all), so binary_crossentropy independently optimises each class.