摘要
最近在看一个Kaggle的比赛【Toxic Comment】
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
比赛目标是判断文字评论是否为毒评论
同时毒评论具体细化成了六个类别
【’toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’】
本博客主要分享学习到的新姿势
Keras 之居然可以同时做多个2分类
使用Bi-LSTM实现的Baseline[0.051],居然是同时做6个2分类,以前居然不知道还可以这么操作!
代码如下:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint
max_features = 20000
maxlen = 100
train = pd.read_csv('../data/train/train.csv')
test = pd.read_csv('../data/test/test.csv')
subm = pd.read_csv('../data/sample_submission.csv/sample_submission.csv')
train = train.sample(frac=1)
list_sentences_train = train["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("CVxTz").values
tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)
def get_model():
embed_size = 128
inp = Input(shape=(maxlen, ))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
model = get_model()
batch_size = 32
epochs = 3
file_path="weights_base.best.hdf5"
# checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
# early = EarlyStopping(monitor="val_loss", mode="min", patience=20)
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)
callbacks_list = [checkpoint, early] #early
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list)
# model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1)
model.load_weights(file_path)
y_test = model.predict(X_te)
sample_submission = pd.read_csv("../input/sample_submission.csv")
sample_submission[list_classes] = y_test
sample_submission.to_csv("baseline.csv", index=False)
优质的各种Comment语料
Comment
- YouTube Comments(excellent for supplementing the threat and identity_hate columns)
- Reddit Comments(roughly a terabyte of data, divided by year)
Toxic word dictionary
- http://www.bannedwordlist.com/
- https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
- https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
- https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/
- https://kaggle2.blob.core.windows.net/forum-message-attachments/4810/badwords.txt
- https://gist.github.com/ryanlewis/a37739d710ccdb4b406d
Pre-trained word embeddings
- Google’s word2vec embedding: [Word2Vec] [DownloadLink]
- Glove word vectors: [Glove]
- Facebook’s fastText embeddings: [FastText]
- [DeepMoji]: To understand how language is used to express emotions
WikiPedia
- Wikipedia database reports: https://en.wikipedia.org/wiki/Wikipedia:Database_reports
- Wikimedia logs: https://meta.wikimedia.org/w/index.php?title=Special%3ALog
Other
使用Pre-trained词向量
https://github.com/MoyanZitto/keras-cn/blob/master/docs/legacy/blog/word_embedding.md
使用方法如下:
GLOVE
GLOVE_DIR = 'D:\glove.6B'
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
Google Word2Vec
from gensim.models.keyedvectors import KeyedVectors
w2v_bin = 'D:\GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(w2v_bin, binary=True)
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
Fast Text
def get_embeddings_FastText():
from gensim.models.keyedvectors import KeyedVectors
w2v_bin = '../pre-trained/FastText_wiki.en/wiki.en.vec'
model = KeyedVectors.load_word2vec_format(w2v_bin, binary=False)
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
return embedding_matrix
最后用在Keras中
Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Categorical_crossentropy VS Binary_crossentropy
引用第一名的解释如下:
In this case, it should be binary_crossentropy and not categorical_crossentropy. categorical_crossentropy assumes that all the probabilities of classes sum to 1 (a multi-class scenario where every sample has exactly 1 class). In this competition, we have a multi-label scenario, because a sample can have any number of classes (or none at all), so binary_crossentropy independently optimises each class.