用keras使用glove预训练的词向量来构建实验的embedding矩阵-以Jigsaw Unintended Bias in Toxicity Classification比赛baseline为例

本文通过Jigsaw Unintended Bias in Toxicity Classification比赛的baseline,展示了如何使用keras结合GloVe预训练词向量来构建embedding矩阵,并进一步建立模型进行预测。
摘要由CSDN通过智能技术生成

数据加载

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import os
# print(os.listdir("../input"))
# print(os.listdir("../input/crawl300d2m"))
# Any results you write to the current directory are saved as output.
import numpy as np
import pandas as pd
import os
import gc
import logging
import datetime
import warnings
import pickle
from keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, Dropout, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing import text, sequence
from keras.losses import binary_crossentropy
from keras import backend as K
import keras.layers as L
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers

from keras.models import Model
from keras.optimizers import Adam
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

COMMENT_TEXT_COL = 'comment_text'
EMB_MAX_FEAT = 300
MAX_LEN = 220
MAX_FEATURES = 100000
BATCH_SIZE = 512
NUM_EPOCHS = 4
LSTM_UNITS = 128
DENSE_HIDDEN_UNITS = 512
NUM_MODELS = 2
EMB_PATHS = [
    #'data/crawl-300d-2M.vec',
    'data/glove.840B.300d.txt'
]
JIGSAW_PATH = 'data/'


def get_logger():
    FORMAT = '[%(levelname)s]%(asctime)s:%(name)s:%(message)s'
    logging.basicConfig(format=FORMAT)
    logger = logging.getLogger('main')
    logger.setLevel(logging.DEBUG)
    return logger
logger = get_logger()
############################################################################################

def custom_loss(y_true, y_pred):#计算loss
    return binary_crossentropy(K.reshape(y_true[:, 0], (-1, 1)), y_pred) * y_true[:, 1]


def load_data():
    logger.info('Load train and test data')
    train = pd.read_csv(os.path.join(JIGSAW_PATH, 'train.csv'), index_col='id')
    test = pd.read_csv(os.path.join(JIGSAW_PATH, 'test.csv'), index_col='id')
    return train, test

数据预处理

def perform_preprocessing(train, test):
    logger.info('data preprocessing')
    punct_mapping = {
   "_": " ", 
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Sure! Here's an example code for building a textCNN model for text binary classification using pre-trained word2vec embeddings in the Keras framework: ```python from keras.layers import Conv1D, MaxPooling1D, Embedding, Flatten, Dense from keras.models import Sequential from gensim.models import KeyedVectors # Load pre-trained word2vec embeddings word2vec_path = 'path/to/word2vec.bin' word_vectors = KeyedVectors.load_word2vec_format(word2vec_path, binary=True) # Set the embedding dimension based on the pre-trained word2vec embeddings embedding_dim = word_vectors.vector_size # Define the model architecture model = Sequential() # Embedding layer with pre-trained word2vec weights model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[word_vectors.vectors], trainable=False, input_length=max_sequence_length)) # Convolutional layers with different filter sizes filters = [3, 4, 5] conv_blocks = [] for filter_size in filters: conv = Conv1D(filters=128, kernel_size=filter_size, activation='relu')(model.layers[-1].output) pool = MaxPooling1D(pool_size=2)(conv) flatten = Flatten()(pool) conv_blocks.append(flatten) # Concatenate the outputs of the conv blocks concat = concatenate(conv_blocks, axis=-1) # Fully connected layer model.add(Dense(128, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile the model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # Print the model summary model.summary() ``` Make sure to replace `vocab_size`, `word2vec_path`, and `max_sequence_length` with appropriate values for your dataset and provide the correct path to your pre-trained word2vec embeddings file.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值