Convolutional Neural Networks for Sentence Classification 代码实现

该博客详细介绍了如何使用预处理后的文本数据,通过TensorFlow构建卷积神经网络(CNN)进行情感分析。内容包括数据预处理、词汇表构建、填充、词嵌入、模型构建以及训练过程。在模型构建中,重点讨论了Embedding层、CNN层、Dropout层和全连接层的使用。
摘要由CSDN通过智能技术生成

预处理数据

原始数据及预处理数据结果

原始数据是两个文本,一个是表示positive的文本,另一个是negative的文本,两个文本的句数相同。想要能够使用的预处理数据集的要求:

  1. 每句话的长度相同,也就是用pad将每一句话扩充至相同长度,并且输出的每句话中的每个单词由一个vocabulary中的value来代表,这样我们就能将两个文本转换成等大的矩阵
  2. 每句话应该根据positive或者negative做一个label [ 0 , 1 ] [0,1] [0,1] or [ 1 , 0 ] [1,0] [1,0],并且最后这两个文本会混合并且打乱(shuffle)
  3. 文本中还需要对许多字符做预处理,这样得到的矩阵更加整洁且少误差
  4. 分开训练集和测试集(train/test set)

预处理步骤

Load Data

Load from file

指定路径,open file,并且检查open得到的examples. 用 strip() 删除 \n
部分代码:

# 指定路径
positive_data_file = "../data/rt-polaritydata/rt-polarity.pos"
# 打开文件
positive_examples = list(open(positive_data_file, 'r', encoding='utf-8').readlines())
# 列举某些句子
positive_examples[:3]
# 用strip 删除\n
positive_examples = [s.strip() for s in positive_examples]
Split by Words

使用创建的 clean_str() 函数去处理数据中特殊的character
部分代码:

# 合并两个文本,方便后面的操作
x_text = positive_examples + negative_examples
# 使用 clean_str() 对每句话做处理
x_text = [clean_str(string) for string in x_text]
# 将每句话中的每个单词分隔成单独的一个元素
x_text = [string.split(" ") for string in x_text]

clean_str() 函数:

import re
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()
Generate Labels

对应于文本,将文本中每句话的positive or negative换做向量表示存入label list里
部分代码:

# 给两个文本标记,[0,1] 代表 pos, [1,0] 代表neg,这样后面用argmax(axis = 1)就能 pos = 1, neg = 0了
positive_labels = [[0,1] for _ in positive_examples]
# 和x_text一样,合并到一块
y = np.concatenate([positive_labels, negative_labels], axis = 0)

Padding

让所有句子的长度一致,便于后面的矩阵表达
部分代码

# 找到最长句子返回长度
longest_length = max([len(string) for string in x_text])
# padding
padded_sentences = []
padding_word = "<PAD/>"
for string in x_text:
	num_padding = longest_length - len(string)
	padded_sentence = string + [padding_word] * num_padding
	padded_sentences.append(new_sentences)

Build Vocabulary

对整个数据集做一个vocabulary,其实也是字典的含义。
部分代码:

import itertools
from collections import Counter
# 建立 vocabulary
word_counts = Counter(itertools.chain(*padded_sentences))
# 得到频数由大到小的单词list
vocabulary_inv = [x[0] for x in word_counts.mmost_common()]
# 然后为其建立dict
vocabulary = {x:i for i, x in enumerate(vocabulary_inv)}
# 建立一个dict_inv方便index与word的相互对应
vocabulary_inv = {value: key for key, value in vocabulary.items()}

Map Sentences and Labels to Index

import numpy as np
# sentences map
x = np.array([[vocabulary[word] for word in sentence] for sentence in padded_sentences])
# labels map
y = np.array(y)
y = y.argmax(axis = 1)

Shuffle Data and Split Train/Test Set

# shuffle Data
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
# Split Train/Test Set
training_rate = 0.9
train_len = int(len(x_shuffled)*training_rate)
x_train = x[:train_len]
y_train = y[:train_len]
x_text = x[train_len:]
y_text = y[train_len:]

Word Embedding(将word 转换成vec,中间经过训练让相关单词有关联性 )

预处理后,我们得到了训练集和测试集,在放入训练模型之前还要进行预训练,就是word通过模型训练转换成相关的vector,这里用到了函数word2vec
预加载某些函数包

from gebsim.models import word2vec
from os.path import join, exists, split
import os
import numpy as np
  1. 为 word2vec model设置某些参数
"""
inputs:
sentence_matrix # int matrix: num_sentences x max_sentence_len
vocabulary_inv  # dict {int: str}
num_features    # Word vector dimensionality                      
min_word_count  # Minimum word count                        
context         # Context window size 
"""
  1. 还有两个参数:
num_workers = 2 # Number of threads to run in parallel
downsampling = 1e-3  # Downsample setting for frequent words
  1. 由于之前的sentences我们已经编程了一个value matrix,但是在word2vec函数我们需要转变成包含word的函数
sentences = [[vocabulary_inv[w] for w in s] for s in sentence_matrix]
  1. 将参数放入model来建立embedding_model
embedding_model = word2vec.Word2Vec(sentences, worders = number_workers
									size = num_features, min_count = min_word_count, 
									window = context, sample = downsampling)

比如,如果我们要得到某一个向量单词 ‘rock’

embedding_model.wv['rock'] # features有300D
  1. 对未出现在embeding_model里的词,我们随机初始话它
# add unknown word vector
embedding_weights = {}
for key, word in vocabulary_inv.items():
    if word in embedding_model.wv:
        embedding_weights[key] = embedding_model.wv[word]
    else:
        embedding_weights[key] = np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
 
  1. 模型的保存
    这里我暂时不明白是用来干嘛的
# save model
model_dir = 'models'
model_name = "{:d}features_{:d}minwords_{:d}context".format(num_features, min_word_count, context)
model_name = join(model_dir, model_name)
model_name

建立模型

库与函数的调取

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, GlobalMaxPooling1D, Conv1D, Embedding
from tensorflow.keras.layers import Concatenate
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
np.random.seed(0)

具体模型解释

toy model

建立模型

1. 预训练embedding_word集

# Word2Vec parameters (see train_word2vec)
embedding_dim = 50
min_word_count = 1
context = 10

#Prepare embedding layer weights for not-static model
embedding_weights = train_word2vec(np.vstack((x_train, x_test)), vocabulary_inv, num_features=embedding_dim,
                                   min_word_count=min_word_count, context=context)

运行之后就会出现,这就解释了之前预训练模型保存时的问题了

Saving Word2Vec model '50feature_1minwords_10context'

2. 模型中的某些参数

# Model Hyperparameters
embedding_dim = 50 # embedding layer 后的word vector的dim
filter_sizes = (3,8) # 两种filters, 这些filter因为都是作用于1D的,filter另一个维度固定为 embedding_dim
num_filters = 10 # 每个filter的个数, 即filter_size 为3的有10个, 为8的也有10个
dropout_prob = (0.5, 0.8) # dropout probability, 作用于Dropout layer,用于regularization.
hidden_dims = 50 #这个应该是表示Dense layer最后应该输出的维度,与word vector是一致的
# Training Parameters
batch_size = 64 # 每个batch包含64句,每64句话之后会做backpropagation去fine-tune parameter
num_epochs = 5 # 遍历完所有句子5次
# Prepossessing parameters
sequence_length = 400 #这个其实应该是padding后句子的最大单词数,这里相当于做了一个初始化
max_words = 5000
# Word2Vec parameters (see train_word2vec)
min_word_count = 1
context = 10

3. 建立Input层

sequence_length = x_test.shape[1] # Embedding Layer的参数
input_shape = (sequence_length,) #建立input_layer的参数
input_layer = Input(shape = input_shape, name='input_layer')

4. 建立Embedding层并且导入与训练参数

对于embedding_layer.build((None,))的解释,可以参考:
https://stackoverflow.com/questions/57252260/keras-embedding-layer-build

https://keras.io/guides/making_new_layers_and_models_via_subclassing/

# Embedding, 这里感觉刚开始明白起来稍微困难一点
weights = np.array([v for v in embedding_weights.values()]) # assemble the embedding_weights in one numpy array
embedding_layer = Embedding(input_dim=len(vocabulary_inv),
                           output_dim = embedding_dim,
                           input_length = sequence_length,
                           trainable = True, #为了后面能fine-tune每个词的embedding weight
                           name = 'embedding_layer')
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([weights]) # use pre-trained word vector as the weights
embedded = embedding_layer(input_layer)

5. 建立CNN层,要遍历所有的filter size

conv_blocks = []
for fs in filter_sizes:
    conv = Conv1D(filters=num_filters, #确定在这个filter_size下的filter数目
                  kernel_size = fs, #每个filter能够运行的windows有多大,同时作用在几个单词上
                  padding = 'valid', # valid, 还有一种full padding忘了
                  strides = 1, # 步长为1,每次卷积做完一次向下'1'个单词再做卷积,
                  activation = 'relu', # Activation 层用的relu
                  use_bias = True #表示 y = wx + bias 中的bias
                 )(embedded)
    conv = GlobalMaxPooling1D()(conv) # 1-Max pooling
    conv_blocks.append(conv)

6. 将所有的filter的输出写入同一个向量,并对该层使用Dropout做Regularization

concat1max = Concatenate()(conv_blocks)
concat1max = Dropout(dropout_prob[1])(concat1max)

7. 最后两层FC 层

前一层是为了类似于输出一个单词的word embedding, 后一层是为了通过word embedding 输出 labels?感觉这样做是不是和原文不太一致?

output_layer = Dense(hidden_dims, activation='relu',
                    kernel_regularizer = regularizers.l2(0.01),
                    bias_regularizer = regularizers.l1(0.01))(concat1max)
output_layer = Dense(1, activation='softmax')(output_layer)

8. 封装成Model, 指定损失函数等指标

model = Model(inputs = input_layer, outputs = output_layer)
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])

9. 使用模型和已得数据进行训练

model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs,
          validation_data=(x_test, y_test), verbose=2)

模型总结

可以通过以下代码检查模型的各layer及其参数

model.summary()

输出代码为

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_layer (InputLayer)        [(None, 56)]         0                                            
__________________________________________________________________________________________________
embedding_layer (Embedding)     (None, 56, 50)       938250      input_layer[0][0]                
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 54, 10)       1510        embedding_layer[0][0]            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 49, 10)       4010        embedding_layer[0][0]            
__________________________________________________________________________________________________
global_max_pooling1d (GlobalMax (None, 10)           0           conv1d[0][0]                     
__________________________________________________________________________________________________
global_max_pooling1d_1 (GlobalM (None, 10)           0           conv1d_1[0][0]                   
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 20)           0           global_max_pooling1d[0][0]       
                                                                 global_max_pooling1d_1[0][0]     
__________________________________________________________________________________________________
dropout (Dropout)               (None, 20)           0           concatenate[0][0]                
__________________________________________________________________________________________________
dense (Dense)                   (None, 50)           1050        dropout[0][0]                    
__________________________________________________________________________________________________
softmax_output (Dense)          (None, 1)            51          dense[0][0]                      
==================================================================================================
Total params: 944,871
Trainable params: 944,871
Non-trainable params: 0
__________________________________________________________________________________________________

如果想要快速检查结果,可以用如下代码,前提是模型已经建立好:

from keras import backend as K
get_softmax_layer_output = K.function([model.layers[0].input, K.learning_phase()],
                                  [model.layers[-1].output])

layer_output = get_softmax_layer_output([x_train, 1])        

而后可以输出layer_output

[array([[1.],
        [1.],
        [1.],
        ...,
        [1.],
        [1.],
        [1.]], dtype=float32)]

可以看出如果最后一层只有一个neuron,并且用softmax时,必然每个output都是1,因为每一行的总和为1,这样必然造成很大的loss.
因此这里最后一层可以用sigmoid层或者时用两个neurons.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值