CNN对句子分类（tensorflow）_句向量神经网络分类-CSDN博客

本文链接：https://blog.csdn.net/hwang4_12/article/details/62891484

卷积神经网络是一种特殊的深层的神经网络模型，它的特殊性体现在两个方面，一方面它的神经元间的连接是非全连接的，另一方面同一层中某些神经元之间的连接的权重是共享的（即相同的）。它的非全连接和权值共享的网络结构使之更类似于生物神经网络，降低了网络模型的复杂度（对于很难学习的深层结构来说，这是非常重要的），减少了权值的数量。

在这里我主要是解释（Convolutional Neural Network for Sentence Classification）的源代码，来解释我对CNN在文本分类上的处理。

源代码出处https://github.com/dennybritz/cnn-text-classification-tf

#! /usr/bin/env python
#encoding:utf-8

import tensorflow as tf
import numpy as np

class TextCNN(object):
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    """
    def __init__(
      self, sequence_length, num_classes, vocab_size,
      embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):
        #embeding_size是128

        # Placeholders for input, output and dropout
        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
        #这个只限定宽，不限定高
        #sequence_length 为句子的长度
        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

        # Keeping track of l2 regularization loss (optional)
        l2_loss = tf.constant(0.0)

        # Embedding layer 嵌入层 把每个word用128维的向量形式来进行表示
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
            W = tf.Variable(
                tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
                name="W")
            #vocab_size行，128列的随机矩阵

            self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
        #embedding layer的结果 是 [None, sequence_length, embedding_size]
        #3维 [句子数，每句有几个单词，每个单词利用embedding_size大的向量来表示]
        # 最后输入的维度表示 [None, sequence_length, embedding_size, 1]
        # None 表示句子数，这是任意的，没有固定它

        # Create a convolution + maxpool layer for each filter size
        # 卷积 和 池化层设计
        pooled_outputs = []
        for i, filter_size in enumerate(filter_sizes): # [3,4,5]
        # 在这里的三次卷积是并行的，不是依次相连的
            with tf.name_scope("conv-maxpool-%s" % filter_size):
                # Convolution Layer
                filter_shape = [filter_size, embedding_size, 1, num_filters] #[3 or 4 or 5,128,1,128]
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                #truncated_normal是用来输出随机值
                #stddev 标准差 ,mean 默认为0
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")# [0.1 ... 0.1] 共128个
                conv = tf.nn.conv2d(
                    self.embedded_chars_expanded, #卷积输入4维
                    W,#卷积核4维
                    strides=[1, 1, 1, 1],
                    padding="VALID",
                    name="conv") #计算2维的卷积
                # 输出结果的类型和输入结果的类型一样
                #conv2d 输入是4维的。[batch(每次训练一次所用的句子数), width（每句话几个词）,
                #  height（每个单词利用几维向量来表示） , channel（选用通道）]
                #tf.expand_dims上面的作用就是拓展一个维度，即选用哪个通道
                #卷积核的形状[filter_height, filter_width, in_channels, out_channels]
                # Apply nonlinearity
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                # 经过卷积层，加上偏置之后，再通过非线性的函数之后，把最后结果进行最大池化
                # Maxpooling over the outputs
                pooled = tf.nn.max_pool(
                    h,# 这个是卷积之后的结果
                    ksize=[1, sequence_length - filter_size + 1, 1, 1], #ksize pooling每一维度的大小
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="pool")
                pooled_outputs.append(pooled)
        #输入 [num-sen,56 len-sen,128 embedding_size,1(输入通道数)]
        # 卷积核 [3 or 4 or 5,128,1(输入通道数)，128（输出通道数）]
        # 输出 [num-sen,54 or 53 or 52,1,128(通道数)]
        # 池化 [1,54 or 53 or 52,1,1],输出结果[num-sen,1,1,128],输出通道是128.但是只有一个数了，128个池化特征。
        # 3,4,5 这三个尺寸的卷积核是并行的，不是依次连接的
        # Combine all the pooled features
        num_filters_total = num_filters * len(filter_sizes) #128 × 3
        self.h_pool = tf.concat(3, pooled_outputs)
        # 在最后一个维度上把它们连接起来。就是把128个输出通道，都连在一起
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
        #-1代表的是句子数，即后面就是每个句子的向量长度是128×3

        # Add dropout
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

        # Final (unnormalized) scores and predictions,
        # 其实这就是一个全连接层（与连接层不同的地方是它是把数据提取出来，
        # 只有两维了，然后在后面利用其他的算法来训练）
        with tf.name_scope("output"):
            W = tf.get_variable(
                "W",
                shape=[num_filters_total, num_classes], # W 形状是384（128×3） × 2 （其实就是一个全连接层）
                initializer=tf.contrib.layers.xavier_initializer())
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            l2_loss += tf.nn.l2_loss(W)

            l2_loss += tf.nn.l2_loss(b)
            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
            # h_drop * W +b
            self.predictions = tf.argmax(self.scores, 1, name="predictions")

        # CalculateMean cross-entropy loss
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

        # Accuracy
        with tf.name_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")


--------------------------------------------------------------------------------------------------------------------------

# Data Preparatopn
# ==================================================

# Load data
print("Loading data...")
x_text, y = data_helpers.load_data_and_labels(FLAGS.positive_data_file, FLAGS.negative_data_file)

# Build vocabulary
max_document_length = max([len(x.split(" ")) for x in x_text]) #每句话的长度，找到所有句子中最长的长度
# 在上面的sequence_length就等于max_document_length
# 记住在CNN输入的时候，它只能处理那种等长的句子，
# 所以先统计出最长句子的最长的长度，然后把那些稍微短一点的进行补充，以0来填充满
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
# vocabularyProcessor(max_document_length,min_frequency=0,vocabulary=None,tokenizer_fn=None)
# max_document_length 每次输入的句子中最长的长度，长度要是更长，就截取，短了，就补充
# 可以限制字典里面出现的word的最低频率和最高频率,vocabulary是一个对象
#.vocabulary_ 里面有好多关于字典的特性
x = np.array(list(vocab_processor.fit_transform(x_text)))
# x就是通过字典转化过来的文本输入，每个word有一个数字替代，填充的部分为0，标签为