机器学习之文本分类-神经网络TensorFlow实现（一）

最新推荐文章于 2024-04-11 10:15:40 发布

zzubqh103

最新推荐文章于 2024-04-11 10:15:40 发布

阅读量612

点赞数

文章标签：机器学习神经网络

本文链接：https://blog.csdn.net/qq_36810544/article/details/78698556

版权

假设，大家对于激活函数、损失函数、权重、偏置、反向传播算法都已经了解，文章只列出了网络结构和具体的tf代码实现，如果对这些概念不清楚，还请参看相关资料。代码已经上传到 https://github.com/zzubqh/TextCategorization
=============请叫我分割线===============
上篇文章得到的TF-IDF值虽然维数很大，好在只有10条数据，于是决定利用tensorflow来试试
先用一个最简单的感知器模型，后面再用一个简单的网络进行测试。将10*140000的数据直接输入10个感知器来做预测，模型如下：
这里写图片描述
这里的输入层就是在贝叶斯类里已经得到的样本集的tf-idf特征值，总共有140000个特征，所以输入层到输出层的权重为140000*10的矩阵，偏置就是10*1的向量，我们希望输出的是具有最大可能性的那个类的标签，所以激活函数采用softmax函数，关于softmax函数的详细介绍可以看这篇文章 https://www.zhihu.com/question/23765351，最后还有个损失函数，这里采用最小化交叉熵

def Loss(X,Y,W,b):
    return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))

注：由于tensorflow的版本问题，如果是在之前的版本上运行的话，这里的参数(logits = CombineInputs(X,W,b), labels=Y)是不需要提供名字的，最新的版本需要这样写！另外，这个函数要求lable必须是从0开始的，所以手工把类名改成了[0,1,2,…,9]
OK，现在需要的全部都有了，可以开始训练了，代码里写的是迭代1000次，其实大概200多次就已经收敛到0.09以下了，在我跑的机器上还是很快的，大概3分钟左右就出结果，先给个结果：
2017-12-02 10:13:20,623 - main - INFO - the class 3 rate is 73.1707334518%
2017-12-02 10:13:23,822 - main - INFO - the class 0 rate is 89.3023252487%
2017-12-02 10:13:27,787 - main - INFO - the class 5 rate is 65.0735318661%
2017-12-02 10:13:29,954 - main - INFO - the class 8 rate is 91.0958886147%
2017-12-02 10:13:33,413 - main - INFO - the class 2 rate is 77.0212769508%
2017-12-02 10:13:38,393 - main - INFO - the class 4 rate is 81.1940312386%
2017-12-02 10:13:41,140 - main - INFO - the class 6 rate is 68.1081056595%
2017-12-02 10:13:45,486 - main - INFO - the class 9 rate is 76.8382370472%
2017-12-02 10:13:49,830 - main - INFO - the class 1 rate is 65.7794654369%
2017-12-02 10:13:53,115 - main - INFO - the class 7 rate is 80.1020383835%
考虑到单层感知器的分类能力有限，这个结果其实还算比较满意的。
代码具体实现：
定义一个Inference()函数用于计算激活函数的输出：

def Inference(X,W,b):
    return tf.nn.softmax(CombineInputs(X,W,b))

定义一个CombineInputs()函数用于计算X*W+b：

def CombineInputs(X, W, b):
    return tf.matmul(X, W) + b

定义损失函数：

def Loss(X,Y,W,b):
    return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))

然后预测输出：

def Evaluate(sess, X, Y, W,b):
    predicted = tf.cast(tf.arg_max(Inference(X,W,b), 1), tf.int32)
    return sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))

最后训练函数：

def Train(totalLoss):
    learingRate = 0.01
    return tf.train.GradientDescentOptimizer(learingRate).minimize(totalLoss)

这样定义后，以后如果需要不同的激活函数、损失函数直接改相对应的函数实现就可以了，不用动程序的框架。程序需要导入前面文章中提到的NavieBayes 和SplitText 两个类，具体实现：

#-*-coding:utf-8-*-
#-------------------------------------------------------------------------------
# Name:        
# Purpose:
#
# Author:      BQH
#
# Created:     19/09/2017
# Copyright:   (c) BQH 2017
# Licence:     <your licence>
#-------------------------------------------------------------------------------
from time import ctime

import tensorflow as tf
import yaml
import logging.config
from SplitText import *
from NavieBayes import *

def Inference(X,W,b):
    return tf.nn.softmax(CombineInputs(X,W,b))

def CombineInputs(X, W, b):
    return tf.matmul(X, W) + b

def Loss(X,Y,W,b):
    return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))

def Evaluate(sess, X, Y, W,b):
    predicted = tf.cast(tf.arg_max(Inference(X,W,b), 1), tf.int32)
    return sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))

def Train(totalLoss):
    learingRate = 0.01
    return tf.train.GradientDescentOptimizer(learingRate).minimize(totalLoss)

def SetupLogging(default_path='logging.yaml', default_level=logging.INFO, env_key='LOG_CFG'):
    """Setup logging configuration"""
    path = default_path
    value = os.getenv(env_key, None)
    if value:
        path = value
    if os.path.exists(path):
        with open(path, 'rt') as f:
            config = yaml.load(f.read())
        logging.config.dictConfig(config)
    else:
        logging.basicConfig(level=default_level)

def main():
    SetupLogging()
    logger = logging.getLogger(__name__)
    logger.info( 'start at :{0}'.format(ctime()))
    root_path = os.path.abspath(os.curdir)
    #先生成词袋，样本数据集
    wordset_fileName = os.path.join(root_path , 'wordSet.txt')
    st = SplitText(root_path)
    if os.path.exists(wordset_fileName) == False:
        st.CreateDataSet(wordset_fileName)
    else:
        st.wordsetvec = st.GetWordVec(wordset_fileName)
    #生成计算所需的样本数据集
    sampleData = st.CreateSampleData()
    nb = Bayes(sampleData)
    X = nb.dataMatrix
    Y = nb.labelVec

    #训练模型
    with tf.Session() as sess:
        W = tf.Variable(tf.zeros([X.shape[1], len(Y)], dtype = tf.float64), name='weights')
        b = tf.Variable(tf.zeros([len(Y)], dtype = tf.float64), name='bias')
        tf.initialize_all_variables().run()
        totalLoss = Loss(X,Y,W,b)
        trainOp = Train(totalLoss)
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)
        trainning_steps = 1000
        for step in range(trainning_steps):
            sess.run([trainOp])
            if step%10 == 0:
                print "loss: ", sess.run([totalLoss])
        # 开始测试
        testFilePath = os.path.join(root_path, 'tarin_corpus_seg', 'verify')
        for className in os.listdir(testFilePath):
            logger.info('start create class {0} test data...'.format(className))
            trueLable = []
            testData = []
            startTime =time.time()
            testFileSet = os.listdir(os.path.join(testFilePath, className))
            testFileLable = [os.path.join(testFilePath, className, fileName) for fileName in testFileSet]
            trueLable.append([int(className)] * len(testFileSet))
            for testFile in testFileLable:
                testContent = st.ReadFile(testFile)
                testContent = testContent.replace("\r\n", "").strip()
                content = testContent.encode('utf-8')
                testVec = st.CreateDataVec(content.split(), '0')
                testData.append(testVec[0:-1])
            endTime = time.time()
            logger.info('create class {0} data cost {1}s'.format(className, endTime - startTime))

            logger.info('start pridict class {0}...'.format(className))
            correctRate = Evaluate(sess,np.array(testData, dtype=float),trueLable,W,b)
            logger.info('the class {0} correct rate is {1}'.format(className, correctRate))
            logger.info('class {0} pridict end!'.format(className))

        coord.request_stop()
        coord.join(threads)
        sess.close()

    pass

if __name__ == '__main__':
    main()