机器学习之文本分类-神经网络TensorFlow实现(一)

假设,大家对于激活函数、损失函数、权重、偏置、反向传播算法都已经了解,文章只列出了网络结构和具体的tf代码实现,如果对这些概念不清楚,还请参看相关资料。代码已经上传到 https://github.com/zzubqh/TextCategorization
=============请叫我分割线===============
上篇文章得到的TF-IDF值虽然维数很大,好在只有10条数据,于是决定利用tensorflow来试试
先用一个最简单的感知器模型,后面再用一个简单的网络进行测试。将10*140000的数据直接输入10个感知器来做预测,模型如下:
这里写图片描述
这里的输入层就是在贝叶斯类里已经得到的样本集的tf-idf特征值,总共有140000个特征,所以输入层到输出层的权重为140000*10的矩阵,偏置就是10*1的向量,我们希望输出的是具有最大可能性的那个类的标签,所以激活函数采用softmax函数,关于softmax函数的详细介绍可以看这篇文章 https://www.zhihu.com/question/23765351,最后还有个损失函数,这里采用最小化交叉熵

def Loss(X,Y,W,b):
    return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))

注: 由于tensorflow的版本问题,如果是在之前的版本上运行的话,这里的参数(logits = CombineInputs(X,W,b), labels=Y)是不需要提供名字的,最新的版本需要这样写!另外,这个函数要求lable必须是从0开始的,所以手工把类名改成了[0,1,2,…,9]
OK,现在需要的全部都有了,可以开始训练了,代码里写的是迭代1000次,其实大概200多次就已经收敛到0.09以下了,在我跑的机器上还是很快的,大概3分钟左右就出结果,先给个结果:
2017-12-02 10:13:20,623 - main - INFO - the class 3 rate is 73.1707334518%
2017-12-02 10:13:23,822 - main - INFO - the class 0 rate is 89.3023252487%
2017-12-02 10:13:27,787 - main - INFO - the class 5 rate is 65.0735318661%
2017-12-02 10:13:29,954 - main - INFO - the class 8 rate is 91.0958886147%
2017-12-02 10:13:33,413 - main - INFO - the class 2 rate is 77.0212769508%
2017-12-02 10:13:38,393 - main - INFO - the class 4 rate is 81.1940312386%
2017-12-02 10:13:41,140 - main - INFO - the class 6 rate is 68.1081056595%
2017-12-02 10:13:45,486 - main - INFO - the class 9 rate is 76.8382370472%
2017-12-02 10:13:49,830 - main - INFO - the class 1 rate is 65.7794654369%
2017-12-02 10:13:53,115 - main - INFO - the class 7 rate is 80.1020383835%
考虑到单层感知器的分类能力有限,这个结果其实还算比较满意的。
代码具体实现:
定义一个Inference()函数用于计算激活函数的输出:

def Inference(X,W,b):
    return tf.nn.softmax(CombineInputs(X,W,b))

定义一个CombineInputs()函数用于计算X*W+b:

def CombineInputs(X, W, b):
    return tf.matmul(X, W) + b    

定义损失函数:

def Loss(X,Y,W,b):
    return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))    

然后预测输出:

def Evaluate(sess, X, Y, W,b):
    predicted = tf.cast(tf.arg_max(Inference(X,W,b), 1), tf.int32)
    return sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))

最后训练函数:

def Train(totalLoss):
    learingRate = 0.01
    return tf.train.GradientDescentOptimizer(learingRate).minimize(totalLoss)

这样定义后,以后如果需要不同的激活函数、损失函数直接改相对应的函数实现就可以了,不用动程序的框架。程序需要导入前面文章中提到的NavieBayes 和SplitText 两个类,具体实现:

#-*-coding:utf-8-*-
#-------------------------------------------------------------------------------
# Name:        
# Purpose:
#
# Author:      BQH
#
# Created:     19/09/2017
# Copyright:   (c) BQH 2017
# Licence:     <your licence>
#-------------------------------------------------------------------------------
from time import ctime

import tensorflow as tf
import yaml
import logging.config
from SplitText import *
from NavieBayes import *

def Inference(X,W,b):
    return tf.nn.softmax(CombineInputs(X,W,b))

def CombineInputs(X, W, b):
    return tf.matmul(X, W) + b

def Loss(X,Y,W,b):
    return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))

def Evaluate(sess, X, Y, W,b):
    predicted = tf.cast(tf.arg_max(Inference(X,W,b), 1), tf.int32)
    return sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))

def Train(totalLoss):
    learingRate = 0.01
    return tf.train.GradientDescentOptimizer(learingRate).minimize(totalLoss)

def SetupLogging(default_path='logging.yaml', default_level=logging.INFO, env_key='LOG_CFG'):
    """Setup logging configuration"""
    path = default_path
    value = os.getenv(env_key, None)
    if value:
        path = value
    if os.path.exists(path):
        with open(path, 'rt') as f:
            config = yaml.load(f.read())
        logging.config.dictConfig(config)
    else:
        logging.basicConfig(level=default_level)

def main():
    SetupLogging()
    logger = logging.getLogger(__name__)
    logger.info( 'start at :{0}'.format(ctime()))
    root_path = os.path.abspath(os.curdir)
    #先生成词袋,样本数据集
    wordset_fileName = os.path.join(root_path , 'wordSet.txt')
    st = SplitText(root_path)
    if os.path.exists(wordset_fileName) == False:
        st.CreateDataSet(wordset_fileName)
    else:
        st.wordsetvec = st.GetWordVec(wordset_fileName)
    #生成计算所需的样本数据集
    sampleData = st.CreateSampleData()
    nb = Bayes(sampleData)
    X = nb.dataMatrix
    Y = nb.labelVec

    #训练模型
    with tf.Session() as sess:
        W = tf.Variable(tf.zeros([X.shape[1], len(Y)], dtype = tf.float64), name='weights')
        b = tf.Variable(tf.zeros([len(Y)], dtype = tf.float64), name='bias')
        tf.initialize_all_variables().run()
        totalLoss = Loss(X,Y,W,b)
        trainOp = Train(totalLoss)
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)
        trainning_steps = 1000
        for step in range(trainning_steps):
            sess.run([trainOp])
            if step%10 == 0:
                print "loss: ", sess.run([totalLoss])
        # 开始测试
        testFilePath = os.path.join(root_path, 'tarin_corpus_seg', 'verify')
        for className in os.listdir(testFilePath):
            logger.info('start create class {0} test data...'.format(className))
            trueLable = []
            testData = []
            startTime =time.time()
            testFileSet = os.listdir(os.path.join(testFilePath, className))
            testFileLable = [os.path.join(testFilePath, className, fileName) for fileName in testFileSet]
            trueLable.append([int(className)] * len(testFileSet))
            for testFile in testFileLable:
                testContent = st.ReadFile(testFile)
                testContent = testContent.replace("\r\n", "").strip()
                content = testContent.encode('utf-8')
                testVec = st.CreateDataVec(content.split(), '0')
                testData.append(testVec[0:-1])
            endTime = time.time()
            logger.info('create class {0} data cost {1}s'.format(className, endTime - startTime))

            logger.info('start pridict class {0}...'.format(className))
            correctRate = Evaluate(sess,np.array(testData, dtype=float),trueLable,W,b)
            logger.info('the class {0} correct rate is {1}'.format(className, correctRate))
            logger.info('class {0} pridict end!'.format(className))

        coord.request_stop()
        coord.join(threads)
        sess.close()

    pass

if __name__ == '__main__':
    main()


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值