假设,大家对于激活函数、损失函数、权重、偏置、反向传播算法都已经了解,文章只列出了网络结构和具体的tf代码实现,如果对这些概念不清楚,还请参看相关资料。代码已经上传到 https://github.com/zzubqh/TextCategorization
=============请叫我分割线===============
上篇文章得到的TF-IDF值虽然维数很大,好在只有10条数据,于是决定利用tensorflow来试试
先用一个最简单的感知器模型,后面再用一个简单的网络进行测试。将10*140000的数据直接输入10个感知器来做预测,模型如下:
这里的输入层就是在贝叶斯类里已经得到的样本集的tf-idf特征值,总共有140000个特征,所以输入层到输出层的权重为140000*10的矩阵,偏置就是10*1的向量,我们希望输出的是具有最大可能性的那个类的标签,所以激活函数采用softmax函数,关于softmax函数的详细介绍可以看这篇文章 https://www.zhihu.com/question/23765351,最后还有个损失函数,这里采用最小化交叉熵
def Loss(X,Y,W,b):
return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))
注: 由于tensorflow的版本问题,如果是在之前的版本上运行的话,这里的参数(logits = CombineInputs(X,W,b), labels=Y)是不需要提供名字的,最新的版本需要这样写!另外,这个函数要求lable必须是从0开始的,所以手工把类名改成了[0,1,2,…,9]
OK,现在需要的全部都有了,可以开始训练了,代码里写的是迭代1000次,其实大概200多次就已经收敛到0.09以下了,在我跑的机器上还是很快的,大概3分钟左右就出结果,先给个结果:
2017-12-02 10:13:20,623 - main - INFO - the class 3 rate is 73.1707334518%
2017-12-02 10:13:23,822 - main - INFO - the class 0 rate is 89.3023252487%
2017-12-02 10:13:27,787 - main - INFO - the class 5 rate is 65.0735318661%
2017-12-02 10:13:29,954 - main - INFO - the class 8 rate is 91.0958886147%
2017-12-02 10:13:33,413 - main - INFO - the class 2 rate is 77.0212769508%
2017-12-02 10:13:38,393 - main - INFO - the class 4 rate is 81.1940312386%
2017-12-02 10:13:41,140 - main - INFO - the class 6 rate is 68.1081056595%
2017-12-02 10:13:45,486 - main - INFO - the class 9 rate is 76.8382370472%
2017-12-02 10:13:49,830 - main - INFO - the class 1 rate is 65.7794654369%
2017-12-02 10:13:53,115 - main - INFO - the class 7 rate is 80.1020383835%
考虑到单层感知器的分类能力有限,这个结果其实还算比较满意的。
代码具体实现:
定义一个Inference()函数用于计算激活函数的输出:
def Inference(X,W,b):
return tf.nn.softmax(CombineInputs(X,W,b))
定义一个CombineInputs()函数用于计算X*W+b:
def CombineInputs(X, W, b):
return tf.matmul(X, W) + b
定义损失函数:
def Loss(X,Y,W,b):
return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))
然后预测输出:
def Evaluate(sess, X, Y, W,b):
predicted = tf.cast(tf.arg_max(Inference(X,W,b), 1), tf.int32)
return sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))
最后训练函数:
def Train(totalLoss):
learingRate = 0.01
return tf.train.GradientDescentOptimizer(learingRate).minimize(totalLoss)
这样定义后,以后如果需要不同的激活函数、损失函数直接改相对应的函数实现就可以了,不用动程序的框架。程序需要导入前面文章中提到的NavieBayes 和SplitText 两个类,具体实现:
#-*-coding:utf-8-*-
#-------------------------------------------------------------------------------
# Name:
# Purpose:
#
# Author: BQH
#
# Created: 19/09/2017
# Copyright: (c) BQH 2017
# Licence: <your licence>
#-------------------------------------------------------------------------------
from time import ctime
import tensorflow as tf
import yaml
import logging.config
from SplitText import *
from NavieBayes import *
def Inference(X,W,b):
return tf.nn.softmax(CombineInputs(X,W,b))
def CombineInputs(X, W, b):
return tf.matmul(X, W) + b
def Loss(X,Y,W,b):
return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits = CombineInputs(X,W,b), labels=Y))
def Evaluate(sess, X, Y, W,b):
predicted = tf.cast(tf.arg_max(Inference(X,W,b), 1), tf.int32)
return sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))
def Train(totalLoss):
learingRate = 0.01
return tf.train.GradientDescentOptimizer(learingRate).minimize(totalLoss)
def SetupLogging(default_path='logging.yaml', default_level=logging.INFO, env_key='LOG_CFG'):
"""Setup logging configuration"""
path = default_path
value = os.getenv(env_key, None)
if value:
path = value
if os.path.exists(path):
with open(path, 'rt') as f:
config = yaml.load(f.read())
logging.config.dictConfig(config)
else:
logging.basicConfig(level=default_level)
def main():
SetupLogging()
logger = logging.getLogger(__name__)
logger.info( 'start at :{0}'.format(ctime()))
root_path = os.path.abspath(os.curdir)
#先生成词袋,样本数据集
wordset_fileName = os.path.join(root_path , 'wordSet.txt')
st = SplitText(root_path)
if os.path.exists(wordset_fileName) == False:
st.CreateDataSet(wordset_fileName)
else:
st.wordsetvec = st.GetWordVec(wordset_fileName)
#生成计算所需的样本数据集
sampleData = st.CreateSampleData()
nb = Bayes(sampleData)
X = nb.dataMatrix
Y = nb.labelVec
#训练模型
with tf.Session() as sess:
W = tf.Variable(tf.zeros([X.shape[1], len(Y)], dtype = tf.float64), name='weights')
b = tf.Variable(tf.zeros([len(Y)], dtype = tf.float64), name='bias')
tf.initialize_all_variables().run()
totalLoss = Loss(X,Y,W,b)
trainOp = Train(totalLoss)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)
trainning_steps = 1000
for step in range(trainning_steps):
sess.run([trainOp])
if step%10 == 0:
print "loss: ", sess.run([totalLoss])
# 开始测试
testFilePath = os.path.join(root_path, 'tarin_corpus_seg', 'verify')
for className in os.listdir(testFilePath):
logger.info('start create class {0} test data...'.format(className))
trueLable = []
testData = []
startTime =time.time()
testFileSet = os.listdir(os.path.join(testFilePath, className))
testFileLable = [os.path.join(testFilePath, className, fileName) for fileName in testFileSet]
trueLable.append([int(className)] * len(testFileSet))
for testFile in testFileLable:
testContent = st.ReadFile(testFile)
testContent = testContent.replace("\r\n", "").strip()
content = testContent.encode('utf-8')
testVec = st.CreateDataVec(content.split(), '0')
testData.append(testVec[0:-1])
endTime = time.time()
logger.info('create class {0} data cost {1}s'.format(className, endTime - startTime))
logger.info('start pridict class {0}...'.format(className))
correctRate = Evaluate(sess,np.array(testData, dtype=float),trueLable,W,b)
logger.info('the class {0} correct rate is {1}'.format(className, correctRate))
logger.info('class {0} pridict end!'.format(className))
coord.request_stop()
coord.join(threads)
sess.close()
pass
if __name__ == '__main__':
main()