神经网络之文本情感分析(三)

Project 3

  • Project 2中,我们构建并训练了一个两层的神经网络用于情感的分析,但是准确率不尽人意
  • 在这个Project中,我们将对问题进行分析,思考如何提高准确率

到底出了什么问题?

  • 总结来说,就是输入数据噪声太大了。这里的噪声是指那些对于输出没有任何价值的单词,比如the, an, of, on等等。这些单词属于中性词,对于判断Positive或者Negative没有任何帮助,并且这些词出现频率非常高,在统计个数时,这些词肯定比带有情绪的词个数要多。
  • 当这些中性的词占据了大部分的输入时,神经网络在训练的时候就会感到疑惑。它会这么想:label=1(positive)的时候,这些词出现了很多次,label=0(negative)的时候,这些词也出现了很多次,那这些词到底是正向词还是负向词呢?
  • 我们举一个例子来感受一下这些噪声的出现频率有多高
import numpy as np
import sys
import time
import pandas as pd
# 读取数据
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
from collections import Counter
positive_counter = Counter()
negative_counter = Counter()
total_counter = Counter()
for review, label in zip( reviews.values, labels.values ):
    word = review[0].split(' ')
    if label == 'positive':
        positive_counter.update(word)
    elif label == 'negative':
        negative_counter.update(word)
    total_counter.update(word)
positive_counter.most_common()
negative_counter.most_common()
  • 从positive_counter和negative_counter的统计结果来看,那些出现最多次的词都是噪声。真正有用词,例如happy,funny,horrible等出现次数并不多,但是却是非常重要的

如何解决

  • 一个直观的想法就是对训练数据进行预处理,将那些中性的词,无用的词进行剔除,只保留真正有用的词。
  • 嗯,想法不错,我们之后的Project也会这么做,但是在这里,我们给出另一种解决方法:降低中性词的权重
  • 回忆我们将文本数字化的步骤:
    1. 将输入的文本划分成一个一个的单词
    2. 统计每个单词出现的个数 count,
    3. 从word2idx中找到这个单词的位置,将count作为输入
  • 这么做确实将文本进行数字化,但是,中性单词,例如the,在一个review中出现次数是很多的,导致神经网络认为the这个单词很重要,因此我们需要降低中性单词的权重
  • 一种最简单的方法就是,我们不再以出现次数作为输入,而是只输入0或者1,0表示这个单词在review中没有出现,1表示出现。
  • 那么修改就非常简单了,将update_input_layer()中+=修改为=
class SentimentNetwork(object):
    def __init__(self, reviews, labels, hidden_nodes=10, learning_rate = 0.1):
        """
        参数:
            reviews(dataFrame), 用于训练
            labels(dataFrame), 用于训练
            hidden_nodes(int), 隐层的个数
            learning_rate(double),学习步长
        """

        np.random.seed(1)

        self.pre_process_data(reviews, labels)

        self.init_network(len(self.review_vocab), hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        """
        预处理数据,统计reviews中出现的所有单词,并且生成word2index
        """

        # 统计reviews中出现的所有单词,
        review_vocab = set()
        for review in reviews.values:
            word = review[0].split(' ')
            review_vocab.update(word)

        self.review_vocab = list(review_vocab)

        # 统计labels中所有出现的label(其实在这里,就+1和-1两种)
        label_vocab = set()
        for label in labels.values:
            label_vocab.add(label[0])
        self.label_vocab = list(label_vocab)

        # 构建word2idx,给每个单词安排一个"门牌号"
        self.word2idx = dict()
        for idx, word in enumerate(self.review_vocab):
            self.word2idx[word] = idx

    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        """
        初始化网络的参数
        """
        self.learning_rate = learning_rate
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        self.weights_0_1 = np.random.normal( 0.0, self.input_nodes**-0.5, (self.input_nodes, self.hidden_nodes) )
        self.weights_1_2 = np.random.normal( 0.0, self.hidden_nodes**-0.5, (self.hidden_nodes, self.output_nodes) )

        self.layer_0 = np.zeros((1, self.input_nodes))

    def update_input_layer(self, review):
        """
        对review进行数字化处理,并将结果存放到self.layer_0中,也就是输入层
        """
        self.layer_0 *= 0
        for word in review.split(' '):
            if word.lower() in self.word2idx:
                idx = self.word2idx[word.lower()]
                # 在出现的单词位置设置为1,不再使用出现次数作为输入
                self.layer_0[0,idx] = 1
                # self.layer_0[0,idx] += 1

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_output_2_derivative(self, output):
        return output * (1 - output)

    def get_target_for_label(self,label):
        if label == 'positive':
            return 1
        else:
            return 0

    def train(self, training_reviews, training_label):
        assert(len(training_reviews) == len(training_label))

        correct_so_far = 0

        start = time.time()

        # 进行训练
        for i in range(len(training_reviews)):
            review = training_reviews.iloc[i,0]
            label = training_label.iloc[i,0]

            self.update_input_layer(review)

            layer_1_i = np.dot( self.layer_0, self.weights_0_1 )
            layer_1_o = layer_1_i

            layer_2_i = np.dot( layer_1_o, self.weights_1_2 )
            layer_2_o = self.sigmoid( layer_2_i )

            layer_2_error = layer_2_o - self.get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2_o)

            layer_1_error = np.dot( layer_2_delta, self.weights_1_2.T )
            layer_1_delta = layer_1_error
            # 权重更新
            self.weights_1_2 -= np.dot(layer_1_o.T, layer_2_delta) * self.learning_rate
            self.weights_0_1 -= np.dot(self.layer_0.T, layer_1_delta) * self.learning_rate

            if(layer_2_o >= 0.5 and label=='positive'):
                correct_so_far += 1
            elif(layer_2_o < 0.5 and label=='negative'):
                correct_so_far += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0

            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")

    def test(self, testing_reviews, testing_labels):
        assert(len(testing_reviews) == len(testing_labels))

        correct = 0

        start = time.time()

        for i in range(len(testing_reviews)):
            review = testing_reviews.iloc[i,0]
            label = testing_labels.iloc[i,0]

            pred = self.run(review)
            if pred == label:
                correct += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0

            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")


    def run(self, review):
        self.update_input_layer(review)

        layer_1_i = np.dot( self.layer_0, self.weights_0_1 )
        layer_1_o = layer_1_i

        layer_2_i = np.dot( layer_1_o, self.weights_1_2 )
        layer_2_o = self.sigmoid( layer_2_i )            
        if layer_2_o >= 0.5:
            return 'positive'
        else:
            return 'negative'

训练

mlp = SentimentNetwork(reviews, labels, learning_rate=0.1)
mlp.train(reviews[:-1000], labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):103.6 #Correct:1952 #Trained:2501 Training Accuracy:78.0%
Progress:20.8% Speed(reviews/sec):104.3 #Correct:3999 #Trained:5001 Training Accuracy:79.9%
Progress:31.2% Speed(reviews/sec):104.4 #Correct:6119 #Trained:7501 Training Accuracy:81.5%
Progress:41.6% Speed(reviews/sec):103.3 #Correct:8277 #Trained:10001 Training Accuracy:82.7%
Progress:52.0% Speed(reviews/sec):102.9 #Correct:10435 #Trained:12501 Training Accuracy:83.4%
Progress:62.5% Speed(reviews/sec):102.9 #Correct:12574 #Trained:15001 Training Accuracy:83.8%
Progress:72.9% Speed(reviews/sec):102.8 #Correct:14689 #Trained:17501 Training Accuracy:83.9%
Progress:83.3% Speed(reviews/sec):102.1 #Correct:16861 #Trained:20001 Training Accuracy:84.3%
Progress:93.7% Speed(reviews/sec):102.0 #Correct:19051 #Trained:22501 Training Accuracy:84.6%
Progress:99.9% Speed(reviews/sec):101.9 #Correct:20371 #Trained:24000 Training Accuracy:84.8%

测试

mlp.test(reviews[-1000:], labels[-1000:])

Progress:99.9% Speed(reviews/sec):942.9 #Correct:858 #Tested:1000 Testing Accuracy:85.8%

End Project 3

  • 哇!准确率上80了!看来我们一个很简单的方法就能让效果提升非常多
  • 在下一个Project,我们将对现在这个网络进行加速,使得训练的速度能够大大加快
  • 3
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值