神经网络之情感分析(一)

神经网络之情感分析

本文主要是介绍了运用神经网络进行情感分类,来源于Udacity的深度学习基石,这是第二周的课程,

原项目中是对英文进行了分类,我这边改为了中文。 首先是中文切词,使用的是结巴。

import jieba
seg = '使用结巴来对中文进行分词'
seg_list = jieba.cut(seg)
print("/ ".join(seg_list))
使用/ 结巴/ 来/ 对/ 中文/ 进行/ 分词

情感分类的依据


一个思路是分别统计在 positive 和 negative 中词出现的次数,然后理论上应该某些词在 positive 和 negative 

中出现的此处应该是有倾向的,下面来验证下吧

import pandas as pd
import numpy as np
neg=pd.read_excel('data/neg.xls',header=None,index=None)
pos=pd.read_excel('data/pos.xls',header=None,index=None)
neg.head()
                                0
0	做为一本声名在外的流行书,说的还是广州的外企,按道理应该和我的生存环境差不多啊。但是一看之下...
1	作者有明显的自恋倾向,只有有老公养不上班的太太们才能像她那样生活。很多方法都不实用,还有抄袭...
2	作者完全是以一个过来的自认为是成功者的角度去写这个问题,感觉很不客观。虽然不是很喜欢,但是,...
3	作者提倡内调,不信任化妆品,这点赞同。但是所列举的方法太麻烦,配料也不好找。不是太实用。
4	作者的文笔一般,观点也是和市面上的同类书大同小异,不推荐读者购买。
pos.head(6)
                                0
0	做父母一定要有刘墉这样的心态,不断地学习,不断地进步,不断地给自己补充新鲜血液,让自己保持一...
1	作者真有英国人严谨的风格,提出观点、进行论述论证,尽管本人对物理学了解不深,但是仍然能感受到...
2	作者长篇大论借用详细报告数据处理工作和计算结果支持其新观点。为什么荷兰曾经县有欧洲最高的生产...
3	作者在战几时之前用了"拥抱"令人叫绝.日本如果没有战败,就有会有美军的占领,没胡官僚主义的延...
4	作者在少年时即喜阅读,能看出他精读了无数经典,因而他有一个庞大的内心世界。他的作品最难能可贵...
5	作者有一种专业的谨慎,若能有幸学习原版也许会更好,简体版的书中的印刷错误比较多,影响学者理解...

pos['mark'] = 1
neg['mark'] = 0 # 给训练语料贴上标签
pn = pd.concat([pos,neg],ignore_index=True) # 合并语料
neglen = len(neg)
poslen = len(pos) # 计算语料数目

cw = lambda x:list(jieba.cut(x)) # 定义分词函数
pn['words'] = pn[0].apply(cw)


# 随机
pn = pn.reindex(np.random.permutation(pn.index))
pn.head()
rom collections import Counter


positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()
len(pn['words'])
pn['words'][1][:10]
['作者', '真有', '英国人', '严谨', '的', '风格', ',', '提出', '观点', '、']

我们开始统计每个词出现的次数


for i in range(len(pn['words'])):
    if pn['mark'][i] == 1:
        for word in pn['words'][i]:
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in pn['words'][i]:
            negative_counts[word] += 1
            total_counts[word] += 1
positive_counts.most_common(10)
[(',', 63862),
 ('的', 48811),
 ('。', 25667),
 ('了', 14110),
 ('是', 10775),
 ('我', 9578),
 ('很', 8270),
 (',', 6682),
 (' ', 6354),
 ('也', 6307)]
negative_counts.most_common(10)
[(',', 42831),
 ('的', 28859),
 ('。', 16847),
 ('了', 13476),
 (',', 8462),
 ('是', 7994),
 ('我', 7841),
 (' ', 7528),
 ('!', 7084),
 ('不', 5821)]
pos_neg_ratios = Counter()
for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term] +1)
        pos_neg_ratios[term] = pos_neg_ratio
list(reversed(pos_neg_ratios.most_common()))[0:30]
[('上当', 0.014285714285714285),
 ('不买', 0.037267080745341616),
 ('最差', 0.04580152671755725),
 ('抵制', 0.057034220532319393),
 ('退货', 0.0707070707070707),
 ('死机', 0.07075471698113207),
 ('太差', 0.0728476821192053),
 ('退', 0.07920792079207921),
 ('极差', 0.08421052631578947),
 ('论语', 0.08849557522123894),
 ('恶心', 0.0896551724137931),
 ('很差', 0.09166666666666666),
 ('招待所', 0.09243697478991597),
 ('投诉', 0.09433962264150944),
 ('垃圾', 0.10138248847926268),
 ('没法', 0.125),
 ('几页', 0.12903225806451613),
 ('糟糕', 0.13333333333333333),
 ('脏', 0.13978494623655913),
 ('维修', 0.14606741573033707),
 ('晕', 0.15151515151515152),
 ('严重', 0.16),
 ('不值', 0.16161616161616163),
 ('浪费', 0.16793893129770993),
 ('失望', 0.16888045540796964),
 ('差', 0.16926503340757237),
 ('页', 0.18562874251497005),
 ('郁闷', 0.19730941704035873),
 ('根本', 0.20512820512820512),
 ('后悔', 0.20574162679425836)]
for word,ratio in pos_neg_ratios.most_common():
    if (ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio +0.01)))
pos_neg_ratios.most_common(10)
[('结局', 3.7954891891721947),
 ('命运', 3.1986731175506815),
 ('成长', 3.0002674287193822),
 ('人们', 2.9885637840753785),
 ('快乐', 2.968080742223481),
 ('人类', 2.8332133440562162),
 ('自由', 2.6996819514316934),
 ('小巧', 2.57191802677763),
 ('世界', 2.5416019934645457),
 ('幸福', 2.5403579543242145)]
我们会发现一些一些词:好,不错,喜欢等带有感情色彩的词

list(reversed(pos_neg_ratios.most_common()))[:10]
[('上当', -3.7178669909871886),
 ('不买', -3.0519411931108684),
 ('最差', -2.8859540494394644),
 ('抵制', -2.7025520357679857),
 ('退货', -2.5169290903564066),
 ('死机', -2.5163389039584163),
 ('太差', -2.4907515123361046),
 ('退', -2.4167854452210129),
 ('极差', -2.3622233593137767),
 ('论语', -2.3177436534248872)]

我们现在有了个大致的判断,对于标注为 positive 和 negative 的其评论切词后是会有些许不同,

一些词在正评论中出现的评论会比负评论中多


vocab = set(total_counts.keys())
vocab_size = len(vocab)

对词进行编号

现在我们思路是直接对存在的vocab_size个分词进行排号,即一个vocab_size的向量,

然后对于每段话都可以用一个vocab_size的向量表示了

layer_0 = np.zeros((1,vocab_size))
word2index = {}

for i,word in enumerate(vocab):
    word2index[word] = i
def update_input_layer(reviews):
    global layer_0
    # clear out previous state,reset the layer to be all 0s
    layer_0 *= 0
    for word in reviews:
        layer_0[0][word2index[word]] += 1
        
update_input_layer(pn['words'][5])
print(layer_0)
[[ 0.  0.  0. ...,  0.  0.  0.]]

完整代码如下:

import time
import sys
import numpy as np

# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
    def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
        '''
          参数:
        reviews(dataFrame), 用于训练
        labels(dataFrame), 用于训练
        hidden_nodes(int), 隐层的个数
        learning_rate(double),学习步长
        '''

        # set our random number generator 
        # np.random.seed(1)
        self.pre_process_data(reviews, labels)
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
        
        
    def pre_process_data(self, reviews, labels):
        '''
        预处理数据,统计reviews中出现的所有单词,并且生成word2index
        '''
        # 统计reviews中出现的所有单词
        review_vocab = set()
        for review in reviews:
            for word in review:
                review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        
#  统计labels中所有出现的label(其实在这里,就+1和-1两种)
#         label_vocab = set()
#         for label in labels:
#             label_vocab.add(label)
        
#         self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
#         self.label_vocab_size = len(self.label_vocab)
        # 构建word2idx,给每个单词安排一个"门牌号"
        self.word2index = {} 
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
#         self.label2index = {}
#         for i, label in enumerate(self.label_vocab):
#             self.label2index[label] = i
         
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights
        self.weights_0_1 = np.zeros((self.hidden_nodes,self.input_nodes))
    
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.output_nodes, self.hidden_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((input_nodes,1))
    
        
    def update_input_layer(self,review):
        '''
         对review进行数字化处理,并将结果存放到self.layer_0中,也就是输入层
        '''
        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review:
            if(word in self.word2index.keys()):
                self.layer_0[self.word2index[word]][0] = 1
                
#     def get_target_for_label(self,label):
#         if(label == 'POSITIVE'):
#             return 1
#         else:
#             return 0
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def train(self, training_reviews, training_labels):
        
        assert(len(training_reviews) == len(training_labels))
        
        correct_so_far = 0
        
        start = time.time()
        
        for i in range(len(training_reviews)):
            
            review = training_reviews[i]
            label = training_labels[i]
            
            #### Implement the forward pass here ####
            ### Forward pass ###

            # Input Layer
            self.update_input_layer(review)
            layer_0 = self.layer_0
            # Hidden layer
            layer_1 = self.weights_0_1.dot(self.layer_0)

            # Output layer
            layer_2 = self.sigmoid(self.weights_1_2.dot(layer_1))

            #### Implement the backward pass here ####
            ### Backward pass ###

            # TODO: Output error
            layer_2_error = layer_2 - label # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # TODO: Backpropagated error
            layer_1_error = self.weights_1_2.T.dot(layer_2_delta) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # TODO: Update the weights
            self.weights_1_2 -= layer_2_delta.dot(layer_1.T) * self.learning_rate # update hidden-to-output weights with gradient descent step
            self.weights_0_1 -= layer_1_delta.dot(layer_0.T) * self.learning_rate # update input-to-hidden weights with gradient descent step

            if(np.abs(layer_2_error) < 0.5):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + __builtins__.str(100 * i/float(len(training_reviews)))[:4]
                             + "% Speed(reviews/sec):" + __builtins__.str(reviews_per_second)[0:5] 
                             + " #Correct:" + __builtins__.str(correct_so_far) 
                             + " #Trained:" + __builtins__.str(i+1) 
                             + " Training Accuracy:" + __builtins__.str(correct_so_far * 100 / float(i+1))[:4] 
                             + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        
        correct = 0
        
        start = time.time()
        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + __builtins__.str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + __builtins__.str(reviews_per_second)[0:5] \
                            + "% #Correct:" + __builtins__.str(correct) + " #Tested:" + __builtins__.str(i+1) + " Testing Accuracy:" + __builtins__.str(correct * 100 / float(i+1))[:4] + "%")
            
    def run(self, review):
        
        # Input Layer
#         print(review)
        self.update_input_layer(review)
#         print(self.layer_0.shape)
#         print(self.weights_0_1.shape)
#         print(np.dot(self.weights_0_1,self.layer_0))
        # Hidden layer
        layer_1 = self.weights_0_1.dot(self.layer_0)

        # Output layer
        layer_2 = self.sigmoid(self.weights_1_2.dot(layer_1))
#         print(layer_2) # 发现一只0.5呢
        if(layer_2[0] > 0.5):
            return 1
        else:
            return 0
reviews = pn['words'].values
labels = pn['mark'].values
# 除最后1000个外的数据训练
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],learning_rate=0.01)
mlp.train(reviews[:-1000],labels[:-1000])
Progress:0.0% Speed(reviews/sec):0.0 #Correct:0 #Trained:1 Training Accuracy:0.0%
Progress:12.4% Speed(reviews/sec):91.76 #Correct:1911 #Trained:2501 Training Accuracy:76.4%
Progress:24.8% Speed(reviews/sec):103.7 #Correct:3993 #Trained:5001 Training Accuracy:79.8%
Progress:37.3% Speed(reviews/sec):108.4 #Correct:6117 #Trained:7501 Training Accuracy:81.5%
Progress:49.7% Speed(reviews/sec):110.7 #Correct:8285 #Trained:10001 Training Accuracy:82.8%
Progress:62.1% Speed(reviews/sec):112.5 #Correct:10450 #Trained:12501 Training Accuracy:83.5%
Progress:74.6% Speed(reviews/sec):113.0 #Correct:12654 #Trained:15001 Training Accuracy:84.3%
Progress:87.0% Speed(reviews/sec):113.4 #Correct:14849 #Trained:17501 Training Accuracy:84.8%
Progress:99.4% Speed(reviews/sec):113.0 #Correct:17055 #Trained:20001 Training Accuracy:85.2%
Progress:99.9% Speed(reviews/sec):113.0 #Correct:17147 #Trained:20105 Training Accuracy:85.2%
mlp.test(reviews[-1000:],labels[-1000:])
Progress:15.2% Speed(reviews/sec):745.0% #Correct:136 #Tested:153 Testing Accuracy:88.8%
Progress:32.6% Speed(reviews/sec):806.8% #Correct:292 #Tested:327 Testing Accuracy:89.2%
Progress:52.3% Speed(reviews/sec):862.9% #Correct:461 #Tested:524 Testing Accuracy:87.9%
Progress:70.9% Speed(reviews/sec):877.4% #Correct:628 #Tested:710 Testing Accuracy:88.4%
Progress:88.2% Speed(reviews/sec):874.0% #Correct:782 #Tested:883 Testing Accuracy:88.5%
Progress:99.9% Speed(reviews/sec):876.2% #Correct:886 #Tested:1000 Testing Accuracy:88.6%

总结

至此就是本篇情感分析的所有了,回顾下:

1.最开始,我们通过分析在不同意见中词出现的次数不同,我们得出了可以根据一段话分词后不同词出现的次数来判断最终的意见,

2.接着我们通过对分词后的词进行编码,将一段话转换为一个向量

3.接着就是构建神经系统了(老套路)

4.下面我们不断去分析怎么能计算的更快,得出可以去掉某些频度太低的词,以及去除一些在正负观点中都出现的,代表性不是那么强的词

5.最后我们分析了训练出来的神经网络的weights的含义,发现可以根据weighs来对词进行分类,相同意见的词自然而然就聚合到一起了

分析下上面的问题,其实在对于词输入上,我们只是简单的进行了编码,没有考虑词之间的前后的位置关系,也没有考虑不同词其实其含义是一样的,下一篇将会使用RNN和word2Vec来进行优化。





















评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值