亚马逊食品评论舆情分析-思路梳理_python亚马逊食品评论-CSDN博客

本文基于Kaggle上的亚马逊食品评论数据，利用NLP技术进行文本预处理，包括转小写、缩写转换、去除停用词等。采用ConceptNet Numberbatch词嵌入模型，并构建Seq2seq模型，其中encoder使用双向LSTM，decoder应用注意力机制。当评论中UNK过多或summary中出现UNK时，数据被过滤。通过TensorFlow实现，注意 dtype 设定为 float32 的重要性。

摘要由CSDN通过智能技术生成

在这里插入图片描述
本文针对kaggle上的NLP比赛，具体链接：
https://www.kaggle.com/snap/amazon-fine-food-reviews

文章代码引自Github上Currie32同学的方法做一个学习笔记类型的博客，并非原创，项目具体链接：
https://github.com/Currie32/Text-Summarization-with-Amazon-Reviews/blob/master/summarize_reviews.ipynb
大家也可以把数据集下载了本地运行尝试下效果。

比赛目标：根据评论生成短文本摘要，示例如下：
Description(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
Summary(1): great coffee
Description(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won’t either!
Summary(2): omg gross gross

测试环境：python 3.5.6 tensorflow 1.12+。请注意，实际测试中，如果tensorflow版本低于1.1需要对照着做较多改动。在原帖下面有人贴出了针对1.0版本改动后的代码。

其中，Seq2seq有另一位同学做了教程，github链接如下：
https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq/contrib_seq2seq

进入正题，首先进行文本预处理，分为以下部分：

转换小写
将缩写转化为分开格式，如don’t 转化为do not
移除不需要的字符词汇

停词清洗

 def clean_text(text, remove_stopwords = True):
     
     # Convert words to lower case
     text = text.lower()
     
     # Replace contractions with their longer forms 
     if True:
         text = text.split()
         new_text = []
         for word in text:
             if word in contractions:
                 new_text.append(contractions[word])
             else:
                 new_text.append(word)
         text = " ".join(new_text)
     
     # Format words and remove unwanted characters
     text = re.sub(r'https?:\/\/.*[\r\n]*', '', text,  
                   flags=re.MULTILINE)
     text = re.sub(r'\<a href', ' ', text)
     text = re.sub(r'&amp;', '', text) 
     text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
     text = re.sub(r'<br />', ' ', text)
     text = re.sub(r'\'', ' ', text)
     
     # Optionally, remove stop words
     if remove_stopwords:
         text = text.split()
         stops = set(stopwords.words("english"))
         text = [w for w in text if not w in stops]
         text = " ".join(text)
 return text

注意部分需要导入nltk的stopwords，这一部分大家可以存下来备用，之后针对英文的预处理都可以直接使用。

词嵌入模型作者并没有用比较常见的Glove，而是用了另一个ConceptNet Numberbatch，包含Glove，且效果相较于Glove要好一些。

embeddings_index = {}
with open('/Users/Dave/Desktop/Programming/numberbatch-en-17.02.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

这里的dtype设定为float32比较重要，因为默认值为64，与tensorflow并不匹配，可以在接下来的输入参数设定中看到这一点，值得注意；

显然，评论中的部分词不属于这个词嵌入，这类词会被赋予一个UNK的key并赋予随机word vector；为了保证数据的有效性，作者采取的办法是：如果评论中包含多于一个的UNK或者summary中包含UNK则直接放弃该条数据。

接下来开始建立模型。模型输入包含多个参数，具体定义如下：

def model_inputs():
    input_data = tf.placeholder(tf.int32,[None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    summary_length = tf.placeholder(tf.int32, (None,),name='summary_length')
    max_summary_length = tf.reduce_max(summary_length,name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,),name='text_length')
	return input_data, targets, lr, keep_prob, summary_length, 
	                           max_summary_length, text_length

对于encoder网络，作者使用了双向LSTM，然后将每一时间步的前后向输出输出拼接（也就是encoder的output）；

用variable_scope函数保证了权重复用；variable_scope这个函数是声明了参数的作用域，没用过的同学可以去tensorflow官网看，注意和name_scope在get_variable()方面使用时的区别。

对于decoder网络，作者采用了Bhadanau的attention网络；用wrapper和state得到decoder的输出；text和summary都需要padding。

梳理了下大概的流程，其他细节可以看代码；欢迎提问讨论~