深度学习实战(五):Seq2Seq序列排序生成模型——文本摘要:Seq2seq + Attention

版权声明:本文为博主原创文章,未经博主允许不得转载


一、项目介绍

1.1 数据介绍

  • 【数据集】Amazon 500000评论

1.2 项目步骤

  • 数据预处理
  • 构建Seq2Seq模型
  • 训练网络
  • 测试效果

二、代码实现

import pandas as pd
import numpy as np
import re
import tensorflow as tf
import time

from nltk.corpus import stopwords
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors

2.1 加载数据集

reviews = pd.read_csv('./data/reviews.csv') 

数据预览如下图:
在这里插入图片描述
在这里插入图片描述

# Check for any nulls values
reviews.isnull().sum()

在这里插入图片描述

2.2 数据预处理

2.2.1 特征处理

# Remove null values and unneeded features

# 删除空值的行
reviews = reviews.dropna()  

# 删除不需要的列
reviews = reviews.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time'], axis=1)

在这里插入图片描述

2.2.2 全部转换成小写 / 连词转换 / 去停用词(只在描述中去掉)

contractions = {
   
    "ain't": 'am not',
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "connot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'll": "i will",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "must've": "must have",
    "mustn't": "must not",
    "needn't": "need not",
    "oughtn't": "ought not",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "that'd": "that would",
    "that's": "that is",
    "there'd": "there had",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "where'd": "where did",
    "where's": "where is",
    "who'll": "who will",
    "who's": "who is",
    "won't": "will not",
    "wouldn't": "would not",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are"
}
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords and format the text to create fewer nulls word embeddings'''
    # Convert words to lower case
    text = text.lower()
    
    
    # Replace contractions with their longer forms
    if True:
        text = text.split()
        next_text = []
        
        for word in text:
            if word in contractions:
                next_text.append(contractions[word])
            else:
                next_text.append(word)
        text = " ".join(next_text)
        
    # Format words and remove unwanted characters   ???
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text)
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/"]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if w not in stops]
        text = " ".join(text)
    
    return text
  • We will remove the stopwords from the texts because they do not provide much use for training our model.
  • However,we will keep them for our summaries so that they sound more like natural phrases.(保留Summary列的停用词)
# Clean the summaries and texts
clean_summaries = []
for summary in reviews['Summary']:
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
print('Summaries are complete.')

clean_texts = []
for text in reviews['Text']:
    clean_texts.append(clean_text(text))
print('Texts are complete.')

2.2.3 统计

# Count the number of occurrences of each word in s set of text
def count_words(count_dict, text):
    for sentence in text:
        for
  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

InitialHeart2021

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值