Machine Learning Mastery 深度学习 NLP 教程（六）

绝不原创的飞龙

于 2024-09-17 00:17:19 发布

阅读量1.6k

点赞数 21

分类专栏： MLM 文章标签： MLM

License CC BY-NC-SA 4.0 / 自豪地采用谷歌翻译

本文链接：https://blog.csdn.net/wizardforcel/article/details/142309307

版权

MLM 专栏收录该内容

3745 篇文章

订阅专栏

原文：Machine Learning Mastery

协议：CC BY-NC-SA 4.0

浅谈神经机器翻译

原文： machinelearningmastery.com/introduction-neural-machine-translation/

计算机最早的目标之一是将文本从一种语言自动转换为另一种语言。

考虑到人类语言的流动性，自动或机器翻译可能是最具挑战性的人工智能任务之一。传统上，基于规则的系统被用于这项任务，在 20 世纪 90 年代用统计方法取代了这一系统。最近，深度神经网络模型在一个恰当地命名为神经机器翻译的领域中实现了最先进的结果。

在这篇文章中，您将发现机器翻译的挑战和神经机器翻译模型的有效性。

阅读这篇文章后，你会知道：

鉴于人类语言固有的模糊性和灵活性，机器翻译具有挑战性。
统计机器翻译将经典的基于规则的系统替换为学习从示例翻译的模型。
神经机器翻译模型适合单个模型而不是微调模型的管道，并且目前实现最先进的结果。

让我们开始吧。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

神经机器翻译的温和介绍
Fabio Achilli 的照片，保留一些权利。

什么是机器翻译？

机器翻译是将一种语言的源文本自动转换为另一种语言的文本的任务。

在机器翻译任务中，输入已经由某种语言的符号序列组成，并且计算机程序必须将其转换为另一种语言的符号序列。

第 98 页，深度学习，2016 年。

给定源语言中的一系列文本，该文本没有一个单一的最佳翻译成另一种语言。这是因为人类语言的自然模糊性和灵活性。这使得自动机器翻译的挑战变得困难，也许是人工智能中最难的一个：

事实是，准确的翻译需要背景知识，以解决歧义并确定句子的内容。

第 21 页，人工智能，现代方法，第 3 版，2009 年。

经典机器翻译方法通常涉及将源语言中的文本转换为目标语言的规则。这些规则通常由语言学家开发，可以在词汇，句法或语义层面上运作。这种对规则的关注给出了这个研究领域的名称：基于规则的机器翻译或 RBMT。

RBMT 的特点是明确使用和手动创建语言知情规则和表示。

第 133 页，自然语言处理和机器翻译手册，2011。

经典机器翻译方法的主要局限性是开发规则所需的专业知识，以及所需的大量规则和例外。

什么是统计机器翻译？

统计机器翻译（简称 SMT）是使用统计模型来学习将文本从源语言翻译成目标语言，从而提供大量的示例。

使用统计模型的任务可以正式说明如下：

给定目标语言中的句子 T，我们寻找翻译者产生 T 的句子 S.我们知道通过选择最可能给出 T 的句子 S 来最小化我们的错误机会。因此，我们希望选择 S 所以为了最大化 Pr（S | T）。

机器翻译的统计方法，1990。

这种形式化的规范使输出序列的概率最大化，给定文本的输入序列。它还使得存在一套候选翻译的概念明确，并且需要搜索过程或解码器从模型的输出概率分布中选择最可能的翻译。

鉴于源语言中的文本，目标语言中最可能的翻译是什么？ […]如何构建一个统计模型，为“好”翻译分配高概率，为“坏”翻译分配低概率？

第 xiii 页，基于句法的统计机器翻译，2017。

该方法是数据驱动的，只需要包含源语言和目标语言文本的示例语料库。这意味着语言学家不再需要指定翻译规则。

这种方法不需要复杂的语际概念本体论，也不需要源语言和目标语言的手工语法，也不需要手工标记的树库。它所需要的只是数据样本翻译，从中可以学习翻译模型。

第 909 页，人工智能，现代方法，第 3 版，2009 年。

很快，机器翻译的统计方法优于传统的基于规则的方法，成为事实上的标准技术集。

自 20 世纪 80 年代末该领域开始以来，最流行的统计机器翻译模型基于序列。在这些模型中，翻译的基本单位是单词或单词序列[…]这些模型简单有效，适用于人类语言对

基于句法的统计机器翻译，2017。

最广泛使用的技术是基于短语的，并且侧重于分段翻译源文本的子序列。

几十年来，统计机器翻译（SMT）一直是主流的翻译范式。 SMT 的实际实现通常是基于短语的系统（PBMT），其翻译长度可以不同的单词或短语的序列

谷歌的神经机器翻译系统：缩小人机翻译之间的差距，2016。

虽然有效，但统计机器翻译方法很少关注被翻译的短语，失去了目标文本的更广泛性质。对数据驱动方法的高度关注也意味着方法可能忽略了语言学家已知的重要语法区别。最后，统计方法需要仔细调整转换管道中的每个模块。

什么是神经机器翻译？

神经机器翻译（简称 NMT）是利用神经网络模型来学习机器翻译的统计模型。

该方法的主要好处是可以直接在源文本和目标文本上训练单个系统，不再需要统计机器学习中使用的专用系统的管道。

与传统的基于短语的翻译系统不同，翻译系统由许多单独调整的小子组件组成，神经机器翻译尝试构建和训练单个大型神经网络，该网络读取句子并输出正确的翻译。

通过联合学习对齐和翻译的神经机器翻译，2014。

因此，神经机器翻译系统被称为端到端系统，因为翻译仅需要一个模型。

NMT 的优势在于它能够以端到端的方式直接学习从输入文本到相关输出文本的映射。

谷歌的神经机器翻译系统：缩小人机翻译之间的差距，2016。

编解码器模型

多层感知机神经网络模型可用于机器转换，尽管模型受固定长度输入序列的限制，其中输出必须具有相同的长度。

最近，通过使用组织成编解码器结构的循环神经网络，这些早期模型得到了极大的改进，该结构允许可变长度的输入和输出序列。

编码器神经网络将源句子读取并编码为固定长度的向量。然后，解码器从编码向量输出转换。整个编解码器系统，包括用于语言对的编码器和解码器，被联合训练以最大化给定源句子的正确翻译的概率。

通过联合学习对齐和翻译的神经机器翻译，2014。

编解码器架构的关键是模型将源文本编码为称为上下文向量的内部固定长度表示的能力。有趣的是，一旦编码，原则上可以使用不同的解码系统将上下文翻译成不同的语言。

…一个模型首先读取输入序列并发出一个汇总输入序列的数据结构。我们将此摘要称为“上下文”C. […]第二种模式，通常是 RNN，然后读取上下文 C 并生成目标语言的句子。

第 461 页，深度学习，2016 年。

有关编解码器循环神经网络架构的更多信息，请参阅帖子：

编解码器长短期记忆网络

带注意的编解码器

虽然有效，但编解码器架构在要翻译的长文本序列方面存在问题。

问题源于必须用于解码输出序列中每个单词的固定长度内部表示。

解决方案是使用注意机制，该机制允许模型在输出序列的每个字被解码时学习将注意力放在输入序列的哪个位置。

使用固定大小的表示来捕获很长句子的所有语义细节是非常困难的。 […]然而，更有效的方法是阅读整个句子或段落[…]，然后一次一个地产生翻译的单词，每次都集中在他输入句子的不同部分以收集所需的语义细节生成下一个输出字。

第 462 页，深度学习，2016 年。

目前关注的编解码器循环神经网络架构是机器翻译的一些基准问题的最新技术。此架构用于谷歌翻译服务中使用的谷歌神经机器翻译系统（GNMT）的核心。
https://translate.google.com

…当前最先进的机器翻译系统由引起注意的模型提供动力。

第 209 页，自然语言处理中的神经网络方法，2017。

有关关注的更多信息，请参阅帖子：

长期短期记忆循环神经网络的注意事项

虽然有效，但神经机器翻译系统仍然存在一些问题，例如缩放到较大的单词词汇表以及训练模型的速度慢。目前有大型生产神经翻译系统的重点领域，例如 Google 系统。

神经机器翻译的三个固有缺点：它的训练速度和推理速度较慢，处理稀有单词的效率低下，有时无法翻译源句中的所有单词。

谷歌的神经机器翻译系统：缩小人机翻译之间的差距，2016。

进一步阅读

如果您希望深入了解，本节将提供有关该主题的更多资源。

图书

文件

额外

摘要

在这篇文章中，您发现了机器翻译的挑战和神经机器翻译模型的有效性。

具体来说，你学到了：

鉴于人类语言固有的模糊性和灵活性，机器翻译具有挑战性。
统计机器翻译将经典的基于规则的系统替换为学习从示例翻译的模型。
神经机器翻译模型适合单个模型而不是精细调整模型的管道，并且目前实现最先进的结果。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。

什么是自然语言处理？

原文： machinelearningmastery.com/natural-language-processing/

自然语言处理（简称 NLP）被广义地定义为通过软件自动操纵自然语言，如语音和文本。

自然语言处理的研究已经存在了 50 多年，随着计算机的兴起，语言学领域逐渐兴起。

在这篇文章中，您将了解自然语言处理是什么以及它为何如此重要。

阅读这篇文章后，你会知道：

什么是自然语言以及它与其他类型的数据有何不同。
是什么让使用自然语言如此具有挑战性。
NLP 领域的来源以及现代从业者如何定义。

让我们开始吧。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

什么是自然语言处理？
照片由 pedrik 拍摄，保留一些权利。

自然语言

自然语言是指我们人类彼此沟通的方式。

即，言语和文字。

我们被文字所包围。

想想你每天看到多少文字：

迹象
菜单
电子邮件
短信
网页
_ 等等…_

名单是无止境的。

现在想想讲话。

作为一个物种，我们可以互相交谈，而不是写作。学会说话比写作更容易。

语音和文字是我们彼此沟通的方式。

鉴于此类数据的重要性，我们必须有方法来理解和推理自然语言，就像我们对其他类型的数据一样。

自然语言的挑战

使用自然语言数据尚未解决。

它研究了半个世纪，真的很难。

从孩子的角度来看，这是很难的，他们必须花很多年时间学习语言…对于成人语言学习器来说很难，对于试图对相关现象进行建模的科学家来说很难，对于那些工程师来说很难尝试构建处理自然语言输入或输出的系统。这些任务非常艰巨，图灵可以正确地用自然语言进行流畅的对话，这是他对情报测试的核心。

第 248 页，数学语言学，2010。

自然语言很难，因为它很混乱。规则很少。

然而，我们可以在大多数时间轻松地相互理解。

人类语言非常模糊…它也在不断变化和发展。人们善于训练语言和理解语言，能够表达，感知和诠释非常精细和微妙的含义。与此同时，虽然我们人类是语言的伟大用户，但我们在正式理解和描述管理语言的规则方面也很差。

第 1 页，自然语言处理中的神经网络方法，2017。

从语言学到自然语言处理

语言学

语言学是语言的科学研究，包括语法，语义和语音学。

古典语言学涉及设计和评估语言规则。语法和语义的形式化方法取得了很大进展，但在大多数情况下，自然语言理解中的有趣问题抵制了清晰的数学形式。

从广义上讲，语言学家是研究语言的人，但也许更通俗地说，自我定义的语言学家可能更专注于在外地学习。

数学是科学的工具。致力于自然语言的数学家可能将他们的研究称为数学语言学，专注于使用离散数学形式和自然语言理论（例如形式语言和自动机理论）。

计算语言学

计算语言学是使用计算机科学工具进行语言学的现代研究。昨天的语言学可能是今天的计算语言学家，因为计算工具和思维的使用已经超越了大多数研究领域。

计算语言学是用于理解和生成自然语言的计算机系统的研究。 …计算语言学的一个自然函数是理论语言学家提出的语法测试。

第 4-5 页，计算语言学：导言，1986。

大数据和快速计算机意味着可以通过编写和运行软件从大型文本数据集中发现新的和不同的东西。

在 20 世纪 90 年代，统计方法和统计机器学习开始并最终取代了传统的自上而下的基于规则的语言方法，主要是因为它们具有更好的结果，速度和稳健性。现在，研究自然语言的统计方法在该领域占主导地位;它可以定义该领域。

用于自然语言处理的数据驱动方法现在变得如此受欢迎，以至于它们必须被认为是计算语言学的主流方法。 …这一发展的一个重要因素无疑是可以应用这些方法的可用电子存储数据的增加量;另一个因素可能是由于观察到的脆弱性而完全依赖于手工制作的规则的方法的某种祛魅。

第 358 页，牛津计算语言学手册，2005 年。

自然语言的统计方法不仅限于统计本身，还包括用于应用机器学习的高级推理方法。

…理解自然语言需要大量关于形态学，语法，语义和语用学的知识以及关于世界的一般知识。获取和编码所有这些知识是开发有效和强大的语言系统的基本障碍之一。就像统计方法一样…机器学习方法有望从注释或未注释的语言语料库中自动获取这些知识。

第 377 页，牛津计算语言学手册，2005 年。

统计自然语言处理

计算语言学也以自然语言过程（NLP）的名称而闻名，以反映统计方法的更基于工程师或经验的方法。

该领域的统计优势通常也导致 NLP 被描述为统计自然语言处理，或许可以将其与经典计算语言学方法相比较。

我认为计算语言学既有科学又有工程方面。计算语言学的工程方面，通常称为自然语言处理（NLP），主要涉及构建用语言做有用事情的计算工具，例如机器翻译，摘要，问答等。像任何工程学科一样，自然语言处理涉及各种不同的科学学科。

统计革命如何改变（计算）语言学，2009。

语言学是一个很大的研究课题，虽然 NLP 的统计学方法在某些领域取得了巨大的成功，但是经典的自上而下的方法仍然有很大的空间和巨大的好处。

粗略地说，统计 NLP 将概率与在分析话语或文本的过程中遇到的替代方案联系起来，并将最可能的结果接受为正确的结果。 …毫不奇怪，那些命名世界上密切相关的现象或我们对它的看法的词语经常彼此接近，以便关于世界的清晰事实反映在有些模糊的文本事实中。这种观点有很大的争论空间。

Page xix，牛津计算语言学手册，2005。

自然语言处理

作为对使用文本数据感兴趣的机器学习从业者，我们关注自然语言处理领域的工具和方法。

我们已经看到了上一节中从语言学到 NLP 的路径。现在，让我们来看看现代研究人员和从业者如何定义 NLP 的全部内容。

也许是该领域顶尖研究人员编写的更广泛的教科书之一，他们将该主题称为“_ 语言科学 _”，允许讨论经典语言学和现代统计学方法。

语言科学的目的是能够描述和解释围绕我们，谈话，写作和其他媒体的大量语言观察。其中一部分与人类获取，产生和理解语言的认知大小有关，其中一部分与理解语言话语与世界之间的关系有关，而其中一部分与理解语言结构有关。哪种语言沟通。

第 3 页，统计自然语言处理基础，1999。

他们通过在自然语言处理中使用统计方法继续关注推理。

统计 NLP 旨在对自然语言领域进行统计推断。统计推断通常包括获取一些数据（根据一些未知的概率分布生成），然后对此分布进行一些推断。

第 191 页，统计自然语言处理基础，1999。

在他们关于应用自然语言处理的文本中，NLP 流行的 NLTK Python 库的作者和贡献者将该领域广泛地描述为使用计算机来处理自然语言数据。

我们将采用自然语言处理 - 或简称 NLP - 涵盖任何类型的自然语言的计算机操作。在一个极端，它可以像计算单词频率一样简单，以比较不同的写作风格。在另一个极端，NLP 涉及“理解”完整的人类话语，至少在能够对它们作出有用的反应的程度上。

Page ix， Python 自然语言处理，2009。

统计 NLP 已经转向另一个角落，现在强烈关注使用深度学习神经网络来执行特定任务的推断和开发强大的端到端系统。

在致力于这一新兴主题的第一本教科书中，Yoav Goldberg 简洁地将 NLP 定义为以自然语言为输入或以自然语言为输出的自动方法。

自然语言处理（NLP）是指人类语言的自动计算处理的总称。这包括将人类生成的文本作为输入的算法，以及生成自然文本作为输出的算法。

第 xvii 页，自然语言处理中的神经网络方法，2017。

进一步阅读

如果您要深入了解，本节将提供有关该主题的更多资源。

图书

维基百科

摘要

在这篇文章中，您发现了自然语言处理的重要性。

具体来说，你学到了：

什么是自然语言以及它与其他类型的数据有何不同。
是什么让使用自然语言如此具有挑战性。
NLP 领域的来源以及现代从业者如何定义。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。

牛津自然语言处理深度学习课程

原文： machinelearningmastery.com/oxford-course-deep-learning-natural-language-processing/

深度学习方法在一系列自然语言处理问题上实现了最先进的结果

令人兴奋的是，单个模型经过端到端的训练，取代了一套专业的统计模型。

英国牛津大学教授自然语言处理深度学习课程，本课程的大部分材料都可以在线免费获取。

在这篇文章中，您将发现牛津自然语言处理深度学习课程。

阅读这篇文章后，你会知道：

课程包含的内容和先决条件。
讲座细分以及如何访问幻灯片和视频。
课程项目的细分以及访问材料的位置。

让我们开始吧。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

牛津自然语言处理深度学习课程
摄影： Martijn van Sabben ，保留一些权利。

概观

这篇文章分为 4 部分;他们是：

课程大纲
先决条件
讲座细分
项目

课程大纲

该课程名为“_ 深度学习自然语言处理 _”，并在牛津大学（英国）授课。它最后一次是在 2017 年初教授的。

这门课程的优点在于它由 Deep Mind 人员运行和教授。值得注意的是，讲师是 Phil Blunsom 。

本课程的重点是自然语言处理的统计方法，特别是在 NLP 问题上实现最新结果的神经网络。

从课程：

这将是一个应用课程，侧重于使用循环神经网络分析和生成语音和文本的最新进展。我们将介绍相关机器学习模型的数学定义，并推导出它们相关的优化算法。

先决条件

本课程专为本科生和研究生设计。

该课程假设主题有一些背景：

可能性。
线性代数。
连续数学。
基础机器学习。

如果您是对 NLP 深度学习感兴趣的从业者，您可能会从材料中获得不同的目标和要求。

例如，您可能希望专注于方法和应用程序而不是基础理论。

讲座细分

该课程由 13 个讲座组成，虽然第一和第二讲座分为两部分。

完整的讲座细分如下。

该课程的 GitHub 存储库提供了幻灯片，Flash 视频和每个讲座阅读的链接。

我建议通过这个非官方 YouTube 播放列表观看视频。

以下是第一讲的课程概述幻灯片。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

牛津大学自然语言处理的深度学习讲座

请注意，涵盖各种主题的客座讲师很多，其中大多数来自 Deep Mind。

第 1a 讲 - 介绍[Phil Blunsom]
第 1b 讲 - 深度神经网络[王玲]
第 2a 讲 - 词汇层次语义[Ed Grefenstette]
第 2b 讲 - 实用概述[Chris Dyer]
第 3 讲 - 语言建模和 RNN 第一部分[Phil Blunsom]
第 4 讲 - 语言建模和 RNN 第二部分[Phil Blunsom]
第 5 讲 - 文本分类[Karl Moritz Hermann]
第 6 讲 - 关于 Nvidia GPU 的深度 NLP [Jeremy Appleyard]
第 7 讲 - 条件语言模型[Chris Dyer]
第 8 讲 - 注意引起语言[Chris Dyer]
第 9 讲 - 语音识别（ASR）[Andrew Senior]
第 10 讲 - 文本到语音（TTS）[Andrew Senior]
第 11 讲 - 回答问题[Karl Moritz Hermann]
第 12 讲 - 记忆[Ed Grefenstette]
第 13 讲 - 神经网络中的语言知识

你看过这些讲座了吗？你觉得呢？
请在下面的评论中告诉我。

项目

该课程包括 4 个实际项目，您可能希望尝试确认您对该主题的了解。

项目如下，每个项目都有自己的 GitHub 项目，其中包含描述和相关的起始材料：

进一步阅读

如果您要深入了解，本节将提供有关该主题的更多资源。

摘要

在这篇文章中，您发现了牛津自然语言处理深度学习课程。

具体来说，你学到了：

课程包含的内容和先决条件。
讲座细分以及如何访问幻灯片和视频。
课程项目的细分以及访问材料的位置。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。

如何为机器翻译准备法语到英语的数据集

原文： machinelearningmastery.com/prepare-french-english-dataset-machine-translation/

机器翻译是将文本从源语言转换为目标语言中的连贯和匹配文本的挑战性任务。

诸如编解码器循环神经网络之类的神经机器翻译系统正在通过直接在源语言和目标语言上训练的单个端到端系统实现机器翻译的最先进结果。

需要标准数据集来开发，探索和熟悉如何开发神经机器翻译系统。

在本教程中，您将发现 Europarl 标准机器翻译数据集以及如何准备数据以进行建模。

完成本教程后，您将了解：

Europarl 数据集由欧洲议会以 11 种语言提供的程序组成。
如何加载和清理准备在神经机器翻译系统中建模的平行法语和英语成绩单。
如何减少法语和英语数据的词汇量，以降低翻译任务的复杂性。

让我们开始吧。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

如何为机器翻译准备法语 - 英语数据集
Giuseppe Milo 的照片，保留一些权利。

教程概述

本教程分为 5 个部分;他们是：

Europarl 机器翻译数据集
下载法语 - 英语数据集
加载数据集
清理数据集
减少词汇量

Python 环境

本教程假设您安装了安装了 Python 3 的 Python SciPy 环境。

本教程还假设您安装了 scikit-learn，Pandas，NumPy 和 Matplotlib。

如果您需要有关环境的帮助，请参阅此帖子：

如何使用 Anaconda 设置用于机器学习和深度学习的 Python 环境

Europarl 机器翻译数据集

Europarl 是用于统计机器翻译的标准数据集，最近是神经机器翻译。

它由欧洲议会的议事程序组成，因此数据集的名称为收缩Europarl。

诉讼程序是欧洲议会发言人的抄本，翻译成 11 种不同的语言。

它是欧洲议会议事录的集合，可追溯到 1996 年。总共包括欧盟 11 种官方语言中每种语言约 3000 万字的语料库。

Europarl：统计机器翻译平行语料库，2005。

原始数据可在欧洲议会网站上以 HTML 格式获得。

数据集的创建由 Philipp Koehn 领导，该书是“统计机器翻译”一书的作者。

该数据集在网站“欧洲议会会议论文集平行语料库 1996-2011 ”上免费提供给研究人员，并且经常作为机器翻译挑战的一部分出现，例如机器翻译任务在 2014 年统计机器翻译研讨会上。

最新版本的数据集是 2012 年发布的版本 7，包含 1996 年至 2011 年的数据。

下载法语 - 英语数据集

我们将专注于平行的法语 - 英语数据集。

这是 1996 年至 2011 年间记录的法语和英语对齐语料库。

数据集具有以下统计量：

句子：2,007,723
法语单词：51,388,643
英语单词：50,196,035

您可以从此处下载数据集：

平行语料库法语 - 英语（194 兆字节）

下载后，您当前的工作目录中应该有“ fr-en.tgz ”文件。

您可以使用 tar 命令解压缩此存档文件，如下所示：

tar zxvf fr-en.tgz

您现在将拥有两个文件，如下所示：

英语：europarl-v7.fr-en.en（288M）
法语：europarl-v7.fr-en.fr（331M）

以下是英文文件的示例。

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.

以下是法语文件的示例。

Reprise de la session
Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.
Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.
Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.
En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.

加载数据集

让我们从加载数据文件开始。

我们可以将每个文件作为字符串加载。由于文件包含 unicode 字符，因此在将文件作为文本加载时必须指定编码。在这种情况下，我们将使用 UTF-8 来轻松处理两个文件中的 unicode 字符。

下面的函数名为 load_doc（），它将加载一个给定的文件并将其作为一个文本块返回。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

接下来，我们可以将文件拆分成句子。

通常，在每一行上存储一个话语。我们可以将它们视为句子并用新行字符拆分文件。下面的函数to_sentences()将拆分加载的文档。

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

在以后准备我们的模型时，我们需要知道数据集中句子的长度。我们可以写一个简短的函数来计算最短和最长的句子。

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

我们可以将所有这些结合在一起，以加载和汇总英语和法语数据文件。下面列出了完整的示例。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

运行该示例总结了每个文件中的行数或句子数以及每个文件中最长和最短行的长度。

English data: sentences=2007723, min=0, max=668
French data: sentences=2007723, min=0, max=693

重要的是，我们可以看到 2,007,723 行符合预期。

清理数据集

在用于训练神经翻译模型之前，数据需要一些最小的清洁。

查看一些文本样本，一些最小的文本清理可能包括：

用空格标记文本。
将大小写归一化为小写。
从每个单词中删除标点符号。
删除不可打印的字符。
将法语字符转换为拉丁字符。
删除包含非字母字符的单词。

这些只是一些基本操作作为起点;您可能知道或需要更复杂的数据清理操作。

下面的函数clean_lines()实现了这些清理操作。一些说明：

我们使用 unicode API 来规范化 unicode 字符，将法语字符转换为拉丁语字符。
我们使用逆正则表达式匹配来仅保留可打印单词中的那些字符。
我们使用转换表按原样翻译字符，但不包括所有标点字符。

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

标准化后，我们使用 pickle API 直接以二进制格式保存简洁行列表。这将加快后期和未来的进一步操作的加载。

重用前面部分中开发的加载和拆分功能，下面列出了完整的示例。

import string
import re
from pickle import dump
from unicodedata import normalize

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'english.pkl')
# spot check
for i in range(10):
	print(sentences[i])

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'french.pkl')
# spot check
for i in range(10):
	print(sentences[i])

运行后，干净的句子分别保存在english.pkl和french.pkl文件中。

作为运行的一部分，我们还打印每个清晰句子列表的前几行，转载如下。

英语：

resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago

法国：

reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine

我对法语的阅读非常有限，但至少就英语而言，可以进一步改进，例如丢弃或连接复数的’'字符。

减少词汇量

作为数据清理的一部分，限制源语言和目标语言的词汇量非常重要。

翻译任务的难度与词汇量的大小成比例，这反过来影响模型训练时间和使模型可行所需的数据集的大小。

在本节中，我们将减少英语和法语文本的词汇量，并使用特殊标记标记所有词汇（OOV）单词。

我们可以从加载上一节保存的酸洗干净线开始。下面的load_clean_sentences()函数将加载并返回给定文件名的列表。

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

接下来，我们可以计算数据集中每个单词的出现次数。为此，我们可以使用Counter对象，这是一个键入单词的 Python 字典，每次添加每个单词的新出现时都会更新计数。

下面的to_vocab()函数为给定的句子列表创建词汇表。

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

然后，我们可以处理创建的词汇表，并从计数器中删除出现低于特定阈值的所有单词。

下面的trim_vocab()函数执行此操作并接受最小出现次数作为参数并返回更新的词汇表。

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

最后，我们可以更新句子，删除不在修剪词汇表中的所有单词，并用特殊标记标记它们的删除，在本例中为字符串“unk”。

下面的update_dataset()函数执行此操作并返回更新行的列表，然后可以将其保存到新文件中。

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

我们可以将所有这些结合在一起，减少英语和法语数据集的词汇量，并将结果保存到新的数据文件中。

我们将使用最小值 5，但您可以自由探索适合您的应用的其他最小值。

完整的代码示例如下所示。

from pickle import load
from pickle import dump
from collections import Counter

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

# load English dataset
filename = 'english.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

# load French dataset
filename = 'french.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

首先，报告英语词汇的大小，然后是更新的大小。更新的数据集将保存到文件’english_vocab.pkl’，并打印一些更新的示例的现场检查，其中包含用“unk”替换的词汇单词。

English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl

我们可以看到词汇量的大小缩减了一半到 40,000 多个单词。

resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago

然后对 French 数据集执行相同的过程，将结果保存到文件’french_vocab.pkl’。

French Vocabulary: 141642
New French Vocabulary: 58800
Saved: french_vocab.pkl

我们看到法语词汇量大小相似缩小。

reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine

进一步阅读

如果您希望深入了解，本节将提供有关该主题的更多资源。

摘要

在本教程中，您发现了 Europarl 机器翻译数据集以及如何准备数据以便进行建模。

具体来说，你学到了：

Europarl 数据集由欧洲议会以 11 种语言提供的程序组成。
如何加载和清理准备在神经机器翻译系统中建模的平行法语和英语成绩单。
如何减少法语和英语数据的词汇量，以降低翻译任务的复杂性。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。

如何为情感分析准备电影评论数据

原文： machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/

每个问题的文本数据准备都不同。

准备工作从简单的步骤开始，例如加载数据，但是对于您正在使用的数据非常具体的清理任务很快就会变得困难。您需要有关从何处开始以及从原始数据到准备建模的数据的步骤的工作顺序的帮助。

在本教程中，您将逐步了解如何为情感分析准备电影评论文本数据。

完成本教程后，您将了解：

如何加载文本数据并清除它以删除标点符号和其他非单词。
如何开发词汇表，定制它并将其保存到文件中。
如何使用清洁和预定义的词汇表准备电影评论，并将其保存到准备建模的新文件中。

让我们开始吧。

2017 年 10 月更新：修正了跳过不匹配文件的小错误，感谢 Jan Zett。
2017 年 12 月更新：修复了完整示例中的小错字，感谢 Ray 和 Zain。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

如何为情感分析准备电影评论数据
Kenneth Lu 的照片，保留一些权利。

教程概述

本教程分为 5 个部分;他们是：

电影评论数据集
加载文本数据
清理文本数据
训练词汇量
保存准备好的数据

1.电影评论数据集

电影评论数据是 Bo Pang 和 Lillian Lee 在 21 世纪初从 imdb.com 网站上检索到的电影评论的集合。收集的评论作为他们自然语言处理研究的一部分。

评论最初于 2002 年发布，但更新和清理版本于 2004 年发布，称为“v2.0”。

该数据集包含 1,000 个正面和 1,000 个负面电影评论，这些评论来自 IMDB 托管的 rec.arts.movies.reviews 新闻组的存档。作者将该数据集称为“_ 极性数据集 _”。

我们的数据包含 2000 年之前写的 1000 份正面和 1000 份负面评论，每位作者的评论上限为 20（每位作者共 312 位）。我们将此语料库称为极性数据集。

感伤教育：基于最小削减的主观性总结的情感分析，2004。

数据已经有所清理，例如：

数据集仅包含英语评论。
所有文本都已转换为小写。
标点符号周围有空格，如句号，逗号和括号。
文本每行被分成一个句子。

该数据已用于一些相关的自然语言处理任务。对于分类，经典模型（例如支持向量机）对数据的表现在高 70％至低 80％（例如 78％至 82％）的范围内。

更复杂的数据准备可以看到高达 86％的结果，交叉验证 10 倍。如果我们想在现代方法的实验中使用这个数据集，这给了我们 80 年代中期的球场。

…根据下游极性分类器的选择，我们可以实现高度统计上的显着改善（从 82.8％到 86.4％）

感伤教育：基于最小削减的主观性总结的情感分析，2004。

您可以从此处下载数据集：

电影评论 Polarity Dataset （review_polarity.tar.gz，3MB）

解压缩文件后，您将有一个名为“txt_sentoken”的目录，其中包含两个子目录，其中包含文本“neg”和“pos”的负数和积极的评论。对于 neg 和 pos 中的每一个，每个文件存储一个评论约定cv000到cv999。

接下来，我们来看看加载文本数据。

2.加载文本数据

在本节中，我们将介绍加载单个文本文件，然后处理文件目录。

我们假设审查数据已下载并在文件夹“txt_sentoken”的当前工作目录中可用。

我们可以通过打开它，读取 ASCII 文本和关闭文件来加载单个文本文件。这是标准的文件处理。例如，我们可以加载第一个负面评论文件“cv000_29416.txt”，如下所示：

# load one file
filename = 'txt_sentoken/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()

这会将文档加载为 ASCII 并保留任何空白区域，如新行。

我们可以把它变成一个名为 load_doc（）的函数，它接受文档的文件名加载并返回文本。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

我们有两个目录，每个目录有 1,000 个文档。我们可以依次使用 listdir（）函数获取目录中的文件列表来依次处理每个目录，然后依次加载每个文件。

例如，我们可以使用load_doc()函数在负目录中加载每个文档来进行实际加载。

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# specify directory to load
directory = 'txt_sentoken/neg'
# walk through all files in the folder
for filename in listdir(directory):
	# skip files that do not have the right extension
	if not filename.endswith(".txt"):
		continue
	# create the full path of the file to open
	path = directory + '/' + filename
	# load document
	doc = load_doc(path)
	print('Loaded %s' % filename)

运行此示例会在加载后打印每个评论的文件名。

...
Loaded cv995_23113.txt
Loaded cv996_12447.txt
Loaded cv997_5152.txt
Loaded cv998_15691.txt
Loaded cv999_14636.txt

我们也可以将文档的处理转换为函数，稍后将其用作模板，以开发清除文件夹中所有文档的函数。例如，下面我们定义一个process_docs()函数来做同样的事情。

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load all docs in a directory
def process_docs(directory):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load document
		doc = load_doc(path)
		print('Loaded %s' % filename)

# specify directory to load
directory = 'txt_sentoken/neg'
process_docs(directory)

现在我们知道了如何加载电影评论文本数据，让我们看一下清理它。

3.清理文本数据

在本节中，我们将了解我们可能要对电影评论数据进行哪些数据清理。

我们假设我们将使用一个词袋模型或者可能是一个不需要太多准备的单词嵌入。

分成代币

首先，让我们加载一个文档，然后查看由空格分割的原始标记。我们将使用上一节中开发的load_doc()函数。我们可以使用split()函数将加载的文档拆分为由空格分隔的标记。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

运行该示例从文档中提供了很长的原始令牌列表。

...
'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']

只要查看原始令牌就可以给我们提供很多想法的想法，例如：

从单词中删除标点符号（例如“what’s”）。
删除只是标点符号的标记（例如“ - ”）。
删除包含数字的标记（例如’10 / 10’）。
删除具有一个字符（例如“a”）的令牌。
删除没有多大意义的令牌（例如’和’）

一些想法：

我们可以使用字符串translate()函数从标记中过滤出标点符号。
我们可以通过对每个标记使用isalpha()检查来删除只是标点符号或包含数字的标记。
我们可以使用 NLTK 加载的列表删除英语停用词。
我们可以通过检查短标记来过滤掉短标记。

以下是清洁此评论的更新版本。

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

运行该示例可以提供更清晰的令牌列表

...
'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes']

我们可以将它放入一个名为clean_doc()的函数中，并在另一个评论中测试它，这次是一个积极的评论。

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

同样，清洁程序似乎产生了一组良好的令牌，至少作为第一次切割。

...
'comic', 'oscar', 'winner', 'martin', 'childs', 'shakespeare', 'love', 'production', 'design', 'turns', 'original', 'prague', 'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

我们可以采取更多的清洁步骤，让我们想象一下。

接下来，让我们看看如何管理一个首选的令牌词汇表。

4.训练词汇量

当使用文本的预测模型时，比如词袋模型，存在减小词汇量大小的压力。

词汇量越大，每个单词或文档的表示越稀疏。

为情感分析准备文本的一部分涉及定义和定制模型支持的单词的词汇表。

我们可以通过加载数据集中的所有文档并构建一组单词来完成此操作。我们可能会决定支持所有这些词，或者可能会丢弃一些词。然后可以将最终选择的词汇表保存到文件中供以后使用，例如将来在新文档中过滤单词。

我们可以在计数器中跟踪词汇，这是一个单词及其计数字典，带有一些额外的便利功能。

我们需要开发一个新函数来处理文档并将其添加到词汇表中。该函数需要通过调用先前开发的load_doc()函数来加载文档。它需要使用先前开发的clean_doc()函数清理加载的文档，然后需要将所有标记添加到 Counter，并更新计数。我们可以通过调用计数器对象上的update()函数来完成最后一步。

下面是一个名为add_doc_to_vocab()的函数，它将文档文件名和计数器词汇表作为参数。

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

最后，我们可以使用上面的模板处理名为 process_docs（）的目录中的所有文档，并将其更新为调用 add_doc_to_vocab（）。

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

我们可以将所有这些放在一起，并从数据集中的所有文档开发完整的词汇表。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

运行该示例将创建包含数据集中所有文档的词汇表，包括正面和负面评论。

我们可以看到所有评论中有超过 46,000 个独特单词，前 3 个单词是’_ 电影 '，‘one’和’ 电影 _ ”。

46557
[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]

也许最不常见的单词，那些仅在所有评论中出现一次的单词，都不具有预测性。也许一些最常见的词也没用。

这些都是好问题，应该用特定的预测模型进行测试。

一般来说，在 2000 条评论中只出现一次或几次的单词可能不具有预测性，可以从词汇表中删除，大大减少了我们需要建模的标记。

我们可以通过单词和它们的计数来执行此操作，并且只保留计数高于所选阈值的计数。这里我们将使用 5 次。

# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))

这将词汇量从 46,557 减少到 14,803 个单词，这是一个巨大的下降。也许至少 5 次发生过于激进;你可以尝试不同的价值观。

然后，我们可以将选择的单词词汇保存到新文件中。我喜欢将词汇表保存为 ASCII，每行一个单词。

下面定义了一个名为save_list()的函数来保存项目列表，在这种情况下，标记为文件，每行一个。

def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

下面列出了定义和保存词汇表的完整示例。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))
# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

在创建词汇表后运行此最终片段会将所选单词保存到文件中。

最好先查看，甚至研究您选择的词汇表，以便获得更好地准备这些数据或未来文本数据的想法。

hasnt
updating
figuratively
symphony
civilians
might
fisherman
hokum
witch
buffoons
...

接下来，我们可以看一下使用词汇表来创建电影评论数据集的准备版本。

5.保存准备好的数据

我们可以使用数据清理和选择的词汇表来准备每个电影评论，并保存准备好的评论版本以备建模。

这是一个很好的做法，因为它将数据准备与建模分离，如果您有新想法，则可以专注于建模并循环回数据准备。

我们可以从’vocab.txt’加载词汇开始。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load vocabulary
vocab_filename = 'review_polarity/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

接下来，我们可以清理评论，使用加载的词汇来过滤掉不需要的令牌，并将干净的评论保存在新文件中。

一种方法可以是将所有正面评论保存在一个文件中，将所有负面评论保存在另一个文件中，将过滤后的标记用空格分隔，以便在单独的行上进行每次评审。

首先，我们可以定义一个函数来处理文档，清理它，过滤它，然后将它作为可以保存在文件中的单行返回。下面定义doc_to_line()函数，将文件名和词汇（作为一组）作为参数。

它调用先前定义的load_doc()函数来加载文档，调用clean_doc()来标记文档。

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

接下来，我们可以定义新版本的process_docs()来逐步浏览文件夹中的所有评论，并通过为每个文档调用doc_to_line()将它们转换为行。然后返回行列表。

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

然后我们可以为正面和负面评论的目录调用 process_docs（），然后从上一节调用save_list()将每个处理过的评论列表保存到文件中。

完整的代码清单如下。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# prepare negative reviews
negative_lines = process_docs('txt_sentoken/neg', vocab)
save_list(negative_lines, 'negative.txt')
# prepare positive reviews
positive_lines = process_docs('txt_sentoken/pos', vocab)
save_list(positive_lines, 'positive.txt')

运行该示例将保存两个新文件，‘negative.txt’和’positive.txt’，分别包含准备好的负面和正面评论。

数据已准备好用于单词包甚至单词嵌入模型。

扩展

本节列出了您可能希望探索的一些扩展。

Stemming 。我们可以使用像 Porter stemmer 这样的词干算法将文档中的每个单词减少到它们的词干。
N-Grams 。我们可以使用词汇对词汇，而不是处理单个词汇。我们还可以研究使用更大的群体，例如三胞胎（三卦）和更多（n-gram）。
编码字。我们可以保存单词的整数编码，而不是按原样保存标记，其中词汇表中单词的索引表示单词的唯一整数。这将使建模时更容易处理数据。
编码文件。我们可以使用词袋模型对文档进行编码，并将每个单词编码为布尔存在/不存在标记或使用更复杂的评分，例如 TF-IDF，而不是在文档中保存标记。

如果你尝试任何这些扩展，我很想知道。
在下面的评论中分享您的结果。

进一步阅读

如果您要深入了解，本节将提供有关该主题的更多资源。

数据集

蜜蜂

摘要

在本教程中，您逐步了解了如何为情感分析准备电影评论文本数据。

具体来说，你学到了：

如何加载文本数据并清除它以删除标点符号和其他非单词。
如何开发词汇表，定制它并将其保存到文件中。
如何使用清洁和预定义词汇表准备电影评论，并将其保存到准备建模的新文件中。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。

如何为文本摘要准备新闻文章

原文： machinelearningmastery.com/prepare-news-articles-text-summarization/

文本摘要是创建文章的简短，准确和流畅的摘要的任务。

CNN 新闻故事数据集是一种流行的免费数据集，用于深度学习方法的文本摘要实验。

在本教程中，您将了解如何准备 CNN 新闻数据集以进行文本摘要。

完成本教程后，您将了解：

关于 CNN 新闻数据集以及如何将故事数据下载到您的工作站。
如何加载数据集并将每篇文章拆分为故事文本和突出显示。
如何清理准备建模的数据集并将清理后的数据保存到文件中供以后使用。

让我们开始吧。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

如何为文本摘要准备新闻文章
DieselDemon 的照片，保留一些权利。

教程概述

本教程分为 5 个部分;他们是：

CNN 新闻故事数据集
检查数据集
加载数据
数据清理
保存清洁数据

CNN 新闻故事数据集

DeepMind Q＆amp; A 数据集是来自 CNN 和每日邮报的大量新闻文章以及相关问题。

该数据集是作为深度学习的问题和回答任务而开发的，并在 2015 年的论文“教学机器中进行了阅读和理解”。

该数据集已用于文本摘要中，其中汇总了来自新闻文章的句子。值得注意的例子是论文：

使用序列到序列 RNN 及其后的抽象文本摘要，2016。
达到要点：利用指针生成器网络汇总，2017 年。

Kyunghyun Cho 是纽约大学的学者，已经提供了下载数据集：

DeepMind Q＆amp; A 数据集

在本教程中，我们将使用 CNN 数据集，特别是下载此处提供的新闻报道的 ASCII 文本：

cnn_stories.tgz （151 兆字节）

此数据集包含超过 93,000 篇新闻文章，其中每篇文章都存储在单个“.story”文件中。

将此数据集下载到您的工作站并解压缩。下载后，您可以在命令行上解压缩存档，如下所示：

tar xvf cnn_stories.tgz

这将创建一个 cnn / stories / 目录，其中包含.story文件。

例如，我们可以在命令行上计算故事文件的数量，如下所示：

ls -ltr | wc -l

这向我们展示了我们共有 92,580 家商店。

检查数据集

使用文本编辑器，查看一些故事并记下准备这些数据的一些想法。

例如，下面是一个故事的例子，为简洁起见，身体被截断。

(CNN) -- If you travel by plane and arriving on time makes a difference, try to book on Hawaiian Airlines. In 2012, passengers got where they needed to go without delay on the carrier more than nine times out of 10, according to a study released on Monday.

In fact, Hawaiian got even better from 2011, when it had a 92.8% on-time performance. Last year, it improved to 93.4%.

[...]

@highlight

Hawaiian Airlines again lands at No. 1 in on-time performance

@highlight

The Airline Quality Rankings Report looks at the 14 largest U.S. airlines

@highlight

ExpressJet and American Airlines had the worst on-time performance

@highlight

Virgin America had the best baggage handling; Southwest had lowest complaint rate

我注意到数据集的一般结构是让故事文本后跟一些“_ 突出显示 _”点。

回顾 CNN 网站上的文章，我可以看到这种模式仍然很常见。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

来自 cnn.com 的重点介绍 CNN 新闻文章的例子

ASCII 文本不包括文章标题，但我们可以使用这些人工编写的“_ 重点 _”作为每篇新闻文章的多个参考摘要。

我还可以看到许多文章都是从源信息开始的，可能是创建故事的 CNN 办公室;例如：

(CNN) --
Gaza City (CNN) --
Los Angeles (CNN) --

这些可以完全删除。

数据清理是一个具有挑战性的问题，必须根据系统的特定应用进行定制。

如果我们通常对开发新闻文章摘要系统感兴趣，那么我们可以清理文本以通过减小词汇量来简化学习问题。

这些数据的一些数据清理思路包括。

将大小写归一化为小写（例如“An Italian”）。
删除标点符号（例如“准时”）。

我们还可以进一步减少词汇量来加速测试模型，例如：

删除号码（例如“93.4％”）。
删除名称等低频词（例如“Tom Watkins”）。
将故事截断为前 5 或 10 个句子。

加载数据

第一步是加载数据。

我们可以先编写一个函数来加载给定文件名的单个文档。数据有一些 unicode 字符，因此我们将通过强制编码为 UTF-8 来加载数据集。

下面名为load_doc()的函数将加载单个文档作为给定文件名的文本。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

接下来，我们需要跳过 stories 目录中的每个文件名并加载它们。

我们可以使用listdir()函数加载目录中的所有文件名，然后依次加载每个文件名。以下名为load_stories()的函数实现了此行为，并为准备加载的文档提供了一个起点。

# load all stories in a directory
def load_stories(directory):
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)

每个文档可以分为新闻故事文本和精彩部分或摘要文本。

这两点的分割是第一次出现’ @highlight '令牌。拆分后，我们可以将亮点组织到列表中。

以下名为split_story()的函数实现了此行为，并将给定的已加载文档文本拆分为故事和高亮列表。

# split a document into news story and highlights
def split_story(doc):
	# find first highlight
	index = doc.find('@highlight')
	# split into story and highlights
	story, highlights = doc[:index], doc[index:].split('@highlight')
	# strip extra white space around each highlight
	highlights = [h.strip() for h in highlights if len(h) > 0]
	return story, highlights

我们现在可以更新load_stories()函数，为每个加载的文档调用split_story()函数，然后将结果存储在列表中。

# load all stories in a directory
def load_stories(directory):
	all_stories = list()
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)
		# split into story and highlights
		story, highlights = split_story(doc)
		# store
		all_stories.append({'story':story, 'highlights':highlights})
	return all_stories

将所有这些结合在一起，下面列出了加载整个数据集的完整示例。

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a document into news story and highlights
def split_story(doc):
	# find first highlight
	index = doc.find('@highlight')
	# split into story and highlights
	story, highlights = doc[:index], doc[index:].split('@highlight')
	# strip extra white space around each highlight
	highlights = [h.strip() for h in highlights if len(h) > 0]
	return story, highlights

# load all stories in a directory
def load_stories(directory):
	stories = list()
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)
		# split into story and highlights
		story, highlights = split_story(doc)
		# store
		stories.append({'story':story, 'highlights':highlights})
	return stories

# load stories
directory = 'cnn/stories/'
stories = load_stories(directory)
print('Loaded Stories %d' % len(stories))

运行该示例将打印已加载故事的数量。

Loaded Stories 92,579

我们现在可以访问加载的故事并突出显示数据，例如：

print(stories[4]['story'])
print(stories[4]['highlights'])

数据清理

现在我们可以加载故事数据，我们可以通过清理它来预处理文本。

我们可以逐行处理故事，并在每个高亮线上使用相同的清洁操作。

对于给定的行，我们将执行以下操作：

删除 CNN 办公室信息。

# strip source cnn office if it exists
index = line.find('(CNN) -- ')
if index > -1:
	line = line[index+len('(CNN)'):]

使用空格标记拆分线：

# tokenize on white space
line = line.split()

将案例规范化为小写。

# convert to lower case
line = [word.lower() for word in line]

从每个标记中删除所有标点符号（特定于 Python 3）。

# prepare a translation table to remove punctuation
table = str.maketrans('', '', string.punctuation)
# remove punctuation from each token
line = [w.translate(table) for w in line]

删除任何具有非字母字符的单词。

# remove tokens with numbers in them
line = [word for word in line if word.isalpha()]

将这一切放在一起，下面是一个名为clean_lines()的新函数，它接受一行文本行并返回一个简洁的文本行列表。

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare a translation table to remove punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# strip source cnn office if it exists
		index = line.find('(CNN) -- ')
		if index > -1:
			line = line[index+len('(CNN)'):]
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [w.translate(table) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	# remove empty strings
	cleaned = [c for c in cleaned if len(c) > 0]
	return cleaned

我们可以通过首先将其转换为一行文本来将其称为故事。可以在高亮列表上直接调用该函数。

example['story'] = clean_lines(example['story'].split('\n'))
example['highlights'] = clean_lines(example['highlights'])

下面列出了加载和清理数据集的完整示例。

from os import listdir
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a document into news story and highlights
def split_story(doc):
	# find first highlight
	index = doc.find('@highlight')
	# split into story and highlights
	story, highlights = doc[:index], doc[index:].split('@highlight')
	# strip extra white space around each highlight
	highlights = [h.strip() for h in highlights if len(h) > 0]
	return story, highlights

# load all stories in a directory
def load_stories(directory):
	stories = list()
	for name in listdir(directory):
		filename = directory + '/' + name
		# load document
		doc = load_doc(filename)
		# split into story and highlights
		story, highlights = split_story(doc)
		# store
		stories.append({'story':story, 'highlights':highlights})
	return stories

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare a translation table to remove punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# strip source cnn office if it exists
		index = line.find('(CNN) -- ')
		if index > -1:
			line = line[index+len('(CNN)'):]
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [w.translate(table) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	# remove empty strings
	cleaned = [c for c in cleaned if len(c) > 0]
	return cleaned

# load stories
directory = 'cnn/stories/'
stories = load_stories(directory)
print('Loaded Stories %d' % len(stories))

# clean stories
for example in stories:
	example['story'] = clean_lines(example['story'].split('\n'))
	example['highlights'] = clean_lines(example['highlights'])

请注意，故事现在存储为一个简洁的行列表，名义上用句子分隔。

保存清洁数据

最后，既然已经清理了数据，我们可以将其保存到文件中。

保存清理数据的简便方法是选择故事和精彩部分列表。

例如：

# save to file
from pickle import dump
dump(stories, open('cnn_dataset.pkl', 'wb'))

这将创建一个名为cnn_dataset.pkl的新文件，其中包含所有已清理的数据。该文件大小约为 374 兆字节。

然后我们可以稍后加载它并将其与文本摘要模型一起使用，如下所示：

# load from file
stories = load(open('cnn_dataset.pkl', 'rb'))
print('Loaded Stories %d' % len(stories))

进一步阅读

如果您要深入了解，本节将提供有关该主题的更多资源。

摘要

在本教程中，您了解了如何准备 CNN 新闻数据集以进行文本摘要。

具体来说，你学到了：

关于 CNN 新闻数据集以及如何将故事数据下载到您的工作站。
如何加载数据集并将每篇文章拆分为故事文本和突出显示。
如何清理准备建模的数据集并将清理后的数据保存到文件中供以后使用。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。

如何准备照片标题数据集来训练深度学习模型

原文： machinelearningmastery.com/prepare-photo-caption-dataset-training-deep-learning-model/

自动照片字幕是一个问题，其中模型必须在给定照片的情况下生成人类可读的文本描述。

这是人工智能中的一个具有挑战性的问题，需要来自计算机视觉领域的图像理解以及来自自然语言处理领域的语言生成。

现在可以使用深度学习和免费提供的照片数据集及其描述来开发自己的图像标题模型。

在本教程中，您将了解如何准备照片和文本描述，以便开发深度学习自动照片标题生成模型。

完成本教程后，您将了解：

关于 Flickr8K 数据集，包含 8,000 多张照片和每张照片最多 5 个字幕。
如何为深度学习建模一般加载和准备照片和文本数据。
如何在 Keras 中为两种不同类型的深度学习模型专门编码数据。

让我们开始吧。

2017 年 11 月更新：修复了“_ 整个描述序列模型 _”部分代码中的小拼写错误。谢谢 Moustapha Cheikh 和 Matthew。
2002 年 2 月更新：提供了 Flickr8k_Dataset 数据集的直接链接，因为官方网站已被删除。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

如何准备照片标题数据集以训练深度学习模型
照片由 beverlyislike ，保留一些权利。

教程概述

本教程分为 9 个部分;他们是：

下载 Flickr8K 数据集
如何加载照片
预先计算照片功能
如何加载描述
准备说明文字
整个描述序列模型
逐字模型
渐进式加载
预先计算照片功能

Python 环境

本教程假定您已安装 Python 3 SciPy 环境。您可以使用 Python 2，但您可能需要更改一些示例。

您必须安装带有 TensorFlow 或 Theano 后端的 Keras（2.0 或更高版本）。

本教程还假设您安装了 scikit-learn，Pandas，NumPy 和 Matplotlib。

如果您需要有关环境的帮助，请参阅此帖子：

如何使用 Anaconda 设置用于机器学习和深度学习的 Python 环境

下载 Flickr8K 数据集

Flickr8K 数据集是开始使用图像字幕时使用的一个很好的数据集。

原因是它是现实的并且相对较小，因此您可以使用 CPU 在工作站上下载它并构建模型。

数据集的确切描述在论文“框架图像描述作为排名任务：数据，模型和评估指标”从 2013 年开始。

作者将数据集描述如下：

我们为基于句子的图像描述和搜索引入了一个新的基准集合，包括 8,000 个图像，每个图像与五个不同的标题配对，提供对显着实体和事件的清晰描述。

…

图像是从六个不同的 Flickr 组中选择的，并且往往不包含任何知名人物或位置，而是手动选择以描绘各种场景和情况。

框架图像描述作为排名任务：数据，模型和评估指标，2013。

数据集可免费获得。您必须填写申请表，并通过电子邮件将链接发送给您。我很乐意为您链接，但电子邮件地址明确要求：“_ 请不要重新分发数据集 _”。

您可以使用以下链接来请求数据集：

数据集申请表

在短时间内，您将收到一封电子邮件，其中包含指向两个文件的链接：

Flickr8k_Dataset.zip （1 千兆字节）所有照片的存档。
Flickr8k_text.zip （2.2 兆字节）照片所有文字说明的档案。

UPDATE（2019 年 2 月）：官方网站似乎已被删除（虽然表格仍然有效）。以下是我的数据集 GitHub 存储库的一些直接下载链接：

下载数据集并将其解压缩到当前工作目录中。您将有两个目录：

Flicker8k_Dataset ：包含 8092 张 jpeg 格式的照片。
Flickr8k_text ：包含许多包含不同照片描述来源的文件。

接下来，我们来看看如何加载图像。

如何加载照片

在本节中，我们将开发一些代码来加载照片，以便与 Python 中的 Keras 深度学习库一起使用。

图像文件名是唯一的图像标识符。例如，以下是图像文件名的示例：

990890291_afc72be141.jpg
99171998_7cc800ceef.jpg
99679241_adc853a5c0.jpg
997338199_7343367d7f.jpg
997722733_0cb5439472.jpg

Keras 提供load_img()函数，可用于将图像文件直接作为像素数组加载。

from keras.preprocessing.image import load_img
image = load_img('990890291_afc72be141.jpg')

像素数据需要转换为 NumPy 数组以便在 Keras 中使用。

我们可以使用img_to_array()keras 函数来转换加载的数据。

from keras.preprocessing.image import img_to_array
image = img_to_array(image)

我们可能想要使用预定义的特征提取模型，例如在 Image net 上训练的最先进的深度图像分类网络。牛津视觉几何组（VGG）模型很受欢迎，可用于 Keras。

牛津视觉几何组（VGG）模型很受欢迎，可用于 Keras。

如果我们决定在模型中使用这个预先训练的模型作为特征提取器，我们可以使用 Keras 中的preprocess_input()函数预处理模型的像素数据，例如：

from keras.applications.vgg16 import preprocess_input

# reshape data into a single sample of an image
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)

我们可能还想强制加载照片以使其具有与 VGG 模型相同的像素尺寸，即 224 x 224 像素。我们可以在调用load_img()时这样做，例如：

image = load_img('990890291_afc72be141.jpg', target_size=(224, 224))

我们可能想要从图像文件名中提取唯一的图像标识符。我们可以通过将’。'（句点）字符拆分文件名字符串并检索结果数组的第一个元素来实现：

image_id = filename.split('.')[0]

我们可以将所有这些结合在一起并开发一个函数，给定包含照片的目录的名称，将加载和预处理 VGG 模型的所有照片，并将它们返回到键入其唯一图像标识符的字典中。

from os import listdir
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input

def load_photos(directory):
	images = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get image id
		image_id = name.split('.')[0]
		images[image_id] = image
	return images

# load images
directory = 'Flicker8k_Dataset'
images = load_photos(directory)
print('Loaded Images: %d' % len(images))

运行此示例将打印已加载图像的数量。运行需要几分钟。

Loaded Images: 8091

如果你没有 RAM 来保存所有图像（估计大约 5GB），那么你可以添加一个 if 语句来在加载 100 个图像后提前打破循环，例如：

if (len(images) >= 100):
	break

预先计算照片功能

可以使用预先训练的模型从数据集中的照片中提取特征并将特征存储到文件中。

这是一种效率，这意味着可以将从照片中提取的特征转换为文本描述的模型的语言部分可以从特征提取模型中单独训练。好处是，非常大的预训练模型不需要加载，保存在存储器中，并且用于在训练语言模型时处理每张照片。

之后，可以将特征提取模型和语言模型放在一起，以便对新照片做出预测。

在本节中，我们将扩展上一节中开发的照片加载行为，以加载所有照片，使用预先训练的 VGG 模型提取其特征，并将提取的特征存储到可以加载并用于训练的新文件中。语言模型。

第一步是加载 VGG 模型。此型号直接在 Keras 中提供，可按如下方式加载。请注意，这会将 500 兆的模型权重下载到您的计算机，这可能需要几分钟。

from keras.applications.vgg16 import VGG16
# load the model
in_layer = Input(shape=(224, 224, 3))
model = VGG16(include_top=False, input_tensor=in_layer, pooling='avg')
print(model.summary())

这将加载 VGG 16 层模型。

通过设置 include_top = False ，从模型中删除两个密集输出层以及分类输出层。最终汇集层的输出被视为从图像中提取的特征。

接下来，我们可以像上一节一样遍历图像目录中的所有图像，并在模型上为每个准备好的图像调用predict()函数以获取提取的特征。然后可以将这些特征存储在键入图像 id 的字典中。

下面列出了完整的示例。

from os import listdir
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.layers import Input

# extract features from each photo in the directory
def extract_features(directory):
	# load the model
	in_layer = Input(shape=(224, 224, 3))
	model = VGG16(include_top=False, input_tensor=in_layer)
	print(model.summary())
	# extract features from each photo
	features = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get features
		feature = model.predict(image, verbose=0)
		# get image id
		image_id = name.split('.')[0]
		# store feature
		features[image_id] = feature
		print('>%s' % name)
	return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

该示例可能需要一些时间才能完成，可能需要一个小时。

提取所有功能后，字典将存储在当前工作目录中的“features.pkl”文件中。

然后可以稍后加载这些特征并将其用作训练语言模型的输入。

您可以在 Keras 中尝试其他类型的预训练模型。

如何加载描述

花点时间谈谈描述是很重要的;有一些可用。

文件Flickr8k.token.txt包含图像标识符列表（用于图像文件名）和分词描述。每个图像都有多个描述。

以下是文件中的描述示例，显示了单个图像的 5 种不同描述。

1305564994_00513f9a5b.jpg#0 A man in street racer armor be examine the tire of another racer 's motorbike .
1305564994_00513f9a5b.jpg#1 Two racer drive a white bike down a road .
1305564994_00513f9a5b.jpg#2 Two motorist be ride along on their vehicle that be oddly design and color .
1305564994_00513f9a5b.jpg#3 Two person be in a small race car drive by a green hill .
1305564994_00513f9a5b.jpg#4 Two person in race uniform in a street car .

文件ExpertAnnotations.txt表示每个图像的哪些描述是由“_ 专家 _”编写的，这些描述是由众包工作者写的，要求描述图像。

最后，文件CrowdFlowerAnnotations.txt提供群众工作者的频率，指示字幕是否适合每个图像。可以概率地解释这些频率。

该论文的作者描述了注释如下：

…要求注释者写出描述描绘的场景，情境，事件和实体（人，动物，其他物体）的句子。我们为每个图像收集了多个字幕，因为可以描述许多图像的方式存在相当大的差异。

框架图像描述作为排名任务：数据，模型和评估指标，2013。

还有训练/测试拆分中使用的照片标识符列表，以便您可以比较报告中报告的结果。

第一步是决定使用哪些字幕。最简单的方法是对每张照片使用第一个描述。

首先，我们需要一个函数将整个注释文件（‘Flickr8k.token.txt’）加载到内存中。下面是一个执行此操作的函数，称为 load_doc（），给定文件名，将以字符串形式返回文档。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

我们可以从上面的文件示例中看到，我们只需要用空格分割每一行，并将第一个元素作为图像标识符，其余元素作为图像描述。例如：

# split line by white space
tokens = line.split()
# take the first token as the image id, the rest as the description
image_id, image_desc = tokens[0], tokens[1:]

然后我们可以通过删除文件扩展名和描述号来清理图像标识符。

# remove filename from image id
image_id = image_id.split('.')[0]

我们还可以将描述标记重新组合成一个字符串，以便以后处理。

# convert description tokens back to string
image_desc = ' '.join(image_desc)

我们可以把所有这些放在一个函数中。

下面定义load_descriptions()函数，它将获取加载的文件，逐行处理，并将图像标识符字典返回到它们的第一个描述。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# store the first description for each image
		if image_id not in mapping:
			mapping[image_id] = image_desc
	return mapping

filename = 'Flickr8k_text/Flickr8k.token.txt'
doc = load_doc(filename)
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

运行该示例将打印已加载的图像描述的数量。

Loaded: 8092

还有其他方法可以加载可能对数据更准确的描述。

使用上面的示例作为起点，让我知道你提出了什么。
在下面的评论中发布您的方法。

准备说明文字

描述是分词的;这意味着每个标记由用空格分隔的单词组成。

它还意味着标点符号被分隔为它们自己的标记，例如句点（‘。’）和单词复数（'s）的撇号。

在模型中使用之前清理描述文本是个好主意。我们可以形成一些数据清理的想法包括：

将所有标记的大小写归一化为小写。
从标记中删除所有标点符号。
删除包含一个或多个字符的所有标记（删除标点符号后），例如’a’和挂’s’字符。

我们可以在一个函数中实现这些简单的清理操作，该函数清除上一节中加载的字典中的每个描述。下面定义了clean_descriptions()函数，它将清理每个加载的描述。

# clean description text
def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc in descriptions.items():
		# tokenize
		desc = desc.split()
		# convert to lower case
		desc = [word.lower() for word in desc]
		# remove punctuation from each token
		desc = [w.translate(table) for w in desc]
		# remove hanging 's' and 'a'
		desc = [word for word in desc if len(word)>1]
		# store as string
		descriptions[key] =  ' '.join(desc)

然后我们可以将干净的文本保存到文件中以供我们的模型稍后使用。

每行将包含图像标识符，后跟干净描述。下面定义了save_doc()函数，用于将已清理的描述保存到文件中。

# save descriptions to file, one per line
def save_doc(descriptions, filename):
	lines = list()
	for key, desc in descriptions.items():
		lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

将这一切与上一节中的描述加载在一起，下面列出了完整的示例。

import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# store the first description for each image
		if image_id not in mapping:
			mapping[image_id] = image_desc
	return mapping

# clean description text
def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc in descriptions.items():
		# tokenize
		desc = desc.split()
		# convert to lower case
		desc = [word.lower() for word in desc]
		# remove punctuation from each token
		desc = [w.translate(table) for w in desc]
		# remove hanging 's' and 'a'
		desc = [word for word in desc if len(word)>1]
		# store as string
		descriptions[key] =  ' '.join(desc)

# save descriptions to file, one per line
def save_doc(descriptions, filename):
	lines = list()
	for key, desc in descriptions.items():
		lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
all_tokens = ' '.join(descriptions.values()).split()
vocabulary = set(all_tokens)
print('Vocabulary Size: %d' % len(vocabulary))
# save descriptions
save_doc(descriptions, 'descriptions.txt')

运行该示例首先加载 8,092 个描述，清除它们，汇总 4,484 个唯一单词的词汇表，然后将它们保存到名为“descriptionss.txt”的新文件中。

Loaded: 8092
Vocabulary Size: 4484

在文本编辑器中打开新文件’descriptionss.txt’并查看内容。您应该看到准备好进行建模的照片的可读描述。

...
3139118874_599b30b116 two girls pose for picture at christmastime
2065875490_a46b58c12b person is walking on sidewalk and skeleton is on the left inside of fence
2682382530_f9f8fd1e89 man in black shorts is stretching out his leg
3484019369_354e0b88c0 hockey team in red and white on the side of the ice rink
505955292_026f1489f2 boy rides horse

词汇量仍然相对较大。为了使建模更容易，特别是第一次，我建议通过删除仅在所有描述中出现一次或两次的单词来进一步减少词汇量。

整个描述序列模型

有很多方法可以模拟字幕生成问题。

一种朴素的方式是创建一个模型，以一次性方式输出整个文本描述。

这是一个朴素的模型，因为它给模型带来了沉重的负担，既可以解释照片的含义，也可以生成单词，然后将这些单词排列成正确的顺序。

这与编解码器循环神经网络中使用的语言翻译问题不同，其中在给定输入序列的编码的情况下，整个翻译的句子一次输出一个字。在这里，我们将使用图像的编码来生成输出句子。

可以使用用于图像分类的预训练模型对图像进行编码，例如在上述 ImageNet 模型上训练的 VGG。

模型的输出将是词汇表中每个单词的概率分布。序列与最长的照片描述一样长。

因此，描述需要首先进行整数编码，其中词汇表中的每个单词被赋予唯一的整数，并且单词序列将被整数序列替换。然后，整数序列需要是单热编码，以表示序列中每个单词的词汇表的理想化概率分布。

我们可以使用 Keras 中的工具来准备此类模型的描述。

第一步是将图像标识符的映射加载到存储在’descriptionss.txt’中的干净描述中。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))

运行此片段将 8,092 张照片描述加载到以图像标识符为中心的字典中。然后，可以使用这些标识符将每个照片文件加载到模型的相应输入。

Loaded 8092

接下来，我们需要提取所有描述文本，以便我们对其进行编码。

# extract all text
desc_text = list(descriptions.values())

我们可以使用 KerasTokenizer类将词汇表中的每个单词一致地映射为整数。首先，创建对象，然后将其放在描述文本上。稍后可以将拟合标记器保存到文件中，以便将预测一致地解码回词汇单词。

from keras.preprocessing.text import Tokenizer
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

接下来，我们可以使用 fit tokenizer 将照片描述编码为整数序列。

# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)

该模型将要求所有输出序列具有相同的训练长度。我们可以通过填充所有编码序列以使其具有与最长编码序列相同的长度来实现这一点。我们可以在单词列表之后用 0 值填充序列。 Keras 提供pad_sequences()函数来填充序列。

from keras.preprocessing.sequence import pad_sequences
# pad all sequences to a fixed length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)
padded = pad_sequences(sequences, maxlen=max_length, padding='post')

最后，我们可以对填充序列进行热编码，以便为序列中的每个字提供一个稀疏向量。 Keras 提供to_categorical()函数来执行此操作。

from keras.utils import to_categorical
# one hot encode
y = to_categorical(padded, num_classes=vocab_size)

编码后，我们可以确保序列输出数据具有正确的模型形状。

y = y.reshape((len(descriptions), max_length, vocab_size))
print(y.shape)

将所有这些放在一起，下面列出了完整的示例。

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))
# extract all text
desc_text = list(descriptions.values())
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)
# pad all sequences to a fixed length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)
padded = pad_sequences(sequences, maxlen=max_length, padding='post')
# one hot encode
y = to_categorical(padded, num_classes=vocab_size)
y = y.reshape((len(descriptions), max_length, vocab_size))
print(y.shape)

运行该示例首先打印加载的图像描述的数量（8,092 张照片），数据集词汇量大小（4,485 个单词），最长描述的长度（28 个单词），然后最终打印用于拟合预测模型的数据的形状。形式 [样品，序列长度，特征] 。

Loaded 8092
Vocabulary Size: 4485
Description Length: 28
(8092, 28, 4485)

如上所述，输出整个序列对于模型可能是具有挑战性的。

我们将在下一节中讨论一个更简单的模型。

逐字模型

用于生成照片标题的更简单的模型是在给定图像作为输入和生成的最后一个单词的情况下生成一个单词。

然后必须递归地调用该模型以生成描述中的每个单词，其中先前的预测作为输入。

使用单词作为输入，为模型提供强制上下文，以预测序列中的下一个单词。

这是以前研究中使用的模型，例如：

Show and Tell：神经图像标题生成器，2015。

字嵌入层可用于表示输入字。与照片的特征提取模型一样，这也可以在大型语料库或所有描述的数据集上进行预训练。

该模型将完整的单词序列作为输入;序列的长度将是数据集中描述的最大长度。

该模型必须以某种方式开始。一种方法是用特殊标签围绕每个照片描述以指示描述的开始和结束，例如“STARTDESC”和“ENDDESC”。

例如，描述：

boy rides horse

会成为：

STARTDESC boy rides horse ENDDESC

并且将被输入到具有相同图像输入的模型，以产生以下输入 - 输出字序列对：

Input (X), 						Output (y)
STARTDESC, 						boy
STARTDESC, boy,					rides
STARTDESC, boy, rides, 			horse
STARTDESC, boy, rides, horse	ENDDESC

数据准备工作将与上一节中描述的大致相同。

每个描述必须是整数编码。在编码之后，序列被分成多个输入和输出对，并且只有输出字（y）是单热编码的。这是因为该模型仅需要一次预测一个单词的概率分布。

代码是相同的，直到我们计算序列的最大长度。

...
descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))
# extract all text
desc_text = list(descriptions.values())
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)
# determine the maximum sequence length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)

接下来，我们将每个整数编码序列分成输入和输出对。

让我们在序列中的第 i 个单词处逐步执行称为 seq 的单个序列，其中 i> = 1。

首先，我们将第一个 i-1 个字作为输入序列，将第 i 个字作为输出字。

# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]

接下来，将输入序列填充到输入序列的最大长度。使用预填充（默认值），以便在序列的末尾显示新单词，而不是输入开头。

使用预填充（默认值），以便在序列的末尾显示新单词，而不是输入的开头。

# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]

输出字是单热编码，与上一节非常相似。

# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]

我们可以将所有这些放在一个完整的例子中，为逐字模型准备描述数据。

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))
# extract all text
desc_text = list(descriptions.values())
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)
# determine the maximum sequence length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)

X, y = list(), list()
for img_no, seq in enumerate(sequences):
	# split one sequence into multiple X,y pairs
	for i in range(1, len(seq)):
		# split into input and output pair
		in_seq, out_seq = seq[:i], seq[i]
		# pad input sequence
		in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
		# encode output sequence
		out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
		# store
		X.append(in_seq)
		y.append(out_seq)

# convert to numpy arrays
X, y = array(X), array(y)
print(X.shape)
print(y.shape)

运行该示例将打印相同的统计量，但会打印生成的编码输入和输出序列的大小。

请注意，图像的输入必须遵循完全相同的顺序，其中针对从单个描述中绘制的每个示例显示相同的照片。实现此目的的一种方法是加载照片并将其存储为从单个描述准备的每个示例。

Loaded 8092
Vocabulary Size: 4485
Description Length: 28
(66456, 28)
(66456, 4485)

渐进式加载

如果你有大量的 RAM（例如 8 千兆字节或更多），并且大多数现代系统都有，那么照片和描述的 Flicr8K 数据集可以放入 RAM 中。

如果您想使用 CPU 适合深度学习模型，这很好。

或者，如果您想使用 GPU 调整模型，那么您将无法将数据放入普通 GPU 视频卡的内存中。

一种解决方案是根据模型逐步加载照片和描述。

Keras 通过在模型上使用fit_generator()函数来支持逐步加载的数据集。生成器是用于描述用于返回模型进行训练的批量样本的函数的术语。这可以像独立函数一样简单，其名称在拟合模型时传递给fit_generator()函数。

作为提醒，模型适用于多个时期，其中一个时期是一个遍历整个训练数据集的时期，例如所有照片。一个时期由多批示例组成，其中模型权重在每批结束时更新。

生成器必须创建并生成一批示例。例如，数据集中的平均句子长度为 11 个字;这意味着每张照片将产生 11 个用于拟合模型的示例，而两张照片将产生平均约 22 个示例。现代硬件的良好默认批量大小可能是 32 个示例，因此这是大约 2-3 张照片的示例。

我们可以编写一个自定义生成器来加载一些照片并将样本作为一个批次返回。

让我们假设我们正在使用上一节中描述的逐字模型，该模型期望一系列单词和准备好的图像作为输入并预测单个单词。

让我们设计一个数据生成器，给出一个加载的图像标识符字典来清理描述，一个训练好的标记器，最大序列长度将为每个批次加载一个图像的例子。

生成器必须永远循环并产生每批样品。如果生成器和产量是新概念，请考虑阅读本文：

Python 生成器

我们可以使用 while 循环永远循环，并在其中循环遍历图像目录中的每个图像。对于每个图像文件名，我们可以加载图像并从图像的描述中创建所有输入 - 输出序列对。

以下是数据生成器功能。

def data_generator(mapping, tokenizer, max_length):
	# loop for ever over images
	directory = 'Flicker8k_Dataset'
	while 1:
		for name in listdir(directory):
			# load an image from file
			filename = directory + '/' + name
			image, image_id = load_image(filename)
			# create word sequences
			desc = mapping[image_id]
			in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)
			yield [[in_img, in_seq], out_word]

您可以扩展它以将数据集目录的名称作为参数。

生成器返回一个包含模型输入（X）和输出（y）的数组。输入包括一个数组，其中包含两个输入图像和编码单词序列的项目。输出是单热编码的单词。

你可以看到它调用一个名为load_photo()的函数来加载一张照片并返回像素和图像标识符。这是本教程开头开发的照片加载功能的简化版本。

# load a single photo intended as input for the VGG feature extractor model
def load_photo(filename):
	image = load_img(filename, target_size=(224, 224))
	# convert the image pixels to a numpy array
	image = img_to_array(image)
	# reshape data for the model
	image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
	# prepare the image for the VGG model
	image = preprocess_input(image)[0]
	# get image id
	image_id = filename.split('/')[-1].split('.')[0]
	return image, image_id

调用名为create_sequences()的另一个函数来创建图像序列，输入单词序列和输出单词，然后我们将其输出给调用者。这是一个功能，包括上一节中讨论的所有内容，还可以创建图像像素的副本，每个输入 - 输出对都是根据照片的描述创建的。

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, images):
	Ximages, XSeq, y = list(), list(),list()
	vocab_size = len(tokenizer.word_index) + 1
	for j in range(len(descriptions)):
		seq = descriptions[j]
		image = images[j]
		# integer encode
		seq = tokenizer.texts_to_sequences([seq])[0]
		# split one sequence into multiple X,y pairs
		for i in range(1, len(seq)):
			# select
			in_seq, out_seq = seq[:i], seq[i]
			# pad input sequence
			in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
			# encode output sequence
			out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
			# store
			Ximages.append(image)
			XSeq.append(in_seq)
			y.append(out_seq)
	Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)
	return Ximages, XSeq, y

在准备使用数据生成器的模型之前，我们必须加载干净的描述，准备标记生成器，并计算最大序列长度。必须将所有 3 个作为参数传递给 data_generator（）。

我们使用先前开发的相同load_clean_descriptions()函数和新的create_tokenizer()函数来简化标记生成器的创建。

将所有这些结合在一起，下面列出了完整的数据生成器，随时可用于训练模型。

from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = list(descriptions.values())
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# load a single photo intended as input for the VGG feature extractor model
def load_photo(filename):
	image = load_img(filename, target_size=(224, 224))
	# convert the image pixels to a numpy array
	image = img_to_array(image)
	# reshape data for the model
	image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
	# prepare the image for the VGG model
	image = preprocess_input(image)[0]
	# get image id
	image_id = filename.split('/')[-1].split('.')[0]
	return image, image_id

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, desc, image):
	Ximages, XSeq, y = list(), list(),list()
	vocab_size = len(tokenizer.word_index) + 1
	# integer encode the description
	seq = tokenizer.texts_to_sequences([desc])[0]
	# split one sequence into multiple X,y pairs
	for i in range(1, len(seq)):
		# select
		in_seq, out_seq = seq[:i], seq[i]
		# pad input sequence
		in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
		# encode output sequence
		out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
		# store
		Ximages.append(image)
		XSeq.append(in_seq)
		y.append(out_seq)
	Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)
	return [Ximages, XSeq, y]

# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, tokenizer, max_length):
	# loop for ever over images
	directory = 'Flicker8k_Dataset'
	while 1:
		for name in listdir(directory):
			# load an image from file
			filename = directory + '/' + name
			image, image_id = load_photo(filename)
			# create word sequences
			desc = descriptions[image_id]
			in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)
			yield [[in_img, in_seq], out_word]

# load mapping of ids to descriptions
descriptions = load_clean_descriptions('descriptions.txt')
# integer encode sequences of words
tokenizer = create_tokenizer(descriptions)
# pad to fixed length
max_length = max(len(s.split()) for s in list(descriptions.values()))
print('Description Length: %d' % max_length)

# test the data generator
generator = data_generator(descriptions, tokenizer, max_length)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)

可以通过调用 next（）函数来测试数据生成器。

我们可以按如下方式测试发电机。

# test the data generator
generator = data_generator(descriptions, tokenizer, max_length)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)

运行该示例打印单个批量的输入和输出示例的形状（例如，13 个输入 - 输出对）：

(13, 224, 224, 3)
(13, 28)
(13, 4485)

通过调用模型上的 fit_generator（）函数（而不是 fit（））并传入生成器，可以使用生成器来拟合模型。

我们还必须指定每个时期的步数或批次数。我们可以将此估计为（10 x 训练数据集大小），如果使用 7,000 个图像进行训练，则可能估计为 70,000。

# define model
# ...
# fit model
model.fit_generator(data_generator(descriptions, tokenizer, max_length), steps_per_epoch=70000, ...)

进一步阅读

如果您要深入了解，本节将提供有关该主题的更多资源。

Flickr8K 数据集

API

摘要

在本教程中，您了解了如何准备照片和文本描述，以便开发自动照片标题生成模型。

具体来说，你学到了：

关于 Flickr8K 数据集，包含 8,000 多张照片和每张照片最多 5 个字幕。
如何为深度学习建模一般加载和准备照片和文本数据。
如何在 Keras 中为两种不同类型的深度学习模型专门编码数据。

你有任何问题吗？
在下面的评论中提出您的问题，我会尽力回答。