机器学习图像学习规划_机器学习图像到音频字幕

最新推荐文章于 2024-09-22 10:41:50 发布

weixin_26750481

最新推荐文章于 2024-09-22 10:41:50 发布

阅读量279

点赞数

文章标签：机器学习人工智能 python 深度学习计算机视觉

原文链接：https://towardsdatascience.com/machine-learning-image-to-audio-captioning-964dc0f63df9

版权

研究人员利用VGG16、XNMT和Clustergen等技术，开发了无需文本中介的Image2Speech系统，直接从图像生成语音描述。该工作挑战传统方法，为无书面语言的地区提供了新的解决方案，展示了首个此类模型的潜力。

摘要由CSDN通过智能技术生成

机器学习图像学习规划

Machine learning (ML) has spread into many different fields and disciplines. Dipping your toes into a new field is the best way to grow and learn new things. The following is a summary of how researchers have applied machine learning to generate audio descriptions directly from images.

机器学习(ML)已扩展到许多不同的领域和学科。将脚趾浸入新领域是成长和学习新事物的最佳方法。以下是研究人员如何应用机器学习直接从图像生成音频描述的摘要。

Image2speech：自动生成图像的音频描述 (Image2speech: Automatically generating audio descriptions of images)

Researchers generate audio captions for images without using text as an intermediate form.

研究人员无需使用文本作为中间形式即可为图像生成音频字幕。

Clustergen, VGG16, LSTM, MSCOCO, SPEECH-COCO, XNMT, Flickr, Flickr-Audio

Clustergen，VGG16，LSTM，MSCOCO，SPEECH-COCO，XNMT，Flickr，Flickr-音频

Image for post — Photo by Daniel Sandvik on Unsplash

摘要与介绍 (Abstract and Introduction)

The researchers propose a new area of artificial intelligence, image2speech. They define the primary task as:

研究人员提出了一个新的人工智能领域，即image2speech。他们将主要任务定义为：

“An Image2Speech system should generate a spoken description of an image directly, without first generating text.”

“ Image2Speech系统应直接生成图像的语音描述，而无需先生成文本。”

Previously, image2speech problems were broken up into two steps: image2txt then txt2speech. However, several languages do not have a written form (e.g. Algerian, Moroccan, Levantine, varieties of Arabic, …) thus traditional img2speech translations do not work for those languages.

以前，image2speech问题分为两个步骤：image2txt然后是txt2speech。但是，几种语言没有书面形式(例如，阿尔及利亚，摩洛哥，黎凡特，阿拉伯语等)，因此传统的img2speech翻译不适用于这些语言。

方法 (Methods)

The researchers used standard open-source libraries, software, and data to develop their model. Figure 1. shows how each of these technologies relate together.

研究人员使用标准的开放源代码库，软件和数据来开发他们的模型。图1.显示了这些技术如何相互关联。

VGG16 (Extracting CNN features from image)
VGG16 (从图像中提取CNN功能)
XNMT (Generates speech units from CNN features)
XNMT (从CNN功能生成语音单元)
ClusterGen (Converts speech units into audio)
ClusterGen (将语音单位转换为音频)
Kaldi and Eesen (Automatic Speech Recognition, translate audio into speech units)
Kaldi和Eesen (自动语音识别，将音频转换为语音单元)
Flickr Image Dataset and Flickr Audio
Flickr图像数据集和Flickr音频
MSCOCO and SPEECH-COCO
MSCOCO和SPEECH-COCO

The Image2Speech system is composed of three separate networks (VGG16, XNMT, and Clustergen). The whole network is trained on image and audio description pairs.

Image2Speech系统由三个独立的网络(VGG16，XNMT和Clustergen)组成。整个网络都经过图像和音频描述对的培训。

The authors use VGG16 to extract image features then XNMT translates those image features into speech units, finally, Clustergen turns speech units into audio.

作者使用VGG16提取图像特征，然后XNMT将这些图像特征转换为语音单元，最后，Clustergen将语音单元转换为音频。

“XNMT (the eXtensible Machine Translation Toolkit) is specialized in training sequence to sequence neural networks.”

“ XNMT(可扩展机器翻译工具包)专门用于训练序列以对神经网络进行排序。”

数据 (Data)

The data comes from two different sources. They each have an image dataset (Flickr and MSCOCO) and an audio dataset (Flickr-Audio and SPEECH-MSCOCO). Thus each image is accompanied by a text caption and an audio reading of that text caption.

数据来自两个不同的来源。它们每个都有一个图像数据集(Flickr和MSCOCO)和一个音频数据集(Flickr-Audio和SPEECH-MSCOCO)。因此，每个图像都带有文本标题和该文本标题的音频阅读。

Captions for the image above were generated by workers from Amazon Mechanical Turk:

上图的标题是由Amazon Mechanical Turk的工作人员生成的：

A child in a pink dress is climbing up a set of stairs in an entryway.
一个穿着粉红色裙子的孩子正在爬上入口处的一组楼梯。
A girl going into a wooden building.
进入一个木大厦的女孩。
A little girl climbing into a wooden playhouse.
爬进一个木剧场的一个小女孩。
A little girl climbing the stairs to her playhouse
一个小女孩爬楼梯到她的剧场
A little girl in a pink dress going into a wooden cabin.
进入一个木客舱的一件桃红色礼服的一个小女孩。

MSCOCO is the largest image2txt and text2speech dataset available. It is so large that researchers were unable to include all of it during training.

MSCOCO是可用的最大的image2txt和text2speech数据集。它太大了，研究人员在培训期间无法将所有内容都包括在内。

结果 (Results)

After all of the models were trained they achieved a 78.8% phone error rate. The authors described this as:

在对所有模型进行训练之后，他们的电话错误率达到了78.8％。作者将其描述为：

“Not perfectly natural but is composed of intelligible words arranged into intelligible sentences”

“不是很自然，而是由排列成可理解句子的可理解单词组成”

结论 (Conclusions)

The authors defined a new area of artificial intelligence that is more challenging and restrictive than traditional image2speech. Generating speech from an image without any intermediate text is a unique problem with new applications. In this paper, the authors presented the first model of its kind and hope that others will be encouraged to pursue building more interesting models.

作者定义了一个新的人工智能领域，它比传统的image2speech更具挑战性和限制性。在没有任何中间文本的情况下从图像生成语音是新应用程序的一个独特问题。在本文中，作者介绍了第一个此类模型，并希望鼓励其他人继续构建更有趣的模型。

机器学习及更多 (Machine Learning & More)

参考文件 (Reference Paper)

Mark Hasegawa-Johnson, Alan Black, Lucas Ondel, Odette Scharenborg, Francesco Ciannella, “Image2speech: Automatically generating audio descriptions of images” in Interspeach (2020) http://www.cs.cmu.edu/~awb/papers/hasegawajohnson17icnlssp.pdf

Mark Hasegawa-Johnson，Alan Black，Lucas Ondel，Odette Scharenborg，Francesco Ciannella，Interspeach(2020)中的“ Image2speech：自动生成图像的音频描述” http://www.cs.cmu.edu/~awb/papers/hasegawajohnson17icnlssp .pdf