本地化谷歌字体_本地化叙述，由Google提供的最新，最出色的图像字幕

最新推荐文章于 2024-01-04 17:21:26 发布

羊牮

最新推荐文章于 2024-01-04 17:21:26 发布

阅读量307

点赞数

文章标签： google

原文链接：https://medium.com/swlh/localized-narratives-the-latest-and-greatest-in-image-captioning-by-google-8e47fc3dfa24

版权

本地化谷歌字体

A couple of months back the research team at Google announced a brand new way to develop datasets for learning tasks using vision, tracing, and speech called Localized narratives. As you read on, you’ll see how this new annotation protocol works and how it opens up new research opportunities in the world of machine learning and AI.

几个月前，Google的研究团队宣布了一种全新的方法，即使用视觉，追踪和语音来开发用于学习任务的数据集，称为本地化叙述。在继续阅读时，您将看到这个新的注释协议如何工作，以及它如何在机器学习和AI领域开辟新的研究机会。

Localized narratives is a new protocol for dataset generation developed by Google. It aims to provide highly accurate and rich datasets that can cater to over 15 use-cases once generated.

本地化叙述是Google开发的用于数据集生成的新协议。它旨在提供高度准确和丰富的数据集，这些数据集一旦生成便可以满足超过15个用例。

数据集如何制作？ (How the dataset is made?)

The human-annotators are asked to talk about the image while hovering their mouse over it. They had to do it in such a way that the mouse hovers over the region of the image as they are talking about that region.

要求人类注释者将鼠标悬停在图像上时谈论图像。他们必须以这样的方式进行操作：鼠标在谈论该区域时将鼠标悬停在图像的区域上。

Image for post — Image by Jordi Pont-Tuset 图片由霍尔迪阿蓬Tuset

For example, here the voice and the transcription describes the dried grass when the annotator is talking about the grass. Then he moves on to the woman and describes her clothing and the objects she holds in her hands. And lastly, the sky and the hair of the woman is described. The mapping can be seen from how the color of the mouse trace corresponds with the captions as each region is being described.

例如，这里的声音和转录描述了注释者谈论草时的干草。然后，他走向那个女人，描述她的衣服和她手中的物品。最后，描述了天空和女人的头发。从描述每个区域时鼠标轨迹的颜色与标题的对应关系可以看出映射。

Once the annotator describes the image as a voice clip, he is asked to transcribe it word by word. While this part may seem redundant at first, we have to appreciate the research efforts put in so that any error in transcription is avoided. This step can be automated once the Automatic Text Generators get better. Now, this transcription is accurate but without any mapping to the mouse traces. To address this, the researchers performed a sequence-to-sequence alignment between automatic and manual transcriptions, which leads to accurate and temporally synchronized captions. Now the resulting dataset has for each image, the audio describing it, the transcription of this audio, and the mouse traces all in sync.

一旦注释者将图像描述为语音剪辑，就要求他逐字转录。虽然这部分乍一看似乎是多余的，但我们必须赞赏所做的研究工作，以便避免任何转录错误。一旦自动文本生成器变得更好，则可以自动执行此步骤。现在，这种转录是准确的，但没有映射到任何鼠标轨迹。为了解决这个问题，研究人员在自动和手动转录之间进行了序列到序列的比对，从而产生了准确且时间同步的字幕。现在，生成的数据集具有每个图像的音频，描述它的音频，该音频的转录以及鼠标轨迹的所有同步。

是什么让它如此特别？ (What makes it so special?)

Localized narratives use a combination of text, speech, and mouse traces to describe an image. What makes this form of annotation special is that there is a region mapped to the picture for every word spoken by the narrator. Unlike many other types of captioning methods, nouns are not the only focus of the captions. Verbs, prepositions, etc. are given equal importance and have a region within the image associated with these words. The whole process is based on how people usually describe things: by pointing and explaining. And as such, this comes easily to the annotators. While the transcription adds some length to the process, the resulting data is rich in explanation and error-free. So the overall time taken to the data collected ratio is very favorable.

本地化叙事使用文本，语音和鼠标轨迹的组合来描述图像。这种注释形式之所以与众不同，是因为叙述者说出的每个单词都有一个映射到图片的区域。与许多其他类型的字幕方法不同，名词不是字幕的唯一焦点。动词，介词等具有同等的重要性，并且在图像中具有与这些单词相关的区域。整个过程基于人们通常如何描述事物：通过指向和解释。因此，注释者很容易做到这一点。尽管转录增加了过程的长度，但结果数据却充满了解释且没有错误。因此，总的时间花费在数据收集率上是非常有利的。

在哪里有用？(Where is it useful?)

While the primary use case may be image captioning, localized narratives provide services to a wide variety of tasks. Localized narratives provide four synchronized modularities: the image, the text, the recording, and the grounding or mouse trace. This opens the way to a huge number of use cases for the dataset, combining these four modularities in different orders. For example, with the image as input and test as output, this dataset is ideal for an image captioning or paragraph generation. If you use them the other way around, with the text as input and image as output, it can be used for Text to Image generator. Like this, the researchers have discovered 15 different use cases which are very popular to put this massive data to use.

虽然主要的用例可能是图像字幕，但本地化的叙述可以为各种任务提供服务。本地化的叙述提供了四个同步的模块：图像，文本，记录，以及接地或鼠标轨迹。这为数据集的大量用例开辟了道路，将这四个模块以不同的顺序组合在一起。例如，将图像作为输入，将测试作为输出，此数据集非常适合图像标题或段落生成。如果以其他方式使用它们，以文本作为输入，将图像作为输出，则可以将其用于“文本到图像”生成器。这样，研究人员发现了15种不同的用例，它们非常受欢迎，可以使用这些海量数据。

到目前为止的结果(The results so far)

The researchers performed some of the aforementioned tasks to see how the new dataset performed.

研究人员执行了上述一些任务，以查看新数据集的性能。

Controlled Image Captioning

受控图像字幕

Given both an image and a mouse trace, the goal is to produce an image caption that matches the mouse trace, that is it describes the image regions covered by the trace, and in the order of the trace. The grounding and image are given as inputs and text caption is the expected output.

给定图像和鼠标轨迹，目标是产生与鼠标轨迹匹配的图像标题，也就是说，它按照轨迹的顺序描述了轨迹所覆盖的图像区域。给出接地和图像作为输入，文本标题是预期的输出。

Anyone reading the captions can see how the captioning has improved when mouse traces are provided. Some key observations are

阅读字幕的任何人都可以看到提供鼠标跟踪时字幕的改进情况。一些关键的观察是

Since the mouse trace focuses on a smaller region inside the image, the captions have much more detail. Many more features are described in the last two captions compared to the first. Conditioning on the mouse trace has helped to cover the image more completely.
由于鼠标轨迹集中在图像内部的较小区域，因此字幕具有更多细节。与前两个标题相比，后两个标题描述了更多功能。鼠标跟踪的条件有助于更完整地覆盖图像。
The mouse trace provides a richer caption in the order that the user intends. From the second and third images above, we can see that different captions have been generated for different traces of the image. Mouse traces aid in producing a much more specific caption and gives the user a voice on what the generated caption will look like.
鼠标跟踪按用户期望的顺序提供了更丰富的标题。从上面的第二张和第三张图像中，我们可以看到为图像的不同痕迹生成了不同的标题。鼠标跟踪有助于产生更具体的字幕，并向用户提供有关所生成字幕的外观的声音。

While these resulting captions need not be necessarily better, It is sure to be more complete and in line with how the user wants the image described.

尽管这些最终的字幕不一定必须更好，但可以确保它更完整，并与用户希望的图像描述保持一致。

Image generation

图像生成

Image generation uses a segmentation map, which is a region specified data file that indicates where each object is supposed to go, to generate an image. The researchers demonstrated how localized narratives helps with this task using a state-of-the-art pre-trained model. They showed how it can make the interface for the task much more user-friendly. An incremental generation can be developed as shown above.

图像生成使用分割图来生成图像，分割图是一个区域指定的数据文件，用于指示每个对象应该到达的位置。研究人员使用最新的预训练模型演示了本地化叙事如何帮助完成此任务。他们展示了它如何使用户界面更加友好。可以如上所示开发增量生成。

Here the user specifies only a boat first, followed by water, a person, an umbrella, and a mountain. Notice how water reflects the mountain and the boat and how the boat opens up on the addition of the person.

用户在此仅指定一条船，然后指定水，人，伞和山。注意水如何反映山和船，以及船如何在增加人员的情况下打开。

Image generation tasks are limited to nouns for now. With Localized narratives, grounding is provided for verbs and adjectives. This opens up new possibilities in research and can go a long way with helping in image generation in the future.

目前，图像生成任务仅限于名词。使用局部化叙事时，为动词和形容词提供了基础。这为研究开辟了新的可能性，并且可以在将来为图像生成提供帮助。

These are simply two use-cases where localized narratives helped with popular machine learning tasks. As mentioned earlier, much better models for many different tasks will hopefully be developed using localized narratives.

这只是两个用例，其中本地化的叙述帮助完成了流行的机器学习任务。如前所述，希望可以使用本地化叙述为许多不同任务开发更好的模型。

如何在项目中使用本地化的叙述？ (How can I use localized narratives in my project?)

Google has already annotated 849k images with localized narratives. Localized narratives for popular image datasets like COCO, Flickr30k, ADE20k, and a part of the Open Images dataset have already been made available. If your project uses any of these datasets, you can find the localized narratives for them here.

Google已经使用本地化注释为849k图像添加了注释。流行图像数据集(如COCO，Flickr30k，ADE20k)和“打开图像”数据集的一部分的本地化叙述已经可用。如果您的项目使用这些数据集中的任何一个，则可以在此处找到它们的本地化叙述。

项目理念😋 (Project idea 😋)

Google has annotated all these datasets and made them available for public use. However, if your data does not fall in the image categories of these popular datasets, as of now, there is no way to generate localized narratives datasets for task-specific use. This presents an opportunity to create an annotation application that can help an annotator generate such datasets. I’m hoping someone reading this goes ahead and builds one. If you do let me know and be sure to make it opensource.

Google已为所有这些数据集添加了注释，并可供公众使用。但是，如果您的数据不属于这些流行数据集的图像类别，那么到目前为止，还没有办法生成用于特定任务的本地化叙事数据集。这为创建注释应用程序提供了机会，该应用程序可以帮助注释者生成此类数据集。我希望有人继续阅读并构建一个。如果您愿意，请告诉我，并确保将其开源。

If you are interested in reading more about localized narratives, the research paper is available here. A video by the author describing how it works is available here. All images uploaded in the post were sourced from the research paper put out and all rights belong to the respective owners.

如果您有兴趣阅读有关本地化叙事的更多信息，请在此处获得研究论文。由作者描述它是如何运作的视频，请点击这里。帖子中上传的所有图像均来自所发表的研究论文，所有权利均归各自所有者所有。

Peace✌.️

和平✌.️