gaugan使用教程
Both AI and interactive storytelling are complex and unpredictable systems. As we deepened Marrow’s design process, the challenge of combining those two systems into one coherent experience became apparent. On the one hand, as the authors, we developed AI systems and real-time interactions to lead the flow of the experience. On the other hand, we wished to tell a story that also provokes the participants’ imagination and emotions.
AI和交互式讲故事都是复杂且不可预测的系统。 随着我们对Marrow的设计过程的加深,将这两个系统组合为一个连贯的体验的挑战变得显而易见。 一方面,作为作者,我们开发了AI系统和实时交互来引导体验。 另一方面,我们希望讲一个故事,它也激发了参与者的想象力和情感。
Marrow is a story about the possibility of mental illness in machine learning models, focusing mainly on Generative Adversarial Networks (GAN). We question what kind of mental disorders could emerge in advanced AIs, and invite participants to play in an interactive theater scene operated by GAN. Together they play as one dysfunctional family of AIs. Since we are dealing with very abstract and complex concepts, we wanted to explore multiple ways to communicate the story, more than just through dialogue between the family members. Our tactic was to make the room more ‘alive,’ reflecting on the embodied models’ mental state. We wanted to dissolve the barriers between the participants and the environment; to slowly immerse them in an unfamiliar magical experience within that room. The room and the dinner scene were an invitation to let go and indulge in an emotional affair with three other strangers.
Marrow是有关机器学习模型中精神疾病的可能性的故事,主要关注于生成对抗网络(GAN)。 我们质疑先进的AI中会出现哪种精神障碍,并邀请参与者在GAN运营的互动剧场中表演。 他们在一起扮演着功能失调的AI家族。 由于我们处理的是非常抽象和复杂的概念,我们希望探索多种交流故事的方式,而不仅仅是通过家庭成员之间的对话。 我们的策略是让房间更加“活跃”,以体现具体模型的心理状态。 我们希望消除参与者与环境之间的障碍; 让他们慢慢沉浸在那个房间里不熟悉的魔法体验中。 房间和晚餐现场都是一个邀请,让他们放纵自己,与其他三个陌生人沉迷于情感上。
In practice, this meant that we had to implement GAN networks that frequently interacted with the environment and with the participants. Since GAN’s training process does not happen in real-time, this became a challenge of manipulating the output of a pre-trained GAN network in response to real-time changes in the environment. To explain our solution, we first need to look at the difference between standard GANs and conditional GANs.
实际上,这意味着我们必须实施经常与环境和参与者交互的GAN网络。 由于GAN的培训过程不是实时发生的,因此,应对环境的实时变化来操纵经过预训练的GAN网络的输出成为一个挑战。 为了解释我们的解决方案,我们首先需要研究标准GAN与条件GAN之间的区别。
标准VS条件GAN (Standard VS Conditional GANs)
In its basic form, GAN is trained to produce new images that are visually similar to the training set. If we used a dataset of faces, it would generate new faces. If we trained it on cats, it would render new cats. It can maintain variability (not producing the same image every time) by taking an input ‘noise’ vector (essentially a series of random numbers) and using them as the basis for the output image. Thus, if we want to connect GAN’s output to changes in the environment, we need to manipulate the noise vector based on those changes. However, as we showed in our previous post, there is hardly any control over what kind of change would emerge as a result of changing the noise vector.
在其基本形式中,GAN受过训练以生成视觉上类似于训练集的新图像。 如果我们使用面部数据集,它将生成新的面部。 如果我们在猫身上训练它,它将渲染出新的猫。 通过获取输入的“噪声”矢量(本质上是一系列随机数)并将其用作输出图像的基础,它可以保持可变性(每次都不会生成相同的图像)。 因此,如果我们想将GAN的输出与环境的变化联系起来,则需要基于这些变化来操纵噪声矢量。 但是,正如我们在前一篇文章中所展示的那样,几乎没有任何控制权会由于更改了噪声矢量而产生什么样的变化。
We could link different variables in the physical room (such as the participants’ position, the position of objects, and mood analysis of the participants) to the generated output, but the lack of precise control over the output results in a tenuous connection to the environment.
我们可以将物理空间中的不同变量(例如,参与者的位置,对象的位置以及参与者的情绪分析)链接到生成的输出,但是由于缺乏对输出的精确控制,导致与输出的紧密联系。环境。
That is where conditional GANs enter the picture. Instead of training on one set of images, we train the network on pairs consisting of an image and a label (numerical input), conditioned to generate one type of image when being presented with a specific kind of label. That grants the user full control over how GAN generates its output for a particular input. The result still varies along with the noise vector, as in the original GAN. However, now the author can create meaningful interactions with the environment. One of the most famous conditional GANs is Pix2Pix.
那就是有条件的GAN进入图片的地方。 而不是在一组图像的训练,我们培养在由一个图像和一个标签(数字输入),调节以产生一个图像类型的对网络被呈现特定类型的标签的时候。 这使用户可以完全控制GAN如何为特定输入生成其输出。 与原始GAN一样,结果仍会随噪声矢量而变化。 但是,现在作者可以与环境进行有意义的交互。 Pix2Pix是最著名的条件GAN之一。
It is a general-purpose image-to-image translator. It can be conditioned on any type of image to generate another. It analyzes pixels in both images, learning how to convert from one color to another. Pix2Pix is used in a variety of ways, such as transforming sketches into paintings and colormaps into photos. We have also used it in our prototype to convert a human’s colored pose analysis to a generated human from stock images of families.
它是一种通用的图像到图像转换器。 可以根据任何类型的图像来生成另一个图像。 它分析了两个图像中的像素,学习如何从一种颜色转换为另一种颜色。 Pix2Pix以多种方式使用,例如将草图转换为绘画,将颜色映射转换为照片。 我们还在原型中使用了它,可以将人的彩色姿势分析转换为由家庭库存图像生成的人。
高甘 (GauGAN)
Where Pix2Pix finds its strength, being a generic translator from any image to any image, it also has its weakness. Relying only on color misses out on metadata that one could feed into the network. The algorithm looks only at shapes and colors. It cannot differentiate between a dinner plate and a flying saucer if they look visually similar in the photo. That is what the researchers at NVIDIA addressed when they created GauGAN. Named after post-Impressionist painter Paul Gauguin, GauGAN also creates realistic images from colormaps. However, instead of learning pixel values, it learns the semantic data of the image. The project is also known as SPADE: Semantic Image Synthesis with Spatially-Adaptive Normalization. Instead of learning where green and blue are in the picture, GauGAN learns where there are grass and sky. That is possible because the images used in the training set, such as the generic database COCO-Stuff, contain semantic classifications of the different elements in the picture. The researchers were then able to demonstrate the capability of GauGAN by crafting an interactive painting tool where colors are not just colors but have meanings. When you paint green into the source sketch, you are telling GauGAN that here lies grass. Try it yourself here.
Pix2Pix可以找到其优势的地方,是从任何图像到任何图像的通用转换器,它也有其弱点。 仅依靠颜色会错过可能会馈入网络的元数据。 该算法仅查看形状和颜色。 如果它们在照片上看起来相似,则无法区分餐盘和飞碟。 NVIDIA的研究人员在创建GauGAN时就是如此。 以后印象派画家保罗·高更(Paul Gauguin)的名字命名的GauGAN还可以从色彩图中创建逼真的图像。 但是,它不是学习像素值,而是学习图像的语义数据。 该项目也称为SPADE :具有空间自适应归一化的语义图像合成。 GauGAN不会了解图片中绿色和蓝色的位置,而是了解草和天空的位置。 这是可能的,因为训练集中使用的图像(例如通用数据库COCO-Stuff )包含图片中不同元素的语义分类。 然后,研究人员能够通过制作一种互动绘画工具来展示GauGAN的功能,该工具不仅色彩是色彩,而且具有意义。 在源草图中绘制绿色时,您是在告诉GauGAN这是草。 在这里自己尝试。
将GauGAN连接到实时360环境 (Connecting GauGAN to a real-time 360 environment)
GauGAN can generate photorealistic images from hand-drawn sketches. Our goal was to have it interact with a real-time physical environment. Solving this was like putting together pieces of a puzzle:
GauGAN可以从手绘草图生成逼真的图像。 我们的目标是使它与实时物理环境交互。 解决这个问题就像把一个难题拼在一起:
We know that NVIDIA trained GauGAN on semantic data: they used the DeepLab v2 network to analyze the COCO-Stuff database and produce labels.
我们知道NVIDIA在语义数据方面对GauGAN进行了培训:他们使用DeepLab v2网络分析了COCO-Stuff数据库并产生标签。
- We know that DeepLab V2 can segment a camera stream in real-time. 我们知道DeepLab V2可以实时分割摄像机流。
- 1+2: If we feed DeepLab’s output of a camera stream directly to GauGAN, we should get its mirrored state of reality. 1 + 2:如果我们将DeepLab的摄像机流输出直接提供给GauGAN,我们应该获得其真实的镜像状态。
The code itself was relatively straightforward and mostly had to do with format conversions between the two networks. We also upgraded DeepLab’s webcam code to stream from our 360 camera: RICOH THETA Z1. The segmentation networks are so robust that we could feed the widened stitched image straight to segmentation and generation. The result was surprisingly accurate.
该代码本身相对简单明了,并且主要与两个网络之间的格式转换有关。 我们还升级了DeepLab的网络摄像头代码,以从我们的360度摄像头: RICOH THETA Z1流式传输。 分割网络是如此强大,以至于我们可以将加宽的缝合图像直接用于分割和生成。 结果出乎意料地准确。
操纵GAN的现实(Manipulating GAN’s reality)
We now had a generated mirror image, depicting GAN’s (COCO-Stuff) version of whatever the camera is witnessing in the room. But we wanted more; we wanted a space that changes according to the story and resembles the character’s state of mind. We looked for ways to generate visuals that will connect to the story-world. To find meanings in between the words and lure the users into keeping acting, move objects around, see the reflection, and wonder what this is all about.
现在,我们生成了一个镜像,描绘了房间中任何摄像机所看到的GAN(COCO-Stuff)版本。 但是我们想要更多。 我们想要一个根据故事而变化并与角色的心态相似的空间。 我们一直在寻找产生与故事世界相关的视觉效果的方法。 要在单词之间找到含义,并吸引用户继续行动,四处移动对象,查看反射,并想一想到底是什么。
We realized that we could interfere in the process of perception and generation. Right after DeepLab analyzers the labels in the camera stream, why not replace them with something else? For example, let’s map any recognized bowl to a sea.
我们意识到我们可能会干扰感知和生成的过程。 在DeepLab分析仪分析摄像机流中的标签之后,为什么不用其他标签替换它们? 例如,让我们将任何公认的碗映射到大海。
We started looking for patterns that our characters’ stories can surface and that the physical space can support throughout the visual form: a face, a landscape, an object, a flower. Stories are recognizable patterns, and in those patterns, we find meaning. They are the signal within the noise.
我们开始寻找角色故事可以浮出水面并在整个视觉形式中支持物理空间的模式:面部,风景,物体,花朵。 故事是可识别的模式,在这些模式中,我们找到了意义。 它们是噪声中的信号。
When we finally got to the lab space to test it all, we discovered the effect of the physical setting. We started playing by arranging (and rearranging) strange elements and exploring the results we can achieve. We developed a scripting platform that lets us easily map objects to other objects. We could mask certain objects from the scene, select multiple objects at once, or invert the selection to map everything other than the objects specified. For example: ‘Dinner table,’ ‘table,’ ‘desk,’ ‘desk stuff,’ ‘floor,’ ‘bed, ‘car’ — suddenly became the same item and were mapped into a sea, while everything else was discarded. Although we didn’t have a car or plastic, or bed in the space. Or ‘frisbee’, ‘paper’, ‘mouse’, ‘metal’, ‘rock, ‘bowl’, ‘wine glass’, ‘bottle’ — all mapped to ‘rock’. Again, interesting to note that we didn’t have a mouse, frisbee, metal, rock, or paper in the real scene, but the network detected them. Therefore, we needed to consider them as well.
当我们最终到达实验室进行测试时,我们发现了物理设置的效果。 我们通过排列(和重新排列)奇怪的元素并探索我们可以达到的结果开始游戏。 我们开发了一个脚本平台,使我们可以轻松地将对象映射到其他对象。 我们可以从场景中屏蔽某些对象,一次选择多个对象,或者反转选择以映射除指定对象之外的所有内容。 例如:“餐桌”,“桌子”,“桌子”,“桌子上的东西”,“地板”,“床”,“汽车”-突然变成了同一物品并被映射到大海中,而其他所有物品都被丢弃了。 尽管我们没有汽车或塑料,也没有床。 或“飞盘”,“纸”,“鼠标”,“金属”,“岩石”,“碗”,“酒杯”,“瓶”-都映射到“岩石”。 同样,有趣的是,我们在真实场景中没有鼠标,飞盘,金属,石头或纸,但是网络检测到了它们。 因此,我们也需要考虑它们。
If that wasn’t enough, we discovered that changes in the lights, shadow, and camera angles generated different labels every time, which messed up our mapping. In an interactive storytelling framework, this felt both incredible and horrific. We had a little less than ten days before the opening to refine the space and debug the technology while understanding the range of possibilities we can create with what we just developed.
如果这还不够的话,我们发现每次改变灯光,阴影和摄影机角度都会生成不同的标签,这会使我们的映射混乱。 在交互式讲故事框架中,这既令人难以置信又令人恐惧。 开幕前不到十天,我们就在优化空间和调试技术的同时,了解了我们可以利用刚刚开发的产品创造的各种可能性。
We played together with our network, with little control over the visuals, we looked to visualize the story of our characters’ inner world.
我们与我们的网络一起玩,几乎没有对视觉效果的控制,我们试图形象化角色内部世界的故事。
Slowly, we started to learn the system — what works, what doesn’t, how to clean the scene, how to stabilize the lighting. We also decided to project both stages of the process, the colored segmented analysis of DeepLab and GAN’s generated output. Gradually, the physical environment became more immersive and could link with the words of the story.
慢慢地,我们开始学习该系统-什么有效,什么无效,如何清洁场景,如何稳定照明。 我们还决定计划该过程的两个阶段,即DeepLab的彩色分段分析和GAN的生成输出。 逐渐地,物理环境变得更加身临其境,并且可以与故事的文字联系起来。
感言(Reflections)
- The resolution of the pre-trained SPADE/GauGan network generates images at a low 256x256 resolution. It was hard to engage people in these kinds of visuals and make them understand what they are seeing. Achieving a higher resolution would have required us to invest more resources into our training, which wasn’t possible at that time. 预训练的SPADE / GauGan网络的分辨率可生成低256x256分辨率的图像。 很难使人们参与这些视觉效果并使他们理解所看到的东西。 要获得更高的分辨率,将需要我们在培训上投入更多资源,而当时这是不可能的。
- Because GauGAN is semantically-aware, the context of images matters a lot. For example, mapping a desk to a sea, while leaving the concrete wall in the background, generates a murky lake or a pond. But map the wall into a blue sky, and now the sea looks more like an ocean. 由于GauGAN具有语义意识,因此图像的上下文非常重要。 例如,将桌子映射到大海,同时将混凝土墙留在背景中,则会生成一个模糊的湖或池塘。 但是将墙壁映射成蓝天,现在海洋看起来更像海洋。
- Because of this context-awareness, it was also hard to convey meaning with isolated objects. The images usually looked best when we showed them in their entirety. 由于具有上下文意识,因此很难通过孤立的对象传达含义。 当我们完整显示它们时,这些图像通常看起来最好。
While we still feel that there is a lot more room for experimentation and polishing the images around our story, the results give us the first glimpse of GAN’s “consciousness” as a perceiving entity that generates its inner world. Such a process resonates with the philosophy of human consciousness.
虽然我们仍然觉得还有更多的实验空间可以用来修饰故事的画面,但结果使我们对GAN的“意识”有了一个初步的了解,它是感知事物的实体,可以产生其内心世界。 这样的过程与人类意识的哲学产生了共鸣。
Immanuel Kant’s transcendental philosophy speaks of the act of synthesis: Our representations act together to mold one unified consciousness. In modern neuroscience, we speak of the Neural Correlates of Consciousness that describe the neural activity required for consciousness, not as a discrete feedforward mechanism of object recognition, but a long sustained feedback wave of a unified experience. That is also the type of experience we wished to design in Marrow’s room, where the final ‘editing’ happens in the participant’s mind.
依曼纽尔·康德(Immanuel Kant)的先验哲学谈到综合行为:我们的表象共同作用以塑造一种统一的意识。 在现代神经科学中,我们说的是意识的神经相关性,它描述了意识所需的神经活动,不是作为对象识别的离散前馈机制,而是长期持续的统一体验反馈波。 这也是我们希望在Marrow的房间中设计的体验类型,最终的“编辑”发生在参与者的脑海中。
One thing we are sure will not bring harm to this creative work — is for more people to use it. You will not know what you’re doing unless you’re making it many times, especially in this complicated type of project. Just make make make.
我们确定不会对这项创造性工作造成损害的一件事-是让更多的人使用它。 除非您进行多次,否则您将不知道自己在做什么,尤其是在这种复杂的项目中。 只是制造。
Here is the project’s Open Source GitHub repository. Please share with us what you are making and thinking!
这是项目的开源GitHub存储库。 请与我们分享您的想法和想法!
The development phase was done in collaboration with Philippe Lambert, sound artist, and Paloma Dawkins, animator. In the co-production of NFB Interactive and Atlas V.
开发阶段是与声音艺术家Philippe Lambert和动画师Paloma Dawkins合作完成的。 在NFB Interactive和Atlas V.的联合制作中
gaugan使用教程