像素到像素gan_听像素

像素到像素gan

内部AI (INSIDE AI)

Human perception is multidimensional and a balanced combination of hearing, vision, smell, touch, and taste. Recently, many pieces of research have tried to step forward on the road of improving machine perception by transitioning from single-modality learning to multimodality learning. If you are wondering what modality is, it is the classification of a single independent channel of sensory input/output between a computer and a human (like vision is one modality and audio is another). In this blog, we will talk about the use of audio and visual information (representing the two most important perceptual modalities in our daily life) to make our machine perception smarter without using any labeled data (self-supervision).

人的感知是多维的,是听觉,视觉,气味,触觉和味觉的平衡组合。 最近,许多研究试图通过从单模式学习过渡到多模式学习,在改善机器感知的道路上前进。 如果您想知道什么是模态,那就是计算机和人之间的单个独立的感觉输入/输出通道的分类(例如视觉是一种模态,而音频是另一种模态)。 在此博客中,我们将讨论如何使用音频和视频信息(代表我们日常生活中两个最重要的感知方式)来使我们的机器感知更智能,而无需使用任何标记数据(自我监督)。

我们正在解决的问题是什么? (What is the problem we are solving?)

When you hear a person’s voice you know, can you recall their face? Or can you recall a person’s voice on seeing their face? This shows how humans can ‘hear faces’ and ‘see voices’ by cultivating a mental picture or an acoustic memory of the person. The problem is, can you teach a machine to do that?

当您听到一个人的声音时,您会想起他的脸吗? 还是可以回忆起一个人看到自己的脸的声音? 这显示了人类如何通过培养人的心理图片或声音记忆来“听到面Kong”和“看到声音”。 问题是,您能教一台机器做到这一点吗?

Our world generates a rich source of auditory and visual signals. The visual signals are a result of light reflections, whereas the sounds originate from object motions and vibrations of the surrounding air. Often correlated at the time of naturally occurring events, these two modalities combine to jointly affect human perception. In response to this perceptual input, humans show a remarkable ability to connect and integrate signals from these two modalities. As a matter of fact, the interplay among senses is one of the most ancient schemes that explains how the human brain’s sensory organization works to understand the complex interactions of the physical dimension. Inspired by our capability of interpreting sound sources from how objects move visually, we can create learning-models that learn to perform this interpretation on its own.

我们的世界产生了丰富的听觉和视觉信号源。 视觉信号是光反射的结果,而声音则来自物体的运动和周围空气的振动。 通常在自然事件发生时相关,这两种方式结合在一起共同影响人类的感知。 响应于这种感知输入,人类表现出了显着的能力来连接和整合来自这两种方式的信号。 实际上,感官之间的相互作用是最古老的方案之一,可以解释人脑的感觉组织如何工作以理解物理维度的复杂相互作用。 受我们从物体的视觉移动方式解释声源能力的启发,我们可以创建学习模型来学习如何自行执行这种解释。

Image for post
source

While auditory scene analysis is majorly studied in the fields of environmental sound source separation and recognition, the natural synchronization between sound and vision can provide a rich self-supervisory signal for grounding auditory signals into the visual signals, which is all we need for self-supervision to show it’s magic.

虽然听觉场景分析是在环境声源分离和识别领域中进行的主要研究,但是声音和视觉之间的自然同步可以提供丰富的自我监控信号,以将听觉信号转化为视觉信号,这是我们自我实现所需要的全部监督以证明这是魔术。

In this blog, we will learn how to leverage this cross-modal context as a self-supervisory signal to extract information beyond the limits established by individual modalities. We will acknowledge the importance of temporal features that are based on significant changes in each modality and design a probabilistic formalism that can identify temporal coincidences between these features to yield visual localization and cross-modal association.

在此博客中,我们将学习如何利用这种跨模式上下文作为自我监督信号来提取超出各个模式所建立的限制的信息。 我们将认识到基于每个模态的重大变化的时态特征的重要性,并设计一种概率形式主义,可以识别这些特征之间的时态重合以产生视觉定位和跨模态关联。

直观的解决方案 (The intuitive solution)

The most intuitive solution which will come to our mind is to design a probabilistic formalism that can exploit the inherent coherence of audio-visual signals from large quantities of unlabelled videos to learn sound localization and separation. This can be done by making a computational model that can learn the relationship between visuals and sounds in an unsupervised way by recognizing objects from the sounds they make, to localize them in images, and to separate the audio component coming from each object. With such inspiration in mind, many researchers have developed models that can effectively do sound localization and sound recognition. We will also work our way to one such solution that can do sound source separation and its visual localization by distinguishing the components of sound and their association with the corresponding objects. The solution we will work on is two-fold. First, we will use a simple architecture that will rely on static visual information to learn the cross-modal context. Next, we will take a step further to include the motion cues of the video into our solution. The motion signals are of crucial importance for learning the audio-visual correspondences. This fact can be more clearly understood by taking a simple case of sound production from two similar-looking objects. Consider a case of two artists playing violin duet. This case constructs an impossible situation for humans to separate the melody from harmony by analyzing the single picture. However, if we observe the movements of the artists for a while and try to match these motion cues with the musical beats, we can probably conjecture according to this motion-beat observation. This case illustrates the importance of the temporal repetition of the motion for solving the complex multi-modal reasoning of sound source separation even for humans. Our aim is to computationally mimic this ability to reason the synergy between audio, visual, and motion signals.

我们想到的最直观的解决方案是设计一种概率形式主义,可以利用来自大量未标记视频的视听信号的固有一致性来学习声音的定位和分离。 这可以通过创建一个计算模型来完成,该模型可以通过从物体发出的声音中识别物体,将物体定位在图像中并分离来自每个物体的音频分量,从而以无监督的方式学习视觉和声音之间的关系。 考虑到这种灵感,许多研究人员开发了可以有效进行声音定位和声音识别的模型。 我们还将努力找到一种这样的解决方案,通过区分声音的成分以及它们与相应对象的关联来进行声源分离及其视觉定位。 我们将研究的解决方案有两个方面。 首先,我们将使用一个简单的体系结构,该体系结构将依靠静态视觉信息来学习跨模式上下文。 接下来,我们将进一步采取措施,将视频的运动提示包含到我们的解决方案中。 运动信号对于学习视听对应关系至关重要。 通过从两个看起来相似的物体发出声音的简单案例,可以更清楚地理解这一事实。 考虑两个艺术家拉小提琴二重奏的情况。 这种情况构成了一种不可能的情况,人类无法通过分析单个图片将旋律与和声区分开。 但是,如果我们观察艺术家的运动一段时间,并尝试将这些运动线索与音乐节拍相匹配,则根据此运动节拍观察结果,我们可能会做出推测。 这种情况说明了运动的时间重复对于解决声源分离甚至是人类的复杂多模态推理的重要性。 我们的目标是在计算上模仿这种能力,以推理音频,视频和运动信号之间的协同作用。

Image for post
Image for post
source 的模型的像素级声音嵌入可视化

Computational models of this relationship can be utilized as a fundamental unit for many applications like combining videos with automatically generated ambient sound for better immersion in VR or for enabling equal accessibility by linking sound with visual signals for visually impaired people.

这种关系的计算模型可以用作许多应用程序的基本单元,例如将视频与自动生成的环境声音结合起来,以更好地沉浸在VR中,或者通过将声音与视觉障碍者的视觉信号链接起来,实现平等的可访问性。

方法 (The Approaches)

For our initial approach, we will construct a three-component network as suggested in [1] for processing video frames and audio signals separately, followed by their features’ combined processing in an audio synthesizer network.

对于我们的初始方法,我们将构建一个如[1]中建议的三分量网络,分别处理视频帧和音频信号,然后在音频合成器网络中对它们的功能进行组合处理。

Image for post
source

The first component, the Video Analysis Network (VAN) takes the video frames as input and extracts the appearance features. For the feature extraction part, we will use a dilated Resnet-18 model with an input size of TxHxWx3, and an output stride of 16 followed by a temporal max-pooling layer to output a K channel feature map. In the code snippet below, you can find the PyTorch code for a VAN.

第一个组件是视频分析网络(VAN) ,将视频帧作为输入并提取外观特征。 对于特征提取部分,我们将使用膨胀的Resnet-18模型,其输入大小为T x H x W x 3,输出步幅为16,然后是时间最大合并层,以输出K通道特征图。 在下面的代码片段中,您可以找到VAN的PyTorch代码。

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from functools import partial


class DilatedResnet18MaxPool(nn.Module):
    def __init__(self, fc_dim=64, conv_size=3):
        super(DilatedResnet18MaxPool, self).__init__()


        orig_resnet = torchvision.models.resnet18(pretrained=True)
        orig_resnet.layer4.apply(
                partial(self._nostride_dilate, dilate=2))


        self.features = nn.Sequential(
            *list(orig_resnet.children())[:-2])


        self.fc = nn.Conv2d(
            512, fc_dim, kernel_size=conv_size, padding=conv_size//2)


    def _nostride_dilate(self, m, dilate):
        classname = m.__class__.__name__
        if classname.find('Conv') != -1:
            # the convolution with stride
            if m.stride == (2, 2):
                m.stride = (1, 1)
                if m.kernel_size == (3, 3):
                    m.dilation = (dilate//2, dilate//2)
                    m.padding = (dilate//2, dilate//2)
            # other convoluions
            else:
                if m.kernel_size == (3, 3):
                    m.dilation = (dilate, dilate)
                    m.padding = (dilate, dilate)


    def forward(self, x, pool=True):
        x = self.features(x)
        x = self.fc(x)


        if not pool:
            return x


        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), x.size(1))
        return x


    def forward_multi(self, x, pool=True):
        (B, C, T, H, W) = x.size()
        x = x.permute(0, 2, 1, 3, 4).contiguous()
        x = x.view(B*T, C, H, W)


        x = self.features(x)
        x = self.fc(x)


        (_, C, H, W) = x.size()
        x = x.view(B, T, C, H, W)
        x = x.permute(0, 2, 1, 3, 4)


        if not pool:
            return x


        x = F.adaptive_max_pool3d(x, 1)
        x = x.view(B, C)
        return x

The second component, the Audio Analysis Network (AAN) takes the sound mixture as input and applies the Short-Time Fourier Transform (STFT) with a log-frequency scale to obtain a sound spectrogram. Then the obtained spectrogram is fed to a U-Net that yields K feature maps representing different components of input audio mixture. In the code snippet below, you can find the PyTorch code for an AAN.

第二个组件, 音频分析网络(AAN)将声音混合作为输入,并应用对数频率范围的短时傅立叶变换(STFT),以获得声谱图。 然后,将获得的频谱图馈送到U-Net,该网络会生成代表输入音频混合的不同分量的K个特征图。 在下面的代码片段中,您可以找到AAN的PyTorch代码。

import torch
import torch.nn as nn
import torch.nn.functional as F


# Code from Hang Zhao (@hangzhaomit)
class Unet(nn.Module):
    def __init__(self, fc_dim=64, num_downs=5, ngf=64, use_dropout=False):
        super(Unet, self).__init__()


        # construct unet structure
        unet_block = UnetBlock(
            ngf * 8, ngf * 8, input_nc=None,
            submodule=None, innermost=True)
        for i in range(num_downs - 5):
            unet_block = UnetBlock(
                ngf * 8, ngf * 8, input_nc=None,
                submodule=unet_block, use_dropout=use_dropout)
        unet_block = UnetBlock(
            ngf * 4, ngf * 8, input_nc=None,
            submodule=unet_block)
        unet_block = UnetBlock(
            ngf * 2, ngf * 4, input_nc=None,
            submodule=unet_block)
        unet_block = UnetBlock(
            ngf, ngf * 2, input_nc=None,
            submodule=unet_block)
        unet_block = UnetBlock(
            fc_dim, ngf, input_nc=1,
            submodule=unet_block, outermost=True)


        self.bn0 = nn.BatchNorm2d(1)
        self.unet_block = unet_block


    def forward(self, x):
        x = self.bn0(x)
        x = self.unet_block(x)
        return x




# Defines the submodule with skip connection.
# X -------------------identity---------------------- X
#   |-- downsampling -- |submodule| -- upsampling --|
class UnetBlock(nn.Module):
    def __init__(self, outer_nc, inner_input_nc, input_nc=None,
                 submodule=None, outermost=False, innermost=False,
                 use_dropout=False, inner_output_nc=None, noskip=False):
        super(UnetBlock, self).__init__()
        self.outermost = outermost
        self.noskip = noskip
        use_bias = False
        if input_nc is None:
            input_nc = outer_nc
        if innermost:
            inner_output_nc = inner_input_nc
        elif inner_output_nc is None:
            inner_output_nc = 2 * inner_input_nc


        downrelu = nn.LeakyReLU(0.2, True)
        downnorm = nn.BatchNorm2d(inner_input_nc)
        uprelu = nn.ReLU(True)
        upnorm = nn.BatchNorm2d(outer_nc)
        upsample = nn.Upsample(
            scale_factor=2, mode='bilinear', align_corners=True)


        if outermost:
            downconv = nn.Conv2d(
                input_nc, inner_input_nc, kernel_size=4,
                stride=2, padding=1, bias=use_bias)
            upconv = nn.Conv2d(
                inner_output_nc, outer_nc, kernel_size=3, padding=1)


            down = [downconv]
            up = [uprelu, upsample, upconv]
            model = down + [submodule] + up
        elif innermost:
            downconv = nn.Conv2d(
                input_nc, inner_input_nc, kernel_size=4,
                stride=2, padding=1, bias=use_bias)
            upconv = nn.Conv2d(
                inner_output_nc, outer_nc, kernel_size=3,
                padding=1, bias=use_bias)


            down = [downrelu, downconv]
            up = [uprelu, upsample, upconv, upnorm]
            model = down + up
        else:
            downconv = nn.Conv2d(
                input_nc, inner_input_nc, kernel_size=4,
                stride=2, padding=1, bias=use_bias)
            upconv = nn.Conv2d(
                inner_output_nc, outer_nc, kernel_size=3,
                padding=1, bias=use_bias)
            down = [downrelu, downconv, downnorm]
            up = [uprelu, upsample, upconv, upnorm]


            if use_dropout:
                model = down + [submodule] + up + [nn.Dropout(0.5)]
            else:
                model = down + [submodule] + up


        self.model = nn.Sequential(*model)


    def forward(self, x):
        if self.outermost or self.noskip:
            return self.model(x)
        else:
            return torch.cat([x, self.model(x)], 1)

The third component, the Audio Synthesizer Network (ASN) takes the extracted pixel-level appearance features and audio features as input and predicts a vision-based spectrogram binary mask. The number of predicted binary masks depends on the number of sound sources to separate in the input mixture. These binary masks are then multiplied with the input spectrogram to separate each sound component, followed by a magnitude adjustment of the prediction with the phase of the input to get the final waveform. The final waveform is then processed with an inverse STFT to retrieve the final audio components. In the code snippet below, you can find the PyTorch code for an ASN.

第三个组件, 音频合成器网络(ASN)将提取的像素级外观特征和音频特征作为输入,并预测基于视觉的声谱图二进制掩码。 预测的二进制掩码的数量取决于要在输入混合物中分离的声源的数量。 然后将这些二进制掩码与输入声谱图相乘以分离每个声音分量,然后根据输入的相位对预测的幅度进行调整,以获得最终波形。 然后用反STFT处理最终波形以检索最终音频分量。 在下面的代码片段中,您可以找到ASN的PyTorch代码。

import torch
import torch.nn as nn
import torch.nn.functional as F


# Code from Hang Zhao (@hangzhaomit)
class InnerProd(nn.Module):
    def __init__(self, fc_dim):
        super(InnerProd, self).__init__()
        self.scale = nn.Parameter(torch.ones(fc_dim))
        self.bias = nn.Parameter(torch.zeros(1))


    def forward(self, feat_img, feat_sound):
        sound_size = feat_sound.size()
        B, C = sound_size[0], sound_size[1]
        feat_img = feat_img.view(B, 1, C)
        z = torch.bmm(feat_img * self.scale, feat_sound.view(B, C, -1)) \
            .view(B, 1, *sound_size[2:])
        z = z + self.bias
        return z


    def forward_nosum(self, feat_img, feat_sound):
        (B, C, H, W) = feat_sound.size()
        feat_img = feat_img.view(B, C)
        z = (feat_img * self.scale).view(B, C, 1, 1) * feat_sound
        z = z + self.bias
        return z


    # inference purposes
    def forward_pixelwise(self, feats_img, feat_sound):
        (B, C, HI, WI) = feats_img.size()
        (B, C, HS, WS) = feat_sound.size()
        feats_img = feats_img.view(B, C, HI*WI)
        feats_img = feats_img.transpose(1, 2)
        feat_sound = feat_sound.view(B, C, HS * WS)
        z = torch.bmm(feats_img * self.scale, feat_sound) \
            .view(B, HI, WI, HS, WS)
        z = z + self.bias
        return z




class Bias(nn.Module):
    def __init__(self):
        super(Bias, self).__init__()
        self.bias = nn.Parameter(torch.zeros(1))
        # self.bias = nn.Parameter(-torch.ones(1))


    def forward(self, feat_img, feat_sound):
        (B, C, H, W) = feat_sound.size()
        feat_img = feat_img.view(B, 1, C)
        z = torch.bmm(feat_img, feat_sound.view(B, C, H * W)).view(B, 1, H, W)
        z = z + self.bias
        return z


    def forward_nosum(self, feat_img, feat_sound):
        (B, C, H, W) = feat_sound.size()
        z = feat_img.view(B, C, 1, 1) * feat_sound
        z = z + self.bias
        return z


    # inference purposes
    def forward_pixelwise(self, feats_img, feat_sound):
        (B, C, HI, WI) = feats_img.size()
        (B, C, HS, WS) = feat_sound.size()
        feats_img = feats_img.view(B, C, HI*WI)
        feats_img = feats_img.transpose(1, 2)
        feat_sound = feat_sound.view(B, C, HS * WS)
        z = torch.bmm(feats_img, feat_sound) \
            .view(B, HI, WI, HS, WS)
        z = z + self.bias
        return z

Now as I mentioned earlier, this solution might not be enough for separating sound coming from visually similar objects, as the appearance features may get fooled during the synthesizer phase. Therefore we would need another network for analyzing the motion of the sound-producing objects. This additional module is proposed by Zhao et. al [2].

现在,正如我前面提到的,此解决方案可能不足以分离来自视觉上相似的对象的声音,因为外观特征可能会在合成器阶段被愚弄。 因此,我们将需要另一个网络来分析声音产生对象的运动。 这个附加模块由Zhao等人提出。 等[2]。

The fourth component, the Motion Analysis Network (MAN) takes the video frames as input and predicts a dense trajectory feature map in three major steps. In the first step, we can use a dense optical flow estimator like PWC-Net (lightweight design and fast speed) to extract the dense optical flow vectors for the input frames. In the next step, the network will use the extracted dense optical flow to predict the dense trajectories. To understand this in basic terms, let's assume a pixel’s spatial location to be I_t = (x_t,y_t) and the dense optical flow to be ω_t = (u_t, v_t) at a time “t”. Then for time “t+1”, the estimated position will be I_t+1 = (x_t+1, y_t+1) = (x_t, y_t) + ω|(x_t, y_t). The concatenation of these estimated coordinates (I_t, I_t+1, I_t+2…) is the full trajectory for a pixel. In the third step, the estimated dense trajectories are fed to a CNN model to extract the deep features of these trajectories. The choice of CNN is not fixed and can be arbitrary. Zhao et. al [] proposes to use an I3D model, which is well known for capturing spatiotemporal features. I3D has a compact design that inflates 2D CNN into 3D to bootstrap 3D filters from pre-trained 2D filters.

第四部分, 运动分析网络(MAN) ,将视频帧作为输入,并通过三个主要步骤来预测密集的轨迹特征图。 第一步,我们可以使用像PWC-Net这样的密集光流估算器(轻巧的设计和快速的速度)来提取输入帧的密集光流矢量。 在下一步中,网络将使用提取的密集光流来预测密集轨迹。 为了从基本的角度理解这一点,我们假设在时间“ t”处像素的空间位置为I_t =(x_t,y_t),密集光流为ω_t=(u_t,v_t)。 然后对于时间“ t + 1”,估计位置将为I_t + 1 =(x_t + 1,y_t + 1)=(x_t,y_t)+ω|(x_t,y_t)。 这些估计坐标(I_t,I_t + 1,I_t + 2…)的串联是一个像素的完整轨迹。 在第三步中,将估计的密集轨迹馈送到CNN模型以提取这些轨迹的深层特征。 CNN的选择不是固定的,可以是任意的。 赵等 al []建议使用I3D模型,该模型以捕获时空特征而闻名。 I3D具有紧凑的设计,可将2D CNN扩展为3D,以从预先训练的2D滤镜中引导3D滤镜。

Image for post
source

The question that still remains unanswered is how to incorporate these trajectory features in our initial model framework. To do so, first, we have to fuse these features with the appearance features that were generated as a part of the first component (VAN). A simple way to do this fusion is to extract an attention map from the appearance features convoluting them to a single channel and activating them the values with a sigmoid function to get a spatial attention map. This attention map can then be multiplied with the trajectory features to focus only on important trajectories, followed by the concatenation of both appearance and trajectory features. After this step, either we can use these features directly in place of the old appearance features or we can do an alignment of the visual and sound features in time by applying Feature-wise Linear Modulation (FiLM) on sound features and use fuse them to act as an input to the Audio-U-Net decoder (as suggested by Zhao et. al). In the second case (using FILM) we would no longer need the audio synthesizer network and we can rewrite the U-Net decoder to directly predict the binary masks.

仍然悬而未决的问题是如何将这些轨迹特征纳入我们的初始模型框架中。 为此,首先,我们必须将这些功能与作为第一组件(VAN)的一部分生成的外观功能进行融合。 一种简单的融合方法是从外观特征中提取注意力图,将其卷积到单个通道,然后使用S型函数激活它们的值以获取空间注意力图。 然后可以将此注意力图与轨迹特征相乘,以仅关注重要轨迹,然后将外观特征和轨迹特征串联在一起。 完成此步骤后,我们可以直接使用这些功能代替旧的外观功能,也可以通过在声音功能上应用功能明智的线性调制(FiLM)并及时将它们融合在一起,从而及时调整视觉和声音功能。作为Audio-U-Net解码器的输入(如Zhao等人所建议)。 在第二种情况下(使用FILM),我们将不再需要音频合成器网络,并且可以重写U-Net解码器以直接预测二进制掩码。

自我监督框架 (The Self-supervised framework)

In this blog section, we will discuss two major training frameworks that are necessary for training a model to learn the cross-modal context in a self-supervised way.

在此博客部分中,我们将讨论两个主要的训练框架,这对于训练模型以自指导的方式学习交叉模式上下文是必需的。

混合分离框架(MSF) (Mix and Separate Framework (MSF))

Image for post
source 来源

The mix and separate training framework is a procedure that artificially creates a complex auditory scene for the model under training. MSF enforces the model to analyze some randomly generated complex auditory scenes and frames a situation for it to separate and ground the mixed sounds. The generated data is not directly available in the training data, thus MSF creates an auto data augmentation situation. MSF leverages the fact that audio signals are additive and thus we can mix sounds from different video samples to generate a complex auditory signal for the model input. On the other hand, this framework also creates a self-supervised learning objective for the model. The objective is to separate and restore the sounds back to their original waveform that was intact to each source before the addition by using the visual input associated with the sound mixture. For the mix and separate framework, we randomly sample N video clips from the training set and in a simple case, mix the sound components of any two of them and serve the model with audio mixture input and their respective frames. It is important to note that although the framed training targets are clear in the training process, the process still is unsupervised as we do not use any data labels and data sampling assumptions.

混合和单独的培训框架是为受培训的模型人为地创建复杂的听觉场景的过程。 无国界医生(MSF)强制执行该模型,以分析一些随机生成的复杂听觉场景,并构建一种情况以将混合声音分离和接地。 生成的数据不能直接在训练数据中获得,因此MSF会创建自动数据扩充情况。 MSF充分利用了音频信号可叠加的事实,因此我们可以混合来自不同视频样本的声音,以生成用于模型输入的复杂听觉信号。 另一方面,该框架还为模型创建了自我监督的学习目标。 目的是通过使用与混音相关联的可视输入,将声音分离并恢复为原始声音,并在添加之前将其恢复为原始声音。 对于混合和单独的框架,我们从训练集中随机采样N个视频剪辑,并在一个简单的情况下,混合其中任意两个的声音分量,然后将模型与音频混合输入及其各自的帧一起提供服务。 重要的是要注意,尽管框架化的训练目标在训练过程中是明确的,但是由于我们没有使用任何数据标签和数据采样假设,因此该过程仍然不受监督。

课程学习(CL) (Curriculum Learning (CL))

By definition, curriculum learning is a type of learning in which the training samples start out with only easy examples of a task and then gradually increase the task difficulty. CL is a kind of smart sampling technique that can replace the random sampling nature of the MSF. Inspired by the observation that models trained on a single class of instruments suffer from overfitting due to class imbalance, we can use a multi-stage training curriculum that can start by sampling easily separable sound sources. Such kind of curriculum will help in bootstrapping the model with good weight initialization for better convergence on the difficult tasks.

根据定义,课程学习是一种学习类型,其中训练样本仅从简单的任务示例开始,然后逐渐增加任务难度。 CL是一种智能采样技术,可以代替MSF的随机采样性质。 受到观察的启发,在单一类乐器上训练的模型由于类的不平衡而遭受过度拟合,因此我们可以使用多阶段训练课程,该课程可以从对易于分离的声源进行采样开始。 这种课程将有助于通过良好的权重初始化来引导模型,以更好地收敛于困难的任务。

Note: The learning targets (spectrogram masks) can be both binary and ratios. In the case of binary nature masks, we would use a per-pixel sigmoid cross-entropy loss. Otherwise, in the case of ratio nature masks, we would use a per-pixel L1 loss for training. Also due to possible interference, the values of the ground truth mask do not necessarily stay within the range [0, 1].

注意:学习目标(频谱图掩码)可以是二进制的,也可以是比率的。 在二进制自然掩码的情况下,我们将使用每个像素的S型交叉熵损失。 否则,在比例自然蒙版的情况下,我们将使用每像素L1损失进行训练。 同样由于可能的干扰,地面真值掩模的值不一定保持在[0,1]范围内。

引擎盖下的数学 (The Mathematics under the Hood)

In deep learning applications, we often tend to rely on the network to learn the mathematical model on its own but if we peek under the hood, we will observe numerous interesting mathematical facts.

在深度学习应用程序中,我们通常倾向于依靠网络自行学习数学模型,但是如果我们在内部进行窥视,将会观察到许多有趣的数学事实。

Image for post
source 资源

In the case of a cross-modal association, we assume that each modality will generate a significant event (onset). If the generated onset coincides in time repeatedly (movement of the guitar strings make a sound), then they are assumed to be co-related. In mathematical terms, we can say that if the coincidences are more, the likelihood of cross-modal correspondence is also more. On the other hand, if the onset coincidences are low, so is the cross-modal correspondence likelihood.

在跨模式关联的情况下,我们假设每种模式都会产生一个重大事件(发作)。 如果所产生的开始时间在时间​​上重复一致(吉他弦的移动发出声音),则假定它们是相互关联的。 在数学上,我们可以说,如果更多的重合,交叉模式对应的可能性也更大。 另一方面,如果开始的重合度较低,则交叉模式对应的可能性也较低。

Image for post
source

To understand the process as a likelihood matching algorithm, we must assume that all the onsets of each modality are independent and mutually exclusive. Let us consider the video onset to be a binary with notation V_on and the audio onset binary to be A_on (I am using binary values, just for the sake of explanation). Now if we pre-train our network on an optimization function (likelihood function) of nature,

为了将过程理解为似然匹配算法,我们必须假设每种模态的所有发作都是独立且互斥的。 让我们考虑视频开始点是带有符号V_on的二进制,而音频开始点二进制是A_on (我使用二进制值,只是为了说明)。 现在,如果我们根据自然的优化函数(似然函数)对网络进行预训练,

L = [((A_on)^T ✕ V_on)-(I^T ✕ V_on)]

L = [((A_on)^ T✕V_on)-(I ^ T✕V_on)]

that increases as the coincidences increase, we can explain the likelihood maximization for the cross-modal association better. Assuming that the onsets are random variables that are statically independent of each other and follow the probability law, we can say that L = ∏(P^(onset_match) ✕ (1-P)^(onset_mismatch)) or for an instance L(i) = P^(onset_match) ✕ (1-P)^(onset_mismatch). Now we can take a log and rewrite it as:

随着重合度的增加而增加,我们可以更好地解释交叉模式关联的似然最大化。 假设起始点是彼此静态独立且遵循概率定律的随机变量,我们可以说L = ∏(P ^(onset_match)✕(1-P)^(onset_mismatch))或对于实例L( i)= P ^(初始匹配)✕(1-P)^(初始匹配)。 现在我们可以获取日志并将其重写为:

Log(L(i)) = onset_match ✕ log(P) + onset_mismatch ✕ log(1-P)

Log(L(i))= onset_match✕log(P)+ onset_mismatch✕log(1-P)

Finally, we can also state the onset_match when both are V_on and A_on are either {1, 1}, or {0,0}. Thus showing that onset_match = V_on ✕ A_on + (1-V_on)✕(1-A_on). Therefore we can finally state that, when our network optimizes for cross-modal correspondence modeling, it will indirectly be equivalent to the matching likelihood of features from the cross-modal sources.

最后,当V_on和A_on均为{1、1}或{0,0}时,我们也可以声明onset_match。 从而表明onset_match = V_on✕A_on +(1-V_on)✕(1-A_on)。 因此,我们最终可以说,当我们的网络针对交叉模式对应建模进行优化时,它将间接等同于交叉模式源中特征的匹配可能性。

Note: Due to the limitation of expressing complex mathematical equations in the blog paragraphs, I have simplified the notations to be easily formattable in paragraph format. “^” stands for power, “T” for matrix transpose, and “I” for the identity matrix.

注意:由于在博客段落中表达复杂的数学方程式的限制,我简化了这些符号,使其易于以段落格式进行格式化。 “ ^”代表幂,“ T”代表矩阵转置,“ I”代表单位矩阵。

结论 (Conclusion)

In this blog, we discussed how we can make a system that can learn from unlabeled videos to separate auditory signals and also locate them in the visual input. We started with a simple architecture and showed how the initial system can be enhanced to model the cross-modal context more accurately even when the sound sources are visually similar. In the end, I would conclude on a note that the desire to understand the world from the human perspective has drawn the attention of the deep learning community on the topic of audio-visual learning, and these type of learning will not only help in solving many existing problems but will also lay the foundations for the future development of self-supervised learning and it’s applications on real-world problems.

在此博客中,我们讨论了如何创建一个系统,该系统可以从未标记的视频中学习以分离听觉信号,并将其定位在视觉输入中。 我们从一个简单的体系结构开始,展示了即使声音源在视觉上相似,如何也可以增强初始系统以更准确地对交叉模式上下文进行建模。 最后,我要总结一下,从人类的角度理解世界的愿望已经引起了深度学习社区对视听学习这一主题的关注,而这种类型的学习不仅将有助于解决许多现有问题,但也将为自我监督学习及其在现实世界中的应用的未来发展奠定基础。

My blogs are a reflection of what I worked on and simply convey my understanding of these topics. My interpretation of deep learning can be different from that of yours, but my interpretation can only be as inerrant as I am.

我的博客反映了我所做的工作,只是传达了我对这些主题的理解。 我对深度学习的解释可能与您的解释不同,但是我的解释只能像我一样错误。

翻译自: https://towardsdatascience.com/listening-to-the-pixels-6d929b76eeb7

像素到像素gan

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值