矩阵相似代表着什么_代表性相似性分析

最新推荐文章于 2022-10-10 15:33:22 发布

weixin_26632369

最新推荐文章于 2022-10-10 15:33:22 发布

阅读量1.9k

点赞数

文章标签： python java 机器学习算法人工智能

原文链接：https://towardsdatascience.com/representational-similarity-analysis-f2252291b393

版权

本文探讨了矩阵相似性在数据科学中的概念，它用于衡量两个矩阵之间的相似程度，是机器学习和人工智能中重要的分析工具。通过代表性的相似性分析，可以揭示数据结构的隐藏模式和关系。

摘要由CSDN通过智能技术生成

矩阵相似代表着什么

TL;DR: In today’s blog post we discuss Representational Similarity Analysis (RSA), how it might improve our understanding of the brain as well as recent efforts by Samy Bengio’s and Geoffrey Hinton’s group to systematically study representations in Deep Learning architectures. So let’s get started!

TL; DR：在今天的博客文章中，我们讨论了代表性相似性分析(RSA)，它如何改善我们对大脑的理解以及Samy Bengio和Geoffrey Hinton小组最近为系统研究深度学习架构中的代表性所做的努力。因此，让我们开始吧！

The brain processes sensory information in a distributed and hierarchical fashion. The visual cortex (the most studied object in neuroscience) for example sequentially extracts low-to-high level features. Photoreceptors by the way of bipolar + ganglion cells project to the lateral geniculate nucleus (LGN). From there on a cascade of computational stages sets in. Throughout the different stages of the ventral (“what” vs dorsal — “how”/”where”) visual stream (V1 → V2 → V4 → IT) the activity patterns become more and more tuned towards the task of object recognition. While neuronal tuning in V1 is mostly associated with rough edges and lines, IT demonstrates more abstract conceptual representational power. This modulatory hierarchy has been a big inspiration to the field of computer vision and the development of Convolutional Neural Networks (CNNs).

大脑以分布和分层的方式处理感觉信息。视觉皮层(神经科学中研究最多的对象)例如依次提取从低到高的特征。感光细胞通过双极+神经节细胞的方式投射到外侧膝状核(LGN)。从那里开始，出现一系列计算阶段。在腹侧的不同阶段(“什么” vs背侧–“如何” /“哪里”)视觉流(V1→V2→V4→IT)，活动模式变得越来越多，更着重于对象识别的任务。尽管V1中的神经元调整主要与粗糙的边缘和线条有关，但IT展示了更抽象的概念表示能力。这种调制层次结构对计算机视觉领域和卷积神经网络(CNN)的发展产生了很大的启发。

In the neurosciences, on the other hand, there has been a long lasting history of spatial filter bank models (Sobel, etc.) which have been used to study activation patterns in the visual cortex. Until recently, these have been the state-of-the-art models of visual perception. This was mainly due to the fact that the computational model had to be somehow compared to brain recordings. Therefore, the model space to investigate was severely restricted. Enter: RSA. RSA was first introduced by Kriegeskorte et al. (2008) to bring together the cognitive and computational neuroscience community. It provides a simple framework to compare different activation patterns (not necessarily in the visual cortex; see figure below). More specifically, fMRI voxel-based GLM estimates or multi-unit recordings can be compared between different conditions (e.g. the stimulus presentation of a cat and a truck). These activation measures are then represented as vectors and we can compute distance measures between such vectors under different conditions. This can be done for many different stimuli and each pair allows us to fill one entry of the so-called representational dissimilarity matrix.

另一方面，在神经科学中，空间过滤器库模型(Sobel等)已有很长的历史，可用于研究视觉皮层的激活模式。直到最近，这些还是视觉感知的最新模型。这主要是由于计算模型必须以某种方式与大脑记录进行比较。 因此，研究的模型空间受到严重限制。 输入：RSA 。 RSA首先由Kriegeskorte等人引入。 (2008)将认知和计算神经科学界聚集在一起。它提供了一个简单的框架来比较不同的激活模式(不一定在视觉皮层中；请参见下图)。更具体地说，可以在不同条件(例如猫和卡车的刺激提示)之间比较基于fMRI体素的GLM估计值或多单位记录。然后将这些激活度量表示为向量，我们可以计算不同条件下此类向量之间的距离度量。这可以针对许多不同的刺激来完成，每一对都允许我们填充所谓的代表性相异矩阵的一个条目。

Since the original introduction of RSA, it has got a lot of press and many popular neuroscientists such as James DiCarlo, David Yamins, Niko Kriegeskorte and Radek Cichy have been combining RSA with Convolutional Neural Networks in order to study the ventral visual system. The beauty of this approach is that the dimensionality of the feature vector does not matter, since it is reduced to a single distance value which is then compared between different modalities (i.e. brain and model). Cadieu et al. (2014) for example claim that the CNNs are the best model of the ventral stream. In order to do so they extract features from the penultimate layer of an ImageNet-pretrained AlexNet and compare the features with multi-unit recordings of IT in two macaques. In a decoding exercise they find that the AlexNet features have more predictive power than simultaneously recorded V4 activity. Quite an amazing result. (In terms of prediction.) Another powerful study by Cichy et al. (2016) combined fMRI and MEG to study visual processing through time and space. A CNN does not know the notion of time nor tissue. A layer of artificial neurons can hardly be viewed as analogous to a layer in the neocortex. However, the authors found that sequence of extracted features mirrored the measured neural activation patterns in space (fMRI) and time (MEG).

自从最初引入RSA以来，它已经获得了广泛的关注，许多流行的神经科学家，例如James DiCarlo，David Yamins，Niko Kriegeskorte和Radek Cichy都将RSA与卷积神经网络相结合，以研究腹侧视觉系统。这种方法的优点在于特征向量的维数无关紧要，因为它被减小为一个距离值，然后在不同的模态(即大脑和模型)之间进行比较。 Cadieu等。 (2014)例如，CNN是腹侧流的最佳模型。为了做到这一点，他们从ImageNet预训练的AlexNet的倒数第二层提取特征，并将这些特征与两个猕猴中的IT多单位记录进行比较。在解码练习中，他们发现AlexNet功能比同时记录的V4活动具有更多的预测能力。非常惊人的结果。 (根据预测。) Cichy等人的另一项有力研究。 (2016年)结合功能磁共振成像和MEG研究时空的视觉处理。 CNN不知道时间的概念，也不知道组织。人工神经元层几乎不能被视为类似于新皮层中的一层。然而，作者发现提取的特征序列反映了所测得的空间中神经激活模式(fMRI)和时间(MEG)。

These results are spectacular not because CNNs “are so similar” to the brain, but because of the complete opposite. CNNs are trained by minimizing a normative cost function via backprop and SGD. Convolutions are biologically implausible operations and CNNs process millions of image arrays during training. The brain, on the other hand, exploits inductive biases through genetic manipulation as well as unsupervised learning in order to detect patterns in naturalistic images. However, backprop + SGD and thousands of years of evolution seem to have come up with with similar solutions. But ultimately, we as researchers are interested in understanding the causal mechanisms underlying the dynamics of the brain and deep architectures. How much can RSA help us with that?

这些结果之所以令人瞩目，不是因为CNN与大脑“如此相似”，而是因为完全相反。 CNN通过反向传播和SGD最小化规范成本函数进行训练。卷积是生物学上难以置信的操作，CNN在训练过程中处理数百万个图像阵列。另一方面，大脑通过遗传操作和无监督学习来利用归纳偏见，以检测自然图像中的图案。但是，backprop + SGD和几千年的发展似乎提出了类似的解决方案。但是最终，作为研究人员，我们对了解大脑和深层结构动力学的因果机制感兴趣。 RSA可以为我们提供多少帮助？

All measures computed within RSA are correlational. The RDM entries are based on correlational distances. The R-squared captures the variation of the neural RDM explained by the variation in the model RDM. Ultimately, it is hard to interpret any causal insights. The claim that information in CNNs is the best model for how the visual cortex works is cheap. This really does not help a lot. CNNs are trained via backpropagation and encapsule a huge inductive bias in the form of kernel weight sharing across all neurons involved in a single processing layer. But the brain cannot implement these exact algorithmic details (and has probably found smarter solutions than Leibniz’s chain rule). However, there has been a bunch of recent work (e.g. by Blake Richards, Walter Senn, Tim Lillicrap, Richard Naud and others) to explore the capability of neural circuits to approximate a normative-gradient-driven cost function optimization. So ultimately, we might not be that far off.

RSA中计算的所有度量都是相关的。 RDM条目基于相关距离。 R平方捕获由模型RDM的变化解释的神经RDM的变化。最终，很难解释任何因果关系。声称CNN中的信息是视觉皮层如何工作的最佳模型的说法很便宜。这确实没有太大帮助。 CNN通过反向传播进行训练，并以单个处理层中所有神经元之间的核重量共享的形式封装巨大的归纳偏差。但是大脑无法实现这些确切的算法细节(并且可能找到了比莱布尼兹的链法则更智能的解决方案)。但是，最近有大量工作(例如Blake Richards，Walter Senn，Tim Lillicrap，Richard Naud等人)探索了神经回路近似于规范梯度驱动成本函数优化的能力。因此，最终，我们可能相距不远。

Until then, I firmly believe that one has to combine RSA with the scientific method of experimental intervention. As in economics, we are in the need for quasi-experimental causality by the means of controlled manipulation. And that is exactly what has been done two recent studies by Bashivan et al. (2019) and Ponce et al. (2019)! More specifically, they use generative procedures based on Deep Learning to generate a set of stimuli. The ultimate goal thereby is to provide a form of neural control (i.e. drive firing rates of specific neural sites). Specifically, Ponce et al. (2019) show how to close the loop between generating a stimulus from a Generative Adversarial Network, reading out neural activity and altering the input noise to the GAN in order to drive the activity of single units as well as populations. The authors were able to identify replicable abstract tuning behavior of the recording sites. The biggest strength of using flexible function approximations lies in their capability to articulate patterns which we as experimenters are not able to put in words.

在此之前，我坚信必须将RSA与科学的实验干预方法相结合。与经济学一样， 我们需要通过受控操纵来实现准实验因果关系 。这正是Bashivan等人最近的两项研究所做的。 (2019)和Ponce等人。 (2019) ！更具体地说，他们使用基于深度学习的生成过程来生成一组刺激。因此，最终目标是提供一种神经控制形式(即，驱动特定神经部位的放电速度)。具体来说，庞塞等。 (2019)展示了如何从生成对抗网络生成刺激，读出神经活动和改变GAN输入噪声之间的循环，以驱动单个单元和人群的活动。作者能够确定记录站点的可复制抽象调整行为。 使用灵活函数逼近的最大优势在于其表达模式的能力，而作为实验者我们无法用语言表达这些模式。

For many Deep Learning architectures weight initialization is crucial for successful learning. Furthermore, we still don’t really understand inter-layer repesentational differences. RSA provides an efficient and easy-to-compute quantity that can measure robustness to such hyperparameters. At the last NeuRIPS conference Sami Bengio’s group (Morcos et al., 2018) introduced projected weighted canonical correlation analysis (PWCCA) to study differences in generalization as well as narrow and wide networks. Based on a large Google-style empirical analysis the came up with the following key insights:

对于许多深度学习架构，权重初始化对于成功学习至关重要。此外，我们仍然还不太了解层间重复性差异。 RSA提供了一种有效且易于计算的数量，可以测量此类超参数的健壮性。在上一届NeuRIPS会议上，Sami Bengio的小组( Morcos等人，2018 )介绍了预测加权规范相关分析(PWCCA)，以研究泛化以及狭窄和广泛网络中的差异。基于大量的Google风格的经验分析，得出了以下关键见解：

Networks which are capable of generalization converge to more similar representations. Intuitively, overfitting can be achieved in many different ways. The network is essentially “under-constrained” by the training data and can do whatever it wants outside of that part of the space. Generalization requires to exploit patterns related to the true underlying data-generating process. And this can only be done by a more restricted set of architecture configurations.
能够泛化的网络可以收敛到更相似的表示形式。 直观上，过度拟合可以通过许多不同的方式来实现。网络实际上受到训练数据的“约束不足”，并且可以在空间的那部分之外做任何它想做的事情。泛化要求利用与真正的基础数据生成过程相关的模式。而且这只能通过一组受限制的体系结构配置来完成。
The width of a network is directly related to the representational convergence. More width = More similar representations (across networks). The authors argue that this is evidence for the so-called lottery ticket hypothesis: Empirically it has been shown that wide-and-pruned networks perform a lot better than networks that were shallow from the beginning. This might be due to different sub-networks of the large-width network being initialized differently. The pruning procedure then is able to simply identify the sub-configuration with optimal initialization while the shallow network only has a single initialization from the get-go.
网络的宽度与表示收敛性直接相关。 更多宽度=更多相似表示 (跨网络)。作者认为这证明了所谓的彩票假说：根据经验，已经证明，经过修剪的网络比从一开始就比较浅的网络的性能要好得多。这可能是由于宽带网络的不同子网初始化不同而引起的。然后，修剪过程能够通过最佳初始化简单地标识子配置，而浅层网络从一开始就只有一个初始化。
Different initializations and learning rates can lead to distinct clusters of representational solutions. The clusters generalize similarly well. This might indicate that the loss surface has multiple qualitatively indistinguishable local minima. What ultimately drives the membership could not be identified yet.
不同的初始化和学习率可能导致代表性解决方案的不同簇 。聚类相似地概括得很好。这可能表明损失表面具有多个定性无法区分的局部最小值。最终导致会员资格发展的因素尚无法确定。

A recent extension by Geoffrey Hinton’s Google Brain group (Kornblith et al., 2019) uses centered kernel alignment (CKA) in order to scale CCA to larger vector dimensions (numbers of artificial neurons). Personally, I really enjoy this work since computational models give us scientists the freedom to to turn all the nobs. And there are quite a few in DL (architecture, initialization, learning rate, optimizer, regularizers). Networks are white boxes like Kriegeskorte says. So if we can’t succeed in understanding the dynamics of a simple Multi-Layer Perceptron, how are we ever going to succeed in the brain?

杰弗里·欣顿(Geoffrey Hinton)的谷歌大脑团队( Kornblith et al。，2019 )的最新扩展使用居中核对齐(CKA)以便将CCA缩放到更大的向量尺寸(人工神经元数量)。就个人而言，我真的很喜欢这项工作，因为计算模型使我们可以自由地转动所有点。 DL中有很多东西(体系结构，初始化，学习率，优化器，正则化器)。网络是Kriegeskorte所说的白盒子。因此，如果我们不能成功地理解简单的多层感知器的动力学，那么我们如何在大脑中获得成功？

All in all, I am a huge fan of every scientific development trying to shine some light on approximations of Deep Learning in the brain. However, Deep Learning is not a causal model of computation in the brain. Arguing that the brain as well as CNNs perform similar sequential operations in time and space is a limited conclusion. In order to gain true insights, the loop has to be closed. Using generative models to design stimuli is therefore an exciting new endeavour in neuroscience. But if we want to understand the dynamics of learning, we have to go further than that. How are loss functions and gradients represented? How does the brain overcome the necessity of requiring to separate training and prediction phases? Representations are only a very indirect peephole to answering these fundamental questions. Going forward I there is a lot to gain (from the modeller’s perspective) from skip and recurrent connections as well as Bayesian DL via dropout sampling. But that is the story of another blog post.

总而言之，我是每一个科学发展的忠实拥护者，试图向人们深入介绍大脑中的深度学习。但是， 深度学习不是大脑中计算的因果模型 。认为大脑以及CNN在时间和空间上都执行类似的顺序操作是一个有限的结论。为了获得真知灼见， 必须关闭循环 。因此，使用生成模型设计刺激是神经科学中令人兴奋的新成果。但是，如果我们想了解学习的动力，就必须走得更远。损失函数和梯度如何表示？大脑如何克服需要分开训练和预测阶段的必要性？陈述只是回答这些基本问题的非常间接的窥视Kong。展望未来，我将通过跳过和循环连接以及通过辍学采样获得贝叶斯DL(从建模者的角度来看)有很多收获。但这是另一篇博客文章的故事。