Learning low-level vision学习低层次视觉1999

Learning low-level vision
William T. Freeman and Egon C. Paszto
MERL, a Mitsubishi Electric Res. Lab
201 Broadway, Cambridge, MA 02139
freeman, pasztor@merl.com




We show a learning-based method for low-level vision problems-estimating scenes from images. We generate a synthetic world of scenes and their corresponding rendered images. We model that world with a Markov network, learning the network parameters from the examples. Bayesian belief propagation allows us to efficiently find a local maximum of the posterior probability for the scene, given the image. We cal l this approach VISTA{Vision by Image/Scene Training.


We apply VISTA to the"super-resolution" problem (estimating high frequency details from a low-resolution image), showing good results. For the motion estimation problem, we show figure/ground discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic machinery.


1 Introduction

1 引言

Weseek machineryfor learning low-level visionproblems, such as motion analysis, inferring shape andalbedo from a photograph, or extrapolating image detail. For these problems, given image data, we want toestimate an underlying scene. The scene quantities to be estimated might be projected object velocities, surface shapes and reflectance patterns, or missing high frequency details.


Low-level vision problems are typically underconstrained, so Bayesian [3,23,37] and regularization techniques [31] are fundamental. There has been much work and progress (for example,[23,25,15]),but difficulties remain in working with complex, real images. Typically, prior probabilities or constraints are made-up, rather than learned. A general machinery for a learning-based solution to low-level vision problems would have many applications.


A recent research theme has been to learn the statistics of natural images. Researchers have related those statistics to properties of the human visual system [28,2,36], or have used statistical methods with biologically plausible image representations to analyse andsynthesize realistic image textures[14,8,41,36].These methods may help us understand the early stages of representation and processing,but unfortunately,they don’taddress how a visual system might interpret images,i.e.,estimate the underlying scene.


We want to combine the two research themes of scene estimation and statistical learning.We study the statistical properties of a synthetically generated,labeled world of images with scenes,to learn how to infer scenes from images. Our prior probabilities can then be rich ones, learned from the training data.Several researchers have applied related learning approaches to low-level vision problems, but restricted themselves to linear models [21,16], too weak for many applications. Our approach is similar in spirit to relaxation labelling [33, 22], but our Bayesia propagation algorithm is more efficient and we utilize large sets of labelled training data.We interpret images by modeling the relationship between local regions of images and scenes, and between neighboring local scene regions. The former allows initial scene estimates; the later allows the estimates to propagate. We train from image/scene pairs and apply the Bayesian machinery of graphical models [29, 5, 20]. We were inspired by the work of Weiss [38], who pointed out the speed advantage of Bayesian methods over conventional relaxation methods for propagating local measurement information.For a related approach, but withheuristically derived propagation rules, see [34].We call our approach VISTA, Vision by Image/Scene TrAining. It is a general machinery that may apply to various problems. We illustrate it for estimating missing image details, and estimating motion.

我们想将场景估计和统计学习两个研究主题结合起来。我们研究合成的,被标记的由场景组成的图像世界的统计特性,以便了解如何推断由图像推断出场景。了解训练数据我们可以得到丰富的先验概率,一些研究人员应用相关学习法来解决低层次视觉问题,但只局限于线性模型[21,16],这对许多应用来说远远不够。我们的做法与松弛标记法[ 33,22 ]在思想上非常相似,但我们的贝叶斯算法更有效,我们利用大集的标记训练数据。我们通过对局部图像和场景之间、相邻局部场景区域之间的关系进行建模来对图像进行解译。前者允许初始的场景估计;后者允许估计传播。我们从图像/场景对中训练并采用贝叶斯理论图形模式[ 29,5,20 ]。魏斯指出了贝叶斯方法相对传统的松弛方法在传播测量信息方面的速度优势,这给了我们启发。我们采用了相关的方法,但利用了启发式传播规律,见[ 34 ]。我们称我们的方法是VISTA方法(图像/场景训练视觉法)。它是一个通用方法,适用于各种问题。我们将演示利用VISTA方法进行丢失的图像细节评估和运动评估。

2 Markov network

2 马尔科夫网络

For given image data, y, we seek to estimate the underlying scene, x (weomit the vector symbols for notational simplicity). We first calculate the posterior probability, P(x|y) = cP(x, y) For this analysis, we ignore the normalization, c = 1/P(y), a constant over x. Under two common loss functions [3], the best scene estimate, ^ x, is the mean (minimum mean squared error, MMSE) or the mode (maximum a posteriori, MAP) of the posterior probability.

对于给定的图像数据,y,我们寻求估计潜在的场景,x(为了符号的简化,我们省略了矢量符号).我们首先计算后验概率,P(x|y) = cP(x,y).这一分析,我们忽略了标准化,c = 1/P(y),一个恒定的X。在两个共同的损失函数下[3],最佳场景估计,^x,是中数(最小均方误差,MMSE)或模式(最大后验,MAP)的后验概率。

In general, ^ x can be difficult to compute [23] without approximations. We make the Markov assumption: we divide both the image and scene intopatches, and assign one node of a Markov network [13, 29,20]to each patch. Given the variables atintervening nodes, two nodes of a Markov network are statistically independent. We connect each scene patch to itscorresponding image patch, and to its nearest neighbors, Fig. 1. Solving a Markov network involves a learningphase, where the parameters of the network connections are learned from training data, and an inference phase, when the scene corresponding to particular image data is estimated.


For networks without loops, the Markov assumption leads to simple “message-passing" rules for computing the MAP and MMSE estimates [29, 39, 20].Writing those estimates for xj bymarginalizing (MMSE) or taking the argmax (MAP) over the other variables gives:

对于无环网络·,马尔可夫假设导致简单的“消息传递”规则计算MAP和MMSE估计[ 29,39,20 ]。通过忽略(MMSE)或以argmax(MAP)替换其他变量【译者注:没找到argmax这个单词,argmax表示寻找具有最大评分的参量】给出以下Xj的估计值:

For a Markov random field, the joint probability over the scenes x and images y can be written as [4,13,12]:


where we have introduced pairwise compatibility functions, Ψ and Φ, described below. Thefactorizedstructure of Eq. (3) allows theintegralsand argmax operations of Eqs. (1) and (2) to pass through to the compatibility function factors with the appropriate arguments. For a network without loops, the resulting expression can be computed using repeated, local computations [29,39,20], summarized below: the MMSE estimate at node j is


where kruns over all scene node neighbors of node j.We calculate Lkj from:


where ~ Llk is Llk from the previous iteration. The initial ~ Llk's are 1. After at most one iteration per xi of Eq. (1), Eq. (4) and (5) give Eq. (1). The MAP estimate equation, Eq. (2), yields analogous formulae,with the integral of Eq. (5) replaced by argmaxxk ,and xjxj of Eq. (4) replaced by argmaxxj . For linear topologies, these propagation rules are equivalent to well-known Bayesian inference methods, such as the Kalman filter and the forward-backward algorithm for Hidden Markov Models [29, 26, 38,20,11].


Finding the posterior probability distribution for a grid-structured Markov network with loops is computationally expensive and a variety of approximations have been proposed [13,12,20]. Strong empirical results in "Turbo codes" [24,27] and recent theoretical work [39, 40] provide support for a very simple approximation: applying the propagation rules derived above even in a network with loops. Table 1 summarizes results from [40]: (1) for Gaussian processes, the MMSE propagation scheme will converge only to the true posterior means. (2) Even for non-Gaussian processes, if the MAP propagation scheme converges, it finds at least a local maximum of the true posterior probability.

对于一个没有循环的,网格结构的马尔可夫网络,找到后验概率分布的计算代价是很高的,人们已经提出了许多近似的算法[13,12,20]。“Turbo 代码”中强大的实证结果[24,27]最近的理论工作[ 39,40 ]为一个非常简单的近似法提供了支持:前述方法产生的传播规则甚至适用于带有环的网络。表1总结的结果[ 40]产生于:(1高斯过程最小均方误差传播计划将只专注于真正的后验方法。(2)即使是非高斯过程,如果MAP传播计划收敛,它至少发现真实后验概率的一个局部最大值。

2.1 Learning the compatibility functions

2.1 学习兼容性(一致,相容)功能

One can measure the marginal probabilities relating local scenes, xi, and images, yi,as well as neighboring local scenes, xi and xj . Iterated Proportional Fitting (e.g., [18]) is a scheme to iteratively modifythe compatibility functions until the empirically measured marginal statistics agree with those predicted by the model, Eq. (3). For the problems presented here,we found good results by using the marginal statistics measured from the training data, without modications by iterated proportional fitting. Based on a factorization described in [10, 9], for a message from scene nodes j to k,we used (xj ;xk)= P(xj ;xk)P(xk)and(xj ;yj ) = P(yj jxj ). We fit the probabilities with mixtures of Gaussians.

一个既可以衡量邻接的局部场景,xi,xj也可以衡量与局部场景,xi和图像,yi有关的边际概率。比例拟合迭代(例如,[ 18])方案可以迭代修改兼容性功能直到实证测量边际统计与按公式(3)所示模型预测的相一致。从训练数据测量得到的边际统计,没有经过比例拟合迭代修改,利用这些数据来处理这里提出的问题,我们得到了很好的结果。基于[10, 9]所描述的分解,由场景节点j到k的消息,使用 ¥(xj ,xk)=P(xj ,xk)/P(xk)和phei(xj,xk)=P(yj|xj)。我们将概率与混合高斯适应化

An alternate method, which we find gives comparable results, not shown here, is to use scene and image patches with spatially overlap their neighbors. We assume a Gaussian noise penalty on the multiple observations of the same pixels in the overlap region,yielding (xk;xj ) = exp,where dk and dj are the corresponding values of the scenes described at nodes k and j in their region of common support,and σ is a penalty parameter.

这里没有显示的另外一种替代方法有相似的结果,这种方法使用与邻接点重叠的场景和图像块。假设高斯噪声对重叠区的相同像素的多个观测值都有影响,产生¥(xk,xj ) = exp{-(dk-dj)1/$},这里dk和dj是上面所述场景的相关值,即在节点k和j共同支持的区域内,这里的σ是一个惩罚参数。

2.2 Probability Representation

2.2 概率表示

Inspired by the success of [17, 8], we use a sample-based representation for inference. We describe theposterior probability as a set of weights on scenes ob-served in the training set. Given an image to ana-lyze, for eachnodewe collect a set of 10 or 20 \scenecandidates" from the training data which have im-age data closely matching the local observation. Weevaluate the posterior probability only at those scenevalues. The propagation algorithms, Eq. (5) and (4)then are discrete matrix calculations. This simplication focuses the computation on only those sceneswhich render to the observed image data.

3 Super-resolution

For the super-resolution problem, the input imageisalow-resolution image. The scene to be estimatedis a higher resolution image. A good solution to thisproblem would allow pixel-based images to be handledin a relatively resolution-independent manner. Appli-cations could include enlargment of digital or lm pho-tographs, upconversion of video from NTSC format toHDTV, or image compression.

At first, the task may seem impossible|the highresolution data is not there. However, we can seeedges in the low-resolution image that we know shouldremain sharp at the next resolution level. Further-more, based on the successes of recent texture synthe-sis methods [14,8,41,36], we might expect to handletextured areas well, too.

Others [35]have used a Bayesian method, making-up the prior probability. In contrast, the Markovnetwork learns the relationship between sharp andblurred images from large amounts of training data,and achieves better results. Among the non-Bayesianmethods, fractal image representation [32] (Fig. 8c)only gathers training data from the one image, whileselecting the nearest neighbor from training data

  • 0
  • 1
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


