6D姿态估计从0单排——看论文的小鸡篇——Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estima...

we employ a convolutional auto-encoder that has been trained on a large collection of random local patches. We demonstrate that neural networks coupled with a local voting-based approach can be used to perform reliable 3D object detection and pose estimation under clutter and occlusion. To this end, we deeply learn descriptive features from local RGB-D patches and use them afterwards to create hypotheses in the 6D pose space.
1492311-20190312220902832-1258265570.png
we train a convolutional autoencoder (CAE) from scratch using random patches from RGB-D images with the goal of descriptor regression. With this network we create codebooks from synthetic patches sampled from object views where each codebook entry holds a local 6D pose vote. In the detection phase we sample patches in the input image on a regular grid, compute their descriptors and match them against codebooks with an approximate k-NN search.
1492311-20190312220912025-672641034.png

  1. Local Patch Representation: Given an object appearance, the idea is to separate it into its local parts and let them vote independently. To represent an object locally, we render it from many viewpoints sampled on an icosahedron, and densely extract a set of scale-independent RGB-D patches. we take depth z at the patch center point and compute the patch pixel size such that the patch corresponds to a fixed metric size m :\(patch_{size} = \frac{m}{z}\cdot f\) with f being the focal length of the rendering camera. we de-mean the depth values with z and clamp them to \(\pm m\) to confine the patch locally not only along x and y, but also along z. Finally, we normalize color and depth to \([−1, 1]\) and resize the patches to \(32\times 32\). Our method instead does not need to model the background at all.
  2. Network Training: we decided to train from scratch(从头开始) due to multiple reasons:
    • Not many works have incorporated depth as an additional channel, depth noise
    • first to focus on local RGB-D patches of small-scale objects
    • To robustly train deep architectures, a high amount of training samples is needed.
      We randomly sample local patches from the LineMOD dataset. Furthermore, these samples were augmented such that each image got randomly flipped and its color channels permutated. Our network aims to learn a mapping from the high-dimensional patch space to a much lower feature space of dimensionality \(F\), and we employ a autoencoder (AE) and a convolutional autoencoder (CAE) to accomplish this task.
      1492311-20190312220923621-1690926343.png
  3. Constrained Voting: A problem that is often encountered in regression tasks is the unpredictability of output values. In our case, this can be caused by unseen object parts, general background appearance or sensor noise in color or depth. we render an object from many views and store local patches y of this synthetic data in a database. For each \(y\), we compute its feature \(f(y)\in \mathbb{R}^F\) and store it together with a local vote \((t_x,t_y,t_z,\alpha,\beta,\gamma)\) describing the patch 3D center point offset to the object centroid and the rotation with respect to the local object coordinate frame.we take each sampled 3D scene point \(s = (s_x, s_y, s_z)\) with associated patch \(x\), compute its deep-regressed feature \(f(x)\) and retrieve k (approximate) nearest neighbors \(y_1,..., y_k\). Each neighbor casts then a global vote \(v(s,y) = (t_x + s_x, t_y + s_y, t_z + t_y, \alpha,\beta, \gamma)\) with an associated weight \(w(v) = e^{\left\|f(x)-f(y)\right\|}\) based on the feature distance.
    1492311-20190312220931475-695065286.png
    Vote filtering: Casting votes can lead to a very crowded vote space that requires refinement in order to keep detection computationally feasible. We thus employ a three-stage filtering: in the first stage we subdivide the image plane into a 2D grid and throw each vote into the cell the projected centroid points to. We suppress all cells that hold less than k votes and extract local maxima after bilinear filtering over the accumulated weights of the cells. Each local mode collects the votes from its direct cell neighbors and performs mean shift with a flat kernel, first in translational and then in quaternion space.
    1492311-20190312220940658-661033118.png

转载于:https://www.cnblogs.com/LeeGoHigh/p/10520036.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值