[Stereo_cnn][cvpr16]Improved Stereo Matching with Constant Highway Networks

最新推荐文章于 2021-08-17 10:50:13 发布

原创最新推荐文章于 2021-08-17 10:50:13 发布

· 805 阅读

1 ·

版权

Stereo 同时被 2 个专栏收录

10 篇文章

订阅专栏

highwayNet

1 篇文章

订阅专栏

（未结束）

Notion

Environment Installation
https://askubuntu.com/questions/806160/how-to-check-if-libpng-and-png-is-installed $\to$ sudo apt-get install libpng++-dev

Contributions

1) Presenting a new highway network architecture[inner-outer residual net, multilevel resnet] for patch matching including **multilevel constant highway gating and scaling layers **that control the receptive filed of the network.
2) Train the cost computation network with hybrid loss, for better use the decision-description network which is used to compress input patches/ to extract features.
3）Compute disparity image using a global disparity compute net(CNNs) instead of WTA
4) Using reflective learning to mesure the correctness(confidence) of CNN output
5) This confidence score(incorporate in outlier) could help better outlier detection and correction in the refinement course
6) 1st on KITTI
7) improve fast mac-cnn’s velocity be under 5s/image
8) open source code

We studied MC-CNN and its modified versions
Knowing that ResNet, special case of Highway Network presents state of the art results in the most competitive CV task. Its success was attributed to the capability to train very deep network, but it’s not true for stereo matching.
Since the ResNet could be treated as an ensemble of networks that share weights, while the network used in stereo match is simpler.
But this is not true for stereo matching.
We also added scaling layer to control the receptive filed
Sect.6 proved that our model is better than those winner architecture on ImageNet, such as DenseNet
To estimate the confidence of stereo match, we proposed a CNN network; Our confidence indication is trained with reflective loss.

Stereo Match

Rectification/Stereo calibration
Cost computation

Image from paper: Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning

We used Description Network with Highway Network architecture to calculate separately left and right feature map of inputs patches, which are the input for two pathways: the first concatenate and passes them to the fully-connected decision networks which is trained via the cross-entropy loss, and the second directly employs a Hinge Loss criterion to the dot product(cosine similarity) of the representations.【similar to triplet loss】
输入为来自左右图片的patch，我们对它们做以下处理。
1）分别通过相同结构的Description Networks计算feature map。
2）对两个feature map，我们通过两种方式进行similarity(loss)的计算：

- A. 将两个向量串联然后通过若干个FC网络层+sigmoid进行Cross-Entropy Loss 的计算(the cross-entropy over the network's outputs)[是取一正一负还是多正多多负?]。
- B. 计算两个feature map(vector)的 dot product，然后加负号作为loss

$loss = \alpha * XEnt(v_+, v_-) + (1-\alpha)*Hinge(s_+, s_-)$
$loss = \alpha *(-log(v_-) +log(1- v_+)) + (1-\alpha)*max(0, m + s_- -s_+)$
$u$ 是description net的输出， $s = u_l^Tu_r$
$v$ 是网络中经过sigmoid的输出
对正负样本的采样我们采取以下办法：

由于我们已经知道了GroundTruth d, 那么我们就知道正确的匹配点的位置，则其他位置为非匹配点。假设 $<P_{nxn}^L(p), P_{nxn}^R(q)>$ 为左右图中的patches， $P_{nxn}^L(p)$ 是 $p$ 是做图中的像素点 $p(x, y)$ , d为该像素点对应的视差值。我们定义negtive exemple为一组分别来自左右图像的pathces:

$p = (x, y), q = (x - d + o n e g, y), o n e g \in [d a t a s e t_n e g_l o w, d a t a s e t_n e g_h i g h]$ $p=(x,y), q=(x - d + o_{neg}, y), o_{neg}\in[dataset\_neg\_low, dataset\_neg\_high]$
positive exemple:
$p = (x, y), q = (x - d + o p o s, y), o p o s \in [d a t a s e t_p o s_l o w, d a t a s e t_p o s_h i g h]$ $p=(x,y), q=(x - d + o_{pos}, y), o_{pos}\in[dataset\_pos\_low, dataset\_pos\_high]$
则 $s^+ = (u_l^Tu_r)^+$ ，其中 $u_l, u_r$ 对应的 $p, q$ 满足positive exemple 的关系。
$v^+$ 表示由正样本patch组合得到的sigmoid网络的输出。
？？？最后的loss怎么算的呢？training 的时候感觉不用管下面这个部分，只用
cross-entropy loss 训练就可以了。

实验中，我们取 $\alpha = 0.8, m = 0.2$
由此，网络的输出的到一个三维(max_disparity * hight*width)的Cost map: C(p, d) 其中p是像素，d为可能的视差-disparity。
3. Disparity computation

https://imgbb.com/‘>private image sharing

在MC-CNN中，我们用WTA（winner take all）的方法，求解最后的disparity，也就是取loss最小的d作为最后的output : $D(p) = argmin_d C(p, d)$ ，由此得到Disparity map。
在这里，我们用一个CNN网络(Global Disparity Network)获取视差图，输入cost map，经过4个卷积层和3个FC层，最后我们通过计算一个reflective loss 来训练网络，这种方式可以用来求出预测的confidence。
由上图可以看出，FC3一方面用来算一个Weighted Cross Entropy的loss来得到最后的d，另一方面和FC4，FC5相连，通过计算Binary Cross Entropy loss （reflective loss 因为它不仅和GtoudTruth有关，还和网络的变化有关？[这不废话，最简单的MSE都和这俩有关]）得到confidence： $argmax_i y_i$ 被用来和 $y^{GT}$ 相比，如果差距在一个像素以内，就认为该sample是positive的，否则negative(物理含义？？没弄清楚)
reflective loss和weighted cross entropy loss最终以 15:85 的比例求和作为最后的loss用于训练。
（实验证明这种方式对occluded 区域有很好的效果）
4. Disparity map refinement
我们知道，由上述算法得到的视差图的效果还是比较粗糙的，需要经过进一步的refinement才能得到最后的结果。主要有以下集中refinement的方法：

CBCA 迭代 - incorporate neighboring pixels’ information by averaging the cost w.r.t depth discontinuities.
SGM - enforce smooth constraints (CBCA + SGM iterations attribute to refine D(p), but there are some cases which they could not handle such as: reflective and sparse texture regions, occlude and distorted area and illumination changes )
left right consistency check
subpixel interpolation
-

Whole architecture

Highway network

Reflective loss

#