为什么能注意到这篇文章呢,因为它是端到端,同时在loss处可以达到左右一致性检测。就是将post-processing结合到了端到端之中。因此,我很注重它的loss。
一般接触的算法都是supervise的,就是都需要质量比较好的ground truth。
这篇还有个贡献就是不需要ground truth也依旧可以训练。以图像重建loss可以生成disparity。听说结果比supervise的都要好(吹?)
总结一下,该论文的算法结构模仿DispNet。而且非监督,不需要ground truth深度信息,且直接在loss上面包含了left-right consistency(左右一致性矫正)
Depth Estimation as Image Reconstruction
可以将深度预测问题看成回归问题,即将双目相机的左右两图,左图可以回归成右图即可。
我们的loss module output产生左右两个视差图,loss包含了平滑项,重建项,还有左右视差一致性项。每四个输出scale重复这样的module。C:卷积,UC:上采样卷积S:双边采样 US:上采样 SC:Skip 连接。注意哦~我们的每个model包含 上采样的内容,
3.Method
提出基于左右一致性检测的新颖深度训练loss。这样网络不需要监督信息
3.1 Depth Estimation as Image Reconstruction
就是将监督的回归问题,转化成了非监督的重建问题,最小化重建误差。就是左边warp到右边,右边warp到左边。然后求重建误差最小。
3.2 Depth Estimation Network
本文奇特的是只根据一张左图同时产生两个disparity信息(左->右,右->左),使得它们相互一致~
就是一个左图,产生了左视察,然后左视差形成右视差。。
这个就是网络结构,思路来自于DispNet。
代码如下:
def conv(self, x, num_out_layers, kernel_size, stride, activation_fn=tf.nn.elu):
p = np.floor((kernel_size - 1) / 2).astype(np.int32)
p_x = tf.pad(x, [[0, 0], [p, p], [p, p], [0, 0]])
return slim.conv2d(p_x, num_out_layers, kernel_size, stride, 'VALID', activation_fn=activation_fn)
def conv_block(self, x, num_out_layers, kernel_size):
conv1 = self.conv(x, num_out_layers, kernel_size, 1)
conv2 = self.conv(conv1, num_out_layers, kernel_size, 2)
return conv2
def upsample_nn(self, x, ratio):
s = tf.shape(x)
h = s[1]
w = s[2]
return tf.image.resize_nearest_neighbor(x, [h * ratio, w * ratio])
def upconv(self, x, num_out_layers, kernel_size, scale):
upsample = self.upsample_nn(x, scale)
conv = self.conv(upsample, num_out_layers, kernel_size, 1)
return conv
def build_vgg(self):
#set convenience functions
conv = self.conv
if self.params.use_deconv:
upconv = self.deconv
else:
upconv = self.upconv
with tf.variable_scope('encoder'):
conv1 = self.conv_block(self.model_input, 32, 7) # H/2
conv2 = self.conv_block(conv1, 64, 5) # H/4
conv3 = self.conv_block(conv2, 128, 3) # H/8
conv4 = self.conv_block(conv3, 256, 3) # H/16
conv5 = self.conv_block(conv4, 512, 3) # H/32
conv6 = self.conv_block(conv5, 512, 3) # H/64
conv7 = self.conv_block(conv6, 512, 3) # H/128
with tf.variable_scope('skips'):
skip1 = conv1
skip2 = conv2
skip3 = conv3
skip4 = conv4
skip5 = conv5
skip6 = conv6
with tf.variable_scope('decoder'):
upconv7 = upconv(conv7, 512, 3, 2) #H/64
concat7 = tf.concat([upconv7, skip6], 3)
iconv7 = conv(concat7, 512, 3, 1)
upconv6 = upconv(iconv7, 512, 3, 2) #H/32
concat6 = tf.concat([upconv6, skip5], 3)
iconv6 = conv(concat6, 512, 3, 1)
upconv5 = upconv(iconv6, 256, 3, 2) #H/16
concat5 = tf.concat([upconv5, skip4], 3)
iconv5 = conv(concat5, 256, 3, 1)
upconv4 = upconv(iconv5, 128, 3, 2) #H/8
concat4 = tf.concat([upconv4, skip3], 3)
iconv4 = conv(concat4, 128, 3, 1)
self.disp4 = self.get_disp(iconv4)
udisp4 = self.upsample_nn(self.disp4, 2)
upconv3 = upconv(iconv4, 64, 3, 2) #H/4
concat3 = tf.concat([upconv3, skip2, udisp4], 3)
iconv3 = conv(concat3, 64, 3, 1)
self.disp3 = self.get_disp(iconv3)
udisp3 = self.upsample_nn(self.disp3, 2)
upconv2 = upconv(iconv3, 32, 3, 2) #H/2
concat2 = tf.concat([upconv2, skip1, udisp3], 3)
iconv2 = conv(concat2, 32, 3, 1)
self.disp2 = self.get_disp(iconv2)
udisp2 = self.upsample_nn(self.disp2, 2)
upconv1 = upconv(iconv2, 16, 3, 2) #H
concat1 = tf.concat([upconv1, udisp2], 3)
iconv1 = conv(concat1, 16, 3, 1)
self.disp1 = self.get_disp(iconv1)
有意思的是upconv是先resize到x2的尺度上,再加个conv来实现的。
Training Loss
loss成本有如下三个部分ap,ds,lr,Cap是根据训练输入与它的重建图像相近的程度,Cds限制了视差预测的平滑性,Clr倾向于左右视差一致性。只有左图经过卷积神经网络的洗礼虽然每个主要的步骤当中都包含着左右两图。
Appearance Matching Loss
图像是通过空间转化网络(STN)与视差图从对应的立体匹配图片中获取相应的像素值。STN利用双线性采样,这样输出像素是四个输入像素的加权之和。
# IMAGE RECONSTRUCTION
# L1
self.l1_left = [tf.abs( self.left_est[i] - self.left_pyramid[i]) for i in range(4)]
self.l1_reconstruction_loss_left = [tf.reduce_mean(l) for l in self.l1_left]
self.l1_right = [tf.abs(self.right_est[i] - self.right_pyramid[i]) for i in range(4)]
self.l1_reconstruction_loss_right = [tf.reduce_mean(l) for l in self.l1_right]
# SSIM
self.ssim_left = [self.SSIM( self.left_est[i], self.left_pyramid[i]) for i in range(4)]
self.ssim_loss_left = [tf.reduce_mean(s) for s in self.ssim_left]
self.ssim_right = [self.SSIM(self.right_est[i], self.right_pyramid[i]) for i in range(4)]
self.ssim_loss_right = [tf.reduce_mean(s) for s in self.ssim_right]
# WEIGTHED SUM
self.image_loss_right = [self.params.alpha_image_loss * self.ssim_loss_right[i] + (1 - self.params.alpha_image_loss) * self.l1_reconstruction_loss_right[i] for i in range(4)]
self.image_loss_left = [self.params.alpha_image_loss * self.ssim_loss_left[i] + (1 - self.params.alpha_image_loss) * self.l1_reconstruction_loss_left[i] for i in range(4)]
self.image_loss = tf.add_n(self.image_loss_left + self.image_loss_right)
Disparity Smooth Loss
这个项就是使得预测出来的视差变得非常的平滑,所以就在视差梯度上面加了一个惩罚,这个成本的权重是与图像梯度有关,就是边缘项
def get_disparity_smoothness(self, disp, pyramid):
disp_gradients_x = [self.gradient_x(d) for d in disp]
disp_gradients_y = [self.gradient_y(d) for d in disp]
image_gradients_x = [self.gradient_x(img) for img in pyramid]
image_gradients_y = [self.gradient_y(img) for img in pyramid]
weights_x = [tf.exp(-tf.reduce_mean(tf.abs(g), 3, keep_dims=True)) for g in image_gradients_x]
weights_y = [tf.exp(-tf.reduce_mean(tf.abs(g), 3, keep_dims=True)) for g in image_gradients_y]
smoothness_x = [disp_gradients_x[i] * weights_x[i] for i in range(4)]
smoothness_y = [disp_gradients_y[i] * weights_y[i] for i in range(4)]
return smoothness_x + smoothness_y
# DISPARITY SMOOTHNESS
self.disp_left_loss = [tf.reduce_mean(tf.abs(self.disp_left_smoothness[i])) / 2 ** i for i in range(4)]
self.disp_right_loss = [tf.reduce_mean(tf.abs(self.disp_right_smoothness[i])) / 2 ** i for i in range(4)]
self.disp_gradient_loss = tf.add_n(self.disp_left_loss + self.disp_right_loss)
Left-right disparity consistency loss
没什么好讲的,直接上代码
# LR CONSISTENCY
self.lr_left_loss = [tf.reduce_mean(tf.abs(self.right_to_left_disp[i] - self.disp_left_est[i])) for i in range(4)]
self.lr_right_loss = [tf.reduce_mean(tf.abs(self.left_to_right_disp[i] - self.disp_right_est[i])) for i in range(4)]
self.lr_loss = tf.add_n(self.lr_left_loss + self.lr_right_loss)
# LR CONSISTENCY
with tf.variable_scope('left-right'):
self.right_to_left_disp = [self.generate_image_left(self.disp_right_est[i], self.disp_left_est[i]) for i in range(4)]
self.left_to_right_disp = [self.generate_image_right(self.disp_left_est[i], self.disp_right_est[i]) for i in range(4)]
def generate_image_left(self, img, disp):
return bilinear_sampler_1d_h(img, -disp)
def generate_image_right(self, img, disp):
return bilinear_sampler_1d_h(img, disp)
跑了下训练数据,结果差强人意~还不错,非监督的