（已开源）ECCV 2020 视觉BEV经典算法：LSS详解与代码

自动驾驶小学生

已于 2024-05-15 15:30:34 修改

阅读量4.7k

点赞数 11

分类专栏：论文笔记文章标签： BEV LSS

于 2023-03-06 21:58:43 首次发布

本文链接：https://blog.csdn.net/cg129054036/article/details/129371802

版权

论文笔记专栏收录该内容

68 篇文章 160 订阅

订阅专栏

在这里插入图片描述
本文介绍一篇视觉BEV经典算法：LSS，论文收录于 ECCV2020，本文通过显示的进行图像离散深度估计完成目标语义分割，重点是如何将二维图像特征转换成BEV特征。

项目链接：https://nv-tlabs.github.io/lift-splat-shoot/

文章目录

0. 工程结构

整个工程文件结构如下，非常简洁：文件是 data.py、explore.py、models.py、tools.py 和 train.py。需要重点关注的是 explore.py 和 models.py 两个文件。

.
├── imgs
│   ├── check.gif
│   └── eval.gif
├── LICENSE
├── main.py
├── model525000.pt
├── nuscenes -> /root/bev_baseline/nuscenes
├── README.md
└── src
    ├── data.py
    ├── explore.py
    ├── __init__.py
    ├── models.py
    ├── tools.py
    └── train.py

也可以看到整个项目代码量非常小，总共只有900多行左右。

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           7            309            339            963
Markdown                         1             19              0             55
HTML                             1              0              0              1
-------------------------------------------------------------------------------
SUM:                             9            328            339           1019
-------------------------------------------------------------------------------

1. main.py

首先是main文件，这里使用到了fire库，解析命令行参数。

from fire import Fire
import src

if __name__ == '__main__':
    Fire({
        'lidar_check': src.explore.lidar_check,
        'cumsum_check': src.explore.cumsum_check,
        'train': src.train.train,
        'eval_model_iou': src.explore.eval_model_iou,
        'viz_model_preds': src.explore.viz_model_preds,
    })

2. explore.py

评估模型时，运行下面这条命令，调用的是 eval_model_iou 函数，需要传输的参数有数据版本（nuscenes-mini数据），模型文件（model525000.pt），数据集路径。

python3 main.py eval_model_iou mini --modelf=model525000.pt --dataroot=./nuscenes

函数中传输的参数还有：

H=900, W=1600：图片大小
resize_lim=(0.193, 0.225)：resize的范围
final_dim=(128, 352)：数据预处理之后最终的图片大小
bot_pct_lim=(0.0, 0.22)：裁剪图片时，图像底部裁剪掉部分所占比例范围
rot_lim=(-5.4, 5.4)：训练时旋转图片的角度范围
rand_flip=True：是否随机翻转
xbound=[-50.0, 50.0, 0.5]：限制x方向的范围并划分网格（单位：米）
ybound=[-50.0, 50.0, 0.5],：限制y方向的范围并划分网格（单位：米）
zbound=[-10.0, 10.0, 20.0]：限制z方向的范围并划分网格（单位：米）
dbound=[4.0, 45.0, 1.0]：限制深度方向的范围并划分网格（单位：米）

import torch
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
from PIL import Image
import matplotlib.patches as mpatches
import os

from .data import compile_data
from .tools import (ego_to_cam, get_only_in_img_mask, denormalize_img,
                    SimpleLoss, get_val_info, add_ego, gen_dx_bx,
                    get_nusc_maps, plot_nusc_map)
from .models import compile_model

def eval_model_iou(version,
                modelf,
                dataroot='/data/nuscenes',
                gpuid=-1,

                H=900, W=1600,
                resize_lim=(0.193, 0.225),
                final_dim=(128, 352),
                bot_pct_lim=(0.0, 0.22),
                rot_lim=(-5.4, 5.4),
                rand_flip=True,

                xbound=[-50.0, 50.0, 0.5],
                ybound=[-50.0, 50.0, 0.5],
                zbound=[-10.0, 10.0, 20.0],
                dbound=[4.0, 45.0, 1.0],

                bsz=4,
                nworkers=10,
                ):
    grid_conf = {
        'xbound': xbound,
        'ybound': ybound,
        'zbound': zbound,
        'dbound': dbound,
    }
    data_aug_conf = {
                    'resize_lim': resize_lim,
                    'final_dim': final_dim,
                    'rot_lim': rot_lim,
                    'H': H, 'W': W,
                    'rand_flip': rand_flip,
                    'bot_pct_lim': bot_pct_lim,
                    'cams': ['CAM_FRONT_LEFT', 'CAM_FRONT', 'CAM_FRONT_RIGHT',
                             'CAM_BACK_LEFT', 'CAM_BACK', 'CAM_BACK_RIGHT'],
                    'Ncams': 5,
                }
    trainloader, valloader = compile_data(version, dataroot, data_aug_conf=data_aug_conf,
                                          grid_conf=grid_conf, bsz=bsz, nworkers=nworkers,
                                          parser_name='segmentationdata')

    device = torch.device('cpu') if gpuid < 0 else torch.device(f'cuda:{gpuid}')
    model = compile_model(grid_conf, data_aug_conf, outC=1)    
    print('loading', modelf)
    
    # GPU加载
    # model.load_state_dict(torch.load(modelf))
    
    # CPU加载
    model.load_state_dict(torch.load(modelf, map_location = torch.device('cpu')))
    model.to(device)

    # loss_fn = SimpleLoss(1.0).cuda(gpuid)
    loss_fn = SimpleLoss(1.0)
    model.eval()
    
    val_info = get_val_info(model, valloader, loss_fn, device)
    
    print(val_info)

3. models.py

3.1 LSS模型初始化

将网格参数和数据增强参数传递到网络，这里outC为1，预测类别个数为1。
初始化对网格大小进行划分，图像下采样倍数（16），图像特征维度（64），视锥生成函数以及CamEncode和BEVEncode初始化。

def compile_model(grid_conf, data_aug_conf, outC):
    return LiftSplatShoot(grid_conf, data_aug_conf, outC)


class LiftSplatShoot(nn.Module):
    def __init__(self, grid_conf, data_aug_conf, outC):
        super(LiftSplatShoot, self).__init__()
        # 网格配置参数
        self.grid_conf = grid_conf
        # 数据增强配置参数
        self.data_aug_conf = data_aug_conf  
        
        # 划分网格
        dx, bx, nx = gen_dx_bx(self.grid_conf['xbound'],
                                              self.grid_conf['ybound'],
                                              self.grid_conf['zbound'],
                                              )  
        self.dx = nn.Parameter(dx, requires_grad=False)  # [0.5,0.5,20]
        self.bx = nn.Parameter(bx, requires_grad=False)  # [-49.75,-49.75,0]
        self.nx = nn.Parameter(nx, requires_grad=False)  # [200,200,1]

        self.downsample = 16  # 下采样倍数
        self.camC = 64  # 图像特征维度
        self.frustum = self.create_frustum()  # frustum: DxfHxfWx3(41x8x22x3)
        self.D, _, _, _ = self.frustum.shape  # D: 41
        self.camencode = CamEncode(self.D, self.camC, self.downsample)
        self.bevencode = BevEncode(inC=self.camC, outC=outC)

        # toggle using QuickCumsum vs. autograd
        self.use_quickcumsum = True

3.1.1 create_frustum视锥点云生成

生成视锥，最后得到 $\times H \times W \times 3$ 的张量，这里的张量存储的是视锥点云坐标，也就是常见的 $(d, u, v)$ 坐标。

其中 $D$ 的取值范围为：[ 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44.]，得到41个离散深度值；
$H$ 的取值范围为：[0.0000, 18.1429, 36.2857, 54.4286, 72.5714, 90.7143, 108.8571, 127.0000]，在图像高度上8等分（16倍降采样）；
$W$ 的取值范围为：[0.0000, 16.7143, 33.4286, 50.1429, 66.8571, 83.5714, 100.2857, 117.0000, 133.7143, 150.4286, 167.1429, 183.8571, 200.5714, 217.2857, 234.0000, 250.7143, 267.4286, 284.1429, 300.8571, 317.5714, 334.2857, 351.0000]，在图像宽度上22等分（16倍降采样）；

def create_frustum(self):
        # 原始图片大小  ogfH:128  ogfW:352
        ogfH, ogfW = self.data_aug_conf['final_dim'] 
        
        # 下采样16倍后图像大小  fH: 8  fW: 22
        fH, fW = ogfH // self.downsample, ogfW // self.downsample
        
        # self.grid_conf['dbound'] = [4, 45, 1]
        # 在深度方向上划分网格 ds: DxfHxfW(41x8x22)
        # ds: tensor([ 4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
        # 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31.,
        # 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44.])
        ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
        
        # D: 41 表示深度方向上网格的数量
        D, _, _ = ds.shape 
        
        """
        1. torch.linspace(0, ogfW - 1, fW, dtype=torch.float)
        tensor([0.0000, 16.7143, 33.4286, 50.1429, 66.8571, 83.5714, 100.2857,
                117.0000, 133.7143, 150.4286, 167.1429, 183.8571, 200.5714, 217.2857,
                234.0000, 250.7143, 267.4286, 284.1429, 300.8571, 317.5714, 334.2857,
                351.0000])

        2. torch.linspace(0, ogfH - 1, fH, dtype=torch.float)
        tensor([0.0000, 18.1429, 36.2857, 54.4286, 72.5714, 90.7143, 108.8571,
                127.0000])
        """
        
        # 在0到351上划分22个格子 xs: DxfHxfW(41x8x22)
        xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW) 
        
        # 在0到127上划分8个格子 ys: DxfHxfW(41x8x22)
        ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW) 

        # D x H x W x 3
        # 堆积起来形成网格坐标
        frustum = torch.stack((xs, ys, ds), -1)  
        return nn.Parameter(frustum, requires_grad=False)

3.1.2 CamEncode初始化

图像特征提取网络初始化，使用的网络是efficientnet-b0，efficientnet-b0最后两层特征图为： $(b s, 112, H /16, W /16) ， (b s, 320, H /32, W /32)$ ，对后两层特征进行融合，融合后的特征尺寸大小为 $(b s, 412, H /16, W /16 ）$ 。

import torch
from torch import nn
from efficientnet_pytorch import EfficientNet
from torchvision.models.resnet import resnet18

from .tools import gen_dx_bx, cumsum_trick, QuickCumsum


class Up(nn.Module):
    def __init__(self, in_channels, out_channels, scale_factor=2):
        super().__init__()

        self.up = nn.Upsample(scale_factor=scale_factor, mode='bilinear',
                              align_corners=True) # 上采样 BxCxHxW->BxCx2Hx2W

        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),                # inplace=True使用原地操作，节省内存
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x1, x2):
        x1 = self.up(x1)
        x1 = torch.cat([x2, x1], dim=1)
        return self.conv(x1)


class CamEncode(nn.Module):  
    def __init__(self, D, C, downsample):
        super(CamEncode, self).__init__()
        self.D = D
        self.C = C
                
        # 使用 efficientnet 提取特征
        self.trunk = EfficientNet.from_pretrained("efficientnet-b0")   
        
        # 上采样模块，输入输出通道分别为320+112和512
        self.up1 = Up(320+112, 512)     
        
        # 第二维的105个通道分成两部分；第一部分：前41个维度代表不同深度上41个离散深度；
        #                           第二部分：后64个维度代表特征图上的不同位置对应的语义特征；
        self.depthnet = nn.Conv2d(512, self.D + self.C, kernel_size=1, padding=0)

3.1.3 BEVEncode初始化

BEV特征网络，使用的网络是resnet18，特征图融合时对第1层和第3层特征图进行融合。

class BevEncode(nn.Module):
    def __init__(self, inC, outC):
        super(BevEncode, self).__init__()
        # level0：(bs, 64, 100, 100)
        # level1: (bs, 128, 50, 50)
        # level2: (bs, 256, 25, 25)
        
        # 使用resnet的前3个stage作为backbone
        trunk = resnet18(pretrained=False, zero_init_residual=True)
        self.conv1 = nn.Conv2d(inC, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = trunk.bn1
        self.relu = trunk.relu

        self.layer1 = trunk.layer1
        self.layer2 = trunk.layer2
        self.layer3 = trunk.layer3

        self.up1 = Up(64+256, 256, scale_factor=4)
        self.up2 = nn.Sequential(
            nn.Upsample(scale_factor=2, mode='bilinear',
                              align_corners=True),
            nn.Conv2d(256, 128, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, outC, kernel_size=1, padding=0),
        )

3.2 LSS前向推理

LSS前向推理时，输入参数有：

imgs：输入的环视相机图片，imgs = (bs, N, 3, H, W)，N代表环视相机个数；
rots：由相机坐标系->车身坐标系的旋转矩阵，rots = (bs, N, 3, 3)；
trans：由相机坐标系->车身坐标系的平移向量，trans=(bs, N, 3)；
intrinsic：相机内参，intrinsic = (bs, N, 3, 3)；
post_rots：由图像增强引起的旋转矩阵，post_rots = (bs, N, 3, 3)；
post_trans：由图像增强引起的平移向量，post_trans = (bs, N, 3)；

 def forward(self, x, rots, trans, intrins, post_rots, post_trans):
        # x:[4,6,3,128,352]
        # rots: [4,6,3,3]
        # trans: [4,6,3]
        # intrins: [4,6,3,3]
        # post_rots: [4,6,3,3]
        # post_trans: [4,6,3]
        
        # 将图像转换到BEV下，x: B x C x 200 x 200 (B x 64 x 200 x 200)
        x = self.get_voxels(x, rots, trans, intrins, post_rots, post_trans)
        
        x = self.bevencode(x)  # 用resnet18提取特征  x: B x 1 x 200 x 200
        print("pred: x[0,0,1,1:10]", x[0,0,1,1:10])
        
        return x

3.2.1 get_geometry（几何坐标转换）

视锥点云由图像坐标系向自车坐标系进行转化，主要是刚体变换，根据相机与自车内外参进行变换。

  def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
 
        """
        Determine the (x,y,z) locations (in the ego frame) of the points in the point cloud.
        Returns B x N x D x H/downsample x W/downsample x 3
        """
        B, N, _ = trans.shape  # B:batchsize    N:相机数目

        # undo post-transformation
        # B x N x D x H x W x 3
        # 抵消数据增强及预处理对像素的变化        
        points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
        points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))


        # cam_to_ego
        # 图像坐标系 -> 归一化相机坐标系 -> 相机坐标系 -> 自车坐标系
        # 但是自认为由于转换过程是线性的，所以反归一化是在图像坐标系完成的，然后再利用
        # 求完逆的内参投影回相机坐标系
        points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                            points[:, :, :, :, :, 2:3]
                            ), 5)  # 将像素坐标(u,v,d)变成齐次坐标(du,dv,d)
        
        # d[u,v,1]^T=intrins*rots^(-1)*([x,y,z]^T-trans)
        combine = rots.matmul(torch.inverse(intrins))
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
        
        # 将像素坐标d[u,v,1]^T转换到车体坐标系下的[x,y,z]^T
        points += trans.view(B, N, 1, 1, 1, 3)  
        
        # (bs, N, depth, H, W, 3)：其物理含义
        # 每个batch中的每个环视相机图像特征点，其在不同深度下位置对应
        # 在ego坐标系下的坐标
        # B x N x D x H x W x 3 (4 x 6 x 41 x 8 x 22 x 3)
        return points

3.2.2 get_cam_feats（获取图像特征）

图像特征提取网络，得到 $\times N \times D \times fH \times fW \times C$ 的张量。

 def get_cam_feats(self, x): 
        """
        Return B x N x D x H/downsample x W/downsample x C
        """
        B, N, C, imH, imW = x.shape  # B: 4  N: 6  C: 3  imH: 128  imW: 352

        x = x.view(B*N, C, imH, imW)  # B和N两个维度合起来  x: 12 x 3 x 128 x 352    
        x = self.camencode(x) # 进行图像编码  x: B*N x C x D x fH x fW(12 x 64 x 41 x 8 x 22)
        
        #将前两维拆开 x: B x N x C x D x fH x fW(4 x 6 x 64 x 41 x 8 x 22)
        x = x.view(B, N, self.camC, self.D, imH//self.downsample, imW//self.downsample) 
        x = x.permute(0, 1, 3, 4, 5, 2)  # x: B x N x D x fH x fW x C(4 x 6 x 41 x 8 x 22 x 64)
        
        return x

BEVEncode前向推理函数为：

def forward(self, x):
	# depth: B*N x D x fH x fW(24 x 41 x 8 x 22)  
	# x: B*N x C x D x fH x fW(24 x 64 x 41 x 8 x 22)
	depth, x = self.get_depth_feat(x) 
	
	return x

def get_depth_feat(self, x):
	# 使用efficientnet提取特征  x: 24x 512 x 8 x 22 
	x = self.get_eff_depth(x)    
	# Depth
	# 1x1卷积变换维度  x: 24 x 105(C+D) x 8 x 22
	x = self.depthnet(x)   

	# 第二个维度的前D个作为深度维，进行softmax  depth: 24 x 41 x 8 x 22
	depth = self.get_depth_dist(x[:, :self.D]) 
	
	# 概率密度和语义特征做外积，构建图像特征点云 new_x: 24 x 64 x 41 x 8 x 22
	# depth.unsqueeze(1): (24,1,41,8,22)
	# x[:, self.D:(self.D + self.C)].unsqueeze(2) :(24,64,1,8,22)
	new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)  
	
	return depth, new_x
	
def get_depth_dist(self, x, eps=1e-20):
	return x.softmax(dim=1)          # 对深度维进行softmax，得到每个像素不同深度的概率

def get_eff_depth(self, x):   # 使用efficientnet提取特征
	# adapted from https://github.com/lukemelas/EfficientNetPyTorch/blob/master/efficientnet_pytorch/model.py#L231
	endpoints = dict()
	
	# Stem
	x = self.trunk._swish(self.trunk._bn0(self.trunk._conv_stem(x)))
	prev_x = x
	
	# Blocks
	for idx, block in enumerate(self.trunk._blocks):
	    drop_connect_rate = self.trunk._global_params.drop_connect_rate
	    if drop_connect_rate:
	        drop_connect_rate *= float(idx) / len(self.trunk._blocks) # scale drop connect_rate
	    x = block(x, drop_connect_rate=drop_connect_rate)
	    if prev_x.size(2) > x.size(2):
	        endpoints['reduction_{}'.format(len(endpoints)+1)] = prev_x
	    prev_x = x
	
	# Head
	endpoints['reduction_{}'.format(len(endpoints)+1)] = x
	# 第5层特征上采样，并于第4层特征融合
	x = self.up1(endpoints['reduction_5'], endpoints['reduction_4'])   
	return x

3.2.3 voxel_pooling（生成BEV特征）

这里使用到了QuickCumsum函数，作者给出了伪代码解释，具体见：https://github.com/nv-tlabs/lift-splat-shoot/issues/14，最终得到 $\times 64 \times 200 \times200$ 的特征图。

 def voxel_pooling(self, geom_feats, x):	
	# geom_feats；(B x N x D x fH x fW x 3)：在ego坐标系下的坐标点；
	# x；(B x N x D x fH x fW x C)：图像点云特征
	
	# B: 4  N: 6  D: 41  H: 8  W: 22  C: 64
	B, N, D, H, W, C = x.shape  
	# 将特征点云展平，一共有 B*N*D*H*W 个点
	Nprime = B*N*D*H*W  # Nprime: 173184
	
	# flatten x
	# 将图像展平，一共有 B*N*D*H*W 个点
	x = x.reshape(Nprime, C)  
	
	# flatten indices
	# 将[-50,50] [-10 10]的范围平移到[0,100] [0,20]，计算栅格坐标并取整
	# ego下的空间坐标转换到体素坐标（计算栅格坐标并取整）
	geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long()  
	
	# 将体素坐标同样展平，geom_feats: (B*N*D*H*W, 3)
	geom_feats = geom_feats.view(Nprime, 3) 
	
	# 每个点对应于哪个batch
	# (Nprimer, 1)
	batch_ix = torch.cat([torch.full([Nprime//B, 1], ix, device=x.device, dtype=torch.long) for ix in range(B)]) 
	
	# geom_feats: B*N*D*H*W x 4(173184 x 4), geom_feats[:,3]表示batch_id
	geom_feats = torch.cat((geom_feats, batch_ix), 1) 
	
	# filter out points that are outside box
	# 过滤掉在边界线之外的点 x:0~199  y: 0~199  z: 0
	kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
	    & (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
	    & (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])
	x = x[kept]  # x: 168648 x 64

	geom_feats = geom_feats[kept]
	
	ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
	    + geom_feats[:, 1] * (self.nx[2] * B)\
	    + geom_feats[:, 2] * B\
	    + geom_feats[:, 3]  
	sorts = ranks.argsort()
	x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]  # 按照rank排序，这样rank相近的点就在一起了
	
	# cumsum trick
	if not self.use_quickcumsum:
	    x, geom_feats = cumsum_trick(x, geom_feats, ranks)
	else:
	    x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)  # 一个batch的一个格子里只留一个点 x: 29072 x 64  geom_feats: 29072 x 4
	
	# griddify (B x C x Z x X x Y)
	# final: B x 64 x 1 x 200 x 200
	final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)  
	
	# 将x按照栅格坐标放到final中
	final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x  
	
	# collapse Z
	# 消除掉z维
	final = torch.cat(final.unbind(dim=2), 1) 
	
	# final: B x 64 x 200 x 200
	return final

3.2.4 BEVEncode前向推理

最终得到 $\times 1 \times 200 \times 200$ 的特征。

  def forward(self, x):  
        x = self.conv1(x)  
        x = self.bn1(x)
        x = self.relu(x)

        x1 = self.layer1(x)  # x1: 4 x 64 x 100 x 100
        x = self.layer2(x1)  # x: 4 x 128 x 50 x 50
        x = self.layer3(x)  # x: 4 x 256 x 25 x 25
        
        # 给x进行4倍上采样然后和x1 concat 在一起  x: 4 x 256 x 100 x 100
        x = self.up1(x, x1)  
        
        # 2倍上采样->3x3卷积->1x1卷积  x: 4 x 1 x 200 x 200
        x = self.up2(x) 

        return x