ResNet
paper:https://arxiv.org/pdf/1512.03385.pdf
ResNet(深度残差网络)提出于2015年,有效的解决了当网络层数增加导致的梯度消失和梯度爆炸问题(如下图所示):
文中提出了如下网络结构:
在原本的网络结构上引入右侧的x,这样网络的输出变为H(X)=F(X)+X,当网络层数太深时,若F(X),那么H(X)=X,也就可以将深层网络看成浅层网络。
为什么我们要利用残差网络来制造恒等映射呢,当网络深度增加时,如果新增加的网络什么都不做,模型的效果也不至于下降,然而对于神经网络,什么都不做(恒等映射)恰恰是最难的地方,由于非线性层的存在,每一层都存在信息损失。也就是拟合F(X)=X很难,但拟合F(X)=0要简单很多。
这时我们会发现一个隐患,如果输入层和输出层的chanel不一样怎么相加,此时可以对X做卷积,来让X的chanel数与F(X)匹配。
ResNet的block的两种形式:
FOTS使用的ResNet50使用的就是右侧这种形式,通过11的卷积层降低chanel数,来显著减少33卷积层的参数。
ResNet网络结构:
代码:`
'''
取出Resnst50的四层,来作为encoding部分,保存来进行特征融合
'''
bbNet = pretrainedmodels.__dict__['resnet50'](pretrained='imagenet')
self.backbone = bbNet
def __foward_backbone(self, input):
conv2 = None
conv3 = None
conv4 = None
output = None
for name, layer in self.backbone.named_children():
input = layer(input)
if name == 'layer1':
conv2 = input
elif name == 'layer2':
conv3 = input
elif name == 'layer3':
conv4 = input
elif name == 'layer4':
output = input
break
return output, conv4, conv3, conv2
Unet
网络结构图:
左侧为encoding,采用卷积和下采样(池化),将下采样前的feature进行copy和crop送到右侧,右侧为decoding,采用卷积和上采样(反卷积),每一层上采样完与左侧的送来的进行concat。
'''
self.__foward_backbone:restnet50,返回output, conv4, conv3, conv2
self.mergeLayers0,1,2,3: concat+conv
self.__unpool: 上采样
'''
f = self.__foward_backbone(input)
g = [None] * 4
h = [None] * 4
# 底层
h[0] = self.mergeLayers0(f[0])
g[0] = self.__unpool(h[0])
# i = 2
h[1] = self.mergeLayers1(g[0], f[1])
g[1] = self.__unpool(h[1])
# i = 3
h[2] = self.mergeLayers2(g[1], f[2])
g[2] = self.__unpool(h[2])
# i = 4
h[3] = self.mergeLayers3(g[2], f[3])
#g[3] = self.__unpool(h[3])
# final stage
final = self.mergeLayers4(h[3])
final = self.bn5(final)
final = F.relu(final)
基础知识
上采样:与下采样减小图像尺寸相反,上采样被用来增大图像尺寸,常用的上采样方法有反卷积,双线性插值,上池化等。首先介绍反卷积。
反卷积(Transposed Convolution)
反卷积也是一种卷积,需要设置kernel_size,Stride,padding ;不同之处在于要对输入的特征图进行处理,反卷积的过程可以分为两步:
1.对原特征图进行插值,在宽和高的方向上,每两个像素点间插入(Stride−1)个值为0的点,新特征图的高为:H’=H+(Stride−1)∗(H−1),宽为W’=W+(Stride−1)∗(W−1)
2.对插值后的特征图进行卷积,输出feature map的高为(H-1)Stride-2padding+Size
#反卷积
nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1)
'''
in_channels(int) – 输入信号的通道数
out_channels(int) – 卷积产生的通道数
kerner_size(int or tuple) - 卷积核的大小
stride(int or tuple,optional) - 卷积步长,即要将输入扩大的倍数。
padding(int or tuple, optional) - 输入的每一条边补充0的层数,高宽都增加2*padding
output_padding(int or tuple, optional) - 输出边补充0的层数,高宽都增加padding
groups(int, optional) – 从输入通道到输出通道的阻塞连接数
bias(bool, optional) - 如果bias=True,添加偏置
dilation(int or tuple, optional) – 卷积核元素之间的间距
'''
#上下采样函数
torch.nn.functional.interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None)
'''
参数:
- input (Tensor): input tensor
- size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):输出的 spatial 尺寸.
- scale_factor (float or Tuple[float]): spatial 尺寸的缩放因子.
- mode (string): 上采样算法:nearest, linear, bilinear, trilinear, area. 默认为 nearest.
- align_corners (bool, optional): 如果 align_corners=True,则对齐 input 和 output 的角点像素(corner pixels),保持在角点像素的值. 只会对 mode=linear, bilinear 和 trilinear 有作用. 默认是 False.
"""
'''
生成bbox
fots算法中的bbox采用的是east算法中的RBOX,即带角度的长方形框,其有五个参数,分别为像素位置到矩形的顶部,右侧,底部,左侧边界的4个距离di,和旋转角度θ。
由RBOX五个参数生成bbox四个顶点的坐标的函数为:
def restore_rectangle_rbox(origin, geometry):
'''
:param geometry:[d1,d2,d3,d4,cita] #distance to top,left,bottom,right
:return:
'''
d = geometry[:, :4]
angle = geometry[:, 4]
# for angle > 0
origin_0 = origin[angle >= 0]
d_0 = d[angle >= 0]
angle_0 = angle[angle >= 0]
if origin_0.shape[0] > 0:
# (0,-H),(W, -H),(W, 0),(0, 0),(Left, -bottom)
p = np.array([np.zeros(d_0.shape[0]), -d_0[:, 0] - d_0[:, 2],
d_0[:, 1] + d_0[:, 3], -d_0[:, 0] - d_0[:, 2],
d_0[:, 1] + d_0[:, 3], np.zeros(d_0.shape[0]),
np.zeros(d_0.shape[0]), np.zeros(d_0.shape[0]),
d_0[:, 3], -d_0[:, 2]])
p = p.transpose((1, 0)).reshape((-1, 5, 2)) # N*5*2
rotate_matrix_x = np.array([np.cos(angle_0), np.sin(angle_0)]).transpose((1, 0))
rotate_matrix_x = np.repeat(rotate_matrix_x, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1)) # N*5*2
rotate_matrix_y = np.array([-np.sin(angle_0), np.cos(angle_0)]).transpose((1, 0))
rotate_matrix_y = np.repeat(rotate_matrix_y, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1))
p_rotate_x = np.sum(rotate_matrix_x * p, axis = 2)[:, :, np.newaxis] # N*5*1
p_rotate_y = np.sum(rotate_matrix_y * p, axis = 2)[:, :, np.newaxis] # N*5*1
p_rotate = np.concatenate([p_rotate_x, p_rotate_y], axis = 2) # N*5*2
p3_in_origin = origin_0 - p_rotate[:, 4, :]
new_p0 = p_rotate[:, 0, :] + p3_in_origin # N*2
new_p1 = p_rotate[:, 1, :] + p3_in_origin
new_p2 = p_rotate[:, 2, :] + p3_in_origin
new_p3 = p_rotate[:, 3, :] + p3_in_origin
new_p_0 = np.concatenate([new_p0[:, np.newaxis, :], new_p1[:, np.newaxis, :],
new_p2[:, np.newaxis, :], new_p3[:, np.newaxis, :]], axis = 1) # N*4*2
else:
new_p_0 = np.zeros((0, 4, 2))
# for angle < 0
origin_1 = origin[angle < 0]
d_1 = d[angle < 0]
angle_1 = angle[angle < 0]
if origin_1.shape[0] > 0:
p = np.array([-d_1[:, 1] - d_1[:, 3], -d_1[:, 0] - d_1[:, 2],
np.zeros(d_1.shape[0]), -d_1[:, 0] - d_1[:, 2],
np.zeros(d_1.shape[0]), np.zeros(d_1.shape[0]),
-d_1[:, 1] - d_1[:, 3], np.zeros(d_1.shape[0]),
-d_1[:, 1], -d_1[:, 2]])
p = p.transpose((1, 0)).reshape((-1, 5, 2)) # N*5*2
rotate_matrix_x = np.array([np.cos(-angle_1), -np.sin(-angle_1)]).transpose((1, 0))
rotate_matrix_x = np.repeat(rotate_matrix_x, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1)) # N*5*2
rotate_matrix_y = np.array([np.sin(-angle_1), np.cos(-angle_1)]).transpose((1, 0))
rotate_matrix_y = np.repeat(rotate_matrix_y, 5, axis = 1).reshape(-1, 2, 5).transpose((0, 2, 1))
p_rotate_x = np.sum(rotate_matrix_x * p, axis = 2)[:, :, np.newaxis] # N*5*1
p_rotate_y = np.sum(rotate_matrix_y * p, axis = 2)[:, :, np.newaxis] # N*5*1
p_rotate = np.concatenate([p_rotate_x, p_rotate_y], axis = 2) # N*5*2
p3_in_origin = origin_1 - p_rotate[:, 4, :]
new_p0 = p_rotate[:, 0, :] + p3_in_origin # N*2
new_p1 = p_rotate[:, 1, :] + p3_in_origin
new_p2 = p_rotate[:, 2, :] + p3_in_origin
new_p3 = p_rotate[:, 3, :] + p3_in_origin
new_p_1 = np.concatenate([new_p0[:, np.newaxis, :], new_p1[:, np.newaxis, :],
new_p2[:, np.newaxis, :], new_p3[:, np.newaxis, :]], axis = 1) # N*4*2
else:
new_p_1 = np.zeros((0, 4, 2))
return np.concatenate([new_p_0, new_p_1])
在这个函数的基础上从score map和geo map中得到bbox顶点坐标
def detect(score_map, geo_map, score_map_thresh = 0.5, box_thresh = 0.1, nms_thres = 0.2, timer = None):
'''1e-5
restore text boxes from score map and geo map
:param score_map:1 channel
:param geo_map:5 channel
:param timer:
:param score_map_thresh: threshhold for score map
:param box_thresh: threshhold for boxes
:param nms_thres: threshold for nms
:return:
'''
if len(score_map.shape) == 4:
score_map = score_map[0, :, :, 0]
geo_map = geo_map[0, :, :, ]
# filter the score map
xy_text = np.argwhere(score_map > score_map_thresh)
# sort the text boxes via the y axis
xy_text = xy_text[np.argsort(xy_text[:, 0])]
# restore
start = time.time()
text_box_restored = Toolbox.restore_rectangle_rbox(xy_text[:, ::-1] * 4, geo_map[xy_text[:, 0], xy_text[:, 1], :]) # N*4*2
# print('{} text boxes before nms'.format(text_box_restored.shape[0]))
boxes = np.zeros((text_box_restored.shape[0], 9), dtype = np.float32)
boxes[:, :8] = text_box_restored.reshape((-1, 8))
boxes[:, 8] = score_map[xy_text[:, 0], xy_text[:, 1]]
timer['restore'] = time.time() - start
# nms part
start = time.time()
# boxes = nms_locality.nms_locality(boxes.astype(np.float64), nms_thres)
boxes = lanms.merge_quadrangle_n9(boxes.astype('float32'), nms_thres)
timer['nms'] = time.time() - start
if boxes.shape[0] == 0:
return None, timer
# here we filter some low score boxes by the average score map, this is different from the orginal paper
for i, box in enumerate(boxes):
mask = np.zeros_like(score_map, dtype = np.uint8)
cv2.fillPoly(mask, box[:8].reshape((-1, 4, 2)).astype(np.int32) // 4, 1)
boxes[i, 8] = cv2.mean(score_map, mask)[0]
boxes = boxes[boxes[:, 8] > box_thresh]
return boxes, timer