Lipnet的学习理解

最新推荐文章于 2024-08-30 09:33:19 发布

just-solo

最新推荐文章于 2024-08-30 09:33:19 发布

阅读量2.4k

点赞数

分类专栏：深度学习计算机视觉文章标签：深度学习 pytorch 机器学习自然语言处理

本文链接：https://blog.csdn.net/justsolow/article/details/105279988

版权

深度学习同时被 2 个专栏收录

63 篇文章 4 订阅

订阅专栏

计算机视觉

57 篇文章 8 订阅

订阅专栏

Lipnet的学习理解

原文链接
通读完这篇文章，感觉并没有什么很大的创新点，所以我们直接看网络和代码吧。
在这里插入图片描述
LipNet架构。
一个T帧序列作为输入（这个T的取值一般是你输入数据的最大序列长度的2倍加1，也就是 =2L+1），由3层STCNN(Spatiotemporal convolutional neural networks时空卷积神经网络)处理，每层后面是一个空间最大池化层。对提取的特征进行时间上采样，用Bi-LSTM（也可以用GRU速度更快点，精度也没降低，代码里面就是用的GRU）进行处理;LSTM输出的每个时间步长由一个两层前馈网络和一个softmax进行处理。该端到端模型采用CTC（关于CTC的详细解读见我的这篇博客）进行训练。
网络看完了，接下来就看代码吧，代码也比较简单，可以说一目了然！！！代码有生疏的地方均有详细注释

import torch 
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
import math
import numpy as np


class LipNet(torch.nn.Module):
    def __init__(self, dropout_p=0.5):
        super(LipNet, self).__init__()
'''
	conv = nn.Conv3d(in_channels=2,
	                 out_channels=6,
	                 kernel_size=(2,1,1),
	                 stride=1,
	                 padding=0,
	                 dilation=1,
	                 groups=1,
	                 bias=False)
			max_pool3d(
			    input,
			    ksize,
			    strides,
			    padding,
			    data_format='NDHWC',
			    name=None
			)
'''
        self.conv1 = nn.Conv3d(3, 32, (3, 5, 5), (1, 2, 2), (1, 2, 2))
        self.pool1 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))
        
        self.conv2 = nn.Conv3d(32, 64, (3, 5, 5), (1, 1, 1), (1, 2, 2))
        self.pool2 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))
        
        self.conv3 = nn.Conv3d(64, 96, (3, 3, 3), (1, 1, 1), (1, 1, 1))     
        self.pool3 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))
        
        self.gru1  = nn.GRU(96*4*8, 256, 1, bidirectional=True)
        self.gru2  = nn.GRU(512, 256, 1, bidirectional=True)
        
        self.FC    = nn.Linear(512, 27+1)
        self.dropout_p  = dropout_p

        self.relu = nn.ReLU(inplace=True)
        self.dropout = nn.Dropout(self.dropout_p)        
        self.dropout3d = nn.Dropout3d(self.dropout_p)  
        self._init()
    
    def _init(self):
    #权重初始化[可以看我的另一篇博客](https://blog.csdn.net/justsolow/article/details/105137595)
        
        init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        init.constant_(self.conv1.bias, 0)
        
        init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        init.constant_(self.conv2.bias, 0)
        
        init.kaiming_normal_(self.conv3.weight, nonlinearity='relu')
        init.constant_(self.conv3.bias, 0)        
        
        init.kaiming_normal_(self.FC.weight, nonlinearity='sigmoid')
        init.constant_(self.FC.bias, 0)
        

        '''
# 正交矩阵 - (semi)orthogonal matrix
# From - Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe 2013
# torch.nn.init.orthogonal_(tensor, gain=1)
nn.init.orthogonal_(w)
# tensor([[ 0.5786, -0.5642, -0.5890],
#         [-0.7517, -0.0886, -0.6536]])

# 均匀分布 - u(a,b)
# torch.nn.init.uniform_(tensor, a=0, b=1)
nn.init.uniform_(w)
# tensor([[ 0.0578,  0.3402,  0.5034],
#         [ 0.7865,  0.7280,  0.6269]])

'''
        for m in (self.gru1, self.gru2):
            stdv = math.sqrt(2 / (96 * 3 * 6 + 256))
            for i in range(0, 256 * 3, 256):
                init.uniform_(m.weight_ih_l0[i: i + 256],
                            -math.sqrt(3) * stdv, math.sqrt(3) * stdv)
                init.orthogonal_(m.weight_hh_l0[i: i + 256])
                init.constant_(m.bias_ih_l0[i: i + 256], 0)
                init.uniform_(m.weight_ih_l0_reverse[i: i + 256],
                            -math.sqrt(3) * stdv, math.sqrt(3) * stdv)
                init.orthogonal_(m.weight_hh_l0_reverse[i: i + 256])
                init.constant_(m.bias_ih_l0_reverse[i: i + 256], 0)
        
        
    def forward(self, x):
        
        x = self.conv1(x)
        x = self.relu(x)
        x = self.dropout3d(x)
        x = self.pool1(x)
        
        x = self.conv2(x)
        x = self.relu(x)
        x = self.dropout3d(x)        
        x = self.pool2(x)
        
        x = self.conv3(x)
        x = self.relu(x)
        x = self.dropout3d(x)        
        x = self.pool3(x)
        
        # (B, C, T, H, W)->(T, B, C, H, W)
        x = x.permute(2, 0, 1, 3, 4).contiguous()
        # (B, C, T, H, W)->(T, B, C*H*W) 这里转换的原因是由于CTC对输入有要求。必须是(T, B, C*H*W)
        x = x.view(x.size(0), x.size(1), -1)
        
        self.gru1.flatten_parameters()
        self.gru2.flatten_parameters()
        
        x, h = self.gru1(x)        
        x = self.dropout(x)
        x, h = self.gru2(x)   
        x = self.dropout(x)
                
        x = self.FC(x)
        x = x.permute(1, 0, 2).contiguous()
        return x