街景字符识别baseline

最新推荐文章于 2022-10-26 17:00:22 发布

CoderPro

最新推荐文章于 2022-10-26 17:00:22 发布

阅读量583

点赞数

分类专栏：比赛文章标签：深度学习 pytorch

本文链接：https://blog.csdn.net/qq_38896666/article/details/106236042

版权

比赛专栏收录该内容

5 篇文章 0 订阅

订阅专栏

街景字符识别

赛题理解与解决方案

题目要求：给定一张街景字符图片，能够正确识别出其中的数字字符。
根据题目的要求可以分为三种思路：

定长字符识别：
由于图片中的数字字符最多不会超过6个，因此可以将问题视为一个定长字符问题。我们需要将数据中的label填充为6位，不存在数字的位置填充为X，因此最终的问题就是6个字符的分类问题，一共分为11个类别，若识别为X，则表示该位置没有数字。
不定长字符识别问题：
利用字符识别模型直接进行不定长字符识别。
目标检测问题：
利用YOLO、SSD等目标检测算法先检测出字符的边界框，再通过模型识别字符所表示的数字。

baseline based on pytorch

结合本题的baseline来说明如何利用pytorch来搭建一个CNN模型。
搭建CNN模型一般分为以下几步：

构建自己的dataset以及dataloadr用于训练时数据的读取
构建CNN模型
模型训练与验证
模型结果测试

构建自己的dataset以及dataloader

pytorch中的torchvision提供了一些公开的数据集可直接加载，但很多时候我们需要使用自己的数据集实现训练，因此我们需要定义自己的dataset.
pyotch中的dataset通常支持两种类型：映射型与迭代型，我个人见到的代码大部分使用映射型：给定一个索引，能够返回索引对应的img以及label，定义这个类需要继承torch.utils.data.Dataset类并重写其中的__geitem__函数以及__len__函数，以本题的baseline为例：

class SVHNDataset(Dataset):
    def __init__(self, img_path, img_label, transform=None):
        self.img_path = img_path
        self.img_label = img_label 
        if transform is not None:
            self.transform = transform
        else:
            self.transform = None

    def __getitem__(self, index):
        img = Image.open(self.img_path[index]).convert('RGB')

        if self.transform is not None:
            img = self.transform(img)
        
        # 设置最长的字符长度为5个
        lbl = np.array(self.img_label[index], dtype=np.int)
        lbl = list(lbl)  + (5 - len(lbl)) * [10]
        return img, torch.from_numpy(np.array(lbl[:5]))

    def __len__(self):
        return len(self.img_path)

__geitem__函数负责返回我们需要的数据，__len__函数负责返回数据集的长度。需要注意的是，pytorch中的图片读取要使用PIL而不是opencv，如果需要使用opencv对图片进行处理需要先将图片类型转化为numpy数组。

很多pytorch中在定义dataset时在__init__下面会使用super，当程序中有多重继承关系时建议这样使用，可以参考这个博客：super用法

在__geitem__函数中，我们在字符不到5的图片里添加label信息，将问题变为一个定长字符识别问题，数据集信息一般按行堆叠。

构建CNN模型

class SVHN_Model1(nn.Module):
    def __init__(self):
        super(SVHN_Model1, self).__init__()
                
        model_conv = models.resnet18(pretrained=True)
        model_conv.avgpool = nn.AdaptiveAvgPool2d(1)
        model_conv = nn.Sequential(*list(model_conv.children())[:-1])
        self.cnn = model_conv
        
        self.fc1 = nn.Linear(512, 11)
        self.fc2 = nn.Linear(512, 11)
        self.fc3 = nn.Linear(512, 11)
        self.fc4 = nn.Linear(512, 11)
        self.fc5 = nn.Linear(512, 11)
    
    def forward(self, img):        
        feat = self.cnn(img)
        # print(feat.shape)
        feat = feat.view(feat.shape[0], -1)
        c1 = self.fc1(feat)
        c2 = self.fc2(feat)
        c3 = self.fc3(feat)
        c4 = self.fc4(feat)
        c5 = self.fc5(feat)
        return c1, c2, c3, c4, c5

这里我们对resnet18的网络进行了微调，将最后一层调整为我们需要的全连接层，前面的参数与结构保持不变。

模型训练与验证

def train(train_loader, model, criterion, optimizer):
    # 切换模型为训练模式
    model.train()
    train_loss = []
    
    for i, (input, target) in enumerate(train_loader):
        if use_cuda:
            input = input.cuda()
            target = target.cuda()
            
        c0, c1, c2, c3, c4 = model(input)
        loss = criterion(c0, target[:, 0]) + \
                criterion(c1, target[:, 1]) + \
                criterion(c2, target[:, 2]) + \
                criterion(c3, target[:, 3]) + \
                criterion(c4, target[:, 4])
        
        # loss /= 6
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if i % 100 == 0:
            print(loss.item())
        
        train_loss.append(loss.item())
    return np.mean(train_loss)

def validate(val_loader, model, criterion):
    # 切换模型为预测模型
    model.eval()
    val_loss = []

    # 不记录模型梯度信息
    with torch.no_grad():
        for i, (input, target) in enumerate(val_loader):
            if use_cuda:
                input = input.cuda()
                target = target.cuda()
            
            c0, c1, c2, c3, c4 = model(input)
            loss = criterion(c0, target[:, 0]) + \
                    criterion(c1, target[:, 1]) + \
                    criterion(c2, target[:, 2]) + \
                    criterion(c3, target[:, 3]) + \
                    criterion(c4, target[:, 4])
            # loss /= 6
            val_loss.append(loss.item())
    return np.mean(val_loss)

def predict(test_loader, model, tta=10):
    model.eval()
    test_pred_tta = None
    
    # TTA 次数
    for _ in range(tta):
        test_pred = []
    
        with torch.no_grad():
            for i, (input, target) in enumerate(test_loader):
                if use_cuda:
                    input = input.cuda()
                
                c0, c1, c2, c3, c4 = model(input)
                output = np.concatenate([
                    c0.data.numpy(), 
                    c1.data.numpy(),
                    c2.data.numpy(), 
                    c3.data.numpy(),
                    c4.data.numpy()], axis=1)
                test_pred.append(output)
        
        test_pred = np.vstack(test_pred)
        if test_pred_tta is None:
            test_pred_tta = test_pred
        else:
            test_pred_tta += test_pred
    
    return test_pred_tta

这里我们将训练、验证、预测过程分别封装为函数。这里需要注意的是在验证与测试阶段使用with torch.no_grad()不记录梯度信息从而加快速度。
随后我们就可以进行模型的训练：

model = SVHN_Model1()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), 0.001)
best_loss = 1000.0

use_cuda = False
if use_cuda:
    model = model.cuda()

for epoch in range(2):
    train_loss = train(train_loader, model, criterion, optimizer, epoch)
    val_loss = validate(val_loader, model, criterion)
    
    val_label = [''.join(map(str, x)) for x in val_loader.dataset.img_label]
    #这里用的是imglabel，因此是未添加信息的label
    val_predict_label = predict(val_loader, model, 1)
    #每十一个作为一个字符的概率分布，求最大值对应的索引作为结果
    val_predict_label = np.vstack([
        val_predict_label[:, :11].argmax(1),
        val_predict_label[:, 11:22].argmax(1),
        val_predict_label[:, 22:33].argmax(1),
        val_predict_label[:, 33:44].argmax(1),
        val_predict_label[:, 44:55].argmax(1),
    ]).T #这里的转置一定要注意，因为图片的label信息是按行堆叠的
    val_label_pred = []
    for x in val_predict_label:
        val_label_pred.append(''.join(map(str, x[x!=10])))
    
    val_char_acc = np.mean(np.array(val_label_pred) == np.array(val_label)) #accuracy,dim should coincident
    
    print('Epoch: {0}, Train loss: {1} \t Val loss: {2}'.format(epoch, train_loss, val_loss))
    print(val_char_acc)
    # 记录下验证集精度
    if val_loss < best_loss:
        best_loss = val_loss
        torch.save(model.state_dict(), './model.pt')

这里再来解释一下代码，一些需要注意的地方已经在上面注释，这里说明一下为什么要使用map和join，通过map将识别得到的结果映射为字符，再通过join将这些字符连接为字符串，由于题目规定只要有一个字符识别错误整张图片就算识别失败，因此我们将连接得到的字符串作为结果，同样在数据集里面的label也连接成为字符串，直接比较字符串是否相等来作为判断的条件，通过np.mean来求出精度。

模型结果测试

test_path = glob.glob('../input/test_a/*.png')
test_path.sort()
test_label = [[1]] * len(test_path)
print(len(val_path), len(val_label))

test_loader = torch.utils.data.DataLoader(
    SVHNDataset(test_path, test_label,
                transforms.Compose([
                    transforms.Resize((64, 128)),
                    transforms.RandomCrop((60, 120)),
                    # transforms.ColorJitter(0.3, 0.3, 0.2),
                    # transforms.RandomRotation(5),
                    transforms.ToTensor(),
                    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])), 
    batch_size=40, 
    shuffle=False, 
    num_workers=10,
)

test_predict_label = predict(test_loader, model, 1)

test_label = [''.join(map(str, x)) for x in test_loader.dataset.img_label]
test_predict_label = np.vstack([
    test_predict_label[:, :11].argmax(1),
    test_predict_label[:, 11:22].argmax(1),
    test_predict_label[:, 22:33].argmax(1),
    test_predict_label[:, 33:44].argmax(1),
    test_predict_label[:, 44:55].argmax(1),
]).T

test_label_pred = []
for x in test_predict_label:
    test_label_pred.append(''.join(map(str, x[x!=10])))
    
import pandas as pd
df_submit = pd.read_csv('../input/test_A_sample_submit.csv')
df_submit['file_code'] = test_label_pred
df_submit.to_csv('renset18.csv', index=None)