教育领域“知之”大模型--山东大学软件学院2024年项目实训(五)

组内的王同学提出要集成一些小工具到"知之"中去,其中姜同学负责“公式图片转latex代码”这一环节。我和姜同学合作来实现这个工作。

我负责在前期调研可能用到的工具和技术,并在中期确定使用深度学习后进行模型的训练和调参。姜同学负责去决策选择何种技术,写代码debug,以及后期对识别精度的优化(比如探讨图像分辨率对识别精度的影响)。

1.调研

经调研,当前可以实现公式图片转latex代码的主要技术有两个:基于cv2的传统图像识别方法,基于深度学习的数据驱动的公式识别算法。

1.1传统图像识别方法

思路:

准备好一个数字模板,查找到每个数字的轮廓后通过每个轮廓x坐标值来确保模板轮廓与数字相对应,测试图片同理,循环匹配来获得识别结果。

代码:

import cv2
import os
import numpy as np
from PIL import Image

def sort_contours(cnts):  # 排序
    boundingBoxes =[cv2.boundingRect(c) for c in cnts]
    (cnts,boundingBoxes) = zip(*sorted(zip(cnts,boundingBoxes),key=lambda b: b[1][i],reverse=False))
    return cnts

# 模板
tempimg = cv2.imread('./numbertemp.png')
refimg = cv2.cvtColor(tempimg, cv2.COLOR_BGR2GRAY)
refimg = cv2.threshold(refimg, 10, 255, cv2.THRESH_OTSU)[1]
refimg = cv2.bitwise_not(refimg)
contours, hierarchy = cv2.findContours(refimg.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# cv2.drawContours(tempimg, contours, -1, (0, 0, 255), 3)
# cv2.imshow('tempimg', tempimg)
# cv2.waitKey(0)
contours = sort_contours(contours)
digits = {}  #模板
for (i, c) in enumerate(contours):
    (x, y, w, h) = cv2.boundingRect(c)
    roi = refimg[y:y + h, x:x + w]
    roi = cv2.resize(roi,(57,88))
    # cv2.imshow('temproi', roi)
    # cv2.waitKey(0)
    digits[i] = roi  # 对应模板


# 测试
testimg = cv2.imread('./test2.jpg')
trefimg = cv2.cvtColor(testimg, cv2.COLOR_BGR2GRAY)
trefimg = cv2.threshold(trefimg, 10, 255, cv2.THRESH_OTSU)[1]
trefimg = cv2.bitwise_not(trefimg)
testcontours, testhierarchy = cv2.findContours(trefimg.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# cv2.drawContours(testimg, testcontours, -1, (0, 0, 255), 3)
# cv2.imshow('testimg', testimg)
# cv2.waitKey(0)
testcontours = sort_contours(testcontours)
groupOutput = []
for c in testcontours:
    (x, y, w, h) = cv2.boundingRect(c)
    roi = trefimg[y:y + h, x:x + w]

	# 判断是否为‘-’或‘.’
    # 统计白色像素点的个数
    white_pixels = np.count_nonzero(roi == 255)
    # 计算白色区域的占比
    white_ratio = white_pixels / (roi.shape[0] * roi.shape[1])
    if white_ratio > 0.8:
            if roi.shape[1]/roi.shape[0] >= 2:
                groupOutput.append('-')
            else:
                groupOutput.append('.')
            continue

    roi = cv2.resize(roi,(57,88))
    scores = []
    for (digit, digitROI) in digits.items():
        result = cv2.matchTemplate(roi, digitROI, cv2.TM_CCOEFF)
        (_, score, _, _) = cv2.minMaxLoc(result)
        scores.append(score)
    groupOutput.append(str(np.argmax(scores)))  # 得到数字

# 输出结果
s = ''.join(groupOutput)  # 将列表拼接为字符串
result = float(s)
formatted_result = format(result, '.3f')  # 格式化结果,保留三位小数
print(formatted_result)

注意,上面的代码虽然能成功实现数字的识别,但距离公式识别还有一定距离。公式中涉及许多符号也要OCR模板。并且因为我们要输出的是latex格式,所以即使将图片中的符号提取出来后,还涉及到如何还原这些符号的位置,并将其转化latex格式。 

1.2 深度学习算法

在深度学习方面,姜同学建议使用“编码器-解码器”结构。我负责调研合适的编码器与解码器的模型结构。

1.2.1编码器:CNN

代码:

class FashionMNISTModelV2(nn.Module):
    def __init__(self, input_shape: int, hidden_units: int, output_shape: int):
        super().__init__()
        self.block_1 = nn.Sequential(
            nn.Conv2d(in_channels=input_shape,
                      out_channels=hidden_units,
                      kernel_size=3, 
                      stride=1,
                      padding=1),
            nn.ReLU(),
            nn.Conv2d(in_channels=hidden_units,
                      out_channels=hidden_units,
                      kernel_size=3,
                      stride=1,
                      padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2,
                         stride=2) 
        )
        self.block_2 = nn.Sequential(
            nn.Conv2d(hidden_units, hidden_units, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_units, hidden_units, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_features=hidden_units*7*7,
                      out_features=output_shape)
        )

    def forward(self, x: torch.Tensor):
        x = self.block_1(x)
        # print(x.shape)
        x = self.block_2(x)
        # print(x.shape)
        x = self.classifier(x)
        # print(x.shape)
        return x

这里是一个很简单的CNN搭建的框架。如果我们最终选择CNN的话,具体的模型结构由姜同学再去实现。

1.2.2编码器:ViT

代码:

def pair(t):
    return t if isinstance(t, tuple) else (t, t)
 
 
class PreNorm(nn.Module):
    # 在执行fn之前执行一个Layer Norm
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)
 
 
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        # 前馈神经网络 = 2个全连接层
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
 
    def forward(self, x):
        return self.net(x)
 
 
class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)
 
        self.heads = heads
        self.scale = dim_head ** -0.5   # 缩放因子
 
        self.attend = nn.Softmax(dim = -1)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
 
        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()
 
    def forward(self, x):
        # x: [bs, 197, 1024]   197 = 1个Cls + 196个patch  1024就是每一个patch需要转为1024长度的向量
        # self.to_qkv(x)将x向量映射到长度为1024*3
        # chunk: qkv 最后是一个元祖,tuple,长度是3,每个元素形状:[1, 197, 1024]
        # 直接用x配合一个Linear生成qkv,再切分为3块
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        # 再把qkv分别拆分开来
        # q: [1, 16, 197, 64]  k: [1, 16, 197, 64]  v: [1, 16, 197, 64]
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
        # q * k转置 除以根号d_k
        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
        # softmax得到每个token对于其他token的attention系数
        attn = self.attend(dots)
        # * v  [1, 16, 197, 64]
        out = torch.matmul(attn, v)
        # [1, 197, 1024]
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)
 
 
class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):  # 堆叠多个Encoder  depth个
            self.layers.append(nn.ModuleList([
                # 每个encoder = Attention(Multi-Head Attention) + FeedForward(MLP)
                # PreNorm:指在fn(Attention/FeedForward)之前执行一个Layer Norm
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))
 
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x
 
 
class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)   # 224*224
        patch_height, patch_width = pair(patch_size)   # 16 * 16
 
        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
 
        num_patches = (image_height // patch_height) * (image_width // patch_width)  # 得到多少个token  14x14=196
        patch_dim = channels * patch_height * patch_width  # 3x16x16 = 768  patch展平后的维度
        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
 
        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),   # 把所有的patch拉平->768维
            nn.Linear(patch_dim, dim),                                                                  # 映射到encoder需要的维度768->1024
        )
 
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))  # 生成所有token和Cls的位置编码
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))   # 生成Cls的初始化参数
        self.dropout = nn.Dropout(emb_dropout)                  # embedding后面一般会接的一个Dropout
 
        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)   # encoder
 
        self.pool = pool
        self.to_latent = nn.Identity()
 
        self.mlp_head = nn.Sequential(   # CLS多分类输出部分
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )
 
    def forward(self, img):
        # img: [1, 3, 224, 224] x = [1, 196, 1024]
        # 生成每张图片的Patch Embedding
        # 图片的每一个通道切分为Token +  将3个channel的所有Token拉直,拉到一个1维,长度为768的向量 + 接一个线性层映射到encoder需要的维度768->1024
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape  # b = 1   n = 196
 
        # 为每张图片生成一个Cls符号 [1, 1, 1024]
        cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
        # [1, 197, 1024]   将每张图片的Cls符号和Patch Embedding进行拼接
        x = torch.cat((cls_tokens, x), dim=1)
        # 初始化位置编码 再和(Cls和Patch Embedding)对应位置相加
        x += self.pos_embedding[:, :(n + 1)]
        # embedding后接一个Dropout
        x = self.dropout(x)
 
        # 将最终的Embedding输入Encoder  x: [1, 197, 1024]  -> [1, 197, 1024]
        x = self.transformer(x)
 
        # self.pool = 'cls' 所以取第一个输出直接进行多分类 [1, 1024]
        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
        x = self.to_latent(x)  # 恒等映射 [1, 1024]
 
        # Cls Head 多分类 [1, cls_num]
        return self.mlp_head(x)

ViT在提取图像特征方面有着CNN没有的优势,比如全局信息的捕捉。我同样建议姜同学考虑ViT。

1.2.3解码器:MLP

代码:

class MLP(nn.Module):
    def __init__(self, input_size):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, len(label_encoder.classes_))

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

最传统的神经网络,可以用于做解码器,但是想取得比较好的效果需要的层数比较多。

 1.2.4解码器:Transformer

代码:


class Transformer(nn.Module):  
    """  
    A standard Encoder-Decoder architecture. Base for this and many other models.  
    """  
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):  
        super(Transformer, self).__init__()  
        self.encoder = encoder  
        self.decoder = decoder  
        self.src_embed = src_embed  
        self.tgt_embed = tgt_embed  
        self.generator = generator  
        self.reset_parameters()  
          
    def forward(self, src, tgt, src_mask, tgt_mask):  
        "Take in and process masked src and target sequences."  
        return self.generator(self.decode(self.encode(src, src_mask),   
                src_mask, tgt, tgt_mask))  
      
    def encode(self, src, src_mask):  
        return self.encoder(self.src_embed(src), src_mask)  
      
    def decode(self, memory, src_mask, tgt, tgt_mask):  
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)  
      
    @classmethod  
    def from_config(cls,src_vocab,tgt_vocab,N=6,d_model=512, d_ff=2048, h=8, dropout=0.1):  
        encoder = TransformerEncoder.from_config(N=N,d_model=d_model,  
                  d_ff=d_ff, h=h, dropout=dropout)  
        decoder = TransformerDecoder.from_config(N=N,d_model=d_model,  
                  d_ff=d_ff, h=h, dropout=dropout)  
        src_embed = nn.Sequential(WordEmbedding(d_model, src_vocab), PositionEncoding(d_model, dropout))  
        tgt_embed = nn.Sequential(WordEmbedding(d_model, tgt_vocab), PositionEncoding(d_model, dropout))  
          
        generator = Generator(d_model, tgt_vocab)  
        return cls(encoder, decoder, src_embed, tgt_embed, generator)  
      
    def reset_parameters(self):  
        for p in self.parameters():  
            if p.dim() > 1:  
                nn.init.xavier_uniform_(p)
————————————————

                            版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
                        
原文链接:https://blog.csdn.net/qq_29788741/article/details/132072630

因为自注意力的存在,能更高效得捕捉隐向量的全局信息。

2.模型训练与调参

这部分更偏向于重复性的工作。我们选择不同的超参进行训练,直至得到一个最低的loss。

这里我们调整的参数有:学习率和学习率下降比率。其他如batch-size以及隐向量的维度等都是依据经验选择的固定值。

因为模型收敛最重要的参数就是学习率以及与学习率相关的参数。如此我们能以较少的重复性劳动获得一个相对最优的模型。

最优参数组合:

lr = 0.15

lr.decay = 0.7

loss下降曲线:

loss逐渐下降,模型收敛。

  • 25
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值