MiniGPT-4论文阅读+Colab部署

1.Colab上部署MiniGPT-4

代码:
https://github.com/Czi24/Awesome-MLLM-LLM-Colab/tree/master/MLLM/MiniGPT-4-colab

2.论文摘要

最近的 GPT-4 展示了非凡的多模式能力,例如直接从手写文本生成网站以及识别图像中的幽默元素。这些特征在以前的视觉语言模型中很少观察到。我们认为 GPT-4 先进的多模态生成能力的主要原因在于使用了更先进的大语言模型(LLM)。为了研究这一现象,我们提出了 MiniGPT-4,它仅使用一个投影层将冻结的视觉编码器与冻结的 LLM、Vicuna 对齐。我们的研究结果表明,MiniGPT-4 拥有许多与 GPT-4 类似的功能,例如详细的图像描述生成和根据手写草稿创建网站。此外,我们还观察到 MiniGPT-4 中的其他新兴功能,包括受给定图像启发编写故事和诗歌、为图像中显示的问题提供解决方案、教用户如何根据食物照片烹饪等。在我们的实验中,我们发现仅对原始图像-文本对进行预训练可能会产生缺乏连贯性的不自然语言输出,包括重复和支离破碎的句子。为了解决这个问题,我们在第二阶段策划了一个高质量、对齐良好的数据集,以使用对话模板来微调我们的模型。事实证明,这一步骤对于增强模型的生成可靠性和整体可用性至关重要。值得注意的是,我们的模型计算效率很高,因为我们只利用大约 500 万个对齐的图像文本对来训练投影层。

3.模型架构

ViT & Q-Former + Linear + Vicuna

MiniGPT-4 由一个带有预训练 ViT 和 Q-Former 的视觉编码器、一个线性投影层和一个高级 Vicuna 大语言模型组成。 MiniGPT-4只需要训练线性层即可将视觉特征与Vicuna对齐:

4.模型训练

Stage1:
5M的图文对

Stage2:
3500张高质量的指令数据

  1. 从Conceptual Caption datase中筛选5000张图片,利用第一阶段的模型输出回答,第一阶段自动生成的图像描述包含噪音或不连贯的描述,如重复的单词或句子,支离破碎的句子,或不相关的内容。
  2. 通过ChatGPT重构数据,最终得到3500张图片
    系统提示:

Fix the error in the given paragraph. Remove any repeating sentences, meaningless characters, not English sentences, and so on. Remove unnecessary repetition. Rewrite any incomplete sentences. Return directly the results without explanation. Return directly the input paragraph if it is already correct without explanation.

5.总结

  1. 幻觉问题:long caption的幻觉比short caption 严重
  2. 空间理解能力不足


  • 8
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Sure! Here are the steps to fine-tune ViT-S on a custom dataset using Google Colab: 1. Open a new Google Colab notebook and select a GPU runtime environment. 2. Install the necessary libraries: ``` !pip install torch torchvision !pip install timm ``` 3. Download and prepare the custom dataset. You can use any dataset of your choice. Make sure to split it into training and validation sets. 4. Define the data loaders: ``` import torch import torchvision.transforms as transforms from torch.utils.data import DataLoader from torchvision.datasets import ImageFolder # Define the transformations transform_train = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) transform_val = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Define the data loaders train_dataset = ImageFolder('path_to_train_data', transform=transform_train) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4) val_dataset = ImageFolder('path_to_val_data', transform=transform_val) val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4) ``` Replace 'path_to_train_data' and 'path_to_val_data' with the paths to your training and validation data folders, respectively. 5. Load the pre-trained ViT-S model: ``` import timm model = timm.create_model('vit_small_patch16_224', pretrained=True) ``` 6. Modify the last layer of the model to fit your custom dataset: ``` import torch.nn as nn num_classes = len(train_dataset.classes) model.head = nn.Sequential( nn.LayerNorm((768,)), nn.Linear(768, num_classes) ) ``` Replace '768' with the hidden size of the model you are using. For ViT-S, it is 768. 7. Define the optimizer and criterion: ``` import torch.optim as optim optimizer = optim.Adam(model.parameters(), lr=1e-4) criterion = nn.CrossEntropyLoss() ``` 8. Fine-tune the model: ``` device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) num_epochs = 10 for epoch in range(num_epochs): train_loss = 0.0 val_loss = 0.0 correct = 0 total = 0 # Train the model model.train() for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() train_loss += loss.item() * inputs.size(0) # Evaluate the model on validation set model.eval() with torch.no_grad(): for inputs, labels in val_loader: inputs, labels = inputs.to(device), labels.to(device) outputs = model(inputs) loss = criterion(outputs, labels) val_loss += loss.item() * inputs.size(0) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() train_loss = train_loss / len(train_loader.dataset) val_loss = val_loss / len(val_loader.dataset) accuracy = 100 * correct / total print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f} \tAccuracy: {:.2f}'.format( epoch+1, train_loss, val_loss, accuracy)) ``` 9. Save the model: ``` torch.save(model.state_dict(), 'path_to_save_model') ``` Replace 'path_to_save_model' with the path where you want to save the model. That's it! You have fine-tuned ViT-S on your custom dataset using Google Colab.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值