CLIP模型导出ONNX模型

openai CLIP

GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

模型权重:

https://huggingface.co/openai/clip-vit-base-patch32 

from PIL import Image
import requests
import torch
import torch.nn as nn
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("clip-vit-base-patch32")

url = "000000039769.jpg"
image = Image.open(url)

inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)

print("inputs:", inputs.keys())

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities


class ImgModelWrapper(nn.Module):
    def __init__(self, model):
        super(ImgModelWrapper, self).__init__()
        self.model = model

    def forward(self, pixel_values):
        image_features = model.get_image_features(pixel_values=pixel_values)
        return image_features


class TxtModelWrapper(nn.Module):
    def __init__(self, model):
        super(TxtModelWrapper, self).__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        text_features = model.get_text_features(input_ids=input_ids, attention_mask=attention_mask)
        return text_features


img_model = ImgModelWrapper(model)
txt_model = TxtModelWrapper(model)


torch.onnx.export(img_model,  # model being run
                  (inputs.pixel_values),  # model input (or a tuple for multiple inputs)
                  "clip_img.onnx",   # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=15,          # the ONNX version to export the model to
                  do_constant_folding=False,  # whether to execute constant folding for optimization
                  input_names=['pixel_values'],   # the model's input names
                  # output_names=['output'],  # the model's output names
                  # dynamic_axes={'pixel_values': {0: 'batch', 2: 'hight', 3: 'width'}},
                  )

torch.onnx.export(txt_model,  # model being run
                  (inputs.input_ids, inputs.attention_mask),  # model input (or a tuple for multiple inputs)
                  "clip_txt.onnx",   # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=15,          # the ONNX version to export the model to
                  do_constant_folding=False,  # whether to execute constant folding for optimization
                  input_names=['input_ids', 'attention_mask'],   # the model's input names
                  # output_names=['output'],  # the model's output names
                  dynamic_axes={'input_ids': {0: 'batch', 1: 'seq'}, 
                                'attention_mask': {0: 'batch', 1: 'seq'}},
                  )

chinese-clip用上面类似的方法

GitHub - OFA-Sys/Chinese-CLIP: Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Luchang-Li

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值