多模态生成式模型MultiModal Generative Models 2024最新列表

rockingdingo

已于 2024-02-24 20:15:18 修改

阅读量1k

点赞数 21

文章标签：人工智能多模态图像生成视频生成

于 2024-02-24 20:14:43 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/rockingdingo/article/details/136275490

版权

Introduction to Multimodal Generative Models-Model Architecture Key Features and Codes

1. 多模态生成式模型 MultiModal Generative Models 2024最新列表

Model	Year	Developer	Modality	Architecture	Key Features
SORA	2024	OpenAI	Video,Text	Image Encoder: Diffusion DiT	Generative Modeling,Text-to-Video
Gemini V1.5	2024	Google	Video,Text,Audio	Image Encoder: ViT,Text Encoder:Transformer	Generative Modeling,Long Context Window
BLIP2	2023	Salesforce Research	Image,Text	Q-Former: Bridging Modality Gap,Image Encoder: ViT-L/ViT-G,Text LLM Encoder: OPT/FlanT5	Generative Modeling,Image-to-Text,Visual Question Answering,Image-to-Text Retrieval
GPT-4V	2023	OpenAI	Image,Text	Text Encoder: GPT	Generative Modeling,Multimodal LLM,Visual Question Answering
LLaVA	2023	Microsoft	Image,Text	Text LLM Encoder: Vicuna,Image Encoder:CLIP visual ViT-L	Generative Modeling,Visual Instruction Generation
KOSMOS-2	2023	Microsoft	Image,Text	Vision encoder , LLM Encoder: 24-layer MAGNETO Transformer	Multimodal Grounding,Language Understanding and Generation
PaLM-E	2023	Google	Image,Text	Image Encoder: ViT encoding	Multimodal Language Model
BLIP	2022	Salesforce Research	Image,Text	Image Encoder: ViT-B,ViT-L; Text Encoder: BERT-Base	Generative Modeling,Bootstrapping,VQA,Caption Generation
FLAMINGO	2022	DeepMind	Image,Text	Gated Cross Attention,Multiway Transformer,ViT-giant	VQA,Interleaved Visual and Textual Data
upCLIP	2022	OpenAI	Image,Text	CLIP ViT-L,Diffusion Prior/Autoregressive prior	Generative Modeling,Text-to-Image,Image Generation,Diffusion Models
BEiT-3	2022	Microsoft	Image,Text	Text Encoder: OPT/FlanT5,Image Encoder:ViT-L/ViT-g	Object Detection,Visual Question Answering,Image Captaining
CLIP	2021	OpenAI	Image,Text	Text Encoder: Transformer; Image Encoder: ResNet/ViT	Multimodal Alignment,Zero-Shot Learning
ALIGN	2021	Google	Image,Text	Image Encoder: EfficientNet,Text-Encoder: BERT	Multimodal Alignment,Image-Text Retrieval

2.多模态生成式模型常见任务

Image Captioning
Image Text Retrieval
Text-to-Image
Text-to-Video
Visual Question Answering

3.相关链接

参考文档：多模态生成式模型MultiModal Generative Models 2024最新列表 - 知乎

关注

21
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
多模态生成式模型MultiModal Generative Models 2024最新列表

Introduction to Multimodal Generative Models-Model Architecture Key Features and Codes参考文档：多模态生成式模型MultiModal Generative Models 2024最新列表 - 知乎
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。