第一个化学领域的开源多模态大语言模型：通过ChemVLM架起视觉与化学知识的桥梁

最新推荐文章于 2025-03-26 23:08:42 发布

Phoenixtree_DongZhao

最新推荐文章于 2025-03-26 23:08:42 发布

阅读量2.1k

点赞数 23

分类专栏： Large Model Multi-modal 随笔文章标签：语言模型自然语言处理化学

本文链接：https://blog.csdn.net/u014546828/article/details/141231057

版权

Seeing and Understanding:

Bridging Vision with Chemical Knowledge Via ChemVLM

https://huggingface.co/AI4Chem/ChemVLM-26B

https://arxiv.org/pdf/2408.07246

Abstract

In this technical report, we propose ChemVLM, the first open-source multimodal large language model dedicated to the fields of chemistry, designed to address the incompatibility between chemical image understanding and text analysis. Built upon the VIT-MLP-LLM architecture, we leverage ChemLLM-20B as the foundational large model, endowing our model with robust capabilities in understanding and utilizing chemical text knowledge. Additionally, we employ InternVIT-6B as a powerful image encoder. We have curated high-quality data from the chemical domain, including molecules, reaction formulas, and chemistry examination data, and compiled these into a bilingual multimodal question-answering dataset. We test the performance of our model on multiple open-source benchmarks and three custom evaluation sets. Experimental results demonstrate that our model achieves excellent performance, securing state-of-the-art results in five out of six involved tasks.

本文提出了ChemVLM，这是首个面向化学领域的开源多模态大型语言模型，旨在解决化学图像理解与文本分析之间的不兼容问题。

该模型基于VIT-MLP-LLM架构，采用ChemLLM-20B作为基础大型模型，使模型在理解和利用化学文本知识方面具备了强大的能力。

此外，还采用了InternVIT-6B作为强大的图像编码器。从化学领域精心挑选了高质量数据，包括分子、反应式以及化学考试数据，并将这些数据编译成一个双语多模态问答数据集。

本文在多个开源基准测试集和三个自定义评估集上测试了模型的性能。实验结果表明，本文的模型表现优异，在六个相关任务中的五个任务中均达到了最先进水平。