探索InternLM-XComposer：多模态大模型应用实践

寻道AI小兵

于 2024-09-02 08:00:00 发布

阅读量613

点赞数 22

分类专栏： AI大模型开源项目精选实战文章标签：人工智能 AIGC 开源 AI编程语言模型

本文链接：https://blog.csdn.net/xiaobing259/article/details/141788418

版权

AI大模型开源项目精选实战专栏收录该内容

42 篇文章 15 订阅

订阅专栏

引言

在人工智能的浪潮中，多模态学习作为连接视觉与语言的桥梁，正逐渐展现出其独特的魅力和潜力。上海人工智能实验室推出的浦语·灵笔InternLM-XComposer，正是这一领域的杰出代表。本文将深入探讨InternLM-XComposer的技术特点、模型测评、应用场景以及本地部署推理的方法。

一、总体概述

InternLM-XComposer是由上海人工智能实验室研发的多模态大型视觉语言模型，它在图像-文本理解和生成方面展现出卓越的能力。该模型支持高分辨率图像理解、多轮多图像对话、细粒度视频理解、网页制作和高质量文本-图像文章创作等功能，并在多项基准测试中表现优异。
在这里插入图片描述

二、功能特点

InternLM-XComposer-2.5模型具备以下功能特点：

长上下文输入输出能力：能够处理高达96K的长文本输入和输出，适合需要广泛输入和输出的任务。
超高分辨率图像理解：通过原生560×560 ViT视觉编码器支持任意比例的高分辨率图像。
细粒度视频理解：将视频视为由数十到数百帧组成的超高分辨率复合图像，捕捉每个帧的细节。
多轮多图像对话：支持自由形式的多轮多图像对话，与人类自然互动。
网页制作：根据文本图像指令，轻松创建网页，编写HTML、CSS和JavaScript源代码。
高质量图文文章创作：使用特别设计的思维链（Chain-of-Thought, CoT）和直接偏好优化（Direct Preference Optimization, DPO）技术，显著提升写作内容质量。

三、模型测评

InternLM-XComposer-2.5在28个多模态基准测试中进行了评估，它在16个基准测试上超越了现有的开源先进模型，并在16个关键任务上与GPT-4V和Gemini Pro表现相近。
在这里插入图片描述

四、应用场景

InternLM-XComposer-2.5的应用场景包括但不限于：

视频内容理解与生成：能够对视频帧进行详细描述，如同一位专业的视频解说员，为视频内容分析和生成提供有力的支持。在影视制作、视频监控等领域有着广泛的应用前景。
多图像多轮对话系统：在多轮对话中理解和回应多张图片，如同一位机智的对话伙伴，适用于客户服务和交互式应用等场景。为用户提供更加智能、便捷的服务体验。
网页设计：根据给定的文本描述生成网页设计，如同一位专业的网页设计师，适用于快速原型设计和网页开发等领域。为用户节省时间和成本，提高工作效率。
图文文章创作：创作图文并茂的文章，如同一位才华横溢的作家，适用于新闻、博客和教育材料的编写等领域。为用户带来更加生动、精彩的阅读体验。
学术研究：模型权重完全开放，如同一座知识的宝库，允许学术研究和免费商业使用，为多模态学习和人工智能领域的研究提供了有力的支持。促进了学术交流和技术创新。

五、本地部署推理

1、下载模型

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b', cache_dir='/root/autodl-tmp', revision='master')

模型下载如下：
在这里插入图片描述

2、安装依赖

使用以下命令安装所需依赖：

pip install transformers
pip install accelerate
pip install decord
pip install einops
pip install sentencepiece

3、视频理解

视频理解功能如同一位专业的视频分析师，能够深入挖掘视频中的信息，为用户提供更加全面、深入的视频内容分析。

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_name = '/root/autodl-tmp/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b'
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Here are some frames of a video. Describe this video in detail'
image = ['/root/autodl-tmp/liuxiang.mp4',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
#The video opens with a shot of an athlete, dressed in a red and yellow uniform with the word "CHINA" emblazoned across the front, preparing for a race. 
#The athlete, Liu Xiang, is seen in a crouched position, focused and ready, with the Olympic rings visible in the background, indicating the prestigious setting of the Olympic Games. As the race commences, the athletes are seen sprinting towards the hurdles, their determination evident in their powerful strides. 
#The camera captures the intensity of the competition, with the athletes' numbers and times displayed on the screen, providing a real-time update on their performance. The race reaches a climax as Liu Xiang, still in his red and yellow uniform, triumphantly crosses the finish line, his arms raised in victory. 
#The crowd in the stands erupts into cheers, their excitement palpable as they witness the athlete's success. The video concludes with a close-up shot of Liu Xiang, still basking in the glory of his victory, as the Olympic rings continue to symbolize the significance of the event.

query = 'tell me the athlete code of Liu Xiang'
image = ['/root/autodl-tmp/liuxiang.mp4',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, history=his, do_sample=False, num_beams=3, use_meta=True)
print(response)
#The athlete code of Liu Xiang, as displayed on his uniform in the video, is "1363".

4、多图多轮对话

多图多轮对话功能如同一位善解人意的朋友，能够与人类进行自然互动，理解和回应多张图片，为用户带来更加智能、便捷的服务体验。

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_name = '/root/autodl-tmp/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b'
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = 'Image1 <ImageHere>; Image2 <ImageHere>; Image3 <ImageHere>; I want to buy a car from the three given cars, analyze their advantages and weaknesses one by one'
image = ['./examples/cars1.jpg',
        './examples/cars2.jpg',
        './examples/cars3.jpg',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
#To analyze the advantages and disadvantages of each car, we need to consider factors such as brand reputation, performance, design, cost, and maintenance. 
#1. Mercedes-Benz: - Advantages: Known for its luxury and reliability, the Mercedes-Benz brand offers a high level of comfort, advanced technology, and superior craftsmanship. The vehicle in the image appears to be an SUV, which is versatile for both city driving and off-road conditions. - Disadvantages: Typically, Mercedes-Benz vehicles are more expensive compared to other brands, and they may require more frequent maintenance due to their luxury status. 
#2. Bugatti: - Advantages: Bugatti is renowned for producing some of the fastest and most powerful cars in the world. The vehicle in the image is a sports car, likely offering an exhilarating driving experience with its high-performance engine and advanced aerodynamics. - Disadvantages: Bugatti cars are extremely expensive, making them less accessible to the average consumer. They also require specialized knowledge for maintenance and may not be suitable for everyday driving due to their high-performance nature. 
#3. BMW: - Advantages: BMW is a well-respected manufacturer known for its innovative engineering and design. The car in the image appears to be an electric vehicle (EV), which offers advantages such as lower emissions, reduced noise pollution, and potentially lower operating costs. - Disadvantages: EVs like the one in the image may have a shorter range compared to traditional gasoline-powered vehicles, limiting their suitability for long-distance travel. 
#Additionally, the initial cost of purchasing an EV can be higher than that of a conventional car. In conclusion, each car has its own set of advantages and disadvantages. The Mercedes-Benz offers luxury and versatility, the Bugatti provides high performance and exclusivity, and the BMW represents innovation in electric vehicle technology. The choice ultimately depends on the buyer's priorities, such as comfort, performance, environmental impact, and budget.

query = 'Image4 <ImageHere>; How about the car in Image4'
image.append('./examples/cars4.jpg')        
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, history= his, use_meta=True)
print(response)
#The car in Image4 is a red sports car, which appears to be a Ferrari. Ferrari is a renowned Italian manufacturer known for producing some of the most iconic and high-performance sports cars in the world. - Advantages: Ferrari vehicles are synonymous with speed, luxury, and engineering excellence. 
#The car in the image likely offers an exhilarating driving experience with its powerful engine, advanced aerodynamics, and high-quality craftsmanship. The red color adds to the car's aesthetic appeal, making it stand out on the road. - Disadvantages: Ferrari cars are extremely expensive, making them less accessible to the average consumer. 
#They also require specialized knowledge for maintenance and may not be suitable for everyday driving due to their high-performance nature. In conclusion, the Ferrari in Image4 represents a pinnacle of automotive engineering and design, offering unmatched performance and luxury. 
#However, its high cost and specialized maintenance requirements make it less practical for everyday use compared to the other vehicles in the images.

5、高清晰图像理解

高清晰图像理解功能如同一台高分辨率的显微镜，能够清晰地展现图像中的每一个细节，为用户提供更加准确、全面的图像信息。

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_name = '/root/autodl-tmp/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b'
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Analyze the given image in a detail manner'
image = ['/root/autodl-tmp/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
#The infographic is a visual representation of various facts about Dubai. It begins with a statement about Palm Jumeirah, highlighting it as the largest artificial island visible from space. It then provides a historical context, noting that in 1968, there were only a few cars in Dubai, contrasting this with the current figure of more than 1.5 million vehicles. 
#The infographic also points out that Dubai has the world's largest Gold Chain, with 7 of the top 10 tallest hotels located there. Additionally, it mentions that the crime rate is near 0%, and the income tax rate is also 0%, with 20% of the world's total cranes operating in Dubai. Furthermore, it states that 17% of the population is Emirati, and 83% are immigrants.
#The Dubai Mall is highlighted as the largest shopping mall in the world, with 1200 stores. The infographic also notes that Dubai has no standard address system, with no zip codes, area codes, or postal services. It mentions that the Burj Khalifa is so tall that its residents on top floors need to wait longer to break fast during Ramadan. 
#The infographic also includes information about Dubai's climate-controlled City, with the Royal Suite at Burj Al Arab costing $24,000 per night. Lastly, it notes that the net worth of the four listed billionaires is roughly equal to the GDP of Honduras.

6、指令生成网页

指令生成网页功能如同一位专业的网页设计师，能够根据用户的指令快速地生成精美的网页，为用户提供更加便捷、高效的网页设计服务。

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_name = '/root/autodl-tmp/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b'
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = 'A website for Research institutions. The name is Shanghai AI lab. Top Navigation Bar is blue.Below left, an image shows the logo of the lab. In the right, there is a passage of text below that describes the mission of the laboratory.There are several images to show the research projects of Shanghai AI lab.'
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response = model.write_webpage(query, seed=202, task='Instruction-aware Webpage Generation', repetition_penalty=3.0)
print(response)
# see the Instruction-aware Webpage Generation.html

7、图文文章写作

图文文章写作功能如同一位才华横溢的作家，能够为用户创作出生动、精彩的图文文章，为用户带来更加丰富、有趣的阅读体验。


import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_name = '/root/autodl-tmp/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b'
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = '阅读下面的材料，根据要求写作。 电影《长安三万里》的出现让人感慨，影片并未将重点全落在大唐风华上，也展现了恢弘气象的阴暗面，即旧门阀的资源垄断、朝政的日益衰败与青年才俊的壮志难酬。高适仕进无门，只能回乡>沉潜修行。李白虽得玉真公主举荐，擢入翰林，但他只是成为唐玄宗的御用文人，不能真正实现有益于朝政的志意。然而，片中高潮部分《将进酒》一节，人至中年、挂着肚腩的李白引众人乘仙鹤上天，一路从水面、瀑布飞升至银河进入仙>宫，李白狂奔着与仙人们碰杯，最后大家纵身飞向漩涡般的九重天。肉身的微贱、世路的“天生我材必有用，坎坷，拘不住精神的高蹈。“天生我材必有用，千金散尽还复来。” 古往今来，身处闲顿、遭受挫折、被病痛折磨，很多人都曾经历>了人生的“失意”，却反而成就了他们“诗意”的人生。对正在追求人生价值的当代青年来说，如何对待人生中的缺憾和困顿?诗意人生中又有怎样的自我坚守和自我认同?请结合“失意”与“诗意”这两个关键词写一篇文章。 要求:选准角度，确定>立意，明确文体，自拟标题;不要套作，不得抄袭;不得泄露个人信息;不少于 800 字。'
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response = model.write_artical(query, seed=8192)
print(response)

结语

InternLM-XComposer作为多模态领域的一款先进工具，不仅在技术上取得了突破，更为图文创作、内容理解等多个领域带来了新的可能性。随着技术的不断进步和优化，我们有理由相信，它将在未来的人工智能应用中扮演更加重要的角色。

在这里插入图片描述

😎 作者介绍：我是寻道AI小兵，资深程序老猿，从业10年+、互联网系统架构师，目前专注于AIGC的探索。
📖 技术交流：欢迎关注【小兵的AI视界】公众号或扫描下方👇二维码，加入技术交流群，开启编程探索之旅。
💘精心准备📚500本编程经典书籍、💎AI专业教程，以及高效AI工具。等你加入，与我们一同成长，共铸辉煌未来。
如果文章内容对您有所触动，别忘了点赞、⭐关注，收藏！加入我，让我们携手同行AI的探索之旅，一起开启智能时代的大门！

寻道AI小兵

关注

22
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
探索InternLM-XComposer：多模态大模型应用实践

InternLM-XComposer是由上海人工智能实验室研发的多模态大型视觉语言模型，它在图像-文本理解和生成方面展现出卓越的能力。该模型支持高分辨率图像理解、多轮多图像对话、细粒度视频理解、网页制作和高质量文本-图像文章创作等功能，并在多项基准测试中表现优异。
复制链接

扫一扫

专栏目录