Phi-3.5-vision-instruct是微软最新发布的 Phi-3.5 系列中的一个AI模型,专注于多模态任务处理,尤其是视觉推理方面的能力。
Phi-3.5-vision-instruct模型具备广泛的图像理解、光学字符识别(OCR)、图表和表格解析、多图像或视频剪辑摘要等功能,非常适合多种AI驱动的应用,在图像和视频处理相关的基准测试中表现出显著的性能提升。
Phi-3.5-vision-instruct模型的架构包括一个42亿参数的系统,集成了图像编码器、连接器、投影器和Phi-3 Mini语言模型,训练使用了256个NVIDIA A100-80G GPU,训练时间为6天。
Phi-3.5-vision在多模态多图像理解(MMMU)中的得分为43.0,相较于之前版本有所提升,显示了其在处理复杂图像理解任务时的增强能力。
github项目地址:https://github.com/microsoft/Phi-3CookBook。
一、环境安装
1、python环境
建议安装python版本在3.10以上。
2、pip库安装
pip install torch==2.3.0+cu118 torchvision==0.18.0+cu118 torchaudio==2.3.0 --extra-index-url https://download.pytorch.org/whl/cu118
pip install upgrade transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install flash-attn --no-build-isolation
3、模型下载:
git lfs install
git clone https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct
二、功能测试
1、运行测试:
(1)python代码调用测试
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
import argparse
class VisionInstructModel:
def __init__(self, model_path, local_image_path, torch_dtype='auto'):
self.model_path = model_path
self.local_image_path = local_image_path
self.torch_dtype = torch_dtype
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self._load_model_and_processor()
def _load_model_and_processor(self):
self.processor = AutoProcessor.from_pretrained(self.model_path, trust_remote_code=True)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_path,
trust_remote_code=True,
torch_dtype=self.torch_dtype,
_attn_implementation='flash_attention_2'
).to(self.device)
def _prepare_input(self, prompt, image_path):
image = Image.open(image_path)
return self.processor(prompt, image, return_tensors="pt").to(self.device)
def generate_response(self, prompt, max_new_tokens=1000):
inputs = self._prepare_input(prompt, self.local_image_path)
generate_ids = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
eos_token_id=self.processor.tokenizer.eos_token_id
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = self.processor.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
return response
def describe_image(self):
user_prompt = '<|user|>\n'
assistant_prompt = '<|assistant|>\n'
prompt_suffix = "<|end|>\n"
prompt = f"{user_prompt}<|image_1|>\nDescribe the picture{prompt_suffix}{assistant_prompt}"
response = self.generate_response(prompt)
print("response:", response)
return response
def main(model_path, image_path):
model = VisionInstructModel(model_path, image_path, torch_dtype='bfloat16')
model.describe_image()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run VisionInstructModel to describe an image.")
parser.add_argument("--model_path", type=str, required=True, help="Path to the model directory.")
parser.add_argument("--image_path", type=str, required=True, help="Path to the image file.")
args = parser.parse_args()
main(args.model_path, args.image_path)
未完......
更多详细的欢迎关注:杰哥新技术