MiniCPM-o 2.6 是 MiniCPM-o 系列中最新、功能最强大的型号。该模型以端到端方式构建,基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B,共有 8B 参数。
代码示例
语音指令
MiniCPM-o-2.6 还能进行语音指令(又称语音创建)。 您可以对声音进行详细描述,模型将生成符合描述的声音。 如需了解更多 "指令到语音 "示例指令,请参阅 https://voxinstruct.github.io/VoxInstruct/。
instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
msgs = [{'role': 'user', 'content': [instruction]}]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_voice_creation.wav',
)
语音克隆
MiniCPM-o-2.6 还可以实现零镜头文本到语音(又称语音克隆)。 使用该模式,模型将像 TTS 模型一样运行。
ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
msgs = [sys_prompt, user_question]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_voice_cloning.wav',
)
处理各种音频理解任务
MiniCPM-o-2.6 还可用于处理各种音频理解任务,如 ASR、说话者分析、一般音频字幕和声音场景标记。 对于音频到文本任务,您可以使用以下提示:
For audio-to-text tasks, you can use the following prompts:
ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker’s content, speculate on their gender, condition, age range, and health status.
General Audio Caption: Summarize the main content of the audio.
General Sound Scene Tagging: Utilize one keyword to convey the audio’s content or the associated scene.
task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_audio_understanding.wav',
)
print(res)
仅视觉模式
MiniCPM-o-2_6 的推理方法与 MiniCPM-V-2_6 相同
# test.py
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(res)
## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
stream=True
)
generated_text = ""
for new_text in res:
generated_text += new_text
print(new_text, flush=True, end='')
使用多张图片聊天
点击显示运行 MiniCPM-o 2.6 并输入多张图片的 Python 代码。
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
在上下文中进行少量输入学习
点击查看运行 MiniCPM-o 2.6 并进行少量输入的 Python 代码。
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
使用视频聊天
点击查看运行 MiniCPM-o 2.6 并带有视频输入的 Python 代码。
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params={}
params["use_image_id"] = False
#params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
params["max_slice_nums"] = 1 #这里经测试直接设置1,否则报错
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
有关使用的更多详情,请访问 GitHub。
使用 llama.cpp 进行推理
MiniCPM-o 2.6(纯视觉模式)可使用 llama.cpp 运行。 更多详情,请参阅我们的 llama.cpp fork 和 readme。
Int4 量化版
下载 int4 量化版,以降低 GPU 内存(7GB)使用率: MiniCPM-o-2_6-int4.
最后
这么全面的 openbmb/MiniCPM-o-2_6 记录,还不速速点赞,关注,加收藏🔥🔥🔥🔥🔥。