【下篇II】用于手机视觉、语音和多模态实时流媒体的 GPT-4o 级 MLLM

MiniCPM-o 2.6 是 MiniCPM-o 系列中最新、功能最强大的型号。该模型以端到端方式构建,基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B,共有 8B 参数。

代码示例

语音指令
MiniCPM-o-2.6 还能进行语音指令(又称语音创建)。 您可以对声音进行详细描述,模型将生成符合描述的声音。 如需了解更多 "指令到语音 "示例指令,请参阅 https://voxinstruct.github.io/VoxInstruct/。

instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'

msgs = [{'role': 'user', 'content': [instruction]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_creation.wav',
)

语音克隆

MiniCPM-o-2.6 还可以实现零镜头文本到语音(又称语音克隆)。 使用该模式,模型将像 TTS 模型一样运行。

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}

msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_cloning.wav',
)

处理各种音频理解任务

MiniCPM-o-2.6 还可用于处理各种音频理解任务,如 ASR、说话者分析、一般音频字幕和声音场景标记。 对于音频到文本任务,您可以使用以下提示:

For audio-to-text tasks, you can use the following prompts:

ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker’s content, speculate on their gender, condition, age range, and health status.
General Audio Caption: Summarize the main content of the audio.
General Sound Scene Tagging: Utilize one keyword to convey the audio’s content or the associated scene.

task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned

msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_audio_understanding.wav',
)
print(res)

仅视觉模式

MiniCPM-o-2_6 的推理方法与 MiniCPM-V-2_6 相同

# test.py
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)
generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

使用多张图片聊天

点击显示运行 MiniCPM-o 2.6 并输入多张图片的 Python 代码。

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

在上下文中进行少量输入学习

点击查看运行 MiniCPM-o 2.6 并进行少量输入的 Python 代码。

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

使用视频聊天

点击查看运行 MiniCPM-o 2.6 并带有视频输入的 Python 代码。

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]
# Set decode params for video
params={}
params["use_image_id"] = False
#params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448
params["max_slice_nums"] = 1 #这里经测试直接设置1,否则报错
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)

有关使用的更多详情,请访问 GitHub

使用 llama.cpp 进行推理

MiniCPM-o 2.6(纯视觉模式)可使用 llama.cpp 运行。 更多详情,请参阅我们的 llama.cpp fork 和 readme。

Int4 量化版

下载 int4 量化版,以降低 GPU 内存(7GB)使用率: MiniCPM-o-2_6-int4.

最后

在这里插入图片描述

这么全面的 openbmb/MiniCPM-o-2_6 记录,还不速速点赞,关注,加收藏🔥🔥🔥🔥🔥。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值