【下篇II】用于手机视觉、语音和多模态实时流媒体的 GPT-4o 级 MLLM

吴脑的键客

已于 2025-02-06 15:25:32 修改

阅读量568

点赞数 17

分类专栏：机器人技术文章标签：人工智能 chatgpt

于 2025-02-06 15:23:08 首次发布

本文链接：https://blog.csdn.net/weixin_41446370/article/details/145474562

版权

机器人技术专栏收录该内容

54 篇文章

订阅专栏

MiniCPM-o 2.6 是 MiniCPM-o 系列中最新、功能最强大的型号。该模型以端到端方式构建，基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B，共有 8B 参数。

代码示例

语音指令
MiniCPM-o-2.6 还能进行语音指令（又称语音创建）。您可以对声音进行详细描述，模型将生成符合描述的声音。如需了解更多 "指令到语音 "示例指令，请参阅 https://voxinstruct.github.io/VoxInstruct/。

instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'

msgs = [{'role': 'user', 'content': [instruction]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_creation.wav',
)

语音克隆

MiniCPM-o-2.6 还可以实现零镜头文本到语音（又称语音克隆）。使用该模式，模型将像 TTS 模型一样运行。

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}

msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_cloning.wav',
)

处理各种音频理解任务

MiniCPM-o-2.6 还可用于处理各种音频理解任务，如 ASR、说话者分析、一般音频字幕和声音场景标记。对于音频到文本任务，您可以使用以下提示：

For audio-to-text tasks, you can use the following prompts:

ASR with ZH(same as AST en2zh): 请仔细听这段音频片段，并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker’s content, speculate on their gender, condition, age range, and health status.
General Audio Caption: Summarize the main content of the audio.
General Sound Scene Tagging: Utilize one keyword to convey the audio’s content or the associated scene.

task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned

msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_audio_understanding.wav',
)
print(res)

仅视觉模式

MiniCPM-o-2_6 的推理方法与 MiniCPM-V-2_6 相同

# test.py
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)
generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

使用多张图片聊天

点击显示运行 MiniCPM-o 2.6 并输入多张图片的 Python 代码。

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

在上下文中进行少量输入学习

点击查看运行 MiniCPM-o 2.6 并进行少量输入的 Python 代码。

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

使用视频聊天

点击查看运行 MiniCPM-o 2.6 并带有视频输入的 Python 代码。

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]
# Set decode params for video
params={}
params["use_image_id"] = False
#params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448
params["max_slice_nums"] = 1 #这里经测试直接设置1，否则报错
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)