大模型系列：OpenAI使用技巧_GPT-4-vision和TTS API处理和叙述视频

数智笔记

已于 2023-12-31 20:13:34 修改

阅读量1.4k

点赞数 10

分类专栏： OpenAI 数据挖掘文章标签：音视频人工智能语言模型

于 2023-12-31 20:13:20 首次发布

本文链接：https://blog.csdn.net/wjjc1017/article/details/135319138

版权

数据挖掘同时被 2 个专栏收录

166 篇文章

订阅专栏

OpenAI

26 篇文章

订阅专栏

本文展示了如何利用GPT的视觉功能对视频进行描述，并通过GPT配合TTSAPI生成视频的配音，使用OpenCV读取视频帧，展示了一个完整的实例过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本笔记本演示了如何使用GPT的视觉能力处理视频。GPT-4不能直接将视频作为输入，但我们可以使用视觉和新的128K上下文窗口一次性描述整个视频的静态帧。我们将演示两个示例：

使用GPT-4获取视频的描述
使用GPT-4和TTS API为视频生成配音

# 导入需要的库
from IPython.display import display, Image, Audio  # IPython.display库用于在Jupyter Notebook中显示图像、音频等
import cv2  # OpenCV库用于读取视频
import base64  # base64库用于将二进制数据编码为ASCII字符
import time  # time库用于计时
from openai import OpenAI  # OpenAI库用于调用OpenAI API
import os  # os库用于与操作系统交互
import requests  # requests库用于向API发送HTTP请求

client = OpenAI()  # 创建OpenAI客户端对象

使用GPT的视觉能力获取视频描述。

首先，我们使用OpenCV从一个包含野牛和狼的自然视频中提取帧：

# 导入OpenCV库
import cv2
# 导入base64库
import base64

# 打开视频文件
video = cv2.VideoCapture("data/bison.mp4")

# 创建一个空列表，用于存储每一帧图像的base64编码
base64Frames = []

# 循环读取视频帧
while video.isOpened():
    # 读取一帧图像
    success, frame = video.read()
    # 如果读取失败，则退出循环
    if not success:
        break
    # 将图像编码为jpg格式
    _, buffer = cv2.imencode(".jpg", frame)
    # 将编码后的图像转换为base64编码，并添加到列表中
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

# 释放视频文件
video.release()

# 打印读取的帧数
print(len(base64Frames), "frames read.")

618 frames read.

显示帧以确保我们正确读取了它们：

# 创建一个display对象，用于显示图像
display_handle = display(None, display_id=True)

# 遍历base64Frames中的每个图像
for img in base64Frames:
    # 将base64编码的图像数据解码为二进制数据，并创建一个Image对象
    image_data = base64.b64decode(img.encode("utf-8"))
    image = Image(data=image_data)
    
    # 更新display对象，显示当前图像
    display_handle.update(image)
    
    # 暂停一段时间，以便观察图像
    time.sleep(0.025)

一旦我们获得了视频帧，我们会制作提示并向GPT发送请求（请注意，我们不需要发送每个帧以让GPT理解发生了什么）：

# 代码注释

# 定义一个包含对话信息的列表，每个对话信息包含角色和内容
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::50]),
        ],
    },
]

# 定义参数字典，包含模型、对话信息和生成的最大令牌数
params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 200,
}

# 调用API创建聊天完成请求，并传入参数
result = client.chat.completions.create(**params)

# 打印生成的描述内容
print(result.choices[0].message.content)

"🐺 Survival of the Fittest: An Epic Tale in the Snow ❄️ - Witness the intense drama of nature as a pack of wolves face off against mighty bison in a harsh winter landscape. This raw footage captures the essence of the wild where every creature fights for survival. With each frame, experience the tension, the strategy, and the sheer force exerted in this life-or-death struggle. See nature's true colors in this gripping encounter on the snowy plains. 🦬"

Remember to respect wildlife and nature. This video may contain scenes that some viewers might find intense or distressing, but they depict natural animal behaviors important for ecological studies and understanding the reality of life in the wilderness.

使用GPT-4和TTS API为视频生成配音。

让我们以大卫·爱登堡的风格为这个视频配音。使用相同的视频帧，我们引导GPT为我们提供一个简短的脚本：

# 代码注释

# 定义一个包含提示信息的列表，用于生成对话
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::60]),
        ],
    },
]

# 定义一个参数字典，用于调用API
params = {
    "model": "gpt-4-vision-preview",  # 指定模型为"gpt-4-vision-preview"
    "messages": PROMPT_MESSAGES,  # 使用上述定义的对话提示信息
    "max_tokens": 500,  # 生成的文本最大长度为500个tokens
}

# 调用API生成文本
result = client.chat.completions.create(**params)

# 打印生成的文本内容
print(result.choices[0].message.content)

In the vast, white expanse of the northern wilderness, a drama as old as time unfolds. Here, amidst the silence of the snow, the wolf pack circles, their breaths visible as they cautiously approach their formidable quarry, the bison. These wolves are practiced hunters, moving with strategic precision, yet the bison, a titan of strength, stands resolute, a force to be reckoned with.

As tension crackles in the frozen air, the wolves close in, their eyes locked on their target. The bison, wary of every movement, prepares to defend its life. It's a perilous dance between predator and prey, where each step could be the difference between life and death.

In an instant, the quiet of the icy landscape is shattered. The bison charges, a desperate bid for survival as the pack swarms. The wolves are relentless, each one aware that their success depends on the strength of the collective. The bison, though powerful, is outnumbered, its massive form stirring up clouds of snow as it struggles.

It's an epic battle, a testament to the harsh realities of nature. In these moments, there is no room for error, for either side. The wolves, agile and tenacious, work in unison, their bites a chorus aiming to bring down the great beast. The bison, its every heaving breath a testament to its will to survive, fights fiercely, but the odds are not in its favor.

With the setting sun casting long shadows over the snow, the outcome is inevitable. Nature, in all its raw beauty and brutality, does not show favor. The wolves, now victors, gather around their prize, their survival in this harsh climate secured for a moment longer. It's a poignant reminder of the circle of life that rules this pristine wilderness, a reminder that every creature plays its part in the enduring saga of the natural world.

现在我们可以将脚本传递给TTS API，它将生成语音解说的mp3文件：

import requests  # 导入requests库，用于发送HTTP请求
import os  # 导入os库，用于获取环境变量

# 发送POST请求，将文本转换为语音
response = requests.post(
    "https://api.openai.com/v1/audio/speech",  # 请求的URL
    headers={
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",  # 设置请求头，包含API密钥
    },
    json={
        "model": "tts-1-1106",  # 使用的语音模型
        "input": result.choices[0].message.content,  # 输入的文本内容
        "voice": "onyx",  # 使用的语音类型
    },
)

audio = b""  # 初始化音频数据为空
# 逐块读取响应内容，并将其添加到音频数据中
for chunk in response.iter_content(chunk_size=1024 * 1024):
    audio += chunk

Audio(audio)  # 播放音频数据