Introduction to GPT-4o
GPT-4o (“o” for “omni”) is designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats.
Background
Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o will integrate these capabilities into a single model that’s trained across text, vision, and audio. This unified approach ensures that all inputs—whether text, visual, or auditory—are processed cohesively by the same neural network.
Current API Capabilities
Currently, the API supports {text, image}
inputs only, with {text}
outputs, the same modalities as gpt-4-turbo
. Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o for text, image, and video understanding.
Getting Started
Install OpenAI SDK for Python
%pip install --upgrade openai --quiet
Configure the OpenAI client and submit a test request
To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage.
You can get an API key by following these steps:
- Create a new project
- Generate an API key in your project
- (RECOMMENDED, BUT NOT REQUIRED) Setup your API key for all projects as an env var
Once we have this setup, let’s start with a simple {text} input to the model for our first request. We’ll use both system
and user
messages for our first request, and we’ll receive a response from the assistant
role.
from openai import OpenAI
import os
Set the API key and model name
MODEL=“gpt-4o”
client = OpenAI(api_key=os.environ.get(“OPENAI_API_KEY”, “”))
completion = client.chat.completions.create(
model=MODEL,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant. Help me with my math homework!”}, # <-- This is the system message that provides context to the model
{“role”: “user”, “content”: “Hello! Could you solve 2+2?”} # <-- This is the user message for which the model will generate a response
]
)
print("Assistant: " + completion.choices[0].message.content)
Image Processing
GPT-4o can directly process images and take intelligent actions based on the image. W