How to Make the Most of GPT-4o

By Samuel Adebayo

Last Updated

Jun 5, 2024

Omnimodal Interactions: GPT-4o excels in understanding and generating text, audio, and images, enhancing human-computer communication.
Rapid Response: Matches human conversational speed with responses as quick as 232 milliseconds.
Global Language Support: OpenAI’s new encoder significantly improves the processing of non-English languages.
Audio and Visual Mastery: Directly processes audio and visually interprets data with advanced comprehension abilities.
Efficiency and Accessibility: Boasts 2x the speed and efficiency at half the cost of its predecessor, GPT-4 Turbo.

Introduction

In AI (Artificial Intelligence), each new iteration of generative models marks a leap forward in the possibilities of HCI (human-computer interaction). OExplicitly declaring the API key as a string in the notebook is a quick and easy way to use it, but it is not the best practice. Consider defining api_key via an environment variable. To learn more about how you can instantiate your OpenAI API Key, click here.

📩 Sending a Simple Request

When interacting with the OpenAI API, the most fundamental method involves sending messages to GPT for the LLM to return an output prompt.

Each message has several main components that should be declared before sending the request, including:

messages: The primary input consists of a list of message objects. For our demo, we’ll call this list, which holds all the message objects as messages. These message objects should contain the following:
role: This can assume one of three roles – system, user, or assistant. Each role differs in the ongoing conversation between you and the LLM. For instance, a system role will indicate that you mention instructions for GPT to adhere to.
content: Represented as a string, the content serves as the text input for the system to process and respond to.

Let’s start with a simple example. In this example, we’re simply asking GPT how they are doing:penAI's latest model, GPT-4o ("o" stands for "omni"), is no exception. It’s designed to process and understand a blend of text, audio, image, and video inputs with better context and faster than GPT-4.

The promise of GPT-4o lies not solely in its omnimodal capabilities but also in its approach to crafting more natural interactions between humans and machines. The demo below showcases the real-time GPT-4o voice assistant to interpret audio and video.

This article explores GPT-4o's latest benefits and features and shows how to quickly integrate it and power your application with the model's features.

Overview of GPT-4o

So, what makes GPT-4o so impressive? Unlike previous models, GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, similar to human response time conversations.

GPT-4o has a 128K context and an October 2023 knowledge cutoff. Even more impressive is that there has been no quality reduction compared to previous models.

Text Evaluation

GPT-4o matches GPT-4 Turbo's performance on text in English and code, with significant improvements on text in non-English languages. Across all tests, GPT-4o maintained the lead except DROP.

Audio

GPT-4o can process audio. Unlike other models, GPT-4o does not require audio-to-text models, such as Whisper, to parse the audio into the model as text. GPT-4o can capture nuances and tone in speech for more personalized and quality responses.

*GPT-4o surpasses OpenAI’s current speech-to-text model, Whisper.*

Additionally, automatic speech recognition (ASR) has been improved across multiple languages, not just English.

*Audio Speech Recognition (ASR) is significantly improved against Whisper across many languages.*

Comprehension

GPT-4o beat GPT-4 in the M3Exam, a dataset of multiple-choice questions on various knowledge, despite requiring less computing. This evaluation is meant to test the model without any task-specific training.

For other evaluations, such as understanding charts (ChartQA), documents (DocVQA), and others, GPT-4o shows it can visually comprehend and extract information from visual graphs to reason, not just detect what is in the image.

*GPT-4o scores higher compared to its competitors for vision comprehension, document search, sentiment analysis, and more.*

Costs

Across the board, OpenAI has improved its tokenization efficiency, allowing all models across their ecosystem to benefit from the cost savings.

Gujarati 4.4x fewer tokens (from 145 to 33	Telugu 3.5x fewer tokens (from 159 to 45)	Tamil 3.3x fewer tokens (from 116 to 35)	Marathi 2.9x fewer tokens (from 96 to 33)	Marathi 2.9x fewer tokens (from 96 to 33)
Urdu 2.5x fewer tokens (from 82 to 33)	Arabic 2.0x fewer tokens (from 53 to 26)	Persian 1.9x fewer tokens (from 61 to 32)	Russian 1.7x fewer tokens (from 39 to 23)	Korean 1.7x fewer tokens (from 45 to 27)
Vietnamese 1.5x fewer tokens (from 46 to 30)	Chinese 1.4x fewer tokens (from 34 to 24)	Japanese 1.4x fewer tokens (from 37 to 26)	Turkish 1.3x fewer tokens (from 39 to 30)	Italian 1.2x fewer tokens (from 34 to 28)
German 1.2x fewer tokens (from 34 to 29)	Spanish 1.1x fewer tokens (from 29 to 26)	Portuguese 1.1x fewer tokens (from 30 to 27)	French 1.1x fewer tokens (from 31 to 28)	English 1.1x fewer tokens (from 27 to 24)

GPT-4o is also cheaper than its competitors and OpenAI’s previous models. GPT-4o is 2x faster, half the price, and has 5x higher rate limits than GPT-4 Turbo. To learn more about OpenAI’s pricing, check the pricing page.

Model	Input (Per 1M Tokens)	Output (Per 1M Tokens)	Cost Compared to GPT4-o (In/Out)
gpt-4o	$5.00	$15.00	0 / 0
gpt-4o-2024-05-13	$5.00	$15.00	0 / 0
gpt-4-turbo	$10.00	$30.00	2x / 2x
gpt-4-turbo-2024-04-09	$10.00	$30.00	2x / 2x
gpt-4	$30.00	$60.00	6x / 4x
gpt-4-32k	$60.00	$120.00	12x / 8x
gpt-3.5-turbo-0125	$0.50	$1.50	-10x / -10x
gpt-3.5-turbo-instruct	$1.50	$2.00	-7.5x / -7.5x

⚡Getting Started with GPT-4o

Getting started with GPT-4o is as simple as interfacing with any other OpenAI model. The API used is the same as that of the other OpenAI models.

This walkthrough will teach you how to send a simple API request to GPT-4o with Python. You will see how simple the API is and learn to input different modalities into the API call. If you wish to follow along in our notebook, click here!

📥Installation

Before we begin, if you don't have an API key, follow either link to generate an API key below (also remember to fill your OpenAI account with API credits):

Next, install some dependencies for our demo.

!pip install --upgrade openai --quiet

# If you wish to try the video demo, install these too.
!apt -qq update && apt -qq install -y ffmpeg
!pip install opencv-python moviepy --quiet

Import the necessary packages and declare a variable for GPT-4o. Also, import your api_key and instantiate the OpenAI client. As a side note, you can select from other models.

import json
from openai import OpenAI
from tenacity import retry, wait_random_exponential, stop_after_attempt
from termcolor import colored

GPT_MODEL = "gpt-4o"
API_KEY = "your_api_key_here"
client = OpenAI(api_key=API_KEY)

Explicitly declaring the API key as a string in the notebook is a quick and easy way to use it, but it is not the best practice. Consider defining api_key via an environment variable. To learn more about how you can instantiate your OpenAI API Key, click here.

📩 Sending a Simple Request

When interacting with the OpenAI API, the most fundamental method involves sending messages to GPT for the LLM to return an output prompt.

Each message has several main components that should be declared before sending the request, including:

messages: The primary input consists of a list of message objects. For our demo, we’ll call this list, which holds all the message objects as messages. These message objects should contain the following:
role: This can assume one of three roles – system, user, or assistant. Each role differs in the ongoing conversation between you and the LLM. For instance, a system role will indicate that you mention instructions for GPT to adhere to.
content: Represented as a string, the content serves as the text input for the system to process and respond to.

Let’s start with a simple example. In this example, we’re simply asking GPT how they are doing:

completion = client.chat.completions.create(
  model=GPT_MODEL,
  messages=[
    {"role": "user", "content": "How are you doing today?"}
  ]
)
print(completion.choices[0].message.content)

Output:

As an artificial intelligence, I don't have feelings, but I'm here and ready to assist you! How can I help you today?

You can define the system's role by modifying the role.

completion = client.chat.completions.create(
  model=GPT_MODEL,
  messages=[
    {"role": "system", "content": "Pretend you are a customer support representative that is having an amazing day!"},
    {"role": "user", "content": "How are you doing today?"}
  ]
)

print(completion.choices[0].message.content)

Output:

Hi there! I'm doing fantastic, thank you for asking! How can I make your day a bit brighter? 😊

As a side note, you can stream the responses instead of waiting for the entire response to be returned.

stream = client.chat.completions.create(
    model=GPT_MODEL,
    messages=[
        {"role": "system", "content": "Pretend you are a customer support representative that is having an amazing day!"},
        {"role": "user", "content": "How are you doing today? Can you explain to me who OpenAI is?"}
    ],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

🖼️ Image Processing

Now that we know how to send simple text requests, we can build on this by learning how to send images.

GPT-4o can directly process images and take intelligent actions based on the image. We can provide images in two formats:

Base64 Encoded
URL

Let’s take a picture to test GPT-4o’s ability to interpret and reason with the image provided.

from IPython.display import Image, display, Audio, Markdown
import base64
import requests

IMAGE_URL = "https://useruploads.socratic.org/7KXtBxYSVKTQVJQljQ4e_Pythagorean-Theorem.jpg"

# Preview image for context
display(Image(IMAGE_URL))

💾 Base64

In this example, we’ve created a function called encode_image_from_url(). This function takes an image URL online, downloads it, and encodes it to base64. The encoded image is fed into the API request through the messages parameter.

# Open the image file and encode it as a base64 string
def encode_image_from_url(image_url):
    response = requests.get(image_url)
    if response.status_code == 200:
        return base64.b64encode(response.content).decode("utf-8")
    else:
        raise Exception("Failed to retrieve image")

base64_image = encode_image_from_url(IMAGE_URL)

response = client.chat.completions.create(
    model=GPT_MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "If a=3 and b=4, what's c?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

🔗 URL Image Processing

Instead of parsing the image to base64, the API can directly take in image URLs natively.

response = client.chat.completions.create(
    model=GPT_MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "If a=3 and b=4, what's c?"},
            {"type": "image_url", "image_url": {
                "url": IMAGE_URL}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

Both responses will output a similar response:

To find $ c $ in a right triangle where $ a = 3 $ and $ b = 4 $, you can use the Pythagorean theorem:

\[ c^2 = a^2 + b^2 \]

Substitute the given values:

\[ c^2 = 3^2 + 4^2 \]

\[ c^2 = 9 + 16 \]

\[ c^2 = 25 \]

Now, take the square root of both sides to solve for $ c $:

\[ c = \sqrt{25} \]

\[ c = 5 \]

So, $ c = 5 $.

📽️ Video Processing

GPT-4o currently does not have a way to parse video directly. However, videos are essentially images stitched together.

We have also mentioned that GPT-4o does not currently support audio at the time of writing (May 2024). However, it is still possible to use OpenAI's Whisper model for audio-to-text conversion and parse the video into images. Once these two steps are done, we can feed audio and video into our GPT request.

⚡Deepgram's Whisper Cloud is a fully managed API that gives you access to Deepgram's version of OpenAI’s Whisper model.

The future benefit of GPT-4o is the ability to consider audio and video natively, allowing the model to respond to both modalities in context. This is especially important in scenarios where something is displayed on video that isn't explained via audio and vice versa.

Before we begin, download the video locally. You’ll use the OpenAI DevDay Keynote Recap video.

!wget -q https://github.com/openai/openai-cookbook/raw/main/examples/gpt4o/data/keynote_recap.mp4 -O keynote_recap.mp4

📹Video Processing Setup

Before sending a request to GPT, parse the video to a series of images.

import os
import cv2
from moviepy.editor import VideoFileClip
import time
import base64

VIDEO_PATH = "/content/keynote_recap.mp4"

def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame=0

    # Loop through the video and extract frames at specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
    video.release()

    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    return base64Frames, audio_path

# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=10)

Let's display the video to ensure we can parse it into base64:

## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)

Audio(audio_path)

Now, let’s parse the audio for our transcript:

transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(audio_path, "rb"),
)

print("Transcript: ", transcription.text + "\n\n")

Finally, we can pass the video and audio to our request:

## Generate a summary with visual and audio
response = client.chat.completions.create(
    model=GPT_MODEL,
    messages=[
        {"role": "system", "content":"You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"},
        {"role": "user", "content": [
            "These are the frames from the video.",
            *map(lambda x: {"type": "image_url",
                            "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
            {"type": "text", "text": f"The audio transcription is: {transcription.text}"}
            ],
        }
    ],
    temperature=0,
)

print(response.choices[0].message.content)

Check out the complete notebook for the output of the request.

Conclusion

Apart from the increase in speed, GPT-4o can maintain the same quality responses as its predecessors. By breaking down the barriers between different forms of input and output, GPT-4o lays the groundwork for a future where AI can act as a more holistic, integrated assistant.

Whether through interpreting the emotional nuance in a voice or discerning the details in a video, GPT-4o's design encapsulates the ambition of creating an AI not just as a tool but as a versatile companion in the digital age.

FAQs

What is GPT-4o?

GPT-4o is the latest AI model developed by OpenAI, characterized by its “omnimodal” abilities to understand and generate responses across text, audio, images, and eventually video, facilitating natural human-computer interactions.

How does GPT-4o improve upon previous versions like GPT-4?

GPT-4o offers significantly reduced response times for audio interactions, enhanced comprehension of non-English languages, direct audio processing without auxiliary models, and improved understanding of visual content, all while being more cost-efficient.

Can GPT-4o process video inputs?

While GPT-4o is primarily designed for text, audio, and image inputs, it can understand video content by interpreting it through sampled frames. Direct video processing capabilities are anticipated in the future.

Is GPT-4o available for commercial use?

Yes, GPT-4o is accessible through OpenAI's API, allowing developers and businesses to integrate its advanced capabilities into their applications. It offers a streamlined approach to integrating AI into various services and products.

How can developers start using GPT-4o?

Developers can start using GPT-4o by accessing it through OpenAI’s API. Integration requires an API key, which can be obtained from OpenAI’s website. The API documentation provides comprehensive guidance for making requests to GPT-4o.

Can GPT-4o understand and generate content in multiple languages?

Yes, one of GPT-4o's key advancements is its significantly enhanced capabilities for non-English languages, offering more inclusive global communication capabilities.

Does GPT-4o support real-time audio processing?

GPT-4o is optimized for quick audio input processing, delivering near real-time responses comparable to human reaction times in conversations.

How to Make the Most of GPT-4o

Table of Contents

Table of Contents

Introduction

📩 Sending a Simple Request

Overview of GPT-4o

Text Evaluation

Audio

Comprehension

Costs

⚡Getting Started with GPT-4o

📥Installation

📩 Sending a Simple Request

🖼️ Image Processing

💾 Base64

🔗 URL Image Processing

📽️ Video Processing

📹Video Processing Setup

Conclusion

FAQs

Unlock language AI at scale with an API call.

Unlock language AI at scale with an API call.