Introduction
In AI (Artificial Intelligence), each new iteration of generative models marks a leap forward in the possibilities of HCI (human-computer interaction). OExplicitly declaring the API key as a string in the notebook is a quick and easy way to use it, but it is not the best practice. Consider defining api_key via an environment variable. To learn more about how you can instantiate your OpenAI API Key, click here.
📩 Sending a Simple Request
When interacting with the OpenAI API, the most fundamental method involves sending messages to GPT for the LLM to return an output prompt.
Each message has several main components that should be declared before sending the request, including:
messages: The primary input consists of a list of message objects. For our demo, we’ll call this list, which holds all the message objects as messages. These message objects should contain the following:
role: This can assume one of three roles – system, user, or assistant. Each role differs in the ongoing conversation between you and the LLM. For instance, a system role will indicate that you mention instructions for GPT to adhere to.
content: Represented as a string, the content serves as the text input for the system to process and respond to.
Let’s start with a simple example. In this example, we’re simply asking GPT how they are doing:penAI's latest model, GPT-4o ("o" stands for "omni"), is no exception. It’s designed to process and understand a blend of text, audio, image, and video inputs with better context and faster than GPT-4.
The promise of GPT-4o lies not solely in its omnimodal capabilities but also in its approach to crafting more natural interactions between humans and machines. The demo below showcases the real-time GPT-4o voice assistant to interpret audio and video.
This article explores GPT-4o's latest benefits and features and shows how to quickly integrate it and power your application with the model's features.
Overview of GPT-4o
So, what makes GPT-4o so impressive? Unlike previous models, GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, similar to human response time conversations.
GPT-4o has a 128K context and an October 2023 knowledge cutoff. Even more impressive is that there has been no quality reduction compared to previous models.
Text Evaluation
GPT-4o matches GPT-4 Turbo's performance on text in English and code, with significant improvements on text in non-English languages. Across all tests, GPT-4o maintained the lead except DROP.
Audio
GPT-4o can process audio. Unlike other models, GPT-4o does not require audio-to-text models, such as Whisper, to parse the audio into the model as text. GPT-4o can capture nuances and tone in speech for more personalized and quality responses.
Additionally, automatic speech recognition (ASR) has been improved across multiple languages, not just English.
Comprehension
GPT-4o beat GPT-4 in the M3Exam, a dataset of multiple-choice questions on various knowledge, despite requiring less computing. This evaluation is meant to test the model without any task-specific training.
For other evaluations, such as understanding charts (ChartQA), documents (DocVQA), and others, GPT-4o shows it can visually comprehend and extract information from visual graphs to reason, not just detect what is in the image.
Costs
Across the board, OpenAI has improved its tokenization efficiency, allowing all models across their ecosystem to benefit from the cost savings.
GPT-4o is also cheaper than its competitors and OpenAI’s previous models. GPT-4o is 2x faster, half the price, and has 5x higher rate limits than GPT-4 Turbo. To learn more about OpenAI’s pricing, check the pricing page.
Recommended: From Turing To GPT-4: 11 Papers that Shaped AI's Language Journey.
⚡Getting Started with GPT-4o
Getting started with GPT-4o is as simple as interfacing with any other OpenAI model. The API used is the same as that of the other OpenAI models.
This walkthrough will teach you how to send a simple API request to GPT-4o with Python. You will see how simple the API is and learn to input different modalities into the API call. If you wish to follow along in our notebook, click here!
📥Installation
Before we begin, if you don't have an API key, follow either link to generate an API key below (also remember to fill your OpenAI account with API credits):
Next, install some dependencies for our demo.
Import the necessary packages and declare a variable for GPT-4o. Also, import your api_key and instantiate the OpenAI client. As a side note, you can select from other models.
Explicitly declaring the API key as a string in the notebook is a quick and easy way to use it, but it is not the best practice. Consider defining api_key via an environment variable. To learn more about how you can instantiate your OpenAI API Key, click here.
📩 Sending a Simple Request
When interacting with the OpenAI API, the most fundamental method involves sending messages to GPT for the LLM to return an output prompt.
Each message has several main components that should be declared before sending the request, including:
messages: The primary input consists of a list of message objects. For our demo, we’ll call this list, which holds all the message objects as messages. These message objects should contain the following:
role: This can assume one of three roles – system, user, or assistant. Each role differs in the ongoing conversation between you and the LLM. For instance, a system role will indicate that you mention instructions for GPT to adhere to.
content: Represented as a string, the content serves as the text input for the system to process and respond to.
Let’s start with a simple example. In this example, we’re simply asking GPT how they are doing:
Output:
As an artificial intelligence, I don't have feelings, but I'm here and ready to assist you! How can I help you today?
You can define the system's role by modifying the role.
Output:
Hi there! I'm doing fantastic, thank you for asking! How can I make your day a bit brighter? 😊
As a side note, you can stream the responses instead of waiting for the entire response to be returned.
🖼️ Image Processing
Now that we know how to send simple text requests, we can build on this by learning how to send images.
GPT-4o can directly process images and take intelligent actions based on the image. We can provide images in two formats:
Base64 Encoded
URL
Let’s take a picture to test GPT-4o’s ability to interpret and reason with the image provided.
💾 Base64
In this example, we’ve created a function called encode_image_from_url(). This function takes an image URL online, downloads it, and encodes it to base64. The encoded image is fed into the API request through the messages parameter.
🔗 URL Image Processing
Instead of parsing the image to base64, the API can directly take in image URLs natively.
Both responses will output a similar response:
To find \( c \) in a right triangle where \( a = 3 \) and \( b = 4 \), you can use the Pythagorean theorem:
\[ c^2 = a^2 + b^2 \]
Substitute the given values:
\[ c^2 = 3^2 + 4^2 \]
\[ c^2 = 9 + 16 \]
\[ c^2 = 25 \]
Now, take the square root of both sides to solve for \( c \):
\[ c = \sqrt{25} \]
\[ c = 5 \]
So, \( c = 5 \).
📽️ Video Processing
GPT-4o currently does not have a way to parse video directly. However, videos are essentially images stitched together.
We have also mentioned that GPT-4o does not currently support audio at the time of writing (May 2024). However, it is still possible to use OpenAI's Whisper model for audio-to-text conversion and parse the video into images. Once these two steps are done, we can feed audio and video into our GPT request.
The future benefit of GPT-4o is the ability to consider audio and video natively, allowing the model to respond to both modalities in context. This is especially important in scenarios where something is displayed on video that isn't explained via audio and vice versa.
Before we begin, download the video locally. You’ll use the OpenAI DevDay Keynote Recap video.
📹Video Processing Setup
Before sending a request to GPT, parse the video to a series of images.
Let's display the video to ensure we can parse it into base64:
Now, let’s parse the audio for our transcript:
Finally, we can pass the video and audio to our request:
Check out the complete notebook for the output of the request.
Conclusion
Apart from the increase in speed, GPT-4o can maintain the same quality responses as its predecessors. By breaking down the barriers between different forms of input and output, GPT-4o lays the groundwork for a future where AI can act as a more holistic, integrated assistant.
Whether through interpreting the emotional nuance in a voice or discerning the details in a video, GPT-4o's design encapsulates the ambition of creating an AI not just as a tool but as a versatile companion in the digital age.
FAQs
What is GPT-4o?
GPT-4o is the latest AI model developed by OpenAI, characterized by its “omnimodal” abilities to understand and generate responses across text, audio, images, and eventually video, facilitating natural human-computer interactions.
How does GPT-4o improve upon previous versions like GPT-4?
GPT-4o offers significantly reduced response times for audio interactions, enhanced comprehension of non-English languages, direct audio processing without auxiliary models, and improved understanding of visual content, all while being more cost-efficient.
Can GPT-4o process video inputs?
While GPT-4o is primarily designed for text, audio, and image inputs, it can understand video content by interpreting it through sampled frames. Direct video processing capabilities are anticipated in the future.
Is GPT-4o available for commercial use?
Yes, GPT-4o is accessible through OpenAI's API, allowing developers and businesses to integrate its advanced capabilities into their applications. It offers a streamlined approach to integrating AI into various services and products.
How can developers start using GPT-4o?
Developers can start using GPT-4o by accessing it through OpenAI’s API. Integration requires an API key, which can be obtained from OpenAI’s website. The API documentation provides comprehensive guidance for making requests to GPT-4o.
Can GPT-4o understand and generate content in multiple languages?
Yes, one of GPT-4o's key advancements is its significantly enhanced capabilities for non-English languages, offering more inclusive global communication capabilities.
Does GPT-4o support real-time audio processing?
GPT-4o is optimized for quick audio input processing, delivering near real-time responses comparable to human reaction times in conversations.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.