Article·AI & Engineering·Jun 4, 2024

The Tests that Tricked GPT-4o

Zian (Andy) Wang
By Zian (Andy) Wang
PublishedJun 4, 2024
UpdatedJun 13, 2024

A little over a week ago, OpenAI announced their latest generation in the GPT model family, GPT-4o. Not quite GPT-5 yet, but GPT-4o still saw a massive leap in the capabilities and accessibilities of Large Language Models.

The release was followed by numerous demonstrations posted by OpenAI, from improved foreign language and translation capabilities to a significant latency reduction in response time.

Expanded Modalities

The “o” in GPT-4o stands for “Omni”, representing a model which encompasses every modality, from text, image to files and audio. ChatGPT has been multimodal for some time now beginning in early 2023. However, its multi-modal capabilities were limited, with much of its processing power relying on user-created plugins and external support.

Additionally, although users could interact with ChatGPT via audio and receive responses in audio on the mobile app prior to the release of GPT-4, there was noticeable latency. This latency occurred because the user’s audio needed to be converted to text for input into ChatGPT, and the outputs then had to be converted back to audio.

GPT-4o completely removed the additional layer of conversion by integrating audio as input directly into the Large Language Model behind the scenes. This is the same case for images as well.

Furthermore, GPT-4o’s ability to access the internet has been improved significantly as now it can browse several web pages in the span of seconds, as opposed to GPT-4, which may take up to 10 seconds to even read one website.

Increased Accessibility

Along with the expanded and enhanced modalities, GPT-4o is also much more accessible. OpenAI released a Mac app with complete access to GPT-4o’s feature along with support for real time audio chats.

In addition, the Mac app provides accessible features that aren’t available in the web app. Pressing option+space on the keyboard will bring up a dialogue box similar to the spotlight search on mac with an option to upload a file or simply ask a question. A neat feature of the app can directly take a screenshot of the current window open on your screen and send it over to GPT-4o, saving time and hassle from snipping the picture yourself.

However, the Mac app does come with some caveats though. First, it can only be installed on Apple Silicon machines. Second, the app is only accessible to ChatGPT Plus subscribers.

Luckily, to try out GPT-4o, one does not need to be a subscriber as for the first time ever, GPT-4o, the best Chat model in the lineup, is available for free with limited rates.

Finally, the increased accessibility doesn’t just stop there, GPT-4o is superior in foreign language communications and translations compared to GPT-4 across the board.

Combined with the reduced latency and direct integration of audio into GPT-4o, real-time translation during conversation is entirely possible and seamlessly integrated as OpenAI has demonstrated.

Along with increased accessibilities, GPT-4o’s character is much more human with OpenAI’s real-time audio conversation demo displaying a playful, flirty character. The audio output isn’t just plain text to speech but is rather filled with emotion such as sighs and laughs.

The realistic audio even went as far as sparking a controversy due to its striking similarity between Scarlett Johansson voicing an AI in the movie “Her” (2013). The voice was eventually taken down by OpenAI to avoid further legal problems.

Putting GPT-4o to the Test

One of the major limitations of Large Language Models is their lack of spatial and vision awareness, as they are primarily trained on text. While text descriptions of images can convey messages, they differ significantly from human visual perception. This is similar to the philosophical thought experiment known as Mary’s Room or the Knowledge Argument.

In this paradox, Mary is a scientist who knows everything about the science of color but has lived her entire life in a black-and-white room. Despite her extensive knowledge, she has never experienced color firsthand. If she were to see color for the first time, would she learn something new?

This experiment highlights the gap between knowing about something and experiencing it. Similarly, Large Language Models can process and generate text but lack the experiential understanding that comes with visual and spatial awareness. Just as Mary would gain new insight by seeing color, language models are limited in comprehending and generating content that relies on visual context.

With that being said, Language Models with vision have significantly improved at interpreting physical objects in images. And I put GPT-4o through some vision-based tests to see just how well its “eyes” are.

One of the more interesting tests of vision both to our human eyes and possibly to Language Models is the classic “Where’s Waldo”. “Where’s Waldo” (or Where’s Wally) is a British Children’s book series containing visual puzzles, typically containing various people doing different scenes in a detailed illustration and the reader is tasked with spotting Waldo, who is always wearing red and white striped shirt, blue pants, and bobble hat.

Let’s put GPT-4o to the test.

Not bad. Waldo is indeed located in the bottom right, above the security checks.

Notice that there is a lot of Red Herring in the images, such as a child in her mother’s arms wearing a red and white striped shirt with blue pants near the security check lines, but without a hat.

Now, can GPT-4o connect the dots and place a mark where Waldo is?

Close, but not quite there. It looks like GPT-4o is able to identify where Waldo is based on descriptions of its rough location, but there’s difficulties in translating that to coordinates and mapping it on the picture.

A follow up question from the detailed illustrations of “Where’s Waldo” is how many people are actually present in the picture. For a small section of the illustration, the task is almost trivial to humans. Let’s see how GPT-4o does.

Close, but again, incorrect. I counted 17 people including the blurry figure in the yellow and blue boat.

Tests like these aren’t pointless, in fact, being able to accurately detect, count, and find people in a large crowd is readily applicable. If Large Language Models can improve to reliably detect people based on descriptions in large, detailed images, it can be a huge leap for analyzing hundreds if not thousands of hours of footage from spotting stolen items, criminal suspects, or just finding things in general.

Now let’s see if GPT-4o can visualize through words and text. I asked it to generate a 3D model of a cube using Javascript.

When I tried to increase the complexity by asking GPT-4o to render the cube but with one of its corners sliced as a cross section, no matter how much prompting and tinkering I did, the model continually failed. Generating code doesn’t directly correlate to GPT-4o’s vision abilities since it requires the extra step of converting what something should look like. Let’s see if it can generate an image of what it should look like.

This isn’t entirely wrong, but just not what we’re looking for. GPT-4o did literally remove a corner of the cube, as opposed to geometrically “slicing” the corner off. 

Despite its remarkable advancements, GPT-4o’s vision capabilities, as demonstrated in the “Where’s Waldo” tests, reveal a nuanced challenge: the difference between recognizing patterns and understanding context. While it can pinpoint elements based on descriptions, the precision required for accurate spatial mapping remains elusive. Furthermore, the 3D geometry tests do show that GPT-4o loosely understands visual concepts, but it’s unable to present those insights accurately in its outputs. There is still a large room for growth.

Next, I decided to put GPT-4o to the test by challenging it to code a simple game—Flappy Bird in JavaScript. The choice of Flappy Bird was deliberate: it’s a game simple enough to be a reasonable request yet complex enough to test a model’s understanding of game mechanics, physics, and real-time interaction.

Astonishingly, GPT-4o succeeded on the first try. This was a stark contrast to my previous experiences with GPT-4, which required multiple attempts to get the game running at all, and it’s still far from working. The difference in performance was immediately noticeable.

And the result is pretty impressive for a Large Language Model. Though the visual would benefit from some improvements, I did prompt the model to only use basic shapes, so it can’t be blamed.

Surprisingly, Language Models that can solve college level math problems often fail at elementary level spelling tests. One of my previous articles outlined the hilarious behavior of ChatGPT-3.5 and Google Bard unable to count the number of "E"s in the word “ketchup” despite many tries. You would think that Large Language Models would be better at the task nearly 2 years later after many iterations from different research labs and companies. I thought so as well. But just to make sure, I asked the same question to GPT-4o.

Looks like GPT-4o needs to go back to Kindergarten.

Okay, at least it can correct itself. Better than your little brother ChatGPT.

But personally, I would not trust someone that can’t spell to handle my customer services, or sort through my data, or manage my business.


In conclusion, GPT-4o represents a significant leap forward in the realm of Large Language Models, embodying a more integrated, responsive, and versatile AI than ever before. Its expanded modalities and accessibility features not only enhance user interaction but also broaden the potential applications across different fields. However, as the tests have shown, while the model excels in some areas, it continues to face challenges in others, such as precise visual mapping and understanding complex spatial relationships. These limitations, however, are not just hurdles but opportunities—indicating the continuous evolution of AI and its increasing alignment with complex human tasks. As GPT-4o paves the way, it’s clear that the journey of AI towards a more intuitive and capable companion is well underway, promising even more sophisticated advancements in the future.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.