TL;DR

  • Multimodal AI is a leap forward in machine learning. It enables models to understand and generate information across different data types.
  • Foundational concepts such as data modalities, representation learning, and fusion techniques are crucial for building effective multimodal systems.
  • Architectures like early and late fusion models, transformer-based multi-modal models, generative multi-modal models, and multi-modal language models provide the framework for integrating diverse data sources.
  • Real-world applications range from enhancing AR experiences, as seen in the Pokémon Go case study code walkthrough in this guide, to improving accessibility, content creation, and user interaction across various domains.

Lots of data in different modalities (text, video, images, audio, etc.etc.) isis available today. As a result, there is a growing demand for more systems that can make sense of the data. This has given rise to multimodal AI, which aims to create systems capable of understanding and interacting with the world in a more nuanced and human-like way.

Multimodal artificial intelligence (AI) integrates different data types, such as text, images, and audio, to comprehensively understand the world. Integrating multiple modalities allows AI systems to achieve a level of understanding and functionality closer to human cognition.

Humans do not rely solely on text or speech to communicate; visual cues, sounds, and context play crucial roles in interpreting information. Multi-modal AI models mirror this complexity, improving accuracy, robustness, and the ability to handle real-world variability in data.

For example, a multimodal AI system in healthcare might analyze medical images, clinical notes, and patient voice recordings to provide a more accurate diagnosis than a system relying on a single data source. Such systems mimic human intelligence more closely and excel in tasks where comprehensive understanding is crucial.

Figure (a) demonstrates an AI model processing audio and image inputs through a cross-attention mechanism. Figure (b) shows a multimodal multitask AI model that combines surgical images and texts as inputs to output interpretations and responses to questions about medical images. | Image Source.

Figure (a) demonstrates an AI model processing audio and image inputs through a cross-attention mechanism. Figure (b) shows a multimodal multitask AI model that combines surgical images and texts as inputs to output interpretations and responses to questions about medical images. | Image Source.

However, developing these systems presents unique challenges, including the need for advanced techniques to align and fuse heterogeneous data sources and handle instances where some modalities may be incomplete or missing.

In this article, you will:

  • Learn the fundamentals of multimodal AI

  • Fine-tune a multimodal AI model to generate image descriptions of Pokémon Go images in a case study.

Foundational Concepts in Multi-Modal Architectures

Data Modalities

Data modalities are the various forms or types of data that machine learning (ML) models can process and analyze. These include:

  • Text: Written or spoken language can be processed using natural language processing (NLP) techniques (tokenization, part-of-speech tagging, and semantic analysis).

  • Images: This category includes visual data that can be analyzed using computer vision (CV) techniques (image classification, object detection, and scene recognition).

  • Audio: Sound recordings, including speech, music, or environmental sounds, processed using audio analysis techniques (sound classification, audio event detection, etc.). In multimodal contexts, audio can complement visual data to understand the scene better.

  • Video: A sequence of images (frames) over time, often with accompanying audio. It requires techniques that handle both spatial and temporal data. In multimodal systems, video can be pivotal for tasks requiring dynamic context, like activity recognition or interactive media.

Others: This includes a variety of other data types, each with unique processing needs. Sensor data can provide environmental context or physical parameters, while time series data offers insights into trends or patterns. These modalities can enrich AI systems by providing additional perspectives and data dimensions.

Meta AI uses a multi-modal data input framework (six modalities) that integrates various data types for advanced AI processing. | Image Source.

Meta AI uses a multi-modal data input framework (six modalities) that integrates various data types for advanced AI processing. | Image Source.

Representation of Learning for Different Modalities

Representation learning involves transforming raw data into a format (features) that models can effectively work with. For different modalities, this involves:

  • Text: Convert words into vectors using word embeddings (e.g., Word2Vec, GloVe) or contextual embeddings from models like BERT or GPT.

  • Images: Using convolutional neural networks (CNNs) to extract features from raw pixel data.

  • Audio: Transforming sound waves into spectrograms or using Mel-frequency cepstral coefficients (MFCCs) as features.

Video: This technique combines image and audio processing techniques and often incorporates recurrent neural networks (RNNs) or three-dimensional (3D) CNNs to capture temporal dynamics.

This image shows a language model that encodes different input modalities, such as text, images, audio, and video. The model helps with semantic understanding and alignment and generates multimodal output. | Image Source.

This image shows a language model that encodes different input modalities, such as text, images, audio, and video. The model helps with semantic understanding and alignment and generates multimodal output. | Image Source.

Multimodal Fusion Techniques

Fusion techniques integrate data from multiple modalities to improve model performance using complementary information across different data types. Fusion can occur at different stages:

  • Early Fusion: Combining features from different modalities at the input level before processing.

  • Intermediate Fusion: Integrating features at some hidden layer within the model, allowing for interaction between modalities during processing.

  • Late Fusion: Combining the outputs of separate models for each modality, often through averaging or voting mechanisms.

You will learn more about early and late fusion in the next section.

Multimodal AI Architectures

Multi-modal architectures are designed to process and integrate information from multiple data sources or modalities. These architectures are crucial for tasks requiring understanding complex inputs combining different data types. 

Below, you will learn the key aspects of multimodal architectures, including:

  • Early vs. late fusion models.

  • Transformer-based multimodal models.

  • Generative multimodal models.

  • Multimodal language models.

Workflow: Several unimodal neural networks encode input modalities. Features are combined using a fusion module and fed into a classification network for prediction. | Image Source.

Workflow: Several unimodal neural networks encode input modalities. Features are combined using a fusion module and fed into a classification network for prediction. | Image Source.

Early Fusion vs. Late Fusion Models

Early and late fusion are two primary strategies for combining information from different modalities in multi-modal learning systems.

In early fusion, the raw data from different modalities is combined at the input level before processing occurs. This approach requires aligning and pre-processing the data from different modalities, which can be challenging due to differences in data formats, resolutions, and sizes. 

Early fusion lets the model learn joint representations from the raw data, which could help it understand more complex interactions between modalities earlier in the processing chain.

Late fusion involves processing each modality separately and combining the outputs later, such as during decision-making or output generation. This approach can be more robust to differences in data formats and modalities but may lead to the loss of important information that could have been captured through early interaction between modalities.

Transformer-based Multimodal Models

Transformer models have achieved significant success in various machine-learning tasks. Their ability to handle sequential data and capture long-range dependencies makes them well-suited for multi-modal applications. 

Transformer-based multimodal models can take in and process information from various modalities. They use self-attention mechanisms to determine the importance of each modality's contributions to the current task.

VATT model uses self-supervised multimodal learning to create accurate image captions without explicit supervision. Its ability to learn from multiple modalities simultaneously has led to impressive results in image captioning. | Image Source.

VATT model uses self-supervised multimodal learning to create accurate image captions without explicit supervision. Its ability to learn from multiple modalities simultaneously has led to impressive results in image captioning. | Image Source.

These models have been applied across various multi-modal tasks, including image captioning, visual question answering, and text-to-image generation.

Generative Multi-modal Models

Generative multi-modal models are designed to generate new data or outputs by learning the joint distribution of data from multiple modalities. Some deep generative models you can use for multi-modal learning are variational autoencoders (VAEs) and generative adversarial networks (GANs). 

These models can perform tasks such as cross-modal generation (e.g., generating images from textual descriptions), data augmentation, and unsupervised representation learning, taking into account the heterogeneous nature of multi-modal data.

The image shows a model that generates varied outputs from different input types, using a U-Net and cross-attention mechanisms. | Image Source.

The image shows a model that generates varied outputs from different input types, using a U-Net and cross-attention mechanisms. | Image Source.

Multimodal Language Models

Multi-modal language models extend the capabilities of traditional language models by integrating additional modalities, such as visual or auditory information, into the language processing tasks. 

These models can understand and generate content that spans different data types, enabling applications like generating descriptive text for images (image captioning), improving language understanding with visual context, and enhancing conversational AI systems with the ability to process and respond to multi-modal inputs.

The unimodal model generates text results from text instructions, while the multimodal model can generate various output modalities, such as text descriptions, images, and audio, from different inputs, such as text instructions and images. | Image Source.

The unimodal model generates text results from text instructions, while the multimodal model can generate various output modalities, such as text descriptions, images, and audio, from different inputs, such as text instructions and images. | Image Source.

Code Walkthrough: Multimodal AI Case Study

In this section, you code along with a case study on fine-tuning the IDEFICS 9B Large Language Model (LLM) to generate Pokémon Go image descriptions. 

Task Description:

This task showcases the practical application of multimodal AI, where the model is adapted to understand and describe the augmented reality (AR) elements of Pokémon characters within real-world settings captured in the game's screenshots. 

The fine-tuning process involves creating a specialized dataset that pairs Pokémon Go images with corresponding textual descriptions.

Data Description:

This dataset serves as the foundation for fine-tuning the IDEFICS 9B model, which has been pre-trained on a diverse array of internet text and image data to recognize and articulate the unique features of Pokémon Go's gameplay and AR components.

🚨Find the complete code walkthrough in this Colab Notebook.

Task Outcome:

The outcome of this fine-tuning is a multimodal model that can provide players with descriptive narratives of their in-game experiences, improving user engagement by adding an AI-powered dimension to the gameplay. 

Send a prompt to the fine-tuned model that takes into two modalities: an image and a text prompt:

url = "https://images.pokemontcg.io/pop6/2_hires.png"

prompts = [
    url,
    "Question: What's on the picture? Answer:",
]

Call the `do_inference()` function in the Colab Notebook and also pass in a `max_token` argument to limit the length of the model’s response:

do_inference(model, processor, prompts, max_new_tokens=100)

Here is a sample output (also in the Colab Notebook):

#### OUTPUT; DO NOT COPY####

Question: What's on the picture? Answer: This is ['Lucario-GX', 'Lucario']. A Basic Pokemon Card of type Fire with the title Lucario-GX and 90 HP of rarity Rare Holo from the set Unbound Legends and the flavor text: It's a Pokemon that can use its tail as a weapon. It's a Pokemon that can use its tail as a weapon. It evolves from Lucario when it is traded with

This case study demonstrates the capabilities of multi-modal AI in interpreting complex visual and textual data. It hints at the potential for applying similar technology to other AR applications and interactive experiences.

Challenges in Multi-Modal Learning

The following are some of the challenges that multi-modal learning faces:

  • Heterogeneity: Different modalities have different data formats, scales, and distributions, making it challenging to integrate them effectively.

  • Alignment: It can be challenging to ensure that data from different modalities correspond to the same phenomena or events, especially when temporal dynamics are involved.

  • Missing Modalities: Some modalities may be missing or incomplete in real-world applications, requiring models to handle such inconsistencies.

  • Complexity: Multimodal models are often more complex and computationally intensive than unimodal models, making training and deployment challenging.

Conclusion: Multimodal AI in Action

Throughout this blog post, you have learned about multi-modal AI. This shows how combining different data types can completely change machine learning models. It is clear that simultaneously processing and understanding various types of data—like text, images, audio, and video—is a big step forward in AI. This is true for both the idea behind multi-modal learning and the complex architectures that make these systems possible.

The case study on fine-tuning the IDEFICS 9B multimodal model to generate descriptions of Pokémon Go images vividly illustrates the practical applications of multimodal AI. It demonstrates the technical feasibility and value of integrating visual and textual data to create more immersive and interactive experiences.

Nice! Thank you for making it to the end of the post. Take a look at the recommended reading to learn more.

Resources for Further Learning

The following resources can help you learn more about multi-modal AI:

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo