What multimodal AI really looks like in practice
Lots of data in different modalities (text, video, images, audio, etc.etc.) isis available today. As a result, there is a growing demand for more systems that can make sense of the data. This has given rise to multimodal AI, which aims to create systems capable of understanding and interacting with the world in a more nuanced and human-like way.
Multimodal artificial intelligence (AI) integrates different data types, such as text, images, and audio, to comprehensively understand the world. Integrating multiple modalities allows AI systems to achieve a level of understanding and functionality closer to human cognition.
Humans do not rely solely on text or speech to communicate; visual cues, sounds, and context play crucial roles in interpreting information. Multi-modal AI models mirror this complexity, improving accuracy, robustness, and the ability to handle real-world variability in data.
However, developing these systems presents unique challenges, including the need for advanced techniques to align and fuse heterogeneous data sources and handle instances where some modalities may be incomplete or missing.
In this article, you will:
Learn the fundamentals of multimodal AI
Fine-tune a multimodal AI model to generate image descriptions of Pokémon Go images in a case study.
Foundational Concepts in Multi-Modal Architectures
Data Modalities
Data modalities are the various forms or types of data that machine learning (ML) models can process and analyze. These include:
Text: Written or spoken language can be processed using natural language processing (NLP) techniques (tokenization, part-of-speech tagging, and semantic analysis).
Images: This category includes visual data that can be analyzed using computer vision (CV) techniques (image classification, object detection, and scene recognition).
Audio: Sound recordings, including speech, music, or environmental sounds, processed using audio analysis techniques (sound classification, audio event detection, etc.). In multimodal contexts, audio can complement visual data to understand the scene better.
Video: A sequence of images (frames) over time, often with accompanying audio. It requires techniques that handle both spatial and temporal data. In multimodal systems, video can be pivotal for tasks requiring dynamic context, like activity recognition or interactive media.
Others: This includes a variety of other data types, each with unique processing needs. Sensor data can provide environmental context or physical parameters, while time series data offers insights into trends or patterns. These modalities can enrich AI systems by providing additional perspectives and data dimensions.
Representation of Learning for Different Modalities
Representation learning involves transforming raw data into a format (features) that models can effectively work with. For different modalities, this involves:
Text: Convert words into vectors using word embeddings (e.g., Word2Vec, GloVe) or contextual embeddings from models like BERT or GPT.
Images: Using convolutional neural networks (CNNs) to extract features from raw pixel data.
Audio: Transforming sound waves into spectrograms or using Mel-frequency cepstral coefficients (MFCCs) as features.
Video: This technique combines image and audio processing techniques and often incorporates recurrent neural networks (RNNs) or three-dimensional (3D) CNNs to capture temporal dynamics.
Multimodal Fusion Techniques
Fusion techniques integrate data from multiple modalities to improve model performance using complementary information across different data types. Fusion can occur at different stages:
Early Fusion: Combining features from different modalities at the input level before processing.
Intermediate Fusion: Integrating features at some hidden layer within the model, allowing for interaction between modalities during processing.
Late Fusion: Combining the outputs of separate models for each modality, often through averaging or voting mechanisms.
You will learn more about early and late fusion in the next section.
Multimodal AI Architectures
Multi-modal architectures are designed to process and integrate information from multiple data sources or modalities. These architectures are crucial for tasks requiring understanding complex inputs combining different data types.
Below, you will learn the key aspects of multimodal architectures, including:
Early vs. late fusion models.
Transformer-based multimodal models.
Generative multimodal models.
Multimodal language models.
Early Fusion vs. Late Fusion Models
Early and late fusion are two primary strategies for combining information from different modalities in multi-modal learning systems.
In early fusion, the raw data from different modalities is combined at the input level before processing occurs. This approach requires aligning and pre-processing the data from different modalities, which can be challenging due to differences in data formats, resolutions, and sizes.
Early fusion lets the model learn joint representations from the raw data, which could help it understand more complex interactions between modalities earlier in the processing chain.
Late fusion involves processing each modality separately and combining the outputs later, such as during decision-making or output generation. This approach can be more robust to differences in data formats and modalities but may lead to the loss of important information that could have been captured through early interaction between modalities.
Transformer-based Multimodal Models
Transformer models have achieved significant success in various machine-learning tasks. Their ability to handle sequential data and capture long-range dependencies makes them well-suited for multi-modal applications.
Transformer-based multimodal models can take in and process information from various modalities. They use self-attention mechanisms to determine the importance of each modality's contributions to the current task.
These models have been applied across various multi-modal tasks, including image captioning, visual question answering, and text-to-image generation.
Generative Multi-modal Models
Generative multi-modal models are designed to generate new data or outputs by learning the joint distribution of data from multiple modalities. Some deep generative models you can use for multi-modal learning are variational autoencoders (VAEs) and generative adversarial networks (GANs).
These models can perform tasks such as cross-modal generation (e.g., generating images from textual descriptions), data augmentation, and unsupervised representation learning, taking into account the heterogeneous nature of multi-modal data.
Multimodal Language Models
Multi-modal language models extend the capabilities of traditional language models by integrating additional modalities, such as visual or auditory information, into the language processing tasks.
These models can understand and generate content that spans different data types, enabling applications like generating descriptive text for images (image captioning), improving language understanding with visual context, and enhancing conversational AI systems with the ability to process and respond to multi-modal inputs.
Code Walkthrough: Multimodal AI Case Study
In this section, you code along with a case study on fine-tuning the IDEFICS 9B Large Language Model (LLM) to generate Pokémon Go image descriptions.
Task Description:
This task showcases the practical application of multimodal AI, where the model is adapted to understand and describe the augmented reality (AR) elements of Pokémon characters within real-world settings captured in the game's screenshots.
The fine-tuning process involves creating a specialized dataset that pairs Pokémon Go images with corresponding textual descriptions.
Data Description:
This dataset serves as the foundation for fine-tuning the IDEFICS 9B model, which has been pre-trained on a diverse array of internet text and image data to recognize and articulate the unique features of Pokémon Go's gameplay and AR components.
Task Outcome:
The outcome of this fine-tuning is a multimodal model that can provide players with descriptive narratives of their in-game experiences, improving user engagement by adding an AI-powered dimension to the gameplay.
Send a prompt to the fine-tuned model that takes into two modalities: an image and a text prompt:
Call the `do_inference()` function in the Colab Notebook and also pass in a `max_token` argument to limit the length of the model’s response:
Here is a sample output (also in the Colab Notebook):
This case study demonstrates the capabilities of multi-modal AI in interpreting complex visual and textual data. It hints at the potential for applying similar technology to other AR applications and interactive experiences.
Challenges in Multi-Modal Learning
The following are some of the challenges that multi-modal learning faces:
Heterogeneity: Different modalities have different data formats, scales, and distributions, making it challenging to integrate them effectively.
Alignment: It can be challenging to ensure that data from different modalities correspond to the same phenomena or events, especially when temporal dynamics are involved.
Missing Modalities: Some modalities may be missing or incomplete in real-world applications, requiring models to handle such inconsistencies.
Complexity: Multimodal models are often more complex and computationally intensive than unimodal models, making training and deployment challenging.
Conclusion: Multimodal AI in Action
Throughout this blog post, you have learned about multi-modal AI. This shows how combining different data types can completely change machine learning models. It is clear that simultaneously processing and understanding various types of data—like text, images, audio, and video—is a big step forward in AI. This is true for both the idea behind multi-modal learning and the complex architectures that make these systems possible.
The case study on fine-tuning the IDEFICS 9B multimodal model to generate descriptions of Pokémon Go images vividly illustrates the practical applications of multimodal AI. It demonstrates the technical feasibility and value of integrating visual and textual data to create more immersive and interactive experiences.
Nice! Thank you for making it to the end of the post. Take a look at the recommended reading to learn more.
Resources for Further Learning
The following resources can help you learn more about multi-modal AI:
“Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville: This book comprehensively introduces deep learning, including foundational concepts that apply to multi-modal learning.
"Attention Is All You Need" by Vaswani et al.: This seminal paper introduces the transformer model, foundational for many multimodal architectures.
Hugging Face's Transformers Library is an open-source library that provides pre-trained models and tools for natural language processing (NLP) and multimodal learning. The library's documentation and tutorials are excellent resources for practical implementation.
"Multimodal Machine Learning: A Survey and Taxonomy" by Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency: This survey paper offers an in-depth look at multimodal machine learning, covering key concepts, challenges, and future directions.
Note: If you like this content and would like to learn more, click here! If you want to see a completely comprehensive AI Glossary, click here.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.