Multimodal AI Models and Modalities

Deepgram’s award-winning voice AI goes global with Dedicated and EU-hosted deployments 🌍

AI Glossary

Multimodal AI Models and Modalities

Last UpdatedJun 24, 2024

Introduction

There is a lot of data available to us today, both in digital and physical forms. As a result, there is a growing demand for more systems that can make sense of this vast amount of information. This has given rise to multimodal AI, a cutting-edge field that aims to create systems capable of understanding and interacting with the world in a more nuanced and human-like way.

Multimodal AI integrates different data types, such as text, images, and audio, to build a more comprehensive understanding of the world. By seamlessly combining these diverse data streams, multimodal AI systems can mimic human cognition, bringing us closer than ever to creating AI that truly understands and interacts with us.

Cross-modality takes this integration a step further, based on the premise of multimodal AI. It involves the parallel use of different data types and the translation between them, like converting text descriptions to images or synthesizing speech from text. This cross-modal communication is key to developing AI that can not only understand but also translate and express concepts across different senses, enhancing the AI's ability to interact in complex environments and perform tasks that require a deeper level of cognitive understanding.

An example of how multimodality can be used in healthcare. Image from Multimodal biomedical AI (Acosta et al., Nature Medicine 2022)

Multimodal AI Models

Models like Mistral, ImageBind, and LLaVA are making significant contributions to multimodal AI research, and this glossary explores their applications and performance benchmarks.

Mistral

Mistral is an open-source large language model (LLM) developed by Mistral AI that can handle very long text sequences efficiently and quickly. Mistral stands out due to its architecture, which allows for faster inference with fewer parameters, making it suitable for applications that require processing large text sequences.

The model's architecture is based on a mixture of experts (MoE), which enables it to process and generate text efficiently across different modalities, including natural language processing (NLP) and natural language understanding (NLU).

Mistral Architectural Details (Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. ArXiv. /abs/2310.06825)

Architectural Components of Mistral

Mistral's architecture primarily consists of the following components:

Self-Attention Layer: Implemented through Sliding Window Attention, Grouped Query Attention, and Rolling Buffer KV Cache. The Sliding Window Attention, combined with the KV Cache, contributes to Mistral's speed and ability to handle large sequences.
Feed Forward Layer (SiLU): Uses the SiLU activation function for improved accuracy and efficiency.
RMS Norm: Uses Root Mean Square Normalization (RMSNorm), which is computationally simple and efficient.
Transformer Decoder Layers: Mistral includes 'N' Transformer Decoder Layers, where 'N' equals 32, indicating the depth of the model's architecture.

Features and Capabilities of Mistral

Efficient Long Sequence Handling: Ability to process very long text sequences.

Fast Inference: The architecture allows for faster inference, making it suitable for real-time applications.

LLaVA

LLaVA, which stands for Large Language and Vision Assistant, is a multimodal model developed to enhance visual and textual data integration. It combines a vision encoder with a large language model, Vicuna, to enable visual and language comprehension.

LLaVA has been specifically engineered to comprehend and produce content in various modalities, such as text, images, and audio. It has impressive chat capabilities and benchmarks state-of-the-art accuracy in tasks such as Science QA (question-answering).

LLaVA network architecture (Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. ArXiv. /abs/2304.08485)

Architectural Components of LLaVA

The architecture primarily consists of the following:

Transformer Architecture: LLaVA is based on the transformer architecture, a deep learning model that utilizes a self-attention mechanism to weigh the importance of different parts of the input data.
Auto-regressive Language Model: It uses autoregressive techniques to predict the next word in a sequence based on the words that have come before it.
Vision Encoder: For visual content processing, LLaVA uses the pre-trained CLIP visual encoder ViT-L/14, which extracts visual features from input images.
Language Model: LLaVA utilizes the LLaMA 2 model for language tasks, which is renowned for its efficacy in open-source language-only instruction-tuning projects.

Features and Capabilities

Impressive Chat Capabilities: LLaVA showcases chat capabilities that rival OpenAI's multimodal GPT-4V, providing state-of-the-art accuracy in Science QA.
Visual Instruction Tuning: The model uses visual instruction tuning, which involves fine-tuning a large language model to understand and execute instructions based on visual cues.

Efficiency: LLaVA completes training its 13B model within a day using 8 A100s, making it highly efficient.

ImageBind

Meta created ImageBind, an advanced AI model that can comprehend and combine data from various modalities to produce a unified representation space. In this space, data from different modalities—like images, text, and audio—are converted into a format that can be processed and understood uniformly by the model. This model can process data from six distinct modalities: images, text, audio, depth images, thermal images, and Inertial Measurement Units (IMU).

ImageBind achieves a more comprehensive and holistic understanding of the world using these modalities. This improves its ability to analyze and interpret complex datasets.

The graphic illustrates a multi-modal data input framework used by Meta AI, integrating various data types such as text, image/video, depth, heat map, audio, and IMU (Inertial Measurement Unit) inputs for advanced AI processing (Meta AI Blog, May 2023, ImageBind: Holistic AI learning across six modalities https://ai.meta.com/blog/imagebind-six-modalities-binding-ai/)

Architectural Components of ImageBind

The architectural components of ImageBind are as follows:

Separate Encoders: Utilizes individual encoders for each modality: image, text, audio, thermal image, depth image, and IMU.
Linear Projection Heads: Adds a modality-specific linear projection head to each encoder to obtain fixed-dimensional embeddings.
Normalization and InfoNCE Loss: Embeddings are normalized and used for training in the InfoNCE loss function.
Joint Embedding Space: Creates a unified representation space for direct comparison and combination of different modalities.

Features and Capabilities

Cross-Modal Understanding: ImageBind can forecast relationships between data from its supported modalities. This enables it to perform tasks such as cross-modal retrieval, composing modalities with arithmetic, and cross-modal detection and generation. This capability allows for novel applications like generating images from sounds or combining inputs from different modalities to create new, derivative works.
Zero-Shot and Few-Shot Learning: These learning techniques are designed to create a unified embedding space for different modalities. This allows ImageBind to generalize from seen to unseen categories without needing explicit paired examples for every case, hence the "zero-shot" capability. For "few-shot" learning, the model can quickly adapt to new tasks with minimal examples due to the rich representations learned during training.

Emergent Capabilities: ImageBind's performance and emergent capabilities improve with the strength of the image encoder. This indicates that enhancing the quality of visual features can boost recognition performance, even in non-visual modalities. The model sets a new state-of-the-art benchmark, outperforming specialist-supervised models in various tasks.

Gen 2

Runway Research created Gen-2. This generative AI model uses the cutting-edge method of stable diffusion to learn from extensive video datasets and generate high-quality video outputs. This model, which builds upon the foundational features of its predecessor, Gen 1, is adept at video synthesis from text or images, crafting realistic and consistent videos.

Architectural Components of Gen 2

Although the paper for Gen2 has not yet been released by Runway, we can infer it’s architecture based on one of their most recent works on Structure and Content-Guided Video Synthesis with Diffusion Models.

Use CLIP embeddings for content representation, enhancing semantic and stylistic feature sensitivity.
Implementation of spatio-temporal latent diffusion for modeling frame relationships in videos.
Extension of image-based UNET architectures to video by incorporating temporal layers.
An autoencoder that processes each video frame independently but in the context of the video's overall structure.
Optimize the model using a per-example loss during training, with adaptations for text-prompt-driven video editing during inference.

Features and Capabilities

Video Synthesis: It can synthesize videos from text or images, transforming static inputs into dynamic video content
Stable Diffusion Technique: Uses stable diffusion for learning from video datasets, contributing to the high quality of the generated videos.

Control and Expressiveness: Offers tools like the Runway Motion Brush for added expressiveness and control over the generated videos.

CLIP

CLIP (Contrastive Language-Image Pre-training) is a model developed by OpenAI that bridges the gap between visual and textual data. It is designed to understand and categorize images by leveraging natural language descriptions. This model represents a significant shift from traditional approaches, which typically require extensive labeled datasets for each new task. Instead, CLIP can generalize across various tasks without task-specific training data.

Architectural Components of CLIP

CLIP's architecture consists of the following components:

Contrastive Learning Framework: CLIP uses a contrastive learning approach to align text and image representations in a shared embedding space.
Dual Encoders: It consists of two separate encoders, one for processing images and another for processing text.
Vision Transformer (ViT): The image encoder is a Vision Transformer that processes visual inputs.
Transformer-Based Text Encoder: The text encoder is a transformer-based model that processes textual inputs.

Features and Capabilities

Zero-Shot Learning: One of the most notable features of CLIP is its ability to perform "zero-shot" learning. This means that once trained, CLIP can be applied to new tasks without any additional fine-tuning simply by providing relevant textual descriptions of the task at hand.
Versatility: CLIP's ability to understand and process images and text makes it highly versatile. It can be used for various applications, including but not limited to image classification, object detection, and even generating textual descriptions of images.

Semantic Understanding: Through its contrastive learning approach, CLIP gains a deep semantic understanding of the content within images and text, enabling it to perform tasks that require a nuanced understanding of visual and textual data.

Flamingo

Flamingo is a Visual Language Model (VLM) developed by DeepMind, designed to perform tasks that require understanding visual and textual information.

Flamingo stands out as a cutting-edge advancement because it integrates the capabilities of vision and language models, enabling it to process and generate responses based on a combination of textual and visual inputs. This integration allows Flamingo to excel in various tasks, such as answering questions about images, generating textual descriptions of visual content, and engaging in dialogues that require understanding visual context.

Examples of inputs and outputs obtained from 80B parameter Flamingo model (Weights & Biases Blog, Dec 2022, DeepMind Flamingo: A Visual Language Model for Few-Shot Learning)

Architectural Components of Flamingo

The primary architectural components of Flamingo can be summarized as follows:

Pretrained Vision-Only and Language-Only Models: Flamingo bridges powerful pretrained vision-only and language-only models.
Interleaved Cross-Attention Layers: Cross-attention layers are interleaved with language-only self-attention layers (frozen) to align visual and textual information.
Perceiver-Based Architecture: Transforms input sequence data (videos) into a fixed number of visual tokens.

Features and Capabilities

Chain-of-Thought Reasoning: Flamingo can perform complex reasoning tasks by generating intermediate textual explanations that bridge the gap between visual inputs and the final response, facilitating more nuanced and accurate outputs.
Few-Shot Learning: Flamingo can adapt to new tasks with minimal examples, outperforming models that require extensive fine-tuning on large datasets.

State-of-the-Art Performance: Flamingo has achieved new state-of-the-art results on multiple benchmarks, including visual question-answering and captioning tasks.

CogVLM

CogVLM (Cognitive Visual Language Model) is an open-source visual language foundation model developed to enhance visual and textual data integration. It bridges the gap between vision and language understanding. Unlike traditional models that use a shallow alignment method, CogVLM achieves a deep fusion of visual and language features without compromising performance on NLP tasks.

It has demonstrated state-of-the-art performance on numerous classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, and others, showcasing its effectiveness across various applications.

Architectural Components of Flamingo

The primary architectural components of Flamingo can be summarized as follows:

Pretrained Vision-Only and Language-Only Models: Flamingo bridges powerful pretrained vision-only and language-only models.
Interleaved Cross-Attention Layers: Cross-attention layers are interleaved with language-only self-attention layers (frozen) to align visual and textual information.
Perceiver-Based Architecture: Transforms input sequence data (videos) into a fixed number of visual tokens.

Features and Capabilities

Chain-of-Thought Reasoning: Flamingo can perform complex reasoning tasks by generating intermediate textual explanations that bridge the gap between visual inputs and the final response, facilitating more nuanced and accurate outputs.
Few-Shot Learning: Flamingo can adapt to new tasks with minimal examples, outperforming models that require extensive fine-tuning on large datasets.

State-of-the-Art Performance: Flamingo has achieved new state-of-the-art results on multiple benchmarks, including visual question-answering and captioning tasks.

Qwen-VL-Plus

Alibaba Cloud created the Qwen-VL, a large-scale vision-language model; Qwen-VL-Plus is an improved version. It is designed to perceive and understand text and images, making significant strides in high-resolution recognition, text analysis, and image reasoning capabilities.

Qwen-VL-Plus can efficiently extract information from tables and documents and reformat this information. It also has an efficient mechanism for identifying and converting dense text, which is very effective in dealing with documents that contain a lot of information. It supports images with extreme aspect ratios, ensuring the flexibility to process diverse visual content.

PaLM-E model architecture, showing how PaLM-E ingests different modalities (states and/or images) and addresses tasks through multimodal language modeling. (Google AI Blog, March 2023, https://blog.research.google/2023/03/palm-e-embodied-multimodal-language.html)

Architectural Components of Qwen-VL-Plus

The primary architectural components of Qwen-VL-Plus are as follows:

Q-Former: A trainable BERT encoder with a causal language modeling head, akin to GPT, designed to bridge the modality gap between visual and textual information.
Image Transformer: Interacts with a frozen image encoder to extract visual features.
Text Transformer: Functions as both a text encoder and decoder, processing and generating text.
Learnable Query Embeddings: A fixed number of trainable query embeddings for modality alignment, interacting with each other and with frozen image features through self-attention and cross-attention layers.
Cross-Attention Layers: Integrated into every two layers of BERT, randomly initialized, and crucial for modality alignment.
BERTbase Initialization: Q-Former is initialized with BERTbase pre-trained weights, while cross-attention layers are randomly initialized.

Features and Capabilities

High-Resolution Recognition: It supports high-definition images with resolutions above one million pixels and images of various aspect ratios.
Text Analysis: It significantly improves text processing in images, especially in terms of recognizing Chinese and English text.
Image Reasoning Capabilities: It substantially boosts image-related reasoning capabilities.

Detailed Recognition Capabilities: It greatly enhances recognition, extracting, and analyzing details within images and texts.

SeamlessM4T

SeamlessM4T is a collection of models developed to provide high-quality translation and enable communication across different linguistic communities through speech and text. It is designed to handle multiple tasks without relying on separate models for each task.

The tasks supported by SeamlessM4T include speech-to-speech translation (S2ST), speech-to-text translation (S2TT), text-to-speech translation (T2ST), text-to-text translation (T2TT), and automatic speech recognition (ASR). Each task has its dedicated sub-model, but the SeamlessM4TModel can perform all the above tasks.

The diagram illustrates a two-tiered framework for speech and text processing. In the first tier, there are pre-trained models: SEAMLESSM4T-NLLB, a T2TT encoder-decoder; w2V-BERT 2.0, T2U, a text-to-unit encoder-decoder; and a Vocoder for speech resynthesis. The second tier, called Multitasking UNITY, integrates a Conformer Speech Encoder with a Length adaptor. (Meta AI Blog, August 2023 https://ai.meta.com/blog/seamless-m4t/)

Architectural Components of SeamlessM4T

The architectural components of SeamlessM4T are as follows:

Sequence-to-Sequence Models: Two seq2seq models enable tasks like Speech-to-Speech Translation (S2ST), Speech-to-Text Translation (S2TT), Text-to-Speech Translation (T2ST), Text-to-Text Translation (T2TT), and Automatic Speech Recognition (ASR).
Shared Configuration Parameters: Includes hidden size, initializer range, and layer norm epsilon, standardizing the dimensionality and initialization across sub-models.
UnitY Integration for S2ST: Utilizes UnitY for speech-to-speech translation, addressing error propagation and domain mismatch issues common in cascaded systems.
Dedicated Encoders for Each Modality: Features unique encoders for text and speech modalities, ensuring effective processing of multimodal inputs.
HiFi-GAN Inspired Vocoder: For speech output, it uses a vocoder based on the HiFi-GAN architecture, enhancing speech generation quality.
Fairseq2 for Efficient Modeling: Leverages the redesigned fairseq2 for a lightweight and efficient sequence modeling toolkit, improving performance and efficiency.

Features and Capabilities

Multitasking: It can perform various translation and recognition tasks across different modalities using a single model.
Multimodal Translation: The model excels in translating and transcribing speech and text across multiple languages, providing a unified solution for multimodal translation.
Support for Multiple Languages: SeamlessM4T supports nearly 100 languages, making it a comprehensive multilingual translation and transcription model.

Unified Multilingual Model: It operates as a unified model, directly producing accurate translation results without needing intermediate models.

BakLLaVA

BakLLaVA is a Large Multimodal Model (LMM) developed collaboratively by LAION, Ontocord, and Skunkworks AI. BakLLaVA utilizes a Mistral 7B base and is augmented with the LLaVA 1.5 architecture, showcasing its capabilities in processing and generating content across different modalities.

Architectural Components of BakLLaVA

The primary components of BakLLaVA's architecture are:

Mistral 7B Base: BakLLaVA uses a Mistral 7B base, which is a foundational component of its architecture.
LLaVA 1.5: It incorporates the LLaVA 1.5 architecture, which includes a vision encoder and Vicuna for processing visual and textual information.

Features and Capabilities

Content Generation: BakLLaVA can generate content that blends text, voice, visuals, and other forms of data, showcasing its generative capabilities.
Accessibility: The model can be run on devices with adequate GPU resources, making it accessible to many users and applications.

PaLM-E

PaLM-E (Pathways Language Model-Embodied) is an advanced multimodal language model. It was created to simplify combining visual and textual data with continuous embodied observations like images, state estimates, or other sensor modalities. PaLM-E can do many things, like plan sequential robotic manipulation, answer visual questions, and caption scenes. It does this by directly integrating real-world continuous sensor modalities into language models.

Architectural Components of PaLM-E

Palm-E's architecture primarily consists of the following components:

Dense Decoder-Only Transformer Model: PaLM-E is based on a dense decoder-only Transformer architecture, which is a type of deep learning model that utilizes self-attention mechanisms.
Unified Embedding Space: Continuous inputs are mapped into a space that resembles "words," allowing both word and image embeddings to have the same dimensionality and be fed into the language model.
Pre-trained Component Initialization: PaLM-E is initialized with pre-trained models for both the language (PaLM) and vision components (Vision Transformer, ViT), updating all its parameters during training.

Features and Capabilities

Embodied Reasoning: PaLM-E can address a variety of embodied reasoning tasks from different observation modalities in multiple embodiments, showcasing its ability to understand and interact with the physical world.
State-of-the-Art Performance: It achieves state-of-the-art performance on tasks like OK-VQA, demonstrating its capabilities as a visual-language generalist while retaining generalist language capabilities with increasing scale.
Positive Transfer: Exhibits “positive transfer," benefiting from diverse joint training across internet-scale language, vision, and visual-language domains.

Generalization: Demonstrates the ability to generalize to tasks it has not been explicitly trained on, such as planning and executing long-horizon tasks with minimal examples.

Gemini

Gemini is a suite of advanced multimodal models by Google, adept at processing and understanding various data types like images, audio, and text. The models range from Gemini 1.0 Ultra, tailored for complex tasks and reasoning, to the Pro version, which balances performance with scalable deployment, down to the Nano model designed for on-device applications, each addressing diverse needs and capabilities.

Gemini 1.5, a subsequent iteration, builds upon this with enhanced speed and efficiency using the MoE architecture. It also introduces an innovative breakthrough in long-context windows, up to 1 million tokens (smallest building block of data) in production and 10 million in research. The context window measures how many tokens the model can process at once. This helps models recall the context of lengthy texts, videos, and audio.

Gemini supports interleaved sequences of text, image, audio, and video as inputs (illustrated by tokens of different colors in the input sequence). It can output responses with interleaved image and text. (Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., . . . Vinyals, O. (2023). Gemini: A Family of Highly Capable Multimodal Models. ArXiv. /abs/2312.11805)

Architectural Components of Gemini

Gemini's architecture primarily consists of the following components:

Mixture-of-Experts (MoE) Architecture: Gemini 1.5 model incorporate the MoE architecture, which divides the model into smaller "expert" neural networks that activate selectively based on the input type, enhancing efficiency.
Transformer Foundation: At its core, Gemini 1.0 models build on the Transformer architecture to process the input data.

Features and Capabilities

Three Size Variants: Gemini 1.0 is optimized for three sizes - Ultra, Pro, and Nano - to cater to various tasks, from complex computational needs to on-device applications.
Pretraining and Fine-tuning: The model employs strategies like pretraining on large datasets and fine-tuning on specific tasks to improve performance and versatility.
TPUv5 Chips for Efficiency: Gemini leverages Google’s TPUv5 chips for training and serving, making it reportedly more efficient than previous models.
Long-Context Understanding: Gemini 1.5 Pro introduces an experimental feature for understanding long contexts, improving the model's ability to process and generate coherent outputs over extended sequences.

Integration Across Google Products: Gemini is integrated across Google's products, enhancing services like Search, Ads, Chrome, and Duet AI with its advanced AI capabilities.

Multimodal AI Benchmarks and Metrics Benchmarking

The capacity of multimodal AI models to process data from various data modalities, such as text, images, and audio, is a criterion for evaluating their performance. This evaluation is essential for determining the models' ability to manage tasks that require understanding complex inputs.

An image indicating the domains and modalities covered under MULTIBENCH benchmark , covering 15 datasets across ten modalities, over 20 prediction tasks, and six research areas such as Multimodal Sensing and Human-machine interaction. It's designed to offer benchmarks for assessing system performance over diverse domains and modalities, complexity during training and inference, and robustness to noise and missing modalities, emphasizing the necessity for adaptable and resilient systems in real-world applications. (Liang, P. P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M. A., Zhu, Y., Salakhutdinov, R., & Morency, L. (2021). MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. ArXiv. /abs/2107.07502)

Metrics and benchmarks are tailored to reflect model performance in accuracy, robustness, and efficiency across different tasks and modalities. The key components include:

Mutual Information Divergence (MID)

MID is introduced as a comprehensive metric for evaluating multimodal generative models, especially in text-to-image generation and image captioning tasks. It uses negative Gaussian cross-mutual information based on CLIP features to evaluate the coherence between text and image modalities.

It showcases superior performance in consistency across benchmarks, sample efficiency, and resilience against the variations in the CLIP model used.

MULTIBENCH

MULTIBENCH is a vast benchmark designed to test multimodal models across various tasks, modalities, and fields. It emphasizes generalization, the complexity of training and inference, and robustness against disturbances or absent modalities. Encompassing 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas, it offers an exhaustive framework for evaluating multimodal learning.

MM-SHAP

A metric grounded on Shapley values, MM-SHAP is indifferent to performance and aims to quantify the contribution of different modalities in vision and language models. In contrast to metrics focusing on accuracy, MM-SHAP measures how different modalities affect model predictions. This helps find unimodal collapses and makes sure that multimodal systems are reliable.

MMBench

MMBench assesses the diverse capabilities of vision-language models. It includes a carefully curated dataset and introduces a CircularEval strategy. This approach utilizes ChatGPT to transform free-form predictions into predefined choices, thoroughly assessing models' prediction abilities across several dimensions.

AutoML Multimodal Benchmark

This benchmark concentrates on models that process tabular datasets containing numerical, categorical, and textual columns. Its goal is to evaluate their proficiency in merging and handling information from these disparate data types.