in Blog

January 03, 2024

Multimodal Models: Integrating Text, Image, and Sound in AI


Artur Haponik

CEO & Co-Founder

Reading time:

17 minutes

AI has come a long way, but it’s not quite there yet. Despite the numerous advantages presented by machine learning models, the fact that they’re limited to just one modality significantly limits their capabilities, particularly when it comes to gaining natural intelligence.

In a bid to address this issue and create new models that can perform a wider range of tasks at greater accuracy, data scientists and developers have recently come up with multimodal machine learning models that incorporate the capabilities of regular, unimodal LLMs with multimodal capabilities, enabling them to process data from multiple sources.

This article will explore the intricacies of multimodal models and multimodal learning, particularly around how they incorporate different modalities, pushing the boundaries of deep learning and taking AI a step closer to gaining natural intelligence.


What are multimodal models?

Traditional ML models were designed to process data from one modality, such as speech recognition or image identification. However, in real-world applications, there’s data coming from different sources in different modalities, making it more difficult to process and analyze without a unified view.

Multimodal models overcome this limitation by integrating data from multiple modalities to create more accurate, well-rounded, and informative models.

As such, multimodal models are simply a combination of multiple unimodal neural networks. These neural networks process each modality separately before the multimodal infrastructure combines and averages their output to make more accurate predictions.

Multimodal infrastructures achieve this through multimodal learning, which typically deals with the fusion and analysis of data from multiple, diverse modalities such as image, text, video, and audio.

At its core, multimodal learning seeks to create a shared representation system capable of capturing complementary information from multiple, diverse modalities effectively. By employing a shared representation, multimodal models can perform a wide variety of tasks, including natural language processing, speech recognition, image captioning, and much more – all from a single model.

A typical multimodal model consists of multiple neural networks. Each neural network in the infrastructure is specialized in analyzing a specific modality. After individual analysis, the multimodal infrastructure combines the outputs of these neural networks through data fusion techniques like hybrid fusion, early fusion, or late fusion to create a unified representation of the data.

Early fusion typically involves aggregating data from different modalities and concentrating it into one input vector before feeding it into the network. Conversely, late fusion involves training different neural networks separately on specific modalities and then combining the outputs of each neural network at a later stage. Hybrid fusion takes things up a notch by combining the elements of early and late fusion to create a more flexible, adaptable model.

Read more about Multimodal AI Models: Understanding Their Complexity

How do multimodal models work?

To understand how multimodal models work, you first need to understand multimodal learning. As stated earlier, these models consist of multiple neural networks designed to process each modality separately. Take an audiovisual multimodal model, for instance. Such a model could have two unimodal neural networks: One for processing audio inputs and another for video inputs. These modalities process data in a process called encoding.

In a multimodal infrastructure, unimodal encoding is typically followed by information extraction and fusion. The fusion process can be achieved through various techniques ranging from the relatively simple concatenation method to more complex techniques like attention mechanisms. This makes multimodal fusion a crucial factor in multimodal model functionality. After fusion, a ‘decision’ network takes the encoded information as input and processes it according to the task it’s trained on.

Here’s how each stage in the multimodal architecture works:

Encoding stage

As ‘unified’ systems, multimodal models first need to aggregate information from the various modalities before they can process it. Therefore, the encoder first extracts relevant features from each modality’s input data and aggregates them into a unified representation that can be easily processed by subsequent layers in the system.

The encoder is generally composed of multiple layers of neural networks that leverage nonlinear transformation methods to extract abstract features from the input data.

The encoder’s input can consist of data from multiple modalities, such as audio, text, and images, which are typically processed individually. Additionally, each modality comes equipped with its own encoder that transforms input data into a set of feature vectors. The model then combines each encoder’s output into a unified representation, capturing each modality’s relevant information.

Encoders typically use concatenation and attention mechanism techniques to combine the outputs of individual encoders with the ultimate goal of capturing the underlying structures and relationships of the input data from different modalities. Ultimately, this allows the model to generate new outputs based on the input or make more accurate predictions.

Fusion module

The primary role of the fusion module is to combine information from multiple modalities into a unified representation that can be utilized for downstream tasks like regression, classification, or generation.

The fusion module doesn’t take a singular approach to data combination. Instead, it takes a different approach depending on the task at hand and the specific architecture of the system.

For instance, the fusion module can apply a weighted sum of modalities feature, which enables it to learn the weights during training. Alternatively, the fusion module can concentrate on the features of the modalities and learn a joint representation by passing them through a neural network. It can also use attention mechanisms to learn the most suitable modality to attend to at each step.

As evidenced by the instances above, despite taking different implementation methods, the fusion module has one primary role – to use information captured from multiple modalities to create a more informative and robust representation that can be applied in downstream tasks.

This unique mode of operation comes in handy in applications like video analysis, where it is important to combine audio and visual cues for a more comprehensive analysis and better model performance.


The primary role of the classification module is to use the unified representations generated by the previous module to make accurate decisions and predictions. Like the fusion module, the classification module can take a different approach depending on the type of data being processed and the task at hand.

For instance, the classification module can take the form of a neural network that allows it to pass the unified representations through a single or multiple connected layers before making a final prediction. The layers applied can include dropouts and nonlinear activation functions to other techniques designed to improve generalization performance and prevent overdrafts.

Likewise, the output of the classification layer comes down to the task at hand. For instance, when applied to a multimodal sentiment analysis task, the module produces a binary decision indicating whether the input is negative or positive. Conversely, when applied to a multimodal image captioning task, the classification module might produce a sentence describing the nature of the input as an output.

Developers typically apply a supervised learning approach when training the classification module. In this approach, developers use input modalities and their corresponding targets and labels to optimize the model’s parameters. Developers also use gradient-based optimization techniques like the stochastic gradient descent technique or other suitable variants.

Multimodal learning techniques

Despite requiring significantly less training data than their unimodal counterparts, multimodal models have an intricate training process that includes several techniques, including:

  • Fusion-based approach
  • Late fusion
  • Alignment-based approach

Multimodal learning techniques

Fusion-based approach

As the name suggests, the approach typically entails integrating multiple modalities into a unified representation space. Once integrated, the representations can be integrated further to form a unified modality-invariant representation capable of capturing the modalities’ semantic information.

Depending on where the fusion occurs, the fusion-based technique can be divided further into early and mid-fusion techniques.

One of the most common examples of the fusion-based technique is text captioning. In text captioning, the model encodes the visual features of an image and the semantic information contained in the corresponding text into a unified representation space, then fused to create a unified modality-invariant representation capable of capturing each modality’s semantic information.

The model typically captures the semantic content of the text using a (RNN) Recurrent Neural Network, while a (CNN) Convolutional Neural Network is used to extract the visual features contained in the image.

After extraction, the model encodes both modalities into a unified representation space where it fuses them using element-wise manipulation or concatenation to generate a unified modality-invariant representation. The unified representation can then be used to generate a relatively accurate caption for the image.

To this effect, developers can use open-source datasets like the Flickr30k dataset. The database contains about 31K images, with each image containing five captions. The images typically feature everyday scenes and are captioned by five different individuals to provide more diverse captions. [1]

The multimodal nature of the dataset makes it especially suitable for applying the fusion-based technique for text and image captioning. Developers achieve this by extracting relevant visual-based features from Convolutional Neural Networks using techniques like bag-of-words representations and word embeddings for features in the text. The fused representation generated from the process can be used to create more informative and accurate captions.

Late fusion

The late fusion technique typically involves combining individual predictions from each modality to generate a final prediction. The late fusion approach is especially useful when dealing with modalities that provide complimentary information or when they are not directly related.

Late fusion is commonly employed in emotion recognition and analysis in music. In a typical emotion recognition task, a multimodal model must be able to recognize the emotional context of the music using the piece’s lyrics and audio features.

This approach is especially suitable for emotion recognition in music since it combines predictions from different unimodal models, which are trained on different modalities to generate an accurate prediction.

The model typically extracts audio features through techniques like Mel-frequency cepstral coefficients (MFCCs) and encodes the lyrics using word embeddings and bag-of-words techniques.

The DEAM dataset, for instance, is commonly used to facilitate research on analysis and emotion recognition in music. The dataset is made up of a collection containing more than 2000 songs and features both the lyrics and audio features of the music. [2]

The features of the audio in the dataset have various descriptors, including spectral contrast, rhythm features, and MFCCs. Similarly, the model represents the lyrics using word embedding and bag-of-words techniques.

When applied to the dataset, late fusion can make a model more accurate by combining multiple predictions from diverse models that are trained on individual modalities. For instance, developers can train one model to analyze and predict a song’s emotional content using lyrics and another model to predict a song’s emotional content by analyzing audio features.

Alignment-based approach

As the name suggests, the alignment-based approach involves directly comparing different modalities by aligning them. The primary goal of this approach is to create modality-constant representations of the data, comparable across different modalities.

This approach is particularly effective when the different modalities involved share a direct relationship. A good example of this is in audio-visual speech recognition, popularly referred to as sign language recognition.

The alignment-based approach is especially effective in sign language recognition since it aligns the temporal information from the video with the information contained in audio modalities.

A typical model designed to facilitate sign language recognition must be able to recognize the gestures in sign language and translate them into text, do the same for the audio, and align the information for accurate analysis.

The RWTH-PHOENIX-Weather 2014T dataset[3], for instance, contains numerous video recordings of German sign language. The mere fact that it contains both audio and visual modalities makes it perfect for training multimodal machine learning and deep learning models that require an alignment-based approach.

Challenges in multimodal learning

Developing and training a multimodal model is an intricate process, to say the least. The fact that multimodal models combine multiple modalities, which need to be integrated and analyzed by other systems in the architecture before the model can make a prediction, means that the development process involves a lot of moving parts, which may pose a few challenges along the way.

Some of the most common challenges experienced in the development and utilization of multimodal models include:

Challenges in multimodal learning


Multimodal representation, in simple terms, is the process of representing data from different modalities in the form of a tensor or vector. However, this data often contains both complementary and redundant information. There’s also the issue of bias, where the model prioritizes one modality over the other, making it incredibly difficult to accurately and effectively capture and analyze the unified data.

In a bid to curb this issue, developers typically employ several strategies, including:

Joint representation

In joint representations, data from multiple modalities is typically combined into a unified space. The data is then projected into a separate space using coordinated representations through structure constraint techniques like partial order or similarity techniques like Euclidean distance.

However, the joint representation technique is only suitable when all necessary modalities are present at the inferred time.

Coordinated representation

Unlike joint representation, coordinated representation can work even when some modalities are not present at the inference time. The approach typically works by maintaining separate encodings for each modality. It also applies attention mechanisms or alignment techniques to ensure that all representations are related and contain the same contextual information.


Transferring knowledge from one modality to another can be quite challenging. Unfortunately, developers often need to transfer knowledge from resource-rich modalities when resources are limited either due to noisy input, lack of annotated data, or unreliable labels.

To this effect, developers employ several techniques including parallel, non-parallel, and hybrid co-learning approaches.

In the parallel co-learning technique, the training data must have similar instances in the modalities involved. For instance, in audio/visual speech recognition, the model must have a corresponding speech signal and a video of a person speaking. When applied correctly, the technique is especially suitable for creating lip-reading models whereby developers transfer information from a speech-recognition neural network.

Conversely, the non-parallel co-learning technique does not require modalities to have direct links between the samples contained in their datasets.

As the name suggests, the hybrid approach bridges modalities through a shared dataset or modality. For instance, if a model is given a parallel dataset containing a collection of images and their corresponding English captions along with another parallel dataset containing French and English documents, the model can apply the hybrid approach to retrieve images from a French caption effectively.


Developers often need to align different modalities to create a coherent representation. Take image captioning tasks, for instance. In such tasks, the model needs to describe an image in natural language.

As such, the model recognizes both the objects and context in the image and then generates a natural language description conveying the meaning of the image. To achieve this, the model needs to align textual and visual modalities to create a coherent representation that captures the relationships between the data.


Fusion, in multimodal models, is the process of combining information from multiple modalities for regression or classification. One of the most common examples of multimodal fusion is multimodal sentiment analysis where the model fuses visual, acoustic, and language modalities to predict the sentiment. [4]

By leveraging multiple modalities, multimodal models are better able to capture complementary information and provide more accurate predictions. What’s even more impressive is the technique can be used even when one of the modalities is missing.

Unfortunately, the fusion technique presents a few challenges, the most notable of which is overfitting. [5] overfitting happens when multiple modalities generalize at different rates, thus making it harder to apply a joint training strategy. The issue can also emanate when the different modalities aren’t temporally aligned, causing them to produce different levels and types of noise at different points in time.

Multimodal fusion typically takes two approaches: model-based and model-agnostic approach. The model-based approach deals with the overall construction of modalities, including their neural networks. Conversely, the model-agnostic approach is applied when the model training techniques employed don’t necessarily depend on a specific machine learning method.

Deep neural networks can learn from a huge amount of data and have optimal end-to-end training capabilities. This has made them particularly useful in data fusion. However, issues related to their predictions’ interpretability pose significant challenges to models utilizing the fusion approach.

Multimodal learning in computer vision

The past decade has seen significant advancements in computer vision technologies. These advancements have been further enhanced by multimodal learning, which combines information from multiple modalities, including text, images, and speech.

This unique approach has enabled significant advancements in several areas, including:

  • Text-to-image generation
  • Visual question answering, and
  • Natural language for visual reasoning

Text-to-image generation

As the name suggests, text-to-image generation involves training models to generate images based on textual descriptions. This way, the model can understand natural language and use the information to generate images that accurately represent the context in the input text.

Take the DALL-E and Stable Diffusion models, for instance. The former, DALL-E, uses a combination of a generative neural network architecture and a transformer-based language model to generate images based on textual descriptions.

DALL-E also utilizes a discrete latent space, which enables it to learn a more controllable and structured representation of generated images.

Similarly, the stable diffusion model architecture uses a diffusion process, which typically involves adding noise to the first generated image iteratively, and then subsequently removing the noise. This unique approach enables models to generate incredibly high-quality images that accurately match the textual prompt.

Visual question answering

VQA is a revolutionary technology that enables models to answer questions based on visual inputs. Deep learning infrastructures and techniques like the transformer architecture have propelled the technology to new heights, allowing a diverse range of applications, including web searches.

Natural language for visual reasoning

Like visual question answering, natural language for question answering (NLVR) involves training models to understand and reason about the visual scenes described in natural language descriptions.

When training a model for NLVR tasks, developers often give a model a textual description of a scene along with two images: One that accurately fits the description and one that doesn’t. As such, the model’s goal is to identify the image that matches the textual description.

The complexities involved in NLVR tasks require models to understand and conceptualize visual information and linguistic structures to make a correct prediction. This can present a wide variety of challenges, including challenges in recognizing objects and their corresponding properties, understanding spatial relations, and understanding the semantics of natural language.

BEiT-3, for example, employs a transformer-based architecture trained on large datasets of natural text and images, including data from Conceptual Captions and ImageNet datasets.

Its complex architecture enables it to handle both visual and natural language information, including information from complex visual scenes and linguistic structures.

Although it has a relatively similar architecture to other transformer-based models like GPT-3 and BERT, BEiT-3 comes with a few modifications, including an encoder and decoder, which process textual and visual inputs to produce an accurate output.

Final thoughts

Multimodal learning is poised to revolutionize how AI models are trained and developed. By combining information from multiple modalities, machine learning models leveraging the technology can get more contextual information from an input, thus enabling them to produce more accurate outputs and predictions.

Despite the numerous challenges experienced in multimodal learning, it’s important to remember that it’s still a new technology that’s evolving at a phenomenal rate.

As researchers and developers work to eliminate some of these bottlenecks, we expect to see the development of more accurate, better-performing models with potential applications spanning multiple industries. Contact Generative AI development company, where innovation meets tailored solutions for your industry.


[1] Shannon.cs. Illinois, Denotation Graph. URL: Accessed on December 27, 2023
[2] DEAM. URL: Accessed on December 27, 2023
[3] Rwth Phoenix Weather 2014. URL: Accessed on December 27, 2023
[4] ABS. URL: Accessed on December 27, 2023
[5] Over lifting and Underlifting in ML. URL: Accessed on December 27, 2023


Generative AI