What is Multimodal Deep Learning and What are the Applications?

Thanks to recent advances in deep neural networks, multimodal technologies have made possible advanced, intelligent processing of all kinds of unstructured data, including images, audio, video, PDFs, and 3D meshes. Multimodal deep learning allows for a more holistic understanding of data, as well as increased accuracy and efficiency.

Jina AI is the most advanced MLOps platform for building multimodal AI applications in the cloud. Users can translate their data and a few lines of code into a production-ready service without dealing with infrastructure complexity or scaling hassles.

But first, what is multimodal deep learning? And what are its applications?

What does “modal” mean?
The term “modal” refers to the human senses: sight, hearing, touch, taste, and smell. We use it here to mean data modality. You can think of it as indicating the kind of data you’re working with, like text, image, video, etc.

Sometimes people use the terms “multimodal” and “unstructured data” interchangeably because both terms describe data that lacks a meaningful internal structure. Multimodal data is data that uses multiple modalities, while unstructured data is a catch-all term that describes any type of data that doesn’t have a readily machine-readable structure.

Real-world data is multimodal
In the early days of AI, research typically focused on one modality at a time. Some works dealt with written language, others with images, or speech. As a result, AI applications were almost always limited to a specific modality: A spam filter works with text. A photo classifier handles images. A speech recognizer deals with audio.

But real-world data is often multimodal. Video is usually accompanied by an audio track and may even have text subtitles. Social media posts, news articles, and any internet-published content routinely mix text with images, videos, and audio recordings. The need to manage and process this data is one factor motivating the development of multimodal AI.

Multimodal vs cross-modal
“Multimodal” and “cross-modal” are another two terms that are often confused for each other, but don’t mean the same thing:

Multimodal deep learning is a relatively new field that is concerned with algorithms that learn from data of multiple modalities. For example, a human can use both sight and hearing to identify a person or object, and multimodal deep learning is concerned with developing similar abilities for computers.

Cross-modal deep learning is an approach to multimodal deep learning where information from one modality is used to improve performance in another. For example, if you see a picture of a bird, you might be able to identify it by its song when you hear it.

AI systems that are designed to work with multiple modalities are said to be “multimodal”. The term “cross-modality” is more accurate when referring narrowly to AI systems that integrate different modalities and use them together.

Multimodal deep learning applications
Multimodal deep learning has a broad array of potential uses. Among the applications already available:

Automatically generating descriptions of images, like captioning for blind people.
Searching for images that match text queries (e.g. “find me a picture of a blue dog”).
Generative art system that creates images from text descriptions (e.g. “make a picture of a blue dog”).
All these applications rely on two pillar technologies: search and creation.

Neural search
The core idea behind neural search is to leverage state-of-the-art neural network models to build every component of a search system. In short, neural search is deep neural network-powered information retrieval.

Below is an example of an embedding space generated by DocArray and used for content-based image retrieval.

If you want to read more, please check here.