Core AI: Multi-modal AI & applications
How multi-modal AI enables multiple new applications, market landscape & industry predictions
In the next posts, weโll delve into Core AI topics starting with multi-modal AI:
what it is
current market landscape
main applications
industry adoption predictions
Multi-modal AI
Foundation models (the core models behind AI development) can ingest the following inputs/outputs:
๐ Text
๐ผ๏ธ Image
</> Codeย
๐ฅ Videoย
๐ต Soundย
๐ฌ Speech
๐ฆ 3D
๐ค Robot state
Multi-modal AI models can accept as input or output multiple data modalities other than text (e.g., images or video).
Common combinations are:
๐ + ๐ผ๏ธ โ ๐ : Multi-modal LLMs
๐ + ๐ผ๏ธ + ๐ค โ ๐ : Multi-modal LLMs for robotics
๐ โ </> : Text to Codeย
๐ โ ๐ผ๏ธ : Text to Image
๐ โ ๐ฅ : Text to Videoย
๐ โ ๐ต : Text to Soundย
๐ โ ๐ฌ : Text to Speech
๐ผ๏ธ โ ๐ฆ : Image to 3D
๐ โ ๐ฆ : Text to 3D
Current market landscape & trends
The โbetter, faster, cheaperโ wave opens the floodgates for application development and multi-modal AI plays a crucial role in expanding the use cases and applications. In fact, recent studies have shown that technology can achieve human-level performance in some capabilities 40 years faster than previously expected (Sources: McKinsey & Company, Sequoia Capital).
Let's break down each foundation model, its main applications and inputs/outputs. Some applications below have a lot of potential for improvement and are tagged with [research phase].
Main applications
๐ โ ๐: Large Language Models
Commonly, Large Language Models (LLMs) have been used to write content, as chatbots or assistants, simplifying search and for analysis or synthesis. The main data modality for LLMs is text.
๐+ ๐ผ๏ธ โ ๐ : Multi-modal LLMs
Latest LLMs such as GPT-4 are multi-modal AI models outperforming its predecessors especially in reasoning capabilities.
LINGO-1 is a multi-modal AI model that provides information about the driverโs behavior or the driving scene as commentary.
๐+ ๐ผ๏ธ+ ๐ค โ ๐ : Multi-modal LLMs for robotics [research phase]
PaLM-E is a foundation model trained on images, text and robotic state data. It can control a robotic arm in real time.
๐ โ ๐ต : Text to Sound [research phase]
New models from Google, Meta and the open source community significantly advance the quality of controllable music generation.
MusicLM (Google), samples: https://google-research.github.io/seanet/musiclm/examples/
MusicGen (Meta), samples: https://ai.honu.io/papers/musicgen/
๐ โ ๐ผ๏ธ : Text to Image
Lots of applications here and many commercially available models like Dall-E 3 (multi-modal AI model) and the Imagen model family. These models are already integrated in multiple products such as Imagenโs integration in Google Cloud Vertex AI, Google Slides and Dall-E-3โs integration with ChatGPT-Plus and enterprise customers.
This year, we have seen new methods enabling co-pilot style capability for image generation and editing such as products by Genmo AI that enable a co-pilot style interface for image generation with text-guided semantic editing.
๐ โ ๐ฅ : Text to Videoย [research phase]
The race in text to video model development continues with high resolution video generation (up to 1280 x 2048!) by NVIDIA.
Other text to video foundation models are Phenaki (by Google) and Make-a-Video (by Meta).
๐ โ </> : Text to Code
GPT-4 Code Interpreter is the leading foundation model. Some text to code use cases include generating code from text prompts, completing code, and even debugging.
There are some other research developments and recently open-sourced models such as Code Llama that showcase superior performance compared to GPT-4 when tested in specific benchmark datasets.
๐ โ ๐ฌ : Text to Speech [research phase]
SeamlessM4T is a multi-modal AI model used for translation and transcription. The model supports nearly 100 languages for input (speech + text), 100 languages for text output and 35 languages (plus English) for speech output. However, it is only available for non-commercial use.
๐ โ ๐ฆ : Text to 3D [research phase]
3D is a particularly challenging domain for generative AI models. Compared to image or even video, 3D datasets are rarely available. Additionally, 3D generation not only includes shape, but also other aspects such as texture or orientation, which are hard to capture in text representation. Therefore, use cases for image to 3D and text to 3D are in research stages.
An example is Shap-E (by OpenAI) that can generate photorealistic and highly detailed 3D objects directly from short written descriptions.
There is significant potential with further development of these models especially for gaming and AR/VR use cases.
Industry adoption predictions
The recent and ongoing foundation model developments enable an array of opportunities in multiple industries and business functions. According to a recent study by McKinsey & Company, high tech and banking will have the most significant impact, while marketing and sales, customer operations and product/R&D will be impacted the most compared to other functions.
Adoption rates will vary depending on the scale of an industryโs revenue, agility in software development and personnel training/AI skillsโ development, to name a few.
Given the rapid developments of AI models, we would need to closely observe new developments to all stay-up-to-date on the recent advances and highest ROI to bring those to market.