AI usage has significantly increased in recent times, with millions of users signing up for platforms like ChatGPT. From the challenges of processing arithmetic operations with room-sized computers in the 1960s, we now have advanced AI models capable of processing high-resolution images on smartphones. However, despite these advancements, we have not yet reached a state of super-intelligence in computer systems.
How to solve this problem?
The multimodal approach trains a single learning model to perform multiple tasks simultaneously. This increases operational scalability and reduces implementation costs for intelligent systems. However, this approach can introduce inefficiency due to the diverse range of data and multiple loss functions required to fine-tune the model for different tasks. To overcome these challenges, extensive amounts of training data and hundreds of hours of training are employed to enable the models to become familiar with the intricate data patterns across various tasks.
What is new in this this?
Meta has been doing consistent research on content moderation, video tagging, speech-to-text and language translation over years now. SeamlessM4T is there recent launch, a multi-model AI model to convert Speech-recognition (detect speakers), Speech-to-text, Text-to-Speech and Speech-to-Speech translation in over 100 languages.
Not just that, meta has also open-source the dataset which they used for this start-of-the-art model. SeamlessAlign, the biggest open multimodal translation dataset to date, totaling 270,000 hours of mined speech and text alignments.
Meta still believes that this work is only a big-first-step given that the existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages
Please, do checkout the cool demo video provided by meta research team here.
Related previous works from Meta

Leave a comment