Meta Unveils Voicebox: Cutting-Edge AI Model Revolutionizes Speech Generation and Audio Editing

USA: Meta has recently unveiled Voicebox, an advanced AI model designed for speech generation tasks such as editing, sampling, and stylizing. This cutting-edge tool has the ability to generate high-quality sound clips and manipulate pre-recorded audio, enabling tasks like removing unwanted noises while preserving the original audio style. Voicebox is a multilingual AI model capable of producing speech in six different languages.

Similar to generative systems for images and text, Voicebox generates outputs in a wide range of styles. However, instead of creating pictures or written passages, it focuses on producing exceptional audio clips. This AI tool can either generate outputs from scratch or modify existing samples provided to it.

Voicebox can be immensely helpful for various speech-related tasks, including speech synthesis, audio editing, noise removal, diverse sample generation, and style conversion. What sets Voicebox apart is its unique approach to learning, which solely relies on raw audio and transcription data. It utilizes a novel technique called Flow Matching, which has demonstrated superior performance compared to diffusion models.

Also Read: Feelpixel: leading Customer-Centric Obsession in UX Designs, Elevating Businesses with Delight and Success!

In terms of performance, Voicebox surpasses other models such as VALL-E and YourTTS. In zero-shot text-to-speech scenarios, Voicebox outperforms the current state-of-the-art English model VALL-E in terms of intelligibility (5.9% vs. 1.9% word error rates) and audio similarity (0.580 vs. 0.681) while being significantly faster, up to 20 times.

Additionally, Voicebox surpasses YourTTS for cross-lingual style transfer, reducing the average word error rate from 10.9% to 5.2% and improving audio similarity from 0.335 to 0.481.

Voicebox is capable of synthesizing speech in six languages. Meta trained the model using over 50,000 hours of pre-recorded speech and transcripts from public-domain audiobooks in English, French, Spanish, German, Polish, and Portuguese. It can predict speech segments given the surrounding speech and the corresponding transcript.

One notable feature of Voicebox is its ability to infill speech from context, enabling it to generate segments within an audio recording without recreating the entire input. It can also replicate the style of a given audio sample for text-to-speech generation.

Also Read: How Amazon.com Started and Became the Biggest E-commerce Platform

The applications of Voicebox are numerous and promising. This multipurpose generative AI model could provide natural-sounding voices for future virtual assistants or non-player characters in the Metaverse. It has the potential to simplify audio track editing for content creators, allow individuals to speak foreign languages using their own voice, and enable visually impaired people to have written messages read aloud in the voices of their friends through AI technology.

Also Read: Amazon Prime Lite: Unlocking the Best of Prime, Light on Your Wallet

Despite its exciting possibilities, Voicebox is currently not accessible to the general public. Meta has only shared audio samples and a research paper outlining the methodology and results achieved with this state-of-the-art AI model. This cautious approach is in place due to the potential risks of misuse associated with releasing the model or its code to the public

Related News

Join NewsTrack Whatsapp group