Conformer-2: Revolutionizing Speech Recognition
Conformer-2 is an advanced AI model that has been making waves in the field of automatic speech recognition. It builds upon the success of its predecessor, Conformer-1, and brings a host of improvements.
Overview
Conformer-2 was trained on a whopping 1.1M hours of English audio data. This extensive training dataset is a significant factor in its enhanced capabilities. It extends the work of Conformer-1 and shows remarkable progress in handling proper nouns, alphanumerics, and being robust to noise. For instance, it achieves a 31.7% improvement on alphanumerics, a 6.8% improvement on Proper Noun Error Rate, and a 12.0% improvement in robustness to noise.
When compared to other existing speech recognition models, Conformer-2 stands out. While some models might struggle with accurately transcribing names or numbers, Conformer-2's improvements in these areas make it a more reliable choice. For example, in real-world scenarios like transcribing podcasts or call center conversations, it can provide more consistent and accurate transcripts.
Core Features
One of the key features of Conformer-2 is its use of model ensembling. Instead of relying on a single "teacher" model like Conformer-1 did with its noisy student-teacher training, Conformer-2 leverages multiple strong teacher models to produce labels. This ensembling technique results in a more robust model that can handle a wider range of data and is less likely to fail in unseen situations.
Another notable aspect is its data and model parameter scaling. Inspired by research on the undertraining of large language models, Conformer-2 increased its model size to 450M parameters and trained on the extensive 1.1 million hours of audio data. This scaling up has contributed to its overall better performance.
Basic Usage
Using Conformer-2 is quite straightforward. You can try it out in the Playground by simply uploading a file or entering a YouTube link to get a transcription in just a few clicks. Additionally, if you're interested in integrating it into your product, you can reach out to the sales team for more details. The API also offers a new parameter called speech_threshold which allows users to set a threshold for the proportion of speech in an audio file for processing, helping to control costs with certain types of files.