Blog

company news

Speechly Introduces New Conformer Speech Recognition Model and Expanded Whisper Offering

Antti Ukkonen

Feb 17, 2023

3 min read

This week the Speechly team released two new product updates. These updates include a new conformer AI model as an update to our original LSTM models and an updated Whisper solution with coverage for 99 languages.

  • Copy link

  • Mail

  • LinkedIn

  • Facebook

  • Twitter

Speechly has introduced a new conformer AI model as an update to our original LSTM models. A key benefit of the Speechly Conformer RNN-Transducer model is improved computational efficiency. This is particularly true for real-time transcription, where it can save as much as 50% in computational resources. In addition, Speechly’s new models can achieve these benefits along with higher accuracy as measured by a lower word error rate (WER). 

This week we also released an updated Whisper solution with coverage for 99 languages. Whisper is a transformer-based large language model for speech recognition developed by OpenAI. We have been testing and optimizing our Whisper infrastructure for months to augment customer deployments. Speechly is now offering this as a hosted option for our customers to provide additional speech recognition capabilities. 

What is a Conformer Transducer Model?

A conformer transducer model is a type of deep neural network that combines aspects of convolutional networks and transformer models. This enables the model to focus on specific parts of an audio input that are most relevant to a transcription or other natural language processing task. In particular, these models are able to identify short and long-term dependencies in speech, which often helps to improve accuracy. Moreover, unlike transformer-based encoder-decoder architectures such as Whisper, the conformer transducer naturally lends itself to real-time streaming transcription.

The benefits of conformer transducer models are realized irrespective of the runtime environment. They can be deployed both as part of our cloud/on-premise product as well as our on-device offering. We can build small or larger versions of the same conformer transducer depending on resource availability, with smaller models being more efficient on smaller devices. In any case, these models offer significant performance improvements over earlier technologies, such as LSTMs, in terms of transcript accuracy.

Why did Speechly build a conformer model?

Speechly researchers developed a conformer model, in part, to improve computational efficiency. Conformer models are generally more computationally efficient than LSTM models due to parallelization and the ability to handle variable input sequence lengths. Conformer models can process input sequences in parallel across multiple computation units, such as GPUs. This allows them to perform more computations simultaneously and speed up both training and inference times. 

The benefits of custom speech recognition models are widely recognized. Using a conformer model can accelerate training for the models and reduce training costs. In addition, live streaming inference for speech recognition is also more computationally efficient with a conformer model and provides reduced latency. LSTM models generally require more memory and parameters to achieve the same level of accuracy.

What is Whisper, and Why Does Speechly Offer it?

Whisper is a transformer-based large language model developed for speech recognition and transcription tasks. It has the added benefit of providing translation when needed. OpenAI introduced Whisper in late 2022, and Speechly immediately began testing it for a variety of tasks. It has some limitations compared to customized AI models but performs many tasks to a level on par with leading cloud speech recognition solutions at a far more attractive price point. 

However, Speechly also learned during our testing and deployment that setting up and managing Whisper infrastructure can be a complex undertaking. Given its useful features and these challenges in deployment and operations, Speechly decided to offer Whisper as a supplement to our existing on-device, on-premise, and cloud models. Whisper today is only available for cloud or on-prem deployment. 

Try Speechly’s New Models Today

Both the new Conformer RNN-T and Whisper models are available today from Speechly’s dashboard. If you have any questions, you can learn more in Speechly’s documentation, or feel free to ask us a question anytime. 

Speechly continues to invest heavily in research and development to improve accuracy, latency, and cost efficiency. We look forward to hearing your feedback on the new models and continuing our research to update, refine, and enhance our speech recognition products.

Latest blog posts

case study

Combating Voice Chat Toxicity in VR Games: Speechly and Gym Class

Gym Class VR is a basketball game that was preparing to launch on Meta Quest after a very successful Beta. Voice chat is an important social element of the game, but the team noticed evidence of toxic behavior emerging. After trying speech recognition from cloud service providers, they quickly learned this was a cost-prohibitive approach and turned to Speechly.

Collin Borns

Mar 20, 2023

5 min read

voice tech

The Dirty Dozen - The Impact of 12 Types of Toxic Behavior in Online Game Voice Chat

Speechly surveyed over 1000 online gamers about toxic behavior in voice and text chat. The results show offensive names, trolling, bullying and annoying behavior top the list with the broadest impact. However, these behaviors are between 50%-200% more frequent in voice chat.

Collin Borns

Mar 09, 2023

3 min read

voice tech

Voice Chat is Popular with Gamers - It's also the Top Source of Toxic Behavior - New Report

Speechly commissioned a survey of a nationally representative sample of over 1000 gamers. The survey found that nearly 70% of gamers have used voice chat at least once. Of those, 72% said they've experienced a toxic incident. Read more today in the Full Report.

Otto Söderlund

Mar 08, 2023

3 min read