Speechly has introduced a new conformer AI model as an update to our original LSTM models. A key benefit of the Speechly Conformer RNN-Transducer model is improved computational efficiency. This is particularly true for real-time transcription, where it can save as much as 50% in computational resources. In addition, Speechly’s new models can achieve these benefits along with higher accuracy as measured by a lower word error rate (WER).
This week we also released an updated Whisper solution with coverage for 99 languages. Whisper is a transformer-based large language model for speech recognition developed by OpenAI. We have been testing and optimizing our Whisper infrastructure for months to augment customer deployments. Speechly is now offering this as a hosted option for our customers to provide additional speech recognition capabilities.
What is a Conformer Transducer Model?
A conformer transducer model is a type of deep neural network that combines aspects of convolutional networks and transformer models. This enables the model to focus on specific parts of an audio input that are most relevant to a transcription or other natural language processing task. In particular, these models are able to identify short and long-term dependencies in speech, which often helps to improve accuracy. Moreover, unlike transformer-based encoder-decoder architectures such as Whisper, the conformer transducer naturally lends itself to real-time streaming transcription.
The benefits of conformer transducer models are realized irrespective of the runtime environment. They can be deployed both as part of our cloud/on-premise product as well as our on-device offering. We can build small or larger versions of the same conformer transducer depending on resource availability, with smaller models being more efficient on smaller devices. In any case, these models offer significant performance improvements over earlier technologies, such as LSTMs, in terms of transcript accuracy.
Why did Speechly build a conformer model?
Speechly researchers developed a conformer model, in part, to improve computational efficiency. Conformer models are generally more computationally efficient than LSTM models due to parallelization and the ability to handle variable input sequence lengths. Conformer models can process input sequences in parallel across multiple computation units, such as GPUs. This allows them to perform more computations simultaneously and speed up both training and inference times.
The benefits of custom speech recognition models are widely recognized. Using a conformer model can accelerate training for the models and reduce training costs. In addition, live streaming inference for speech recognition is also more computationally efficient with a conformer model and provides reduced latency. LSTM models generally require more memory and parameters to achieve the same level of accuracy.
What is Whisper, and Why Does Speechly Offer it?
Whisper is a transformer-based large language model developed for speech recognition and transcription tasks. It has the added benefit of providing translation when needed. OpenAI introduced Whisper in late 2022, and Speechly immediately began testing it for a variety of tasks. It has some limitations compared to customized AI models but performs many tasks to a level on par with leading cloud speech recognition solutions at a far more attractive price point.
However, Speechly also learned during our testing and deployment that setting up and managing Whisper infrastructure can be a complex undertaking. Given its useful features and these challenges in deployment and operations, Speechly decided to offer Whisper as a supplement to our existing on-device, on-premise, and cloud models. Whisper today is only available for cloud or on-prem deployment.
Try Speechly’s New Models Today
Both the new Conformer RNN-T and Whisper models are available today from Speechly’s dashboard. If you have any questions, you can learn more in Speechly’s documentation, or feel free to ask us a question anytime.
Speechly continues to invest heavily in research and development to improve accuracy, latency, and cost efficiency. We look forward to hearing your feedback on the new models and continuing our research to update, refine, and enhance our speech recognition products.