voice tech

When to Run Speech to Text On-Device or On-Premise vs in the Cloud

Antti Ukkonen

Sep 06, 2022

4 min read

When deciding to deploy Speech to Text technology On-Device vs in the Cloud you should consider Cost, Speed, & Privacy.

  • Copy link

  • Mail

  • LinkedIn

  • Facebook

  • Twitter

Speech to Text technology can be deployed in various ways, such as in the Cloud, On-Device, or On-Premise (Server or Private Cloud). However, there are various Pros and Cons in how you deploy that can affect the Cost, Speed, and Privacy of the experience you build. In this post, we will cover the differences between Cloud, On-Device, and On-Premise Speech to Text deployment and scenarios where you should consider ditching the Cloud for an On-Device or On-Premise deployment.

Speech to Text: On-Device vs On-Premise vs Cloud

Whether you are running Speech to Text On-Device, On-Premise, or in the Cloud the core outcome remains the same. Speech to Text enables developers to convert audio to text for various use cases, such as Transcription for Video Calls or Moderation for Video Game chats. However, there are many more use cases for Speech to Text.

Speech to Text can be deployed in multiple ways. The most common way that Speech to Text is deployed is in the Cloud. This simply means that audio is converted into text using the help of a cloud provider such as Google or Amazon, where the audio is captured on a users device, sent to the cloud for transcription and instruction from the developer on what to do with the transcription, before being sent back to the users device.

Other ways of deploying Speech to Text include On-Device or On-Premise. This simply means that Transcription is taking place directly on the user's device running the application or within a company's private server stack or private cloud. While the use cases for On-Device or On-Premise Speech to Text are similar in nature, meaning at the core there is still the conversion of audio into text, deploying in this fashion comes with some additional benefits to consider.

Learn more about running Speech to Text On-Device or On-Premise with Cloud-grade performance

When to run Speech to Text On-Device or On-Premise

Running Speech to Text On-Device or On-Premise has 3 main benefits: Cost, Speed, & Privacy.


Most Speech to Text or Speech Recognition solutions are Cloud based products. However, running Speech to Text in the Cloud requires sending large amounts of audio over the internet to be processed. For use cases where there is a lot of audio to be transcribed, like in a Video Call or Stream, the cost can climb fast making Speech to Text an unviable feature. With the ability to run Speech to Text directly on the user's device or On-Premise, the cost can come down by up to 10x depending on the provider.


Another key pitfall with many cloud based Speech Recognition providers is the inability to deliver real time Speech to Text. Even with the current speed of sending information back and forth between the cloud, there is still a noticeable lag in speed for the majority of Speech to Text products that greatly disrupts the end user experience. Running Speech to Text On-Device or On-Premise also is a great way to increase the speed of the transcription since the transcription process is never required to leave the end user or companies product ecosystem.


The final, but arguably most important reason to run Speech to Text On-Device or On-Premise is Privacy. We live in a world where consumers' attention to privacy is at an all time high. Even the concept of technology listening to complete tasks like transcription can make people uncomfortable.

Running On-Device or On-Premise allows companies to build experiences that leverage Speech to Text while giving users confidence that their valuable Voice Data is remaining private, either by never leaving their device or by remaining secure with the company delivering the experience.

Speech to Text Accuracy: On-Device vs On-Premise vs Cloud

Speech to Text technology is powered by large Machine Learning models which historically has made it difficult to deliver the same accuracy in On-Device or On-Premise experiences vs in the Cloud. Until recently, running Speech to Text anywhere but in the Cloud meant a significant drop in accuracy performance as this environment usually required running smaller and less sophisticated Speech Recognition models.

However, at Speechly the Speech to Text models used by the On-Device and On-Premise solution are the same as the ones used in our Cloud Based offering. This means you can get 95%+ accuracy with Speech to Text Transcription in the Cloud, On-Device, or On-Premise.

Building On-Device & On-Premise Speech to Text

There are still use cases for Speech to Text technology where a cloud based deployment makes sense. These scenarios are not limited to, but usually will have the characteristic of Lower Overall Voice Data volume. This simply means that there is a small amount of information to be transcribed at any given time - such as giving simple Voice Search inputs to a website.

When it comes to high volume scenarios, such as Transcribing a Video Call or Moderating a Voice Chat in an online game, deploying Speech to Text either On-Device or On-Premise can bring you Cost, Speed, and Privacy benefits. It is important to keep these factors in mind when finding a Speech to Text technology partner.

Learn more about running Speech to Text On-Device or On-Premise with Cloud-grade performance

Photo by Juairia Islam Shefa on Unsplash

Latest blog posts

use cases

ADL Report: Voice Chat Remains a Top Channel for Online Harassment

The annual ADL report about harassment in multiplayer video games showed a significant problem worsening. Voice Chat is once again a leading channel for concern.

Collin Borns

Jan 27, 2023

3 min read

use cases

ADL Report: Online Harassment In Games is Bad and Getting Worse

ADL's annual report about harassment in online multiplayer games paints a negative picture for young people and adults alike. Is 2023 the year the gaming industry will start to overcome these challenges?

Collin Borns

Jan 18, 2023

2 min read

use cases

The Hidden Power of Full-Duplex AI for Voice Assistants and Voice Chat Moderation

The most popular voice assistants (Alexa, Siri, Google) use half-duplex architectures, meaning the user and assistant must take turns to speak. However, Full-duplex systems employ real-time understanding where the system begins predicting the user intent from the very first word uttered, unlocking the ability for Proactive Content Moderation.

Hannes Heikinheimo

Dec 09, 2022

8 min read