Speech recognition technology has come a long way in recent years and that has raised more interest in deploying on-device solutions as an alternative to cloud-based solutions. The main difference is that cloud-based solutions must send the audio over the network to a remote server for processing, while the audio is processed locally for on-device implementations and never has to travel the internet to access expensive computing resources.
This difference has far-reaching implications. You might be surprised that on-device speech recognition accuracy can be comparable to the cloud for many use cases, but with the added benefits of improved privacy and lower cost.
Higher Privacy + Lower Cost
The main benefits of on-device speech recognition over cloud-based solutions are privacy and lower costs, especially when very large volumes of audio must be transcribed. If the audio is never uploaded to the cloud, the risks of sensitive information being leaked are substantially reduced and cloud-based solutions come with infrastructure costs that can be avoided with an on-device solution.
Additionally, on-device speech recognition doesn't require an internet connection. This can be a major advantage in situations where security policies may prevent public access to the internet, such as factory floors or hospitals.
But how accurate is on-device speech recognition compared to cloud-based solutions? The short answer is that it can be just as accurate, but this depends on the type of device in question.
Is There an Accuracy Tradeoff?
Accuracy in speech recognition is typically measured using a manually transcribed evaluation corpus, which is a collection of recorded speech samples together with the correct transcript. The most common measure of accuracy is the Word Error Rate (WER), which compares the transcription of a recorded sample to the correct transcript by calculating how many changes one has to make to the automatically generated transcript so that it matches the correct reference. A lower WER indicates a higher level of accuracy.
Speech recognition is based on machine learning models that are trained using large amounts of speech data. To make full use of such datasets, the model itself must be large. The size of the model directly affects its accuracy, with larger models being more accurate. However, larger models also require more resources, both in terms of processing power and memory usage.
Thus there is a trade-off between accuracy and available resources. Typically cloud-based speech recognition solutions have more resources available, can hence use larger models, and are thus capable of providing high accuracy. But what is the situation with on-device speech recognition?
The answer is that it depends on what type of device one is considering, and if the device must do some other processing while speech recognition is running. Importantly, most modern mobile phones have the resources to run fairly large speech recognition models, especially if the device can focus only on transcription. And if the target device has fewer resources, it is possible to train a custom model that is small enough to fit on the device, without compromising too much on accuracy.
Practical Considerations for On-Device Speech Recognition
The precise speech recognition task may play a role in your solution decision. If the task is to transcribe a local audio file, e.g. an interview recording, it is desirable that the processing runs faster than real-time, meaning that transcribing a 10 minute recording would take substantially less than 10 minutes. On the other hand, real-time transcription, where the transcript is generated at the same time the user speaks, may require fewer resources from the device as there is less audio to be processed per unit of time.
Consider that a mid-tier Android phone released in 2021 (Samsung A22 5G) is perfectly capable of running Speechlys large, cloud-grade speech recognition model faster than real-time when no other computationally heavy processing is running concurrently. The device can transcribe a 10 minute audio file in about 2-3 minutes. On the other hand, the same device can easily handle real-time speech recognition using the same large model, even if there is a graphics heavy 3D game running in the foreground. And crucially, using this model, the on-device WER would be exactly the same as the WER of Speechlys Cloud-solution!
You could argue that the Samsung A22 is a fairly powerful device. However, even a Raspberry Pi 4 is capable of real-time transcription with the same large model, and this consumes only about half of the available CPU resources (2 cores).
Practical Solutions for On-Device Speech Recognition
One place we have been asked to deploy on-device speech recognition is in the video game industry. Users typically have a PC, console, or mobile phone that has plenty of computational power and memory to run a speech recognition model in real time. This saves cost for the game maker because they are not processing all of that data in their cloud servers while providing the added benefit of greater user privacy and lower latency. If the user does face an issue such as toxic behavior in voice chat, the data can be automatically uploaded to the cloud for use during the moderation investigation.
The accuracy of on-device speech recognition is not really a matter of on-device vs cloud, but more about model size and resource usage. Many devices, especially reasonably modern mobile phones, have sufficient resources to run relatively large models. Therefore, accuracy of on-device can be as good as in the cloud! And of course Speechly’s on-device models can be adapted to specific use-cases and vocabularies in the same way as our cloud solution.
To learn more about on-device speech recognition, check out our on-device docs or reach out to our product team at any time.