voice tech

Overlooking Great Voice Technology Use Cases

Collin Borns

Mar 23, 2022

5 min read

Modern Voice APIs and Artificial Intelligence (AI) have created new ways for voice technology to enhance user and employee experiences that go beyond device control alone.

  • Copy link

  • Mail

  • LinkedIn

  • Facebook

  • Twitter

When the Voice Tech industry discusses use cases for Voice Technology, the primary focus is on using your voice to control some sort of device. However, this focus negates to ask one essential question: is this where the best use cases for Voice Technology exist today? In facing this necessary reality, there are many use cases outside of simply controlling the devices around us. These can be found in contexts where there are conversations taking place - such as video conferencing, online multiplayer game voice chats, and customer service call centers to name a few, that are not discussed nearly to the same level. Why is this the case and why should the voice tech industry be paying closer attention to Human to Human Interaction (HHI) scenarios? I will answer both of those often overlooked questions as well as provide use cases I believe the Voice Tech industry should be paying more attention to.

Why the Voice Tech industry has been primarily focused on using Voice to control devices

When you think of Voice Technology, the first thing that comes into your mind is probably one of the popular Big Tech Voice Assistants, such as Apple's Siri, Amazon's Alexa, or Google Assistant. This is not a surprise. If you look at the world of Voice Technology prior to the introduction of Siri and later Alexa, most Voice Tech conversations were taking place in the world of R&D labs vs in the real world. In fact, it wasn't until 2018 that Microsoft reached Human Parity with Automatic Speech Recognition (ASR) and the following year that Microsoft also reached Human Parity with Natural Language Processing (NLP).

As Voice Technology started to show more promise back in the mid 2010's, you also saw massive investment from Big Tech companies into their respective Voice Assistant platforms. For example, as of 2018 Amazon had over 10,000 people working on Alexa alone and also stood up a $200M dollar Venture Fund to spur investment into companies building supporting services and applications for the platform. Although Amazon is arguably the top performer when it comes to a Big Tech company starting new projects without the fear of shutting them down, such as the Fire Phone, companies like Amazon have remained persistent with investing into their Voice Assistant platforms. This is despite a stall of adoption by the Voice Assistant market.

When you look specifically at the sheer investment into Voice Assistant experiences by the biggest technology companies in the world, it's easy to understand how they control the majority of mindshare on "the best" use cases for Voice Technology. The question then arises: has this caused the wider Voice Technology industry to overlook great use cases for the underlying Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) technologies that help power Voice Assistants? I believe it has, particularly within the context of HHI scenarios.

Why should we pay attention to where Voice Technology can add value in Human to Human Interactions?

Humans have been interacting and transacting online since the early days of the internet, however with the recent COVID-19 pandemic, we saw a massive acceleration in digital transformation across all industries. Satya Nadella, the CEO of Microsoft said, "We've seen two years' worth of digital transformation in two months. From remote teamwork and learning, to sales and customer service, to critical cloud infrastructure and security…". Alongside the explosion of remote work and services like video conferencing, we also saw big changes on the consumer side of things with the rise of a new space called Social Audio, paved initially by Clubhouse and quickly followed by Big Tech companies like Twitter, Facebook and Spotify.

A key detail with the digital transformation scenarios previously mentioned is that they have a common and relevant attribute: Human to Human communication taking place in a digital world. With this also comes a massive amount of audio data that presents both challenges and opportunities for the developers of these experiences.

Challenges and Use Cases for Voice Tech in an Audio Rich World

On Device Transcription

First, I want to start with a use case that has been a mainstay for ASR technology, but has seen some unique challenges alongside the rise of a world that has gone more remote, Transcription. Traditionally, ASR technology has been run in the cloud for transcription. However, as the usage of experiences that leverage ASR have continued to grow, so have the costs associated with these experiences - making running in the cloud less attractive or viable from a business perspective. One solution to this cost problem is running ASR on-device vs in the cloud. While that is easy to write down in a strategy deck, the technical challenges around being able to run ASR on-device are less straightforward. Solving this challenge, though, comes with an additional value proposition outside of cost alone, which is the added increase in speed of transcription.


The next challenge I would like to address is moderation. Whether looking at new domains where conversations are taking place, such as Social Audio or Virtual Reality, or more established domains like Social Media or Online Gaming, moderation has always been a hot topic. Even Microsoft has had to shut down part of their "Social Metaverse" due to an inability to properly moderate the conversations that were taking place.

This is a problem that cannot be solved simply by adding more human moderation for 2 distinct reasons. First, humans come with bias and no matter what you do to eliminate that bias, history has proven it's nearly impossible to eliminate it from decisions made regarding moderation. The second is the sheer scale of audio data that needs to be monitored. Although apps like Clubhouse have seen a decline in downloads over the last few months, they have created a new category all together that has captured the attention of Big Tech as at least a feature of their platforms and saw a growth in daily rooms being created to 700k. Real time ASR and NLU technologies present a great tool to augment human moderation for specifically being able to call out things like hate speech, harassment, and profanity with audio data.


The final challenge I would like to discuss can be generally described as Assistance. Specifically, I am referring to Customer Support and Sales scenarios. These domains have also seen a rapid rise in the amount of conversational data that is captured as more of our professional work and everyday consumer habits have continued to shift into a more digital and remote world.

With both Customer Support and Sales, voice technology has been a tool for post-call analysis to better train agents on how to improve performance in future calls. However, advancements in voice technology have enabled us to take the next step and make these insights happen in Real-time as opposed to after a conversation has already ended. The benefit here is clear: agents can now learn during their calls while also having an Assistance tool to better serve customers and prospects.

Now what?

Should the Voice Technology industry shift its focus to ignore using voice to control the devices around us? No! Rather we should constantly challenge ourselves to think outside of the box on what the best use cases are for modern day Voice Technologies. The best use cases might be closer than you may think.

Did any new use cases come to mind for Voice Tech while you were reading this? Let us know your ideas or general thoughts from this post on Twitter @SpeechlyAPI.

Cover photo by Cam Adams on Unsplash

About Speechly

Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.

Latest blog posts

company news

Speechly is joining Roblox

Hannes Heikinheimo

Sep 19, 2023

1 min read

voice tech

4 Voice Chat Solutions for Virtual Reality

Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.

Matt Durgavich

Jul 06, 2023

5 min read

company news

Speechly Has Received SOC 2 Type II Certification

Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.

Markus Lång

Jun 01, 2023

1 min read