Blog

voice tech

Web Speech API Alternatives for Voice User Interfaces

Ottomatias Peura

Dec 17, 2020

6 min read

Speechly provides an alternative for Web Speech API React that works in all modern browsers and is optimized for real-time voice user interfaaces

  • Copy link

  • Mail

  • LinkedIn

  • Facebook

  • Twitter

The most used tool for voice user interfaces on the browser is the Web Speech API and SpeechRecognition API, but there are major limitations with both technologies.

First, Web Speech API is only available for Chrome. SpeechRecognition API is also available for Firefox and some derivates of these, but the low support makes them unfeasible for production use in any real-life application.

Support for WebSpeech API is limited

Second, Web Speech API and SpeechRecognition API provides only the transcription of the user's speech. They don't provide any context or meaning (natural language understanding) for this input. While in some use cases that only need the transcription it's not an issue, for more complicated user tasks and for building user interfaces natural language understanding needs to be solved somehow.

Speechly is the first developer tool built from the ground up for building voice user interfaces. Our Spoken Language Understanding API integrates speech recognition (ASR) and natural language understanding (NLU) to a single Spoken Language Understanding API for low latency and improved accuracy.

In addition to wide browser support, Speechly is available for touch screen platforms (Android, iOS and React Native) which makes building cross-platform applications very simple. This makes Speechly the best WebSpeech API alternative for voice user interfaces.

One important aspect when comparing voice APIs is of course the speech recognition accuracy. Speechly benefits from the fact that it's always configured for a certain use case and this configuration is used to bias the speech recognition model.

Biasing helps Speechly correctly catch product names, professional lingo and other harder words. Even without biasing, our speech recognition accuracy is on par with Google's WebSpeech API, as you can see in the video below.

In the video, a standard, non-biased Speechly model is running simultanously with the Google Webspeech API test and both are transcribing Steve Jobs' keynote speech in the first iPhone launch event.

What is natural language understanding and why do I need that?

Natural language understanding is a branch of machine learning that enables computer systems to extract meaning from text or speech input. It reduces natural language into structured data that typically consists of intents and entities (slots) that modify these intents.

While this might sound complicated, let's give a simple example to clarify it. If the user says something like "Show t-shirts", the user intent is probably something like "show_products" and it has an entity "t-shirt". Naturally, the user might also say something like "Show jeans". In this case, the entity would be the same – "show_products" but the entity would be "jeans".

If we are 100% sure that our users will always use either of these two utterances in exactly this format, we can use a very simple regular expression as our natural language understanding algorithm.

But most often this is not the case. Rather, the user can express this same intent in many different ways. Maybe they say something like "I'd like to see turtlenecks" or "Do you have any tees?"

A good natural language understanding algorithm can extract the meaning out of all these utterances and always return with the same intent and entity, no matter how the user expresses themselves.

WebSpeech and SpeechRecognition APIs don't have any natural language understanding capabilities and if you need that, you'll need to start learning SpaCy or some other natural language understanding tool. This increases development time significantly and adds complexity.

Why Spoken Language Understanding?

Now as we've learned, a voice user interface needs two distinct parts: speech recognition to transform user speech into text and natural language understanding to extract meaning (intents and entities) from that text. WebSpeech and SpeechRecognition APIs only offer speech recognition.

If you have ever used Google Assistant, Alexa, or Siri, you've probably seen that while the text transcript appears in near real-time while the user speaks, when the user stops speaking there is a small delay after which the action happens. This is where the natural language understanding happens and the action that the user wanted is performed.

Speechly is a Spoken Language Understanding API that provides both of these functions in a fully streaming fashion. When the user starts talking, the API begins returning both the transcript and the "meaning", eg. intents and entities for this input. This makes applications built with Speechly very responsive and fast to react to user input.

In fact, Speechly returns both interim and final results for both the transcript and for intents and entities for even faster feedback.

Unlike SpeechRecognition or WebSpeech API, Speechly browser-client is supported by all modern browsers on mobile and desktop. You can also use Speechly for iOS and Android and we are adding more client libraries in the future, too. You can find all our client libraries here for up-to-date status.

The streaming fashion of Speechly enables natural end-user utterances such as "Show me t-shirts... sorry I mean jeans". For most other voice UI APIs, this kind of query fails because of end pointing (or failure in natural language understanding): the system recognizes the small pause in between as a signal for the end and starts processing the first part of the utterance without taking into account the last part.

Another important thing that streaming enables is real-time visual feedback. If we think about our example utterance "Show t-shirts" it can show the t-shirts as soon as the user has stopped speaking. This encourages the user to go on and they can continue with something like "for men... in size large".

Configuring the natural language understanding model on Speechly is very simple and can be done either in our web dashboard or by using our command line tools. The former works great for simple projects and initial models and the latter is better for projects with several developers collaborating on the same model.

Spoken Language Understanding demo

Here's a quick demo showing this a web application built with Speechly Spoken Language Understanding in action:

As you can see from the demo, real-time visual feedback is the key to natural voice user interfaces. We believe that the lack of real-time feedback is the reason, why the "iPhone moment" has not happened yet for voice UIs. This kind of real-time feedback can't be done with either WebSpeech or SpeechRecognition API.

You can see the differences between responsiveness also by checking out this GitHub project that is using WebSpeech API for a chess game. Then compare it to this video which shows a similar (albeit more simple!) chess game built with our JavaScript client.

Just like the iPhone succeeded with the touch screen because of its very responsive and intuitive user experience, voice UIs need the same responsiveness and intuitiveness to really succeed.

Alternatives for WebSpeech API

Amazon Transcribe

Amazon Transcribe is Amazon's text-to-speech API that suffers from the same limitations than WebSpeech API and SpeechRecognition API.

While it does offer accurate speech recognition, it does not have natural language understanding capabilities, which makes it slow and non-responsive for voice user interfaces.

IBM Watson Speech to Text

IBM Watson Speech to Text is another paid for speech-to-text API that does not include NLU capabilities.

Microsoft Bing Speech API

Microsoft Bing Speech API is Microsoft's answer to speech recognition, but unfortunately does not support natural language understanding either.

Assembly AI

Assembly AI offers great features for speech to text, including profanity filters and multiple models for different accents. It's a bit cheaper than the other altenatives, but does not support NLU, either.

Speechly

Speechly offers fully streaming real-time Spoken Language Understanding API for integrating responsive voice user interfaces for any web application.

Conclusions

Building voice user interfaces for browser applications can't be done without natural language understanding capabilities. While it is possible to use another tool for speech recognition and another for NLU, it adds complexity and most probably increase latency so that real-time visual feedback is not achievable.

This makes Speechly the only available tool that enables complex voice user interfaces in browser with a single API and with wide support for different browsers.

If you are interested in building real-time voice user interfaces for React or JavaScript, you can start using Speechly by completing our tutorials. You can follow either the React tutorial or JavaScript tutorial depending on the platform you are developing on.

If you want to learn more about what kind of applications Speechly enables, you can refer to our Use cases section.

Latest blog posts

case study

Combating Voice Chat Toxicity in VR Games: Speechly and Gym Class

Gym Class VR is a basketball game that was preparing to launch on Meta Quest after a very successful Beta. Voice chat is an important social element of the game, but the team noticed evidence of toxic behavior emerging. After trying speech recognition from cloud service providers, they quickly learned this was a cost-prohibitive approach and turned to Speechly.

Collin Borns

Mar 20, 2023

5 min read

voice tech

The Dirty Dozen - The Impact of 12 Types of Toxic Behavior in Online Game Voice Chat

Speechly surveyed over 1000 online gamers about toxic behavior in voice and text chat. The results show offensive names, trolling, bullying and annoying behavior top the list with the broadest impact. However, these behaviors are between 50%-200% more frequent in voice chat.

Collin Borns

Mar 09, 2023

3 min read

voice tech

Voice Chat is Popular with Gamers - It's also the Top Source of Toxic Behavior - New Report

Speechly commissioned a survey of a nationally representative sample of over 1000 gamers. The survey found that nearly 70% of gamers have used voice chat at least once. Of those, 72% said they've experienced a toxic incident. Read more today in the Full Report.

Otto Söderlund

Mar 08, 2023

3 min read