The Reactive Voice UI paradigm is especially well suited for the web and mobile environments - really any experience with a touch screen.
In the marketplace today, many digital voice experiences follow the legacy voice assistant model, like the one popularized by Apple's Siri back in 2011. While these assistant driven voice UIs optimize for conversation, Reactive Voice UIs optimize for task completion. The bulk of voice assistant usage over the last decade has been reserved for single-utterance requests like, "play music" or "turn off the lights." In fact, the 2020 Smart Audio Report found that the top five tasks requested of voice assistants were to play music, get the weather, set a timer, check the time, and tell a joke. This is not surprising, given the amount of effort and time required to accomplish more complex tasks in a turn-based, conversational experience.
What do we mean by turn-based? The person and the AI assistant must take turns speaking in order to be understood. The person speaks, and only when they've stopped talking does the NLU kick in to process the spoken language input, determine the intent, and then return with a text-to-speech response. They each must wait for the other to respond before they're able to move forward. It's one-way, asynchronous communication.
As someone is trying to uncover what the assistant can do and is stumbling through everything it can't do, it quickly becomes a time intensive back and forth conversation to reset. The frustration this can create is so a part of our cultural zeitgeist that Googling "yelling at Alexa" returns thousands of results.
It's a sequential waterfall process where the value is delivered at the very end. Likewise, any errors happen silently throughout the process, only to be surfaced at the end when it is too late to recover from them.
In the legacy voice assistant experiences, voice serves as both the UI and operating system. In the Reactive Voice UI design philosophy, voice serves a function as a feature alongside other modalities. The idea is not to substitute an existing Graphical User Interface (GUI) with voice but rather to complement or augment it along the parts of the user journey where a type and swipe input would otherwise be tedious. These include tasks like searching and inputting complex information.
Reactive Voice UIs are characterized by multimodal UI mechanics that enable voice input to generate a visual output. That means that the person can speak and see their words generate a reaction within the visual element directly, without an "assistant" managing the experience.
Whether using voice to search within a site, or voice picking to manage inventory, the UI and the user are able to communicate with one another in both directions simultaneously. Despite not being a "conversation" it's a much more natural (and fast) way to communicate and get things done.
Managing the Limitations of Artificial Intelligence in Voice Technologies
AI based product experiences have one thing in common: when they work, they're magical. When they fail, they fail catastrophically. For instance an autonomous driving experience: a magical experience is driving you safely from home to work. A catastrophic experience would be driving you off a cliff. With voice, a magical experience might be nailing a complex pizza order in one go. A catastrophic experience might be accidentally sending (and paying for) twenty pizzas to be delivered to your old address.
If you look at humans, we aren't that great at understanding spoken language. The typical human word error rate is around five percent. That means that if a sentence is ten words long you will have at least one word misunderstood in every other sentence. This tells you that verbal communication is very error prone to begin with.
However, what's different for humans is that these errors in understanding don't cost very much, as humans are able to quickly recover from them.
One big reason that voice AI experiences are still struggling to grow market share is that the mistakes in understanding often feel too costly. We can tackle this in two ways: either by trying to make the AI smarter and smarter, to the point where they won't make mistakes, or by trying to make the failures cost less. We at Speechly believe that the latter, more pragmatic approach, is the way to go.
If you're familiar with the principles of modern software delivery, this will likely be familiar to you:
Succeeding in communication is about short cycles, incremental delivery, being iterative, failing fast, getting feedback, delivering value early, transparency and adaptation.
These same core principles apply to Reactive Voice UIs.
With Spoken Language Understanding™ technology and a visual Reactive Voice UI displayed on a screen, you can maintain a fast feedback loop. That means that you're able to deliver value early and enable quick recovery from errors in understanding. This keeps errors from compounding, helps build trust with the user, and makes the experience feel seamless.
How to Convert an Alexa Skill into a Multi-Modal Experience
It takes an immense amount of work from a conversational design and development perspective to build voice assistant experiences. These voice-only experiences are not bad - they are just limiting, from a user perspective. We built Speechly to offer up an alternative, one that starts with a focus on the user. If you've built an Alexa Skill and are interested in expanding its reach beyond the Amazon ecosystem, we created a simple conversion tool that lets you create a new Speechly application from an existing Alexa skill in a few simple steps. With these features folks can easily turn their Alexa skill into a streaming Speechly voice application that can then be used to enable Reactive Voice UI experiences across the web and in mobile apps.
It's free to start building: https://docs.speechly.com/basics/getting-started/
If you have questions, a specific use case or a POC you want to try out, our inbox is open.
Cover photo by Tiger Lily on Pexels