There are a few similarities that arise across useful Multi-Modal Voice Interface use cases. As a product builder or team member, being able to identify these scenarios can lead to lucrative opportunities to give users a 10x experience using Spoken Language technology and plant the seed to more sophisticated experiences in the future. In this blog post I am going to discuss the value of Multi-Modal Voice Interfaces, the Scenarios where they thrive, and give a few Examples of these interfaces in action.
Value of Multi-Modal Voice Interfaces
The most valuable aspect of a Multi-Modal Voice Interface is the fact that it allows for developers to truly leverage the power of real-time Spoken Language Understanding (SLU) technology. At a high level, this works by streaming Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) simultaneously, something that is usually done in a one-after-the-other fashion in standard voice interfaces or conversational experiences. This one-after-the-other process tends to encourage shorter user utterances, like common smart speaker Voice Assistant commands, since users are unaware of whether or not the Voice AI is understanding them.
Streaming SLU alongside a screen allows for developers to give visual cues to a user, much like the way we communicate with each other on a daily basis. Whether in person or on a Zoom call, humans give different visual cues to signal to a friend or colleague a whole array of different meanings based on the conversation. For example, when giving a demo or pitch of a new product a presenter is always looking for a head nod from the crowd. That head nod gives the presenter valuable information on if their product or service is actually relevant to that target group of users. Streaming SLU gives Multi-Modal Voice Interfaces that head nod. By understanding a user in real time, a developer can give the user different visual cues to let the user know they are being understood.
Use Voice, Swipe or Type - Whatever is Right
Voice alone as an input can be a fantastic experience in certain situations like home automation, asking basic questions, or starting timers. However, the addition of a screen alongside Spoken Language technology allows for voice, swipe, and type inputs to thrive where they make the most sense for the end user. The reality is forcing a voice input only, or conversational experience, can be stressful for a user that is not used to that type of experience.
It is better to approach products from the perspective of how you can best solve a user or market problem. For some problems, a conversational experience might make sense. However, some experiences simply cannot be forced into a solely conversational mold. Using Spoken Language to control technology can be a fantastic way to supplement swipe and typing with an additional input method that is 3x as efficient.
Ideal Scenarios for Multi-Modal Interfaces
Information Heavy Tasks
Scenarios that require users to engage in complex searches or repetitive inputs into a system can be a great opportunity for a Multi-Modal Voice Interface. There are a few reasons this can be a valuable scenario. Although ASR has achieved near human parity, creating enjoyable end user products that use voice interfaces can be a challenge due to the different complexities that come with language in different contexts. The best way to ensure a good Multi-Modal Voice Interface Experience is to have as much contextual data as possible to give to the SLU model.
This is also important for the end user experience. Although spoken language is a great input for technology, most users are not familiar with the voice modality for everyday experiences outside of basic Voice Assistant controls. If a user has an existing understanding of the context and jargon around a particular process, taking advantage of the voice modality within a familiar experience is easier to overcome.
Speed & Value
I have already discussed how the voice modality can be 3x more efficient than typing on a mobile phone. However, speed alone is not a good reason to build a new product. Speed is important when you can attribute it to actual end user or business value. The reality is many businesses have Omnichannel experiences with complex customer journeys and employee responsibilities. I will explore further examples of this later in the post. This reality provides ample opportunity to assess where a Multi-Modal Voice Interface might be a good fit across an organization.
Existing Digital Experience
I understand a certain percentage of people who read this post will assume that “Multi-Modal Voice Interface” refers to a voice-enabled device with a screen from a company like Amazon or Google. Multi-Modal Voice Interfaces can apply to contexts outside of the smart speaker Voice Assistant ecosystems. As the point above about speed suggests, businesses should look inward when assessing where to implement Spoken Language technology as opposed to outward at unproven emerging platforms. Existing digital experiences are a better way for you to control the user voice experience and plant the seed for more sophisticated experiences down the road.
It’s not hard to understand why having full control of your brand, user experience, and data would be valuable when building a completely new way for users to interface with your company or product. The reality is, best practices are still being defined across different sectors that are applying Spoken Language technology. Giving product teams full control of brand, user experience, and data allows them to plant the seed with Spoken Language technology and iterate over time to build the best user experience. We can speculate over best practices, but the reality is we have not started to scratch the surface on what is possible with modern day Multi-Modal Voice Interfaces that leverage SLU. This provides a real opportunity to define the future on what user experience looks like with Spoken Language technology.
Use Case Examples
I believe that the three use cases that I discuss below are great examples that check the box for each of the 3 Scenarios mentioned above.
Voice Commerce Search, Filtering and Purchasing
E-Commerce has completely changed the way we buy things. With the COVID-19 pandemic, E-Commerce growth accelerated up to 6 years. Despite this growth, E-Commerce product search, filtering and purchasing is outdated. Users are required to search and filter by inputting tedious amounts of data into complex menu hierarchies. This scenario is a perfect example of how Natural Language Voice Search could provide an efficient experience that results in both user value and business value. The user can use their voice to find items more efficiently resulting in less churn and more items being added. This correlates to a direct benefit for the business. It's a win-win making E-Commerce a great place to integrate a Multi-Modal Voice Interface.
Everybody knows the value of data in business today. For this reason, there is a lot of attention spent on how to acquire the most accurate data in the most efficient way possible. In certain professions, such as healthcare, real estate, finance or law there are a lot of legal and paperwork requirements that come with the day to day operations of the business. Multi-Modal Voice Interfaces are perfect in these scenarios. Professionals in jobs that require detail-oriented processes to be followed often become intimately familiar with the paperwork, data collection, and data input that is required of them. Being able to complete these processes more efficiently results in a professional being able to book more business which can benefit both that individual and the businesses top line revenue.
Warehouses are an ideal place to implement Spoken Language technology leveraging devices like a phone, tablet, or screen on a piece of machinery. Voice-Directed Warehousing is the process of managing a warehouse worker or machine by using a Multi-Modal Voice Interface. Allowing warehouse employees to use Spoken Language as an interface to these devices allows for more efficient and accurate data capture while creating safer warehouse environments due to the hands-free ability. Workers are more efficient due to the 3x speed of voice as an input. We have seen warehouses that leverage SLU technology like Speechly are able to quickly adapt the ASR to their unique acoustic ecosystems allowing for higher quality data capture. The safe environment is a byproduct of allowing workers to have minimal time in front of a screen and maximum time with their eyes being up and alert. Overall, it's hard to argue with the value of Voice-Directed Warehousing for both employees and business owners alike.
As you can see there are opportunities that exist across industries to start leveraging Multi-Modal Voice Interfaces today to create better user experiences. If you are ready to Plant the Voice Tech seed with your users, reach out to Speechly.