Blog

voice tech

Bring Multi-Modality to Voice Commerce

Collin Borns

Feb 02, 2021

6 min read

Voice commerce should not mean eCommerce on a smart speaker, but rather a multi-modal experience supported by voice.

  • Copy link

  • Mail

  • LinkedIn

  • Facebook

  • Twitter

Voice Commerce, or V-Commerce, is a topic that comes up frequently when discussing the opportunities with voice technology. This is not surprising due to the fact that voice technology news frequently predicts that V-Commerce will be an $80B opportunity by 2023. For everyday users of voice technology, this may seem optimistic when you think of all the different challenges that face V-Commerce experiences today. Given these challenges, I believe companies will need to embrace Multi-Modal Voice Commerce, or voice experiences that go alongside existing Digital Experiences.

Challenges Facing Voice Commerce Today

With any emerging technology there is bound to be problems, or opportunities, that come along with it. Voice technology is no different. There are a handful of recurring problems that arise with voice technology in general and these problems are usually magnified when applied to V-Commerce. Three common problems that users reference within V-Commerce are the actual lack of a screen, fear of being misunderstood, and concerns with privacy.

Lack of Screen

A common complaint from users that try different V-Commerce experiences is the fact that many voice-enabled devices do not have a screen. Without a screen to give users a sense of comfort that their utterances are being understood and executed properly, it’s hard to imagine purchases being made outside of basic everyday items and reorders.

Problems with Accuracy

Another common problem with V-Commerce is the users' fear of being misunderstood. Frequent users of voice enabled experiences are ok with a voice assistant that is unable to understand them when asking simple requests like answers to a question or song request. However, the risk of being misunderstood while making a financial transaction is likely to be more heavily scrutinized by users.

Privacy Issues

The final problem I want to address that comes up frequently with V-Commerce is privacy. According to Voicebot, ⅓ of U.S. adults are concerned about smart speakers recording them and will not purchase a device, double the amount of 2018. Set aside potential smart speakers owners, there are also heightened concerns around privacy coming from existing smart speaker owners. Just like user fear of being misunderstood, privacy concerns with V-Commerce are heightened due to the fact that it revolves around a financial transaction.

Many of these problems can be better addressed with the addition of a screen to a voice experience creating Multi-Modal Voice Commerce. First, I will discuss a few general reasons why I think Multi-Modal Voice Commerce is the future of V-Commerce. I will then go into why I think businesses interested in creating valuable end user voice experiences should ditch smart speakers and start building voice features in their own digital domains.

3 Reasons for Multi-Modal Voice Commerce

Buying is Visual

Humans have always looked for ways to improve how we transact and trade. We have progressed from making and trading our own goods, to Main Street mom-and-pop businesses, to large scale retail enterprises, to immersive E-Commerce stores. Although humans have consistently innovated how we purchase products, one variable that has also remained consistent is the visual component of purchasing goods. Humans are skeptical and it’s human nature to want to see and better scrutinize an object we are interested in purchasing. Another interesting fact, we process visual information in a fraction of the time compared to other modalities.

This observation gives E-Commerce oriented businesses an opportunity to lean into and leverage existing digital assets to create voice experiences. Product teams spend countless hours optimizing and perfecting both mobile and web experiences where customers are already spending time. By going all in on voice assistants as a Voice Commerce strategy, you leave out a major part of what makes online stores successful: images. Rather, businesses should enhance their current online stores by leveraging the voice modality and give customers a truly value-add experience.

Real-Time Validation

Most voice experiences today, even Multi-Modal experiences on popular Voice Assistants like Alexa or Google Assistant that show a transcript of what you are saying, are turn-based experiences and lack real time validation that the user is actually being understood. When I say “turn-based experience” I define it as Automatic Speech Recognition (ASR) to produce a transcription of what was said followed by Natural Language Processing (NLP) to understand the intent of the user. The real opportunity with Multi-Modal Voice Commerce relies on Spoken Language Understanding (SLU).

Spoken Language Understanding is slightly different from the turn based approach I mentioned above, but can lead to a drastically different experience for users. SLU does ASR and NLP simultaneously in real-time. This allows for users not only to see an actual transcript of what they are saying, but also allows for a designer to take advantage of a screen to illustrate whether or not the system is understanding the users intents in real-time. This leads to comfort for the user knowing that they are being understood, but also results in longer utterances.

Multi-Modal Voice Commerce also allows for user validation on whether or not they are being listened to. With privacy being a top concern of both potential voice technology users and existing users, Product teams need to pay careful attention to how they address privacy. Using visual components, such as a Microphone On/Off button, are a good remedy for privacy concerns with voice technology.

Efficient for Users

According to a study from Stanford, speech recognition is 3x faster than typing into a smartphone. There have been many predictions on what voice experiences might become in the future, and I am as excited as everyone else about that future, but there is one absolute fact about voice technology. Voice is the most efficient way to interact with technology. This makes existing digital experiences, such as E-Commerce websites and mobile applications, the perfect domain for a Voice User Interface. Users are able to make purchase decisions based on products they can actually see, but are able to do things such as search, filter, and checkout more efficiently with a Multi-Modal Voice Interface.

Invest: Multi-Modal Voice Interface vs. Voice Assistant Platforms

There is a difference between Voice Assistant platforms, such as Google Assistant or Amazon Alexa, and companies like Speechly that enable developers to easily embed Voice User Interfaces in existing websites and apps. When it comes to building voice technology that is useful for users in E-Commerce, I believe it is better to approach voice as a modality to build immersive Multi-Modal experiences rather than an emerging platform opportunity. Approaching voice technology through this lens first provides the opportunity to immediately build features with value by bringing efficiency to your users.

See our Voice Search & Filtering Demo below:

Plant the Voice Tech Seed

Starting with a V-Commerce use case like the Search and Filtering Demo above not only provides immediate value to users, but also plants the seed for future innovation around Multi-Modal Voice Interfaces. Searching or filtering products using your voice may seem simple in nature, but do not underestimate the power of user behavior change. With any user behavior change comes massive opportunities for innovation. Giving users a feature that is easy to digest and provides immediate value gives Product teams the opportunity to offer more sophisticated features down the line.

V-Commerce may have its problems, but I believe many of these problems are less concerning if we approach voice technology as a modality to build efficient Multi-Modal experiences. Multi-Modal Voice Commerce allows companies and brands that are interested in voice technology to “Walk before they run” by starting with features that make sense to users and make the purchase journey more efficient. Giving customers true value through voice technology, from day 1, is the only way to lay the foundation for building more sophisticated voice experiences in the future.

If you are interested in turning your E-Commerce store into a Voice Commerce powerhouse, leave your email address and our industry professional will contact you.

Latest blog posts

case study

Combating Voice Chat Toxicity in VR Games: Speechly and Gym Class

Gym Class VR is a basketball game that was preparing to launch on Meta Quest after a very successful Beta. Voice chat is an important social element of the game, but the team noticed evidence of toxic behavior emerging. After trying speech recognition from cloud service providers, they quickly learned this was a cost-prohibitive approach and turned to Speechly.

Collin Borns

Mar 20, 2023

5 min read

voice tech

The Dirty Dozen - The Impact of 12 Types of Toxic Behavior in Online Game Voice Chat

Speechly surveyed over 1000 online gamers about toxic behavior in voice and text chat. The results show offensive names, trolling, bullying and annoying behavior top the list with the broadest impact. However, these behaviors are between 50%-200% more frequent in voice chat.

Collin Borns

Mar 09, 2023

3 min read

voice tech

Voice Chat is Popular with Gamers - It's also the Top Source of Toxic Behavior - New Report

Speechly commissioned a survey of a nationally representative sample of over 1000 gamers. The survey found that nearly 70% of gamers have used voice chat at least once. Of those, 72% said they've experienced a toxic incident. Read more today in the Full Report.

Otto Söderlund

Mar 08, 2023

3 min read