Bring Multi-Modality to Voice Commerce

Collin Borns

Feb 02, 2021

6 min read

Voice commerce should not mean eCommerce on a smart speaker, but rather a multi-modal experience supported by voice.

Copy link
Mail
LinkedIn
Facebook
Twitter

Voice Commerce, or V-Commerce, is a topic that comes up frequently when discussing the opportunities with voice technology. This is not surprising due to the fact that voice technology news frequently predicts that V-Commerce will be an $80B opportunity by 2023. For everyday users of voice technology, this may seem optimistic when you think of all the different challenges that face V-Commerce experiences today. Given these challenges, I believe companies will need to embrace Multi-Modal Voice Commerce, or voice experiences that go alongside existing Digital Experiences.

Challenges Facing Voice Commerce Today

With any emerging technology there is bound to be problems, or opportunities, that come along with it. Voice technology is no different. There are a handful of recurring problems that arise with voice technology in general and these problems are usually magnified when applied to V-Commerce. Three common problems that users reference within V-Commerce are the actual lack of a screen, fear of being misunderstood, and concerns with privacy.

Lack of Screen

A common complaint from users that try different V-Commerce experiences is the fact that many voice-enabled devices do not have a screen. Without a screen to give users a sense of comfort that their utterances are being understood and executed properly, it’s hard to imagine purchases being made outside of basic everyday items and reorders.

Problems with Accuracy

Another common problem with V-Commerce is the users' fear of being misunderstood. Frequent users of voice enabled experiences are ok with a voice assistant that is unable to understand them when asking simple requests like answers to a question or song request. However, the risk of being misunderstood while making a financial transaction is likely to be more heavily scrutinized by users.

Privacy Issues

The final problem I want to address that comes up frequently with V-Commerce is privacy. According to Voicebot, ⅓ of U.S. adults are concerned about smart speakers recording them and will not purchase a device, double the amount of 2018. Set aside potential smart speakers owners, there are also heightened concerns around privacy coming from existing smart speaker owners. Just like user fear of being misunderstood, privacy concerns with V-Commerce are heightened due to the fact that it revolves around a financial transaction.

Many of these problems can be better addressed with the addition of a screen to a voice experience creating Multi-Modal Voice Commerce. First, I will discuss a few general reasons why I think Multi-Modal Voice Commerce is the future of V-Commerce. I will then go into why I think businesses interested in creating valuable end user voice experiences should ditch smart speakers and start building voice features in their own digital domains.

3 Reasons for Multi-Modal Voice Commerce

Buying is Visual

Humans have always looked for ways to improve how we transact and trade. We have progressed from making and trading our own goods, to Main Street mom-and-pop businesses, to large scale retail enterprises, to immersive E-Commerce stores. Although humans have consistently innovated how we purchase products, one variable that has also remained consistent is the visual component of purchasing goods. Humans are skeptical and it’s human nature to want to see and better scrutinize an object we are interested in purchasing. Another interesting fact, we process visual information in a fraction of the time compared to other modalities.

This observation gives E-Commerce oriented businesses an opportunity to lean into and leverage existing digital assets to create voice experiences. Product teams spend countless hours optimizing and perfecting both mobile and web experiences where customers are already spending time. By going all in on voice assistants as a Voice Commerce strategy, you leave out a major part of what makes online stores successful: images. Rather, businesses should enhance their current online stores by leveraging the voice modality and give customers a truly value-add experience.

Real-Time Validation

Most voice experiences today, even Multi-Modal experiences on popular Voice Assistants like Alexa or Google Assistant that show a transcript of what you are saying, are turn-based experiences and lack real time validation that the user is actually being understood. When I say “turn-based experience” I define it as Automatic Speech Recognition (ASR) to produce a transcription of what was said followed by Natural Language Processing (NLP) to understand the intent of the user. The real opportunity with Multi-Modal Voice Commerce relies on Spoken Language Understanding (SLU).

Spoken Language Understanding is slightly different from the turn based approach I mentioned above, but can lead to a drastically different experience for users. SLU does ASR and NLP simultaneously in real-time. This allows for users not only to see an actual transcript of what they are saying, but also allows for a designer to take advantage of a screen to illustrate whether or not the system is understanding the users intents in real-time. This leads to comfort for the user knowing that they are being understood, but also results in longer utterances.

Multi-Modal Voice Commerce also allows for user validation on whether or not they are being listened to. With privacy being a top concern of both potential voice technology users and existing users, Product teams need to pay careful attention to how they address privacy. Using visual components, such as a Microphone On/Off button, are a good remedy for privacy concerns with voice technology.

Efficient for Users

According to a study from Stanford, speech recognition is 3x faster than typing into a smartphone. There have been many predictions on what voice experiences might become in the future, and I am as excited as everyone else about that future, but there is one absolute fact about voice technology. Voice is the most efficient way to interact with technology. This makes existing digital experiences, such as E-Commerce websites and mobile applications, the perfect domain for a Voice User Interface. Users are able to make purchase decisions based on products they can actually see, but are able to do things such as search, filter, and checkout more efficiently with a Multi-Modal Voice Interface.

Invest: Multi-Modal Voice Interface vs. Voice Assistant Platforms

There is a difference between Voice Assistant platforms, such as Google Assistant or Amazon Alexa, and companies like Speechly that enable developers to easily embed Voice User Interfaces in existing websites and apps. When it comes to building voice technology that is useful for users in E-Commerce, I believe it is better to approach voice as a modality to build immersive Multi-Modal experiences rather than an emerging platform opportunity. Approaching voice technology through this lens first provides the opportunity to immediately build features with value by bringing efficiency to your users.

See our Voice Search & Filtering Demo below:

Plant the Voice Tech Seed

Starting with a V-Commerce use case like the Search and Filtering Demo above not only provides immediate value to users, but also plants the seed for future innovation around Multi-Modal Voice Interfaces. Searching or filtering products using your voice may seem simple in nature, but do not underestimate the power of user behavior change. With any user behavior change comes massive opportunities for innovation. Giving users a feature that is easy to digest and provides immediate value gives Product teams the opportunity to offer more sophisticated features down the line.

V-Commerce may have its problems, but I believe many of these problems are less concerning if we approach voice technology as a modality to build efficient Multi-Modal experiences. Multi-Modal Voice Commerce allows companies and brands that are interested in voice technology to “Walk before they run” by starting with features that make sense to users and make the purchase journey more efficient. Giving customers true value through voice technology, from day 1, is the only way to lay the foundation for building more sophisticated voice experiences in the future.

If you are interested in turning your E-Commerce store into a Voice Commerce powerhouse, leave your email address and our industry professional will contact you.

About Speechly

Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.

Latest blog posts

company news

Speechly is joining Roblox

Hannes Heikinheimo

Sep 19, 2023

1 min read

voice tech

4 Voice Chat Solutions for Virtual Reality

Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.

Matt Durgavich

Jul 06, 2023

5 min read

company news

Speechly Has Received SOC 2 Type II Certification

Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.

Markus Lång

Jun 01, 2023

1 min read