Speechly is joining Roblox

We are excited to announce that Speechly is joining Roblox!

We founded Speechly in 2016 with the belief that voice was the future of interaction online and our mission was to enable more delightful computer voice interactions and to empower and enhance communication between people. We are proud of the speech recognition product and solutions our team created that enabled fast, real-time and cost-efficient voice moderation helping developers reduce toxic behavior in online communities.

Roblox is building the leading platform for 3D immersive communication and connection. Everyday 65.5 million daily active users of all ages come to Roblox to be together, experience and create memories with friends. With the addition of new voice features, including voice chat, Roblox is solving new challenges—moderating spoken language in real time.

Safety and civility are foundational to Roblox. We are excited to be joining a company dedicated to safety and civility and to use our AI expertise to evolve traditional methods of moderation to meet the scale, real-time and dynamic needs of a user generated content (UGC) platform. It’s the same focus we have had at Speechly. We share Roblox’s vision for safe and civil immersive communication and bringing more dynamic and nuanced interactivity to the platform through safe and civil voice features.

We want to thank the developers who have trusted Speechly for voice moderation needs, and our friends, family and customers who have been with us through this journey.

We hope to see you all on Roblox soon!


## VR and Voice Chat are a Perfect Match

Earlier this month, Apple introduced the Vision Pro at their 2023 Worldwide Developer Conference, reinvigorating the conversation around virtual (VR), augmented (AR), and mixed (XR) realities. With Meta’s Quest 3, the sequel to the world’s most popular headset, [due out this Fall](https://www.meta.com/quest/quest-3/?utm_source=gg&utm_medium=ps&utm_campaign=20194373502&utm_term=meta%20quest%203%20vr&utm_content=660532752883&utm_funnel=dcap&gclid=CjwKCAjwhJukBhBPEiwAniIcNSVtFQPUgt4BfpUvQIo0rfVCTe5aEujDD8Aq3eIeEV1ow7ESwP2E2xoCylsQAvD_BwE&gclsrc=aw.ds) it’s an exciting time for developers to bring new virtual world experiences to market.
In the last few years, popular multiplayer games such as IRL Studios’ [Gym Class](https://www.oculus.com/experiences/quest/3661420607275144/), Ramen VR’s [Zenith: The Last City](https://zenithmmo.com/), and Big Box VR’s [Population: One](https://www.oculus.com/experiences/quest/2564158073609422/) have reached millions of players and critical acclaim. Social connected experiences like VRChat’s [VRChat Plus](https://hello.vrchat.com/) and [Rec Room](https://recroom.com/) are incredibly popular with millions of active users a month. VR as a platform continues to attract consumers: the market is on track to [grow 50% in 2023](https://en-gb.workplace.com/blog/the-future-of-vr#:~:text=According%20to%20Deloitte%2C%20the%20VR,7%20billion%20in%20global%20revenues.).

Each of the above titles shares a common thread: real-time communication. Voice chat presents a fantastic solution to this problem, seamlessly fitting into the interaction paradigms of VR, preserving immersion, and boasting a low learning curve. Voice chat is simply a must-have for collaborative VR experiences.
Hosted solutions, where vendors operate and maintain the services, strike a good balance between flexibility, capability, and total cost of ownership. Additionally, the Unity engine remains an exceptional choice for VR developers due to its extensive cross-device support and rapid iteration capabilities. We’ll briefly explore the advantages and weaknesses of four popular Unity-compatible hosted solutions.

## Voice Chat Features to Consider

Voice chat support requires careful planning. Like most design considerations, the earlier in the project life cycle the easier it is to adapt and adjust. A common mistake is for a developer to put “voice chat integration” on a schedule late in production, only to find the chosen solution is hard to integrate cleanly.
So before you begin, carefully consider these areas:

* **Solution Architecture**. Is it peer to peer (p2p) or client to server? Each comes with constraints such as total number of voice participants, user bandwidth requirements, and moderation compatibility. Generally, p2p solutions have lower costs of entry but struggle to work well in common network setups like NATs with firewalls. Client/server setups are much more robust, but require significant server bandwidth and computation resources which elevate costs, though offer some opportunities for centralized features such as recording, transcription, or moderation.
* **Spatial Audio**. Spatialized audio modifies voices depending on the location and environment in a 3D virtual space. Voice chat solutions vary in terms of their support for 3D audio, ranging from non-existent to unlimited mixing and matching of spatialized and non-spatialized voices. The right voice chat experience is highly dependent on the style of VR application under consideration, so a clear understanding of your project’s desired user experience is paramount.
* **Developer Experience**. What do the APIs look like? Are they easy to adapt to your project? What are the paradigms and abstractions in play? VR platforms are varied, so a good solution abstracts that away and gives the developer a clean and consistent experience. The best way to reduce risk early on is to look closely at samples and tutorials and understand how the paradigms map to your project. For example, an MMO experience might have a population center where hundreds of users can talk. If the voice chat solution has a participant cap of 8 participants per group, that implies a creative implementation and larger level of effort than a solution that supports unlimited participants.
* **Total Cost of Ownership**. A simple fact of live services is they incur ongoing maintenance and costs. Voice chat vendors typically price on usage, which means the more popular your game the more it will cost you. Related, a good vendor will provide robust Service Level Agreements (SLAs) that promise uptimes, maintenance windows, defect resolutions, live support, issue turnaround times, and more. A great approach is to prepare a list of questions or concerns to put to a vendor’s pre-sales or developer support team. This will give you an excellent sense of customer care patterns, response times, and general levels of comfort dealing with a given solution.

## Popular Voice Chat Solution Pros and Cons

With the above considerations in mind, let’s look more closely at 4 popular services on the market today. All solutions are client-server architectures, with support for spatial audio.

| Vendor                                              | Pros                                                                                                                                                                                                                                                                                                                                                                                                                                      | Cons                                                                                                                                                                                                                                                                                    |
| --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Photon Engine](https://www.photonengine.com/voice) | \- Simple API that is friendly to Unity development best practices like prefabs and drag-and-drop development <br/><br/> - A simple but realistic sample <br/><br/> - All relevant documentation is online and publicly accessible <br/><br/> - Full integration into the Unity audio subsystem <br/><br/> - Web-based service health dashboard                                                                                                   | \- The free tier is very limited with a hard cap <br/><br/> - Only Magic Leap and HoloLens VR platform support (at the time of this writing) <br/><br/> - No explicit uptime or maintenance window guarantees                                                                               |
| [Vivox](https://developer.vivox.com)                | \- Nearly unlimited number of users and user groups <br/><br/> - Web-based developer portal with usage data, SDK downloads, and more <br/><br/> - Generous free usage tier of up to 5000 peak users a month <br/><br/> - Robust SLA with 99.9% uptime <br/><br/> - Support through forums and help desk, with paid professional support available                                                                                                 | \- Developer portal is behind a login, restricting search engine results for documentation and best practices <br/><br/> - Custom audio capture and playback offers limited integration into Unity’s audio subsystem <br/><br/> - No expliict VR support <br/><br/> - Unity sample is limited |
| [Agora](https://console.agora.io/)                  | \- Large number of supported platforms <br/><br/> - Step-by-step instructions with cut and paste ready code for common tasks <br/><br/> - Simple pricing model with per minute costs, with discounts available <br/><br/> - Robust knowledge base and active forums                                                                                                                                                                             | \- No ready-to-use sample <br/><br/> - Unity integration lacks prefabs, editor support, and other ecosystem comforts <br/><br/> - No explicit VR support                                                                                                                                    |
| [Normcore](https://normcore.io/dashboard/login)     | \-Unity exclusive means top notch integration and development experience for Unity developers <br/><br/> - Explicit XR support and Unity audio integration with detailed information in articles like  XR Avatars and Voice Chat <br/><br/> - Documentation is simple, search engine indexed, and accepts community fixes <br/><br/> - Robust web dashboard for tracking usage and app integrations <br/><br/> - Support through email or Discord | \-No explicit uptime or SLA <br/><br/> - Free tier is limited to 30 users, 10 rooms, and 1 hour which is unsuitable for production use                                                                                                                                                    |

## Get Started Now

These four solutions are solid options to consider, and are fast and easy to use in experiments. Virtual experiences are better in every way with voice communication, and it’s never been easier to get started with a third-party solution. With voice technology in place, exciting capabilities like real-time transcription, tonal analysis, recording, or other moderation techniques are achievable and lay a foundation for immersive, exciting virtual worlds.

### About the Author

*\* Matt is a veteran technology leader in and out of the gaming industry with contributions to games like Red Dead Redemption and Marvel Puzzle Quest. His most recent stint was at Vivox, a Unity Technologies brand, helping to bring voice chat to mobile and VR platforms. He writes about these topics as well as practical leadership lessons at [thelead.beehiiv.com](https://thelead.beehiiv.com/).*


Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.

4 Voice Chat Solutions for Virtual Reality

SOC 2 Type II certification is an industry recognized standard that evaluates the security, availability, processing integrity, confidentiality, and privacy practices of service organizations. Achieving this certification involves an extensive audit process performed by an independent third-party auditor, ensuring that Speechly has implemented and maintained effective controls over an extended period.

The audit was conducted by [Prescient Assurance](https://www.prescientassurance.com/) from February to May 2023. Their audit of Speehcly's systems and processes found no exceptions to the SOC 2 Type II industry standards as defined by the American Institute of Certified Public Accountants (AICPA).

## What does this mean for our customers?

With the SOC 2 Type II certification, Speechly ensures that client data is handled with the utmost care and protection. Clients can trust in Speechly's commitment to upholding industry-leading security standards and maintaining the confidentiality, integrity, and availability of their data. The five SOC 2 principles are as follows:

1. **Security** - Protect systems, data, and information from unauthorized access, disclosure, or destruction. Implement measures such as access controls, encryption, network security, and incident response.
1. **Availability** - Ensure systems and services are available for operation and use as agreed upon. Maintain adequate system uptime, minimize downtime, and establish disaster recovery and business continuity plans.
1. **Processing Integrity** - Process data accurately, completely, and in a timely manner. Ensure consistent, accurate, and reliable data processing without unauthorized alterations or omissions.
1. **Confidentiality** - Protect sensitive information from unauthorized disclosure. Implement controls to safeguard confidential data and restrict access to authorized individuals or systems.
1. **Privacy** - Comply with applicable privacy laws and regulations when collecting, using, retaining, disclosing, and disposing of personal information. Establish and adhere to privacy policies and procedures to protect individuals' personal data.

If you would like to request a copy of our SOC 2 Type II attestation or report, please contact us [via email](mailto:info@speechly.com). You can also read more about [Security at Speechly](/security).

Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.

Speechly Has Received SOC 2 Type II Certification

A report by the NYU Stern Center for Business and Human Rights highlights the prevalence of misogyny, racism, and other extreme ideologies in video game voice chat. The study suggests that even though the individuals spreading hate speech are a minority, they have a significant impact on gamer culture and real-life experiences. 

A key reason for the significant impact is the susceptibility of gamers to such viewpoints, particularly due to the presence of impressionable young people and the lack of [proactive moderation](https://www.speechly.com/products/moderation) in online games.

This was followed by an analysis of what game makers are doing or not doing about the problem, as the case may be. The [report](https://bhr.stern.nyu.edu/tech-gaming-report) reads somewhat like an indictment of the game makers' lack of action to address the problem. That seems a bit unfair, even though the core themes are on point. Still, it is a well researched and comprehensive analysis of the problem, and many of the findings align with Speechly’s own industry research. 

## Extremist Behavior is Common

The survey was conducted in the U.S., UK, France, Germany, and South Korea. A key finding is that 51% of online players encountered extremist statements in multiplayer games over the past year. This tracks closely with Speechly’s [consumer survey data](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report), where we found that 53% of games had experienced a toxic incident in text chat and 49% in voice chat. 

![Experiences of Severe Harassment - Charts](/uploads/experiences-of-severe-harassment-charts-min.png "Experiences of Severe Harassment - Charts")

NYU says gaming companies claim to have taken steps to combat hateful content, including prohibiting extremist material and implementing detection systems to remove prohibited content. However, the report argues that the fast-paced nature of games and the sheer number of players make it challenging to monitor and regulate unlawful or inappropriate behavior effectively. It also suggests that many large gaming companies have been slow to take adequate steps to prevent the misuse of their platforms by bad actors, even though they have marketed and profited from features that are easily exploited without robust content moderation mechanisms.

## The Inadequacy of Reactive Moderation

Reactive moderation, which relies on user reports to identify and address problematic content, is an essential component of content moderation, according to the report. However, it is a flawed approach on its own because many users are either unable or unwilling to report troubling incidents. 

NYU’s survey found that only 38% of respondents who experienced severe harassment while playing online games reported the incidents to game publishers or developers. This result is very close to the 36% figure that Speechly found in a larger [U.S. survey](https://www.speechly.com/voice-chat-toxicity-report). 

This reliance on user reports is problematic and highlights the need for companies to improve their moderation capacity and investigate the reasons behind the low reporting rates. Speechly’s analysis also points to the fact that the underreporting situation is even worse than it might appear. The players that have reported incidents don’t report every incident. In reality, somewhere between 82% and 91% of incidents are never reported, and that means game makers have no idea they have even occurred. 

With that said, shifting to proactive moderation is a challenging proposition for many game makers. NYU’s report is very good at laying out the goal and reasons for pursuing proactive moderation. However, this change has been slow to materialize due to the historical prevalence of solutions that are high-cost and inadequate. This is a key reason that game makers recruited Speechly to help solve these problems. 

Speechly strongly recommends game makers implement proactive voice chat moderation. We also want to see them adopt solutions that actually perform well and don’t create a lot of additional work sorting through false positives and missed incidents. And it is in everyone’s interest to have solutions that are cost-efficient and don’t completely undermine the existing video game economic model. 

## NYU’s Recommendations 

To effectively combat extremist content, NYU recommends gaming companies invest in a combination of reactive and proactive moderation measures. Reactive moderation should involve the timely and reliable review of user-flagged content backed by clear explanations and appropriate enforcement actions. They say companies should leverage tools like AI-powered moderation platforms to scale up their reactive moderation efforts. However, certain issues can only be effectively managed by human reviewers, so companies must ensure they have enough in-house staff to promptly and reliably respond to user reports.

Proactive moderation, which involves detecting prohibited content before it is published or in real-time during a voice chat, is crucial in addressing the challenges of gaming platforms. On this point, the researchers suggest companies invest in automated detection systems and employ human investigators who use state-of-the-art tools, including large multilingual pre-trained datasets of extremist vocabulary. 

Implementing proactive enforcement in gaming platforms is particularly challenging due to the instant and ephemeral nature of interactions. However, NYU researchers believe the industry should increase its investment in real-time or near-real-time proactive moderation technology. This would involve utilizing advanced tools and techniques to detect and address extremist content as it happens.

There seems to be a growing consensus that eradicating extremist content from gaming platforms is a complex undertaking. The industry may not be able to fully eradicate extremist encounters in its games, but it can take steps to better address the spread of extremist propaganda and promote a safer and more inclusive gaming environment.

This is exactly what we do here at Speechly. We build [custom AI](https://www.speechly.com/blog/the-5-ai-technologies-you-need-for-voice-chat-moderation-in-games) speech recognition and natural language understanding models to proactively identify toxic behavior in voice chat for game makers. We can conduct a quick analysis of your voice chat toxicity incidents, help you define a plan to mitigate the problem, and implement both reactive and proactive [voice chat moderation solutions](https://www.speechly.com/products/moderation). 

If you would like to learn more about our work with some of the highest-profile game makers, click the Contact Us button below to see a demo. You can also review a case study [here](https://www.speechly.com/solutions/success-stories/gymclass).

A recent NYU report exposes how extremist actors exploit online game communication features. In this blog we expand on NYU's data and recommendations for maintaining safety and security in online gaming communities.

Countering Extremism in Online Games - New NYU Report

Xbox’s Transparency Report from the first half of 2022 reported over seven million actions taken against players for inappropriate content, comments, or activity. Last week’s [blog](https://www.speechly.com/blog/what-you-can-learn-from-the-player-journey-outlined-in-xbox-s-transparency-report) post focused on the qualitative elements and the moderation process details included in the report. Today we go deep into the data.  

The key headline is that incident actions rose over previous periods while player reports (complaints) fell. Xbox presents this as largely the result of increased proactive moderation activities. That is likely true in large part. However, there are some questions we had about the data that revealed interesting insights you will not find in the Transparency Report or the media coverage. 

1. What does the report tell us about how Xbox is managing player moderation, trust, and safety today? 
2. What does the data tell us about trust and safety problems in gaming, and what is being measured?
3. What do the headlines miss about the story that is not being told?

![Enforcements Proactive vs Reactive](/uploads/enforcements-proactive-vs-reactive.png "Enforcements Proactive vs Reactive")

## What is Meant by “Proactive Moderation”

Before we get into the data, some additional context will be helpful. Xbox highlights in the report that it has shifted more of its moderation activity to proactive methods to complement legacy reactive processes. The report comments:

> “To reduce the risk of toxicity and prevent our players from being exposed to inappropriate content, we use proactive measures that identify and stop harmful content before it impacts players. For example, proactive moderation allows us to find and remove inauthentic accounts so we can improve the experiences of real players. For years at Xbox, we’ve been using a set of content moderation technologies to proactively help us address policy-violating text, images, and video shared by players on Xbox… If content that violates our policies is detected, it can be proactively blocked or removed.”

You can see in the text that Xbox has implemented proactive moderation to identify “inauthentic accounts” and for screening “text, images, and video.” Note that voice chat and audio are not mentioned. This is not surprising. Tooling for text and video policy violations is mature and could be considered a basic standard of care. It is good that Xbox is working on improving these, but the absence of a mention of voice chat or audio confirms that this is largely still using “reactive moderation” practices.

> “Proactive blocking and filtering are only one part of the process in reducing toxicity on our service. Xbox offers robust reporting features, in addition to privacy and safety controls and the ability to mute and block other players; however, inappropriate content can make it through the systems and to a player.”  

The “controls” offer “Child, Teen, and Adult” settings as well as some customization options. Muting and blocking offer additional controls:

> “If another player engages in abusive or inappropriate in-game or chat voice communications, you can mute that player. This prevents them from speaking to you in-game or in a chat session.
>
> “Blocking another player prevents you from receiving that player’s messages, game invites, and party invites. It also prevents the player from seeing your online activity and removes them from your friends list, if they were on it.”

## What the Data Show

The most striking data from Xbox’s transparency report is the 10x rise in proactive moderation enforcement. The figure in the first half of 2022 was 4.78 million compared to 461,000 in the previous six-month period. 

Interestingly, the reactive moderation figures did not show significant change and were slightly up from the previous period. The second half of 2021 showed 2.24 million reactive moderation enforcements, and the figure rose to 2.53 million in the first half of 2022. This most likely means that the rise in “proactive moderation enforcement” is not displacing existing “reactive moderation enforcement” but instead represents issues that previously went unnoticed. 

Looking into the data further, you see that 91% of all proactive enforcement was related to “Cheating / inauthentic accounts.” Only about 6% was related to toxic behavior.

![Proactive Enforcements by Policy Area](/uploads/proactive-enforcements-by-policy-area.png "Proactive Enforcements by Policy Area")

Cheating is clearly a big issue for many game makers that undermines the experience for honest players. It is good that Xbox is making progress on this issue. However, you should not look at the proactive moderation numbers and conclude that big strides have been made in reducing toxicity. Proactively identifying 199,000 sexual content incidents, 54,000 harassment and bullying incidents, and 46,000 unwanted profanity incidents may also indicate important progress. However, it seems likely that Xbox is seeing only a small percentage of the toxic behavior problems. 

Speechly’s [consumer](/reports/voice-chat-toxicity-report) and industry research suggests that only 10% - 18% of voice chat toxic behavior incidents are reported by players. That means 82% - 90% is completely invisible to game platforms. Proactive monitoring is the only way to have a full view of the scale, scope, and nature of the problem. And it appears that no proactive moderation is in place for voice chat today. What if Xbox could make the same progress on voice chat toxic behavior as it appears to have achieved with cheating? That could have a profound impact on player safety and experience. 

Xbox offers some insight into the scale of the problem in another chart from the report. The company says that proactive moderation only accounts for 5% of incidents related to “drugs, profanity, hate speech, harassment or bullying, spam, advertising, or solicitation.” So, proactive moderation of toxic behavior is a subset of that 5% category.

![Percentage of Proactive Enforcements by Policy Area](/uploads/percentage-of-proactive-enforcements-by-policy-area.png "Percentage of Proactive Enforcements by Policy Area")

The implication here is that Xbox may, in theory, have visibility of up to 10% - 22% of toxic behavior in voice chat and is still missing the vast majority of the incidents. With that said, the data recorded in the chart is almost certainly related to only text chat and content uploads where the company has automated monitoring tools.  

## Communications Has the Most Reports

Xbox’s transparency report also shows that communications, such as voice and text chat, represent 46% of all player reports complaining about the activity of another user. Cheating and other “Conduct” related reports only represent 43% of complaints.

![Player Reports by Content Type](/uploads/player-reports-by-content-type.png "Player Reports by Content Type")

This offers insight into the consumer perspective on what has the biggest impact on player experience. They report “Communications” incidents–which are largely related to toxic behavior–at a higher rate than cheating. 

It may be that these problems are easier for them to identify, but what is reported is the tip of the iceberg. Consider this: if all of the toxic behavior incidents were reported, this base figure would be 5-10x higher and potentially represent 81% - 89% of all complaints. 

In addition, it appears that player reports are declining. Xbox showed that only 33 million reports were submitted in the first half of 2022. That is down from nearly 60 million in late 2020, 52 million in the first half of 2021, and 42 million in the second half of 2021. Is this a good thing?

![Player Reports](/uploads/player-reports.png "Player Reports")

Note that this decline began before Xbox’s reported rise in proactive enforcement. So, it is hard to claim a correlation between proactive enforcement and a decline in reports. What seems more likely is that Xbox’s proactive enforcement is finding issues that previously went unreported. That is a good outcome. It is unclear what led to the reporting decline, but this could mean that Xbox is missing more visibility than in previous years. 

## What Transparency Reports Mean for Gaming

Transparency reports are based on measuring the problems and explaining the process the game maker uses to address incidents. This is an important development for the industry. Games are now significant social experiences, and everyone from social advocates to regulators is interested in learning more about the scale, scope, and nature of the problems that show up in games. 

The first step is to measure the problem so you can make thoughtful steps to address issues that exist today and identify new issues when they arise. Xbox is taking the proactive step of providing measurement transparency, and we expect its efforts to represent the early stages of a new trend. The game industry will benefit further when more game makers follow suit. 

Speechly can help you measure the problem of voice chat toxicity in order to develop a plan to mitigate the impact or provide an accurate representation in your transparency report. If you would like to learn more, you can contact us anytime [here](https://www.speechly.com/contact).

The 2023 Xbox Transparency Report is (likely) around the corner. Our first blog broke down how the moderation process works at Xbox, but this blog will take a deep dive into the data from the inaugural report comparing Reactive vs Proactive moderation.

What You Can Learn from The Data in Xbox’s Transparency Report

We are expecting Xbox’s second Transparency Report sometime this month. That will cover the second half of 2022 and is expected to offer insight into two and a half years of data. However, it is worth looking back at the first report, which began with the statement:

> "At Xbox, we put the player at the center of everything we do – and this includes our practices around trust and safety. With more than 3 billion players around the world, vibrant online communities are growing and evolving every day, and it is our role to foster spaces that are safe, positive, inclusive, and inviting for all players, from the first-time gamer to the seasoned competitor."

![Player Journey](/uploads/player-journey.png "Player Journey")

## The Player Journey

Xbox outlined the player journey at the end of the report but we wanted to highlight this first because it is among the least understood aspects of trust and safety processes for game makers. In a perfect world, the Xbox experience would involve four steps:

1. Console set up
2. Account creation
3. Onboarding
4. Gameplay

The first three steps are done only once, followed by a lifetime of enjoyable gameplay. Of course, that is not reality, and the next step is ominously listed as “incident occurs.” This is followed by branching logic that indicates whether you caused the incident or witnessed the incident. 

If you were the cause of the problem, the logic suggests you will receive some sort of enforcement which is indicated as a suspension. Xbox does not indicate that there may be permanent suspensions but focuses on the temporary penalties where the guilty can wait until they have fulfilled their debt to society and then resume gameplay. 

Alternatively, the accused can file an appeal which then proceeds to an investigation and a judgment issued by Xbox moderators. The player is notified of the judgment, and if cleared, they can renew gameplay. The diagram does not show the path of a rejected appeal, but most appeals will circle back to the “wait it out” step. Xbox reported that just 6.5% of all appeals in the January to June 2022 period led to an overturned enforcement decision. 

The appeals process is particularly important. A [study of online gamer experience](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report) by Speechly and Voicebot Research found that over half of players say they have been falsely reported.

![Voice Chat Users Reporting False Accusations of Toxic Behavior](/uploads/voice-chat-users-reporting-false-accusations-of-toxic-behavior.png "Voice Chat Users Reporting False Accusations of Toxic Behavior")

## The Moderation Process

Victims or witnesses of these “incidents” can “take action” by filing a report. That is followed by an investigation and judgment by a moderator. 

Xbox points out in several places that it has implemented proactive content moderation tools to help “address policy-violating text, images, and video shared by players.” Proactive in Xbox terms means to prevent “players from being exposed to inappropriate content.” That means a suspension or another action can be taken as the result of a player report or an automated identification. According to the report:

> "Most often, this comes in the form of removing the offending content from the service and issuing the associated account a temporary 3-day, 7-day, 14-day, or permanent suspension. The length of the suspension is primarily based on the offending content, with repeated violations resulting in lengthier suspensions, an account being permanently banned from the service, or a potential device ban…
>
> At Xbox, violations of CSEAI (child sexual exploitation and abuse imagery), grooming of children for sexual purposes, or TVEC (terrorist and violent extremist content) will result in removal of the content and a permanent suspension to the account, even if it is a first offense. These types of cases, along with threats to life (self, others, public) and other imminent harms are immediately investigated and escalated to law enforcement, as necessary."

Xbox indicates in its support documents that individual games may impose other types of penalties based on their own policies and enforcement mechanisms that are “independent from Xbox.” These are called game-specific suspensions. 

The Appeals for Case Review process can lead to an enforcement decision being “confirmed, modified, or overturned.” Xbox says that it investigates every report and that just having a lot of reports against you will not lead to automatic suspension.

## What Data is Available for Investigations?

A key challenge with these assertions is that it assumes moderators have data to consult during the investigation other than the player report. If the incident involved text chat, the moderator is likely to have evidence to consult in the chat logs. There may also be data associated with gameplay telemetry to support player complaints. And, if the incident occurs in party chat and is reported, Xbox may have a recording available for use during the investigation. However, if it occurs during in-game voice chat, they do not. 

This is a key challenge. Voice chat has become a key vector for toxic behavior in online games. However, moderators rarely have audio recordings to consult during the investigation process. This can lead to missed, incorrect, and uneven enforcement. 

Very few companies have ever shared their player journey with a moderation process included. It’s as if Xbox assumes every player will face some sort of incident or be the perpetrator. This is a good assumption, given that over half of all players experience toxic behavior in voice or text chat, and there are other channels of violation that push these figures up further.

In our next post, we will break down Xbox’s numbers and how they shed additional light on the nature of the incidents, the company’s process for identifying them, and overall trends. 

Transparency reports are based on measuring the problems and explaining the process the game maker uses to address incidents. Speechly can help you measure the problem in order to develop a plan to mitigate the impact of toxic incidents or provide an accurate representation in your transparency report. If you would like to learn more, you can contact us anytime [here](https://www.speechly.com/contact).

Xbox will likely release their second Transparency Report this month. To prepare for the release, this blog post digs into the first Transparency Report, bringing specific attention to the Player Journey and the Moderation Process at Xbox.

What You Can Learn from The Player Journey Outlined in Xbox’s Transparency Report

Voice chat has become very popular in games. The rise of multiplayer games has influenced this trend, but the larger impact has become the evolution of online games into social experiences. Nearly half of the players say they like games better with voice chat, and over 68% use the feature.

Game makers also appreciate voice chat because it can help build stronger bonds between players and deliver better retention, longer sessions, and more frequent play. For many game makers, this translates into higher average revenue per user (ARPU). 

However, we also know that voice chat is the biggest source of [toxic behavior](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report), far exceeding text chat, in-game play, and user-generated content. Those toxic incidents have direct negative impacts on the players, playing time, playing frequency, and sentiment toward the game. While game makers have had tools for monitoring toxic behavior in games and through text chat for some time, voice chat is largely a free-for-all. 

The solution to this is AI-enabled [voice chat moderation](https://www.speechly.com/blog/combating-voice-chat-toxicity-in-vr-games-speechly-and-gym-class). It is not uncommon for game industry professionals to be aware of some AI-based tools in use for text chat moderation, and others may have heard of similar technology for voice chat. However, most people don’t know that effective voice chat monitoring for gaming requires multiple AI-based capabilities. 

## Beyond Keywords to Conversational Context

Traditional moderation tools favor simplistic approaches such as keyword spotting. This technique is common for text chat moderation, and some people have tried it for voice chat. It can help you flag or redact a known list of bad words, but you are constantly chasing the novel bad word and just trying to catch up with the bad actors. And this approach will miss many toxic incidents while flagging some that are actually benign. 

The issue is context. The same word could be viewed as toxic or appropriate depending on what is happening in the conversation or the game. For example, "I'm going to kill you!” could be flagged as a threat of physical violence. However, this could also be a key objective of the game based on combat. Similarly, someone may say they are going to plant a bomb at the courthouse. Should the game maker notify law enforcement, or is it a known strategy in the video game? 

Both of these comments could result in a false positive result. That is when the system identifies a toxic incident when the comments are perfectly within the bounds of acceptable behavior for the game. False positives can be detrimental to a game because they can lead to false accusations and unjust penalties for players that are acting entirely in good faith. 

Misunderstandings can also arise from sarcasm, cultural differences, and misinterpreted accents. This is why custom AI models are often essential for voice chat moderation. It is also why a single technique generally doesn’t get the job done. 

## 5 Key Voice Chat Monitoring Technologies

Speechly was asked by several large game makers how to address voice chat toxicity without missing true positives or generating false positives. As we analyzed the problem, we were able to identify five techniques in three categories that can help to finally fill the voice chat moderation gap. 

The first category is accurately identifying what was said. This revolves around the transcript and identifying entity labels. The second category is related to meaning and includes semantic labels and tone-of-voice labels. Finally, there are other signals that are not words and are known as audio event labels. 

1. **Transcripts:** Transcripts are a written record of the conversation that took place during the voice chat. They allow moderators to review the conversation and identify any inappropriate behavior or rule violations that may have occurred. Transcripts are also used for additional AI-based analysis of what was said. 
2. **Entity Labels:** Entity labels refer to identifying and labeling specific people, places, organizations, and other topics mentioned in the conversation. They help moderation systems automatically identify and categorize potentially harmful or inappropriate content that violates the platform's policies. 
3. **Semantic Labels:** Semantic labels help the moderation system better understand the context and the meaning of the conversation to identify any potentially harmful or inappropriate content. They can also be used to help avoid false positives that might arise from considering individual words alone. 
4. **Tone-of-Voice Labels:** Tone-of-voice labels help the moderation system better understand the way things are said by a user. This can be useful in identifying when someone is becoming agitated or upset. This could potentially lead to rule violations or inappropriate behavior or help identify when a user is simply joking or using sarcasm.
5. **Audio Event Labels:** Audio event labels refer to labeling specific sounds or events that occur during the conversation. Audio Event Labels help provide further contextual information to the moderation system that goes past the spoken word alone and identify issues that would otherwise go unnoticed.

It is understandable that [game makers](https://www.speechly.com/blog/combating-voice-chat-toxicity-in-vr-games-speechly-and-gym-class) would first look to what worked for them in text chat when considering how to address toxicity in voice chat. They usually figure out quickly that these techniques fall far short of meeting their moderation and mitigation objectives. The optimal solution involves a portfolio of [AI-based features](https://www.speechly.com/products/moderation) used in concert. 

If you would like to learn more about any of these AI-driven techniques, feel free to contact our product team using our [Contact Form](https://www.speechly.com/contact?ref=https://www.speechly.com/).

The rapid rise of multiplayer online gaming has resulted in video games becoming social experiences. Voice chat has become an important communication channel to facilitate this social experience, but also the top channel for toxic behavior. Luckily there are 5 AI technologies to help overcome this toxicity.

The 5 AI Technologies You Need for Voice Chat Moderation in Games

Speechly was recently featured in [The Wall Street Journal](https://www.wsj.com/articles/ai-bots-listen-in-on-the-toxic-world-of-videogame-voice-chat-e0260392?mod=e2tw). Although it was an honor to receive recognition from such a respected source, what's even more noteworthy is that a prominent general business news publisher is shedding light on the issue of toxicity in video game voice chats. The fact that people outside the online gaming community are taking notice of this problem is significant.

At the same time, it is clear that there is little understanding both inside and outside the gaming industry about the scale, scope, and nature of the toxicity problem in voice chat. The article says new AI technology can mute or ban players automatically, but this is an oversimplification of the problem, and the rules behind this type of solution are not trivial. A silver lining here is the suggested muting feature and interest in becoming more proactive about moderating voice chat toxicity. 

## Wait! How Bad is This Problem?

Nearly everyone is surprised to learn that nearly [72% of in-game voice chat users](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report) have experienced a toxic incident. At the same time, nearly two-thirds of gamers that have experienced toxic behavior have never reported an incident, and even those that have do not report the events every time. According to The Wall Street Journal’s Sarah Needleman:

> “Traditionally, game companies have relied on players to report problems in voice chat, but many don’t bother and each one requires investigating.”

Since game makers only know about toxic behavior when a player submits a complaint, few have any idea about how bad the problem is or how it manifests in their game community. 

## Measure First, Mute Later

![Speechly Voice Analyzer - Dashboard](/uploads/speechly-voice-analyzer-dashboard.png "Speechly Voice Analyzer - Dashboard")

It’s not that surprising that people focus on real-time event flagging and the ability to intervene quickly. These features were not practical until just recently, and it is kind of magical to have the system do everything for you. However, we typically suggest that game makers first measure and analyze their voice chat for toxic behavior before implementing these solutions.

Our goal is to help game makers solve this problem cost-efficiently using the latest AI innovations, some of which are found in Speechly patents. However, we don’t assume the best course of action will be automated muting or banning toxic players or streamlining the investigation process for moderators. We look at the data and then customize our AI models to the game and its specific toxicity problems and apply it in the most effective way to meet the game maker's objectives. This could be a fully automated voice chat moderation solution, a tool to help flag & provide additional context for human moderators, or a combination of the two.

If you are a game maker and would like to talk to us about your voice chat audio data, we’d like to hear from you. Also, if you are interested in learning more about our work in Voice Chat Moderation, checkout the following content:

* A Case Study on analyzing and mitigating voice chat toxic behavior for Gym Class, a leading game in the Meta Quest store, can be found [here](https://www.speechly.com/blog/combating-voice-chat-toxicity-in-vr-games-speechly-and-gym-class). 
* A demo of real-time voice toxicity monitoring in action is available [here](https://demos.speechly.com/moderation/index.html?ref=https://www.speechly.com/demos). 
* Feel free to just ask any question [here](https://www.speechly.com/contact). 

You can also learn about the gamer perspective on in-game voice chat toxicity in our 60-page report with over 40 charts and diagrams. [Download now](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report).

Speechly was recently featured in The Wall Street Journal. While it was an honor to be recognized by a prestigious publication, it is even more notable that voice chat toxicity in video games is being covered by a leading general business news publisher. People outside of online games are noticing the problem with voice chat in games.

Speechly in The Wall Street Journal - Awareness Rises for Voice Chat Toxicity in Games

While OpenAI has published Whisper accuracy numbers for some English open source data sets, there is relatively little information on performance for other languages. Furthermore, the most common open source benchmarks, such as Common Voice and LibriSpeech, are rather clean audio, captured in relatively good acoustic conditions, and contain well articulated speech. Transcription in real life use cases is typically messier. The audio often has poor acoustic conditions and articulation, thick accents, hesitation, overlapping speech, and so on.  These factors all made it attractive to conduct a more robust analysis of Whisper performance across model sizes, languages, and audio quality.

To test the models, we manually transcribed 5 hours' worth of YouTube videos in different languages to establish the ground truth. Youtube videos naturally contain the aforementioned “messiness” and therefore, the word error rates (WER) obtained with Youtube are perhaps a better proxy, compared to an open source benchmark, to what you might expect from typical in-the-wild transcription scenarios. We used the youtube data to test different-sized Whisper multilingual speech recognition models, comparing their transcripts to the ground truths to calculate WER. We also computed the relative word error rate reduction between Whisper small and medium, denoted WERR: S → M.

The resulting word error rates are presented in the table below:

|            | large | medium | small | base  | tiny | WERR: S → M  |
| ---------- | ----- | ------ | ----- | ----- | ---- | ------------ |
| English    | 0.15  | 0.17   | 0.17  | 0.20  | 0.23 | 0.00         |
| Italian    | 0.16  | 0.17   | 0.22  | 0.33  | 0.46 | 0.24         |
| German     | 0.18  | 0.18   | 0.21  | 0.27  | 0.37 | 0.14         |
| Spanish    | 0.19  | 0.19   | 0.20  | 0.28  | 0.37 | 0.07         |
| French     | 0.26  | 0.26   | 0.29  | 0.37  | 0.47 | 0.09         |
| Portuguese | 0.25  | 0.28   | 0.28  | 0.39  | 0.48 | 0.02         |
| Japanese*  | 0.29  | 0.30   | 0.34  | 0.44  |      | 0.11         |
| Danish     | 0.30  | 0.30   | 0.41  | 0.64  | 0.83 | 0.25         |
| Swedish    | 0.29  | 0.31   | 0.38  | 0.51  | 0.64 | 0.19         |
| Indonesian | 0.31  | 0.31   | 0.38  | 0.52  |      | 0.17         |
| Greek      | 0.29  | 0.31   | 0.44  | 0.62  | 0.79 | 0.29         |
| Chinese*   | 0.33  | 0.33   | 0.35  | 0.44  |      | 0.06         |
| Thai*      | 0.34  | 0.34   | 0.52  | 0.59  | 0.71 | 0.34         |
| Tagalog    | 0.36  | 0.37   | 0.48  | 0.70  | 0.87 | 0.24         |
| Korean     | 0.40  | 0.40   | 0.44  | 0.51  |      | 0.09         |
| Norwegian  | 0.42  | 0.42   | 0.46  | 0.75  | 0.93 | 0.09         |
| Finnish    | 0.41  | 0.43   | 0.53  | 0.70  | 0.85 | 0.19         |
| Arabic     | 0.52  | 0.53   | 0.61  | 0.75  | 0.88 | 0.14         |
| Hindi      | 0.60  | 0.67   | 0.104 | 0.108 |      | 0.35         |

_* Character error rate in stead of word error rate._

The top-performing languages for Whisper transcription accuracy are English, Italian, German, and Spanish. Mid-performing languages include French, Portuguese, and Japanese, while the worst-performing languages are Arabic and Hindi.

It is worth noting that the small model often offers the best value for money. There are only slight gains in running the large or medium models in most languages. However, there are some exceptions where the medium model does provide relevant accuracy gains. Languages such as Italian, Danish, Greek, Thai, Tagalog, and Finnish show a noticeable improvement in accuracy when using the medium model compared to the small model.

Additionally, the large model does not provide significant accuracy gains over the medium or small models for most languages. This suggests that, in general, the small and medium models offer the  balance between cost and performance.

_* Actually, Whisper does offer Dutch, but we just couldn't resist the temptation_ 😎

OpenAI has generated a lot of interest in its Whisper automatic speech recognition (ASR) system since launching the open source model in September 2022. However, there is little data about Whisper's in-the-wild performance across languages and models. To fill this gap, we tested several Whisper models against manually transcribed YouTube videos for 19 different languages.

Analyzing Open AI's Whisper ASR Accuracy: Word Error Rates Across Languages and Model Sizes

It is no secret that video game voice chat is a channel for toxic behavior. However, there was an absence of in-depth data about the problem. Speechly knew from its work monitoring and analyzing voice chat for game makers that the data showed a different, larger, and more nuanced issue than industry leaders recognized. 

To gain a broader perspective, Speechly commissioned consumer research to find out how toxic behavior in voice chat impacts player experience and perceptions. Otto Söderlund, CEO of Speechly, was recently interviewed by Voicebot Research about the findings. The video below includes multiple charts, data points, and insights that tell a more comprehensive story of the good, the bad, and the ugly of online game voice chat.

<YouTube videoId="0_UwUq-pe38" />

The full Voice Chat Toxicity Report for Online Games includes over 40 charts and 50 pages of analysis. You can download a copy by clicking the button below. 

<Button href="/reports/voice-chat-toxicity-report" variant="accent">Click Here to Download Report</Button>

Speechly recently released the Voice Chat Toxicity Report for Online Games - a consumer survey of over 1000 online gamers on consumers perspective and sentiment towards toxicity on video games. In this interview, Otto Söderlund (Co-Founder and CEO at Speechly) and Bret Kinsella (Founder of Voicebot.ai) sat down to dig into the key results.

Voice Chat Toxicity in Games: What the Data Say

The Game Developers Conference (GDC) 2023 was packed with announcements, sessions, and game industry professionals. It also offered me the chance to speak with more than two dozen people about their views on the persistent issue of toxicity in online games. 

Developers, product managers, and trust and safety professionals are increasingly aware of the negative impact that toxic behavior has on player experience, particularly in voice and text chat communications. Reports from ADL, Pew, The Wall Street Journal, and [Speechly](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report) all confirm the problem is significant and complex. 

However, opinions differ about the best approach to address the issue. Last week, most of the perspectives I heard fell into one of four categories: toxic islands, user control, moderator assistance, and [proactive moderation](https://www.speechly.com/products/moderation). 

## 1. Create Toxic Islands

The toxic island approach is based on three ideas. Persistently toxic players are few, they are likely to be okay with toxicity from others, and keeping them away from mainstream players will reduce overall harm. When players receive multiple reports of toxic behavior, they are sequestered on a “toxic island” and can only play with others who also have the same stigma. 

While this approach doesn’t reduce the incidence of toxicity, it decreases the frequency that mainstream players are subjected to bad behavior. The toxicity still exists. It is just more likely to be directed at other players that are also labeled as toxic. And it will almost certainly reduce the number of complaints submitted, which is a relief to many moderation and customer service teams. This method has also been used by some game makers to address alleged cheaters.

## 2. Let the User Control the Experience

Some developers believe that the problem can be solved with more user controls. They argue that encouraging users to play exclusively with friends and enabling them to mute toxic players in other circumstances provides an adequate response to toxicity that is found in gaming communications.

This is attractive to some game titles because simply adding muting functionality is far easier than managing moderation technology and human in the loop processes. Granted, selective muting can significantly impact the game experience for every player in a session if everyone is not hearing the same communications. It also begs the question - should gamers be solely responsible for managing toxicity while playing games online?

## 3. Use AI to Assist Moderators

Others argue that AI solutions could be employed to improve the moderation process. This group believes, for a variety of reasons, it is impractical to conduct proactive voice chat monitoring and raises concerns about false positives that could bog down the moderation process. 

However, they also acknowledge that customer service and community management teams are often overwhelmed by user reports, which can lead to lengthy enforcement times, inconsistent enforcement, and little impact on the behavior of rule-breaking players. Their argument favors AI solutions to analyze voice chat exchanges and help human moderators become more efficient, accurate, and consistent in their decision-making.

A key starting point is to record and/or transcribe voice chats that protect player privacy but also provide data that can be analyzed by specially trained AI tools. The data are also important because otherwise there is no evidence for moderators to use in the complaint  investigation process. 

## 4. Use AI to Proactively Flag Toxic Behavior

A fourth group suggests that the problem can only be solved with proactive monitoring and enforcement. They point out that a key challenge in combating toxicity is the slow feedback loop of the moderation process. Today, game makers only know about voice chat toxicity if a complaint is submitted, which means they are always going to be two steps behind the bad actors. 

Monitoring could be implemented in real-time, similar to text chat. However, the process is a bit different, as the redaction of bad words is not practical for voice chat communications. In addition, voice chat toxicity can only be accurately identified when the context of the game and conversation is taken into account. This approach is about rapid intervention and also about having a complete view of the scale, scope, and nature of toxicity present in the game’s community.

AI can be used to automate moderation decisions - such as automatically muting or kicking a player from a game. Proactive moderation also helps flag toxic behavior that should be reviewed by moderators. 

This is important as almost all voice chat moderation practices today rely on user-generated complaints to kick off an investigation. We know from [our research](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report) that only a small percentage of toxic incidents are followed by a complaint. That means a lot of the bad behavior falls through the cracks and is never acted upon without some form of proactive monitoring.

## Where We Are Headed

Combating cheating remains the top priority for game studios. Addressing toxicity in chat communications has emerged as a close second. Game developers have invested heavily in anti-cheating technologies and policies, as well as text chat moderation, but voice chat moderation has received significantly less attention. 

That inaction is often based on the game makers not knowing what course of action to take, lack of tools to proactively address the problem, and concerns about cost. The result is a [voice chat moderation gap](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation) for online games. 

Many companies at GDC wanted to talk to Speechly because several top studios engaged us over the past year to help [address those issues](https://www.speechly.com/blog/combating-voice-chat-toxicity-in-vr-games-speechly-and-gym-class). We have learned a lot about the gaps and the nature of the problem from our work and many conversations with industry professionals. If you are considering how to best combat voice chat toxicity in your game, we would be happy to share our learnings.

The Game Developers Conference (GDC) was back in full force for 2023. This came with new product announcements, plenty of content to consume, and nearly 30k gaming industry professionals all gathered in San Francisco. This also gave me the opportunity to gather a wide ranging perspective on the persistent issue of toxicity in online games.

Four Perspectives on How to Combat Voice Chat Toxicity in Games: A Look Back at GDC 2023

Gym Class VR is a basketball game that was preparing to launch on Meta Quest. In addition to fun game mechanics, it also has a voice chat feature that makes it a social experience. However, there was also some evidence of toxic behavior emerging in voice chat and the company didn’t know whether the problem was widespread or mostly isolated incidents. 

The team was also very serious about building a healthy social space as part of the game experience. To do this, it needed a way to measure the problem and put controls in place to weed out toxic behavior. 

Gym Class had tried popular cloud service providers for speech recognition to see if it could establish a baseline measurement for toxic behavior. However, at a market price of $1 per hour for transcription, those solutions turned out to be cost-prohibitive. So, Gym Class began looking for a new solution.

## The Problem with Toxic Behavior

Toxic behavior can create a cascading series of problems if it is not addressed. Initially, it can undermine the game experience for a few players. If left unchecked, over time, it can shape the game’s community culture and leave some players feeling unwelcome and uncomfortable. 

In addition, it is harder to get five-star reviews when a few bad actors are undermining the game experience. Worse still, these bad experiences can lead directly to one-star reviews, which can turn prospective players off before even trying the game. This is particularly frustrating for game makers when bad reviews emerge that have nothing to do with the game itself but are driven by a handful of toxic players. 

Data from Apptentive and other providers show that there is a direct correlation between app store star ratings and new user acquisition. This is partially driven by how users search for new games on their own and also by how the app stores rank game titles. Star ratings matter. Gym Class knew that lowering toxicity would meet their goals from a game culture standpoint and also could translate into better star ratings. 

## The Proactive Imperative

Gym Class’ goals were pretty straightforward. The company first needed to measure the level of toxicity in the game. That understanding could then be used to decrease the amount of toxicity, improve player experience, and support a successful launch in the Meta Quest app store. 

Most game makers today treat voice chat moderation strictly from a complaint-led model. That means they only are aware of toxicity that is reported. A recent consumer survey of U.S. online gamers found that only about 36% of victims of toxic incidents originating in voice chat, have ever filed a complaint. Even those that have filed a complaint don’t do it for every incident. 

![Voice Chat Toxicity - Filed Complaint](/uploads/voice-chat-toxicity-victims-filed-complaint.png "Voice Chat Toxicity - Filed Complaint")

The implication for Gym Class was clear. A complaint-led process would miss the vast majority of incidents. The company would need to take a proactive approach that involved monitoring voice chat sessions for toxicity. This would enable Gym Class to more effectively measure the problem and figure out the best way to eliminate toxic behavior where practical and mitigate the impact when it did occur. 

## The On-device Solution

Gym Class already knew it needed a highly accurate automated speech recognition (ASR) solution. Speech recognition and transcription accuracy are the first steps in any monitoring of natural language conversations. The company also wanted to ensure it correctly identified toxic incidents by taking context into account, so it didn’t miss cleverly disguised toxic behavior. And the context-based analysis was important to mitigate the likelihood of false positive events which arise when a benign statement is flagged as toxic.

Given that Gym Class has several unique aspects of its VR game mechanics and culture, it was going to need a custom AI model to drive high accuracy. It also became clear that the only economically feasible solution would be to run the monitoring on user devices as part of the downloaded app. 

If you run transcription through a cloud provider, you are paying for all of the data processing. For any individual gamer utterance, it may not exorbitantly expensive, but the costs add up quickly for any game with a significant user base and frequent voice chat use. The cloud provider option mentioned earlier added $1 of cost for every player hour. 

However, if you run the speech recognition locally on the user device, you only need to send messages to the game makers’ servers when an incident is detected. This turns out to be an order of magnitude less expensive than using a cloud provider. The approach also means proactive monitoring is suddenly economically feasible. 

## Comparing ASR Models

Gym Class asked Speechly to benchmark several ASR solutions to assess cost and performance. This evaluation included two cloud providers, a Whisper on-prem deployment, one Speechly on-prem, and Speechly one on-device model. Speechly was the only ASR evaluated on-device as the cloud providers do not offer this option, and the Whisper model was too large to be feasible for the use case requirements. The results showed a strong rationale for implementing an on-device solution.

|                    | Recall % | False positive | Model size | Cost / audio hour |
| ------------------ | -------- | -------------- | ---------- | ----------------- |
| Google             | 69.9%    | 0.2%           | N/A        | Highest cost      |
| Azure              | 75.5%    | 0.3%           | N/A        | Highest cost      |
| Whisper on-prem    | 76.9%    | 0.3%           | 1400 MB    | 70% lower         |
| Speechly on-prem   | 77.9%    | 0.2%           | 70 MB      | 90% lower         |
| Speechly on-device | 77.9%    | 0.2%           | 70 MB      | 95% lower         |

The results made clear that Speechly’s custom ASR model both on-device and on-prem provided better accuracy in terms of Recall (i.e. identifying true positive toxic behavior). False positives were near zero and at par or below other AI model implementations. In addition, Speechly costs were 90% to 95% lower than cloud deployments and one-third to one-sixth the cost of an OpenAI Whisper implementation. 

Cost is a key barrier to using cloud providers for these types of applications. However, the analysis for Gym Class also revealed that the generalized cloud models also had lower accuracy, and Azure showed a higher false positive rate. It is hard for generalized AI models to compete with customized models in terms of accuracy for use cases as specific as a particular game. In the end, Speechly’s custom speech recognition models offered higher accuracy in addition to being smaller and more cost-efficient.

On-device deployments with this level of accuracy were not an option for game makers just a couple of years ago. The reason is the package size for a custom speech recognition model was simply too large to include in a game’s executable file. Recent advances in speech recognition around optimization for application size and processing requirements have made this approach viable today for everything from PCs and consoles down to mobile devices and VR headsets. 

Speechly has been at the forefront of this change, particularly in deploying custom ASR models directly on devices. Our research has focused on building better-than-cloud-grade speech recognition that can be deployed on-device, on-prem, or in the cloud. 

## The Results

Implementing Speechly’s voice chat monitoring solution enabled Gym Class to proactively address toxic behavior and execute a successful Meta Quest store launch. And it was made possible at a reasonable cost. 

Gym Class VR’s toxic incident rate and complaints are just a fraction today compared to before the solution was implemented, and the company has a clear method for measuring incident rates. Now, the game can be judged on the merits of its mechanics and how fun it is for players without the risk of a few users undermining the experience for everyone else. 

Today, Gym Class VR has a 4.9-star rating on the Meta Quest store and over 28,000 positive reviews. Data from VRDB.app show it became the highest-rated experience in the entire store in March 2023. You should give it a try.

Gym Class VR is a basketball game that was preparing to launch on Meta Quest after a very successful Beta. Voice chat is an important social element of the game, but the team noticed evidence of toxic behavior emerging. After trying speech recognition from cloud service providers, they quickly learned this was a cost-prohibitive approach and turned to Speechly.

Combating Voice Chat Toxicity in VR Games: Speechly and Gym Class

Speechly just published the results of a [national consumer survey](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report) of online gamers about their experiences with toxic behavior in voice and text chat. Offensive names, trolling, bullying, and other annoying behavior topped the list for the broadest impact in both spoken and written communications in games. 

However, you may note that gamers have experienced these bad behaviors between 50% to 200% more often in voice chat for each category. And how players react to these incidents differs by the type of offense.

<Button href="/reports/voice-chat-toxicity-report" variant="accent">Download the Voice Chat Toxicity Report for Online Games</Button>

![Toxic Behavior Incidents by Offense Category](/uploads/toxic-behavior-incidents-by-offense-category.png "Toxic Behavior Incidents by Offense Category")

This variance in the frequency of toxic behavior clearly influenced other results as well. Gamers rated voice chat toxicity as significantly worse than text chat, and they experienced an average of 35% more incidents per victim.

## Player Impact

Over two-thirds of gamers change their behavior immediately after experiencing a toxic incident in voice chat. About 40% turned off voice chat, and 28% stopped playing that day. Only 29% said their gameplay was unaffected by the incident. 

![Player Behavior After Toxic Incident](/uploads/player-behavior-after-toxic-incident.png "Player Behavior After Toxic Incident")

The longer-term impact is even more troubling for game makers. Almost 39% of players say they reduced play or quit using the game after experiencing toxic behavior in voice chat. These incidents clearly have significant impacts on victims’ behavior. The incidents also impact players’ perception of the game.

![Player Usage after Toxic Incident](/uploads/player-usage-after-toxic-incident.png "Player Usage after Toxic Incident")

## Different Behavior, Different Impacts

Some game makers have told Speechly that they do not differentiate between different forms of toxic behavior, preferring to treat every incident as equally bad. However, players clearly don’t react the same way in response to these incidents. The type of toxicity matters. 

For example, gamers revealed that reduced play or game abandonment is far more common after stalking and sexual harassment incidents than for name-calling, trolling, and the use of explicit language. While 38.7% of victims of any type of toxic behavior in voice chat will reduce play or abandon the game after the incident, the figure is 52.5% for stalking and 50.6% for sexual harassment.

![Top 5 Incident Categories](/uploads/top-5-incident-categories.png "Top 5 Incident Categories")

We also see differences in player perception of games depending on the type of toxic incident. In general, most victims of toxic behavior differentiate between the bad actors and the game. The optimistic way to interpret the data is that only 17.3% of players are “less likely” or “much less likely” to recommend a game after experiencing toxic behavior in voice chat. However, given how important word-of-mouth promotion can be for game success, even this figure is surely troubling. And it is far worse for sexual harassment.

![Negative Perception Incident Categories](/uploads/negative-perception-incident-categories.png "Negative Perception Incident Categories")

Nearly 28% of victims of sexual harassment in voice chat say they are less likely to recommend a game. The figure is 23.8% for offensive names and 23.1% for bullying. Voice chat toxicity is clearly bad for games regardless of the category of offense. But, it is also worth noting that the behaviors and attitudes of the victims differ depending on the type of toxic incident. 

This doesn’t suggest that game makers should combat some forms of toxic behavior and ignore others. Instead, the findings indicate that game makers should become more proactive in identifying both the incidence of toxic behavior and the frequency of offense types. Victims of the bad behaviors have different reactions and also different expectations about how the game maker should respond depending on what happened. 

## Data-Driven Understanding

The data referenced above are included in a new 60-page report developed by Speechly and Voicebot Research. The report includes over 40 chats and diagrams and is free to download. 

Speechly commissioned the [research](https://www.speechly.com/blog/voice-chat-is-popular-with-gamers-its-also-the-top-source-of-toxic-behavior-new-report) after learning that game makers had very little information about their players’ voice chat experience beyond the complaints submitted and some social media posts and game reviews. We hope that the new report can be helpful for game developers working to improve player experience by reducing the incidence of toxic behavior in voice chat.

<Button href="/reports/voice-chat-toxicity-report" variant="accent">Download the Voice Chat Toxicity Report for Online Games</Button>

Speechly surveyed over 1000 online gamers about toxic behavior in voice and text chat. The results show offensive names, trolling, bullying and annoying behavior top the list with the broadest impact. However, these behaviors are between 50%-200% more frequent in voice chat.

The Dirty Dozen - The Impact of 12 Types of Toxic Behavior in Online Game Voice Chat

In a new national consumer survey, about half of U.S. online game players said they had experienced a toxic behavior incident in voice chat. They also said that voice chat was the channel with the biggest toxic behavior problem, beating out text-chat, in-game play, and user-generated content (UGC) by significant margins.

<Button href="/reports/voice-chat-toxicity-report" variant="accent">Click Here to Download Report</Button>

![Gamer Experience With Toxic Behavior in Voice Chat](/uploads/gamer-experience-with-toxic-behavior-in-voice-chat-sq1200-3.6.23.png "Gamer Experience With Toxic Behavior in Voice Chat")

### Data-Driven Understanding

The data are included in a [60-page report](https://get.speechly.com/voice-chat-toxicity-report-for-online-games/) developed by Speechly and Voicebot Research that includes nearly 50 charts and diagrams. Speechly commissioned the research after learning from game makers that they had very little information about user voice chat experience beyond player submitted complaints and anecdotal evidence posted in social media. We hope that the new report can be helpful for game developers working to improve player experience by reducing the incidence of toxic behavior.

### Voice Chat Toxicity in Perspective

Several previous studies have identified voice chat as a current and growing problem for harassment and toxic behavior in online games. However, when we set out to learn more, it became clear that surveys typically only asked about overall voice chat incidents and did not dig deeper to understand the nuances of the problem.

For example, we wanted to differentiate between harassment and other forms of toxic behavior. Harassment has a legal definition that varies around the world but it is directed and intentional. In addition, there are several forms of harassment ranging from bullying and griefing to sexual harassment and doxing.

Toxic behavior is a broader category that includes activities that undermine the player experience more generally but may not be directed at a specific player or overtly intended to do harm. Sexually explicit language, swear words, terms considered to be hate speech, some forms of trolling, and other categories combined with other forms of harassment all negatively impact player experience. It turns out that these problems are worst in voice chat.

![Toxic Behavior by Engagement Channel](/uploads/toxic-behavior-by-game-engagement-channel.png "Toxic Behavior by Engagement Channel")

We asked online gamers to rate three categories of UGC, in-game play, and voice and text chat separately on a 0-5 scale indicating the severity of the toxicity problem. In-game play was rated second worst to voice chat, followed by text chat in third. The different UGC categories were all rated similarly. This is an important finding because it goes beyond simply identifying the rate of toxicity, but also captures player sentiment about severity by communication channel.

In industry interviews we found that some players even opt-out of voice chat altogether in these game formats and use text chat, a ping system, or no communications at all while playing. Why? They are avoiding toxic behavior common to voice chat channels.

We also found that many users change their behavior after experiencing toxic behavior incidents and these directly undermine game maker key performance indicators such as session length, session frequency, and retention. However, despite the issues, the mere presence of voice chat tends to boost these key metrics and average revenue per user. This means game makers have a significant incentive to offer a voice chat feature, and to actively police bad behavior.

### Voice Chat Remains Popular

Despite the issues, voice chat is a widely adopted feature. More than two-thirds of online game players say they use voice chat and the majority of those say they employ voice chat regularly.

![Gamer Voice Chat Use](/uploads/gamer-voice-chat-use.png "Gamer Voice Chat Use")

Gamers are also using voice chat because they like it. About 48% said they like games more when using voice chat while about 30% are neutral or unsure and about 22% say it has no impact. A plurality of users, 27.5%, strongly agreed that they liked games more when using voice chat.

![Gamer Sentiment About Voice Chat](/uploads/gamer-sentiment-about-voice-chat.png "Gamer Sentiment About Voice Chat")

The interest in voice chat goes beyond team-based games where player coordination is important. Over 53% of voice chat users say they like the feature for connecting with friends. This compares to just 44% that list “improved gameplay coordination” as an important benefit.

It is clear that online gaming has become a key social channel for many players. Voice chat is a great way to connect with friends, get to know new people, and improve the oveall game experience. It has positive benefits for both players and game makers. However, the downside of toxic behavior is noteworthy.

We hope the data presented in the report proves useful to game makers as they plan their initiatives to combat voice chat toxicity.

<Button href="/reports/voice-chat-toxicity-report" variant="accent">Click Here to Download Report</Button>

Speechly commissioned a survey of a nationally representative sample of over 1000 gamers. The survey found that nearly 70% of gamers have used voice chat at least once. Of those, 72% said they've experienced a toxic incident. Read more today in the Full Report.

Voice Chat is Popular with Gamers - It's also the Top Source of Toxic Behavior - New Report

Speechly has introduced a new conformer AI model as an update to our original LSTM models. A key benefit of the Speechly [Conformer RNN-Transducer model](https://docs.speechly.com/basics/models) is improved computational efficiency. This is particularly true for real-time transcription, where it can save as much as 50% in computational resources. In addition, Speechly’s new models can achieve these benefits along with higher accuracy as measured by a lower word error rate (WER). 

This week we also released an updated Whisper solution with coverage for 99 languages. Whisper is a transformer-based large language model for speech recognition developed by OpenAI. We have been testing and optimizing our Whisper infrastructure for months to augment customer deployments. Speechly is now offering this as a [hosted option](https://www.speechly.com/products/hosted-whisper) for our customers to provide additional speech recognition capabilities. 

## What is a Conformer Transducer Model?

A conformer transducer model is a type of deep neural network that combines aspects of convolutional networks and transformer models. This enables the model to focus on specific parts of an audio input that are most relevant to a transcription or other natural language processing task. In particular, these models are able to identify short and long-term dependencies in speech, which often helps to improve accuracy. Moreover, unlike transformer-based encoder-decoder architectures such as Whisper, the conformer transducer naturally lends itself to real-time streaming transcription.

The benefits of conformer transducer models are realized irrespective of the runtime environment. They can be deployed both as part of our cloud/on-premise product as well as our on-device offering. We can build small or larger versions of the same conformer transducer depending on resource availability, with smaller models being more efficient on smaller devices. In any case, these models offer significant performance improvements over earlier technologies, such as LSTMs, in terms of transcript accuracy.

## Why did Speechly build a conformer model?

Speechly researchers developed a conformer model, in part, to improve computational efficiency. Conformer models are generally more computationally efficient than LSTM models due to parallelization and the ability to handle variable input sequence lengths. Conformer models can process input sequences in parallel across multiple computation units, such as GPUs. This allows them to perform more computations simultaneously and speed up both training and inference times. 

The benefits of custom speech recognition models are widely recognized. Using a conformer model can accelerate training for the models and reduce training costs. In addition, live streaming inference for speech recognition is also more computationally efficient with a conformer model and provides reduced latency. LSTM models generally require more memory and parameters to achieve the same level of accuracy.

## What is Whisper, and Why Does Speechly Offer it?

Whisper is a transformer-based large language model developed for speech recognition and transcription tasks. It has the added benefit of providing translation when needed. OpenAI introduced Whisper in late 2022, and Speechly immediately began testing it for a variety of tasks. It has some limitations compared to customized AI models but performs many tasks to a level on par with leading cloud speech recognition solutions at a far more attractive price point. 

However, Speechly also learned during our testing and deployment that setting up and managing Whisper infrastructure can be a complex undertaking. Given its useful features and these challenges in deployment and operations, Speechly decided to offer Whisper as a supplement to our existing on-device, on-premise, and cloud models. Whisper today is only available for cloud or on-prem deployment. 

## Try Speechly’s New Models Today

Both the new Conformer RNN-T and Whisper models are available today from [Speechly’s dashboard](https://docs.speechly.com/basics/getting-started). If you have any questions, you can learn more in Speechly’s [documentation](https://docs.speechly.com/basics/models), or feel free to [ask us a question](https://www.speechly.com/contact?ref=https://www.speechly.com/) anytime. 

Speechly continues to invest heavily in research and development to improve accuracy, latency, and cost efficiency. We look forward to hearing your feedback on the new models and continuing our research to update, refine, and enhance our speech recognition products.

This week the Speechly team released two new product updates. These updates include a new conformer AI model as an update to our original LSTM models and an updated Whisper solution with coverage for 99 languages.

Speechly Introduces New Conformer Speech Recognition Model and Expanded Whisper Offering

Speech recognition technology has come a long way in recent years and that has raised more interest in deploying [on-device solutions](https://www.speechly.com/blog/when-to-run-speech-to-text-on-device-or-on-premise-vs-in-the-cloud) as an alternative to cloud-based solutions. The main difference is that cloud-based solutions must send the audio over the network to a remote server for processing, while the audio is processed locally for [on-device](https://docs.speechly.com/features/on-device/) implementations and never has to travel the internet to access expensive computing resources.

This difference has far-reaching implications. You might be surprised that on-device speech recognition accuracy can be comparable to the cloud for many use cases, but with the added benefits of improved privacy and lower cost.

## Higher Privacy + Lower Cost

The main benefits of on-device speech recognition over cloud-based solutions are privacy and lower costs, especially when very large volumes of audio must be transcribed. If the audio is never uploaded to the cloud, the risks of sensitive information being leaked are substantially reduced and cloud-based solutions come with infrastructure costs that can be avoided with an on-device solution.

Additionally, on-device speech recognition doesn't require an internet connection. This can be a major advantage in situations where security policies may prevent public access to the internet, such as factory floors or hospitals.

But how accurate is on-device speech recognition compared to cloud-based solutions? The short answer is that it can be just as accurate, but this depends on the type of device in question.

## Is There an Accuracy Tradeoff?

Accuracy in speech recognition is typically measured using a manually transcribed evaluation corpus, which is a collection of recorded speech samples together with the correct transcript. The most common measure of accuracy is the Word Error Rate (WER), which compares the transcription of a recorded sample to the correct transcript by calculating how many changes one has to make to the automatically generated transcript so that it matches the correct reference. A lower WER indicates a higher level of accuracy.

Speech recognition is based on machine learning models that are trained using large amounts of speech data. To make full use of such datasets, the model itself must be large. The size of the model directly affects its accuracy, with larger models being more accurate. However, larger models also require more resources, both in terms of processing power and memory usage.

Thus there is a [trade-off](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation) between accuracy and available resources. Typically cloud-based speech recognition solutions have more resources available, can hence use larger models, and are thus capable of providing high accuracy. But what is the situation with on-device speech recognition?

The answer is that it depends on what type of device one is considering, and if the device must do some other processing while speech recognition is running. Importantly, most modern mobile phones have the resources to run fairly large speech recognition models, especially if the device can focus only on transcription. And if the target device has fewer resources, it is possible to train a custom model that is small enough to fit on the device, without compromising too much on accuracy.

## Practical Considerations for On-Device Speech Recognition

The precise speech recognition task may play a role in your solution decision. If the task is to transcribe a local audio file, e.g. an interview recording, it is desirable that the processing runs faster than real-time, meaning that transcribing a 10 minute recording would take substantially less than 10 minutes. On the other hand, real-time transcription, where the transcript is generated at the same time the user speaks, may require fewer resources from the device as there is less audio to be processed per unit of time.

Consider that a mid-tier Android phone released in 2021 (Samsung A22 5G) is perfectly capable of running Speechlys large, cloud-grade speech recognition model faster than real-time when no other computationally heavy processing is running concurrently. The device can transcribe a 10 minute audio file in about 2-3 minutes. On the other hand, the same device can easily handle real-time speech recognition using the same large model, even if there is a graphics heavy 3D game running in the foreground. And crucially, using this model, the on-device WER would be exactly the same as the WER of Speechlys Cloud-solution!

You could argue that the Samsung A22 is a fairly powerful device. However, even a Raspberry Pi 4 is capable of real-time transcription with the same large model, and this consumes only about half of the available CPU resources (2 cores).

## Practical Solutions for On-Device Speech Recognition

One place we have been asked to deploy on-device speech recognition is in the [video game industry](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation). Users typically have a PC, console, or mobile phone that has plenty of computational power and memory to run a speech recognition model in real time. This saves cost for the game maker because they are not processing all of that data in their cloud servers while providing the added benefit of greater user privacy and lower latency. If the user does face an issue such as toxic behavior in [voice chat](https://www.speechly.com/products/moderation), the data can be automatically uploaded to the cloud for use during the moderation investigation.

The accuracy of on-device speech recognition is not really a matter of on-device vs cloud, but more about model size and resource usage. Many devices, especially reasonably modern mobile phones, have sufficient resources to run relatively large models. Therefore, accuracy of [on-device](https://docs.speechly.com/features/on-device/) can be as good as in the cloud! And of course Speechly’s on-device models can be adapted to specific use-cases and vocabularies in the same way as our cloud solution.

To learn more about on-device speech recognition, check out our [on-device docs](https://docs.speechly.com/features/on-device/) or reach out to our [product team](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) at any time.

Speech recognition technology has come a long way in recent years leading to more interest in deploying on-device solutions as an alternative to cloud-based solutions. Why? The benefits of improved privacy and lower cost without any impact in accuracy.

On-device vs Cloud Speech Recognition: Comparing Privacy, Cost, and Accuracy

## ADL Report: Voice Chat Remains a Top Channel for Online Harassment

While ADL’s [annual report](https://www.speechly.com/blog/adl-report-online-harassment-in-games-is-bad-and-getting-worse) about harassment in multiplayer games showed a significant problem worsening, it also highlighted that voice chat is once again a leading channel for these incidents. In-match voice chat has consistently been cited by over 40% of U.S. adults as a source of toxic behavior. It was the top channel for harassment in 2019-2021 and fell just one point behind Gameplay in the 2022 survey.

![Harassment of Adults, by Communication Mode](/uploads/harassment-of-adults-by-communication-mode.png 'Harassment of Adults, by Communication Mode')

In-match voice chat also notably exceeds the reports of in-match text chat for each of the survey years. This is consistent with other primary research data Speechly has reviewed. One reason we suspect that voice chat harassment exceeds text chat is that games are far more likely to have automated tools to mitigate the impact of the latter.

The ADL survey differentiates between in-match and out-of-match voice chat channels. It finds that out-of-match voice chat is not quite as toxic but is still an issue cited by one-in-four gamers.

## Voice Chat is also Problematic for Kids

ADL data also show that voice chat is the leading channel for harassment of 13-17-year-old kids while playing games. Forty-five percent of kids responded that they had been harassed in voice chat, compared with 43% for gameplay and 39% for text chat. Gameplay did not change from the 2021 report, but voice chat rose six full percentage points. Text chat incidents rose more modestly.

![Ages 13-17 Harassment Channels](/uploads/ages-13-17-harassment-channels.png 'Ages 13-17 Harassment Channels')

In 2022, ADL also included data for 10-12-year-old children. Fifty-one percent say they had experienced harassment through in-match text chat, 46% through gameplay, and 41% for in-match voice chat. It may be that younger players are less comfortable making harassing statements via voice chat, or fewer are allowed to use voice chat while playing games. Regardless, the presence of harassment in online games is substantial across channels and age groups.

## Visibility into Harassment is Important

Consumer surveys are beginning to paint a more accurate picture of how widespread harassment is in online games. Most games today use complaint-led reporting for voice chat harassment. Speechly has found that about 70% of players that have experienced toxic behavior in a game’s voice chat have never reported an incident. Even the victims that have reported incidents have not reported every incident.

Game makers have very low visibility into the breadth and depth of these issues. Reports such as ADL’s and another that will be published in February offer much-needed insight.

Many game makers do have visibility into harassment that takes place during gameplay. Even if they don’t regularly monitor these incidents, they typically can assess them by reviewing log data. Similarly, many game makers have at least basic filtering tools for text chat, and some are assessing context after complaints are submitted. This doesn’t necessarily surface the extent of the problem, but the data is available to go into a deeper analysis, and some game companies do this regularly.

## Voice Chat Moderation Gap

Voice chat in gaming is generally a black hole for data. Few game makers today are recording voice chat audio, fewer still are [transcribing](https://www.speechly.com/products/moderation) the chats, and even fewer have the means to algorithmically analyze the data when it is available. This has led to a voice chat moderation gap that appears to be growing.

Game makers tell us that voice chat is important for improved gameplay experience, session frequency, and player retention. Industry data back up these contentions. However, voice chat also presents a significant [risk factor](https://www.speechly.com/blog/adl-report-online-harassment-in-games-is-bad-and-getting-worse). When harassment does occur, players reduce play, change their gameplay behavior, and some abandon specific games altogether.

Granted, there are technical and [cost hurdles](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) for recording voice chat audio, transcribing the conversations accurately, and analyzing it effectively. Speechly was recruited by several game makers to help overcome these obstacles. [R﻿each out](https://www.speechly.com/contact) to our product team if you would like to learn more.

Also, if you would like to read a more detailed breakdown of these challenges, I recommend you check out some of our earlier blog posts on these very topics.

#### [Why Games Need Better Voice Chat Moderation](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation)

#### [3 Common Voice Chat Moderation Mistakes](https://www.speechly.com/blog/3-common-voice-chat-moderation-mistakes)

#### [On-device Speech Recognition for Voice Moderation](https://www.speechly.com/blog/speechly-introduces-a-solution-to-the-voice-chat-moderation-gap-at-voice-summit-2022)

The annual ADL report about harassment in multiplayer video games showed a significant problem worsening. Voice Chat is once again a leading channel for concern.

ADL Report: Voice Chat Remains a Top Channel for Online Harassment

ADL’s annual report about harassment in online multiplayer games once again painted a negative picture of player experience. [Harassment](https://www.speechly.com/blog/online-harassment-stats) of 13-17 year olds rose 6% over the past 12 months. Sixty-six percent of these young people had experienced at least one harassment event. The figure is even worse for preteens in the 10-12 year-old category. Seventy percent have been victims of harassment while playing online games. 

The experience of adult game players was about the same as with children. Sixty-seven million U.S. adults have experienced harassment in online games. While that metric is flat from the previous year, ADL researchers noted that severe harassment increased from 71% to 77%. Doxing alone rose 6%. 

## Beyond the Incident, Harassment Impacts Gamer Behavior 

The impact of harassment goes beyond the incident. Among 13-17 year-olds, 30% said they quit playing certain online games. That figure was up from 28% in 2021. Thirty-five percent said they avoid certain games because of harassment, up from 26%. A little over a third said they changed how they play. 

![In Game and Offline Impact on Young People ADL](/uploads/in-game-and-offline-impact-on-young-people-adl.png "In Game and Offline Impact on Young People ADL")

ADLs findings also revealed that the harassment is not limited to a few games with toxic communities. Nearly half of “young gamers experienced harassment in every game we included in this survey,” concluded the report. 

Among adults, 33% said they quit playing certain games due to a harassment incident, and 32% said they avoid certain games. These figures both represent increases between 2021 and 2022. 

![Harassment Impact on Gameplay ADL](/uploads/harassment-impact-on-gameplay-adl.png "Harassment Impact on Gameplay ADL")

## In-Game Harassment at the Boiling Point

Online harassment is clearly bad for game players and it is also negatively impacting game makers. A few bad actors are causing a tremendous amount of reduced play and user churn from game titles. 

It is hard for game titles to break through the clutter and capture new players in an increasingly crowded marketplace. The last thing any game maker wants is for a few bad actors to cause players to depart for other titles. And, they don’t want to have a reputation for a toxic environment that keeps some players from ever trying out the game in the first place. 

If this problem was easy to solve, it would have been done already. Online games have become social experiences and that means player interaction is inevitable. It also means that these interactions can turn into toxic incidents. The risk is particularly acute when [voice chat](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation) is involved. We will talk about that topic in our next post. 

Combating online harassment with improved [moderation tools](https://www.speechly.com/products/moderation) and practices are clearly on the agenda for nearly every game maker in 2023. It is clear from our conversations with game makers that they recognize the existing infrastructure simply doesn’t protect players from toxicity and doesn’t do enough, fast enough to mitigate the impact. 

Studies such as [ADL](https://www.speechly.com/products/moderation)’s are helping to highlight the problem that game makers already saw in their complaint files and comms logs. Now everyone knows that the problem is not limited by title or genre. In 2023, you can see the movement across the industry to become more proactive in bringing back civility and fair play in online games.

ADL's annual report about harassment in online multiplayer games paints a negative picture for young people and adults alike. Is 2023 the year the gaming industry will start to overcome these challenges?

ADL Report: Online Harassment In Games is Bad and Getting Worse

TLDR;

* The most popular voice assistants (Alexa, Siri, Google) all use half-duplex architectures, which means the user and the assistant must take turns to speak – you cannot interrupt
* This turn-taking requirement limits the versatility of half-duplex systems because neither party can act until the other has finished their turn; this slows system response times and can be tedious for the user when the voice assistant misinterprets the request
* Half-duplex architectures also limit use cases such as real-time proactive [voice chat moderation](https://www.speechly.com/products/moderation); that is why you often see these solutions showing significant processing delays and only after-the-fact issue flagging - transcription processing followed by text analysis is a half-duplex system design
* Full-duplex architectures enable bi-directional communication as both parties are always listening even when speaking or acting
* Full-duplex systems are less common today but offer valuable features because they employ real-time understanding where the system begins predicting the user intent from the very first word uttered
* This means that users can speak to correct the AI’s understanding as soon as it is apparent there is an issue which enables more efficient interactions
* It also means full-duplex can perform actual proactive moderation of live voice chat because the system is not batching text to be analyzed but instead analyzing the meaning of the user speech in parallel with transcription
* This truly [proactive moderation](https://www.speechly.com/products/moderation) feature can make a big difference when toxic material is uploaded or toxic behavior is occurring – the difference between a few seconds and a couple of minutes can have a big impact as Twitch recently learned

Whether you call it speech AI, conversational AI, voice AI, or prefer some [other term](https://docs.speechly.com/glossary), you most likely assume it means turn-based communication. That is not surprising. General purpose voice assistants such as Alexa and Siri are rooted in this model and that is the primary point of reference for most people familiar with conversational interactions or even chatbots.

It begins with a human on their turn making a request to a leading voice assistant. The voice assistant waits for human to complete the request (i.e. utterance). On its turn, the voice assistant starts by processing the full statement through speech recognition and natural language understanding and then responds. That response might be via text-to-speech, an audible sound, an image, or by completing a task. However, you typically cannot engage the AI again until it completes its response. The human user can engage again, followed by the voice assistant.

The technical term for these turn-based conversational models is half-duplex. Half of the two communicating parties can communicate at a time. That means one at a time turn-taking communication – human, then AI, then human, then AI, and so on. That isn’t very humanlike.

Humans typically use what is called full-duplex communication when interacting with each other. While half-duplex systems only enable information to travel in one direction at a time, full-duplex communications enable simultaneous information flow in multiple directions. No one is required to wait for their turn. This is a key engineering differentiator for Speechly and one reason why customers seek us out for capabilities that they cannot implement with half-duplex systems.

Few people realize how radically full-duplex communications transform what is possible in conversational interactions. The real-time nature of a full-duplex architecture also enables other novel use cases, such as real-time voice chat moderation. Full-duplex voice AI was also recently in the news, but more on that in a minute.

## Don’t Take Turns, Barge In

When was the last time you attempted to interrupt Alexa or Siri when they were speaking? How did that work out for you? These assistants will drone on to complete what they believe their task to be, and you simply have to wait. There is no concept of “barging in” while the other party is talking or processing information and deciding what to do. This limitation is true even though a barge-in could help the conversation more efficiently and accurately meet the user’s goal.

The one way you can barge in on Alexa or Siri is to utter their wake word. However, that essentially resets the context, and the user is required to start over instead of building upon the progress the conversation has made toward the goal. This is relevant for information sharing and task completion. Let’s consider some full-duplex communication examples.

#### Full-duplex information sharing example:

| Human                                          | Another human                                                          |
| ---------------------------------------------- | ---------------------------------------------------------------------- |
| 1. "Who were the lead actors in Blade Runner?" | 2. There was Rutger Hauer, Harrison...                                 |
| 3. "No. In the sequel."                        | 4. Harrison Ford again, Ryan Gossling, Ana de Armas, and Robin Wright. |

| Human                                | A full-duplex Al                                                                                                       |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| 1. "What is the weather like today?" | 2. The weather in New York City is...                                                                                  |
| 3. "I mean in Boston."               | 4. The weather in Boston is clear and 68 degrees right now with a high of 75 and cloud cover forming in the afternoon. |

#### Full-duplex task completion example:

| Human                               | Another human                                                          |
| ----------------------------------- | ---------------------------------------------------------------------- |
| 1. "Can you hand me a plate?"       | 2. The person begins to hand over a green plate.                       |
| 3. "The white one would be better." | 4. The person takes back the green plate and hands over a white plate. |

| Human                          | A full-duplex Al                                           |
| ------------------------------ | ---------------------------------------------------------- |
| 1. "Show me basketball shoes." | 2. A variety of shoes begin to populate a screen.          |
| 3. "Show me only Nike shoes."  | 4. All non-Nike shoes are removed from the screen.         |
| 5. "Only red."                 | 6. All shoes that are not red are removed from the screen. |

The misunderstanding of user intent may be due to incomplete information from the requesting party. However, in each case, the requesting party can easily refine their request based on the first indications of activity by the responding party.

Many of these scenarios can be frustrating when using a half-duplex system because you have to wait for an incorrect task to be completed or for the system to deliver the wrong information before starting the query again with more detail. This waiting and inability to introduce real-time collaboration to reach the conversation’s goal is just as annoying when speaking with a human as with an AI.

Half-duplex systems can get the job done in many cases. Users simply have to adjust their expectations and accept a certain amount of inefficiency and frustration from time to time. It also means that these systems cannot fulfill the requirements for many real-time interactions.

## Real-time Understanding

The secret behind full-duplex natural language processing ([NLP](https://docs.speechly.com/glossary)) is real-time understanding. As soon as a user begins to speak, the system begins predicting their intent and starts taking action. It doesn’t wait until the user finishes speaking. That means a correct early prediction could actually fulfill a request before it is fully expressed.

If you have multimodal feedback such as a screen, full-duplex also provides another powerful feature. As the system is visually fulfilling the request, the user can see what the AI is doing and correct an inaccurate understanding. You can see a video example below.

<YouTube videoId="xI68NT8D1m8" />

## Full-Duplex Application for Voice Chat and UGC Moderation

Real-time functionality also enables novel use cases such as voice chat [moderation](https://www.speechly.com/products/moderation) for [games](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation) and social networks. If there is toxic behavior or harassment in progress, you want to identify it immediately and begin taking action. Otherwise, you hava to wait until after the perpetrator stops speaking to begin your analysis.

You see this in many voice chat moderation implementations today. Many are only able to review the transcript of voice chat long after the conversation is over. It is an audit-based reactive approach. Others attempt to provide information sooner but often with a 30-second to multi-minute delay. Again, that delay typically means the conversation may be over or had additional time to escalate with a more severe negative impact. This delayed reaction is better than the audit-based approach, but it undermines a platform’s ability to be proactive and quickly mitigate harm.

This also impacts use cases for user-generated content (UGC). After toxic or inappropriate content is uploaded or the live stream starts, the race is on. The longer the content is available for consumption, the more users are likely to see it, and the more negative effects accumulate.

Speechly’s Otto Söderlund highlighted what is at stake and how latency can impact serious situations in a recent speech at the [Voice 2022 conference](https://www.speechly.com/blog/speechly-introduces-a-solution-to-the-voice-chat-moderation-gap-at-voice-summit-2022). He commented:

> “Speed can be really critical in detecting harmful content online. Consider, for example, the Buffalo shootings. It took Twitch two minutes to actually cut down the live stream of the shootings that were broadcasting on their platform. You think, ‘That is not a long time. It is pretty fast.’ Right? But, it wasn’t fast enough to prevent a viral spreading of those videos to the wider public.”

<YouTube videoId="R6qNXwuos2c" opts={{ playerVars: { start: 275 } }} />

The reason most platforms have moderation is to protect the users from harmful content. Whether it is voice chat or UGC, speed matters. A full-duplex NLP architecture is the only way to provide true real-time proactive moderation of conversations and content. The latency of half-duplex systems introduces higher risk because they are always several steps behind the bad actors.

## Full-Duplex in the News

However, almost no enterprises, game makers, or social platforms are even aware of the distinction between full and half-duplex NLP architectures. They typically don’t even know full-duplex features are an option. That is because users have been conditioned by the expectations set by the half-duplex feature constraints of the general purpose voice assistants combined with a disincentive for vendors to expose the limitations of their technology architecture.

One company that recently joined the full-duplex tribe is [SoundHound](https://voicebot.ai/2022/11/21/soundhound-unveils-multimodal-dynamic-interaction-feature-for-business-smart-displays/). The company demonstrated full-duplex online form filling for restaurant orders. This is a great full-duplex use case because the user can see immediately when the AI makes a mistake in the form entry and begin to take corrective action. This may not be as high stakes as some of the moderation use cases, but it definitely can provide a better user experience and higher throughput for order processing.

It was the higher stakes issues of content and voice chat moderation that led several game makers, metaverse virtual worlds, and social media services to ask Speechly for assistance. Their “transcribe and best-efforts response” approaches introduced risk for users and the companies themselves, while also carrying very high costs. Speechly’s full-duplex architecture, plus the real-time natural language understanding engine, plus the ability to deploy on devices, in the cloud, or as a hybrid, turned out to be a unique solution mix to address an intractable problem.

Our expectation is that full-duplex conversational systems will continue to see adoption growth because it is better for users and enables real-time use cases where speed is of the essence. Having other companies like SoundHound discuss this alternative approach is sure to draw more attention to what full-duplex can do as well as the limitations of half-duplex systems. We also expect this to become a standard requirement for most voice chat moderation solutions going forward.

Let us know if you have any questions about full-duplex NLP and how the technology may be a better fit for your conversational AI or voice chat [moderation](https://www.speechly.com/products/moderation) use case.

The most popular voice assistants (Alexa, Siri, Google) use half-duplex architectures, meaning the user and assistant must take turns to speak. However, Full-duplex systems employ real-time understanding where the system begins predicting the user intent from the very first word uttered, unlocking the ability for Proactive Content Moderation.

The Hidden Power of Full-Duplex AI for Voice Assistants and Voice Chat Moderation

## Speechly Introduces a Solution to the Voice Chat Moderation Gap at Voice Summit 2022

Voice chat has quickly become a popular feature in [games](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation), [metaverse](https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse) virtual worlds, and social media networks. Consumers like how it adds to the experience. The creators of these spaces like how it drives higher user loyalty, session frequency, and ARPU. However, there is a downside.

Voice chat can sometimes become a vector for [harassment](https://www.speechly.com/blog/adl-report-online-harassment-in-games-is-bad-and-getting-worse) and toxic behavior, and it presents challenges that text moderation solutions cannot adequately address due to latency, inaccuracy, privacy, and the high cost of off-the-shelf speech recognition solutions. That has led to a voice chat moderation gap.

<YouTube videoId="R6qNXwuos2c" />

## Filling the Voice Chat Moderation Gap

Otto Söderlund, CEO and co-founder of Speechly, addressed this challenge at Voice Summit 2022. He recounted how companies in the gaming industry explained this challenge and asked, “How do we ensure our experiences are safe?”

Söderlund breaks down the technical requirements as speed, privacy, and accuracy. Speed enables the platform to intervene before a situation escalates and even cut off a problematic feed automatically. Accuracy is critical, so you don’t miss incidents (i.e., false negative errors) or intervene when it is unwarranted (i.e., false positive errors) and undermine user loyalty. Privacy is essential from both a regulatory perspective and for users’ peace of mind.

Speechly’s solution meets these requirements directly while also addressing the other challenge many companies face when monitoring voice chat channels – cost. The solution can be deployed [on-premise or on device](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy), and savings can be as high as 90% over cloud service provider fees.

Let me know what you think about the voice chat moderation gap and Otto’s presentation. You can DM me or tag me on Twitter [@CollinBorns](https://twitter.com/collinborns) or [LinkedIn](https://www.linkedin.com/in/collinborns/). Checkout this [link](https://www.speechly.com/products/moderation) to learn more about Speechly’s solution for voice chat moderation.

Voice chat is a popular feature in games, the metaverse, and social media networks but it comes with challenges like harassment and toxic behavior. This post breaks down our keynote at VOICE 22 exploring how Speechly helps solve these issues.

Speechly Introduces a Solution to the Voice Chat Moderation Gap at VOICE 2022

Voice chat is very popular with both users and the creators of games, social media platforms, and metaverse spaces. People get more out of the experience when they connect directly with other users, and their behavior shows this through longer sessions, more frequent usage, and higher retention. Those same metrics are attractive to platform and application creators because they drive higher average revenue per user (ARPU).

However, the introduction of voice chat comes with the risk of [harassment](https://www.speechly.com/blog/online-harassment-stats) for users. The [Anti-Defamation League](https://www.speechly.com/blog/adl-report-online-harassment-in-games-is-bad-and-getting-worse) (ADL) reported in 2021 that as much as 27% of gamers who were subject to online harassment in a game stop playing. Similar patterns are emerging for social and metaverse applications. So, there is a dilemma. Users like voice chat. Application providers like the user metrics that voice chat delivers. At the same time, harassment delivered through voice chat can undo many of those benefits and undermine a brand’s reputation.

The solution to this problem is not a secret. Moderation has been a hot topic in social media and [gaming](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation) for nearly two decades. The difference is that most moderation techniques have developed around text moderation or asynchronous communication. Voice is real-time, can be [costly](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) to convert to text accurately, and manifests different problems.

Social media, along with some games and [metaverse](https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse) virtual worlds, also support user-generated audio and video content sharing, which can be vectors of toxic behavior and harassment. And there is the issue of brand safety if the space is ad supported. All of these scenarios require accurate, fast, and cost-effective solutions that efficiently convert voice content into text so analysis can identify harassment and other problematic content.

<WhitePaperBanner
  title="4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse"
  description="We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation."
  filePath="/uploads/4-takeaways-on-voice-moderation-in-online-gaming-the-metaverse\.pdf"
/>

## Voice Chat

Voice Chat is the most recognized risk vector where [proactive moderation](https://www.speechly.com/products/moderation) is needed. It is also the hardest to monitor and often the most personal for the victim. Users that exhibit toxic behavior or engage in harassment through voice chat are often targeting an individual or group. The victims are not collateral damage. They are the intended target for abuse. That means the speed of response can really make a difference.

> The ADL’s [2020 report](https://www.adl.org/resources/report/free-play-hate-harassment-and-positive-social-experience-online-games-2020) on online harassment in games indicated that about half of the harassment took place in voice chat during gameplay. Text chat harassment during gameplay totaled 39% while out-of-match voice chat was again higher in terms of bad behavior at 28% to 22%.

Given that voice chat is real time communication, it requires a solution that transcribes conversations quickly, so you identify and intervene before toxic behavior escalates. The transcription also must be [accurate](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation). Otherwise, you may face too many false positives where you reprimand innocent users or too many false negatives where you miss the toxic behavior altogether. Both of these outcomes negatively impact users.

Finally, there is the issue of [cost](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy). Transcribing all voice chat through a cloud service quickly becomes very expensive. The result is that many applications and platform providers avoid full coverage with voice chat monitoring and recording. Instead, most moderate voice chat only on an exception basis which makes it harder to identify perpetrators and protect users from abuse. There are new, more economical solutions to this problem but awareness is limited. These factors make voice chat both a significant risk vector for abuse and more challenging to moderate than text-based communications.

## Video and Audio

Allowing users to add video and audio content is a core use case of social media. These formats are actively encouraged as they are known to generate more user engagement and session time. Consumer video consumption has risen steadily in recent years. According to a report by [Wyzowl](https://blog.hubspot.com/marketing/state-of-video-marketing-new-data), consumers watched an average of 10.5 hours of video per week in 2018. That figure rose to nearly 20 hours as of 2022.

![](/uploads/average-hours-video-watched.png 'Average Hours Video Watched')

eMarketer data show an upward trend in online audio consumption as well. Digital audio listening grew from 1 hour and 14 minutes per day in 2018 to 1 hour and 37 minutes in 2022.

![](/uploads/average-time-spent-digital-audio.png 'Average Time Spent Digital Audio')

Of course, this presents a challenge. How do you know when a user uploads a video or audio that it does not contain objectionable audio content? The obvious answer is to conduct similar moderation steps as you would for voice chat. Transcribe the audio and then run it through your moderation text-analysis tool.

The question is how many application providers are actually doing this effectively. Many wait for user complaints about the offending material before reactively assessing the content. However, you can do this today by proactively transcribing the audio and scanning it for objectionable material. This may spare many users from being subjected to offensive material before the reactive, complaint-led model kicks in. Wouldn’t it be better to quarantine items and address these issues before they become problems?

## Brand Safety

This is the least understood of the voice moderation vectors but may become one of the most important, particularly in ad-supported social media and metaverse environments. A report from [GumGum and Digiday Media](https://gumgum.com/guides/brandrx-how-to-limit-brand-safety-risks#form) found that, “75 percent of brands reported at least one brand-unsafe exposure. And it’s not all about reputation and social media backlash: These incidents can do profound damage, leading to brand confusion and, in extreme cases, loss of revenue.” The top issue cited by advertisers was hate speech.

![](/uploads/brand-safety-top-issues.png 'Brand Safety Top Issues')

From terrorist messages and the war in Ukraine to concerns about COVID misinformation, politically charged topics, and juxtaposition to hate speech, brand safety is a rising concern among advertisers. More than two-thirds of companies say they have faced a known brand safety issue and 70% of marketers are taking the matter seriously. In 2017, [JP Morgan Chase](https://www.nytimes.com/2017/03/29/business/chase-ads-youtube-fake-news-offensive-videos.html) famously reduced their advertising from over 400,000 websites to just 5,000 to reduce brand safety risk.

To identify hate speech or reference to controversial topics in user generated content, you need to have cost-effective automated transcription that takes context into account. Context is very important because keywords alone don’t indicate whether the content is being presented in a problematic way. Moderation techniques applied to an accurate and timely transcript of the audio content can offer brand safety assurance while also protecting users from objectionable material.

## Reactive vs Proactive Moderation

Content moderation practices have emerged from a reactive model. After someone flags content as objectionable, mitigation steps are taken. That flag sometimes originates from an internal moderation team. Very often, the flag comes from a user in the form of a complaint. In both cases, many users are typically subjected to the offensive material before it is removed.

[Proactive moderation](https://www.speechly.com/products/moderation) of real-time and recorded audio content is uncommon largely because the technology required has historically been inadequate to effectively address the problem, and it was also prohibitively expensive. That is changing. New speech recognition technologies are more accurate, faster, and can be delivered more cost effectively than in the past. This situation presents an opportunity to implement automated and proactive moderation. The result will be fewer negative incidents impacting users, their experience, and the brand’s reputation.

## Solution

The difference between voice and chat [moderation](https://www.speechly.com/products/moderation) is not widely understood. As you can see, the differences don’t end with conversations. You may allow users to submit short or long form text that you can proactively scan for objectionable user-generated content. This process becomes more complicated when that content is audio or video. The impact ranges from offending users to advertisers. There are also situations where you can generate false positives that will undermine your relationship with users. Neither of these scenarios is desirable.

Of course, we are not writing this just so people will know about where voice moderation is needed. Speechly has developed technology specifically for [voice moderation](https://www.speechly.com/products/moderation) that meets the demands of high accuracy, speed, and low cost. For the first time, this will enable application and platform providers to proactively monitor all voice chat and recorded audio. This proactive stance can avoid user problems, reduce complaints, and generate a stronger overall business. [Reach out](https://www.speechly.com/contact) to our product team if you would like to learn more.

*Photo by [Pavan Trikutam](https://unsplash.com/es/@ptrikutam?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/three?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)*

Voice chat is very popular with both users and the creators of games, social media platforms, and metaverse spaces. However, the introduction of voice chat comes with the risk of harassment for users.

3 Vectors of Voice Chat Moderation

## TLDR;

* Riot Games, Roblox, Sony, Turtle Rock Studios, and other game makers are recording voice chats and using the data in their moderation. 
* Voice chat is popular with consumers because it adds value to the experience. It is popular with game makers because it drives higher retention and revenue. 
* However, voice chat is also a key vector for online [harassment](https://www.speechly.com/blog/online-harassment-stats), which undermines the gamer experience and leads to game abandonment.
* The solution is to implement better voice chat [moderation](https://www.speechly.com/products/moderation) but the tools in use today typically suffer from low accuracy, high cost, and high latency; a new technical approach is needed to fill the voice chat moderation gap.   

<WhitePaperBanner
  title="4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse"
  description="We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation."
  filePath="/uploads/4-takeaways-on-voice-moderation-in-online-gaming-the-metaverse\.pdf"
/>

Riot Games began recording voice chats in its Valorant title in July 2022 to better monitor the communication channel for toxic behavior and to investigate reported incidents. Roblox also informed users earlier this year that it is recording voice chats and maintains the recordings for [seven days](https://en.help.roblox.com/hc/en-us/articles/5704050147604-Spatial-Voice-Recording-Frequently-Asked-Questions) unless there is a complaint filed. 

Sony said in [2020](https://blog.playstation.com/2020/10/16/details-on-new-voice-chat-functionality-coming-to-ps5/) that it is recording all Playstation voice chats on a 5-minute rolling basis. It is up to the user to select up to a 40-second clip that includes the offensive behavior and submit it to Sony’s moderation team for review. 

Whether it is these games, Back 4 Blood, or others, the intent is the same. Game makers want to offer voice chat because it is a feature that players enjoy. However, they are concerned about the impact that toxic behavior and harassment have on the game-playing experience and how it can lead to players reducing their play session time, use frequency, and avoiding specific game titles altogether. 

Game makers are applying a variety of approaches to combat voice chat abuse. However, our conversations with game makers suggest there is not only room for improvement; the existing techniques for voice chat moderation introduce new problems because they are inaccurate, expensive, and slow. 

### The Case for Adding Voice Chat

Some people may suggest that you could avoid this problem altogether by simply sticking to text chat alone or offering no in-game communications. But simple solutions aren’t always practical. Gamers like voice chat, and its absence could undermine a game’s competitiveness while shortchanging the in-game experience. An article in [Hackernoon](https://hackernoon.com/why-every-multiplayer-game-needs-in-game-voice-chat) shared statistics from Tencent Cloud that reported:

> “[Over 90%](https://www.tencentcloud.com/resources/whitepaper/100203/?ref=hackernoon.com) of Chinese gamers prefer to interact with other players in an experience[](https://intl.cloud.tencent.com/resources/whitepaper/100203/?ref=hackernoon.com). 90.6% of consumers use the built-in voice chat function when playing a game, with 38.4% saying that they use the voice chat function often. When a title doesn’t have an in-game voice communication system in place, 73.7% of these players say they turn to a third-party service instead.” 

It’s not just Chinese gamers that like voice chat. As I wrote in an earlier [post](https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse):

> An Oxford Academic study from 2007 found that “voice chat leads to stronger bonds and deeper empathy than text chat. As Subspace put it in 2021, “Voice deepens the immersive world, helps forge social bonds, and strengthens online play.” 

You can also see it in user behavior related to voice chat. Tod Bouris, the former director of customer success at Vivox and now a manager at Unity, said at a 2019 [conference](https://www.youtube.com/watch?v=IwZawWFVoPs): 

> “The metrics show that people who use communications during their gaming game more and more often than those that don’t. And we also find this holds true for any platform, any game type. So, voice \[chat] is really a social element that adds stickiness and retention to your games that you can’t get from something else.”

Bouris also said that voice chat users [spent twice the amount of time](https://www.slideshare.net/unity3d/using-vivox-to-connect-your-players-text-and-voice-comms-unite-copenhagen-2019) playing as non-voice users and were five times more likely to be playing after five weeks. The data suggest the case for including voice chat in games is strong and getting stronger. Given this situation, game makers are turning to voice chat moderation vs eliminating voice chat from their experiences. 

### Voice Chat Moderation Challenges

If a game maker has other communications, such as text chat, they typically have some type of moderation solution to monitor for abusive behavior. Many companies believe that they can just add an off-the-shelf transcription solution to a voice chat and then feed the text into their existing moderation solutions. This is where complications begin. 

Most general purpose automated speech recognition (ASR) solutions will not recognize a significant portion of the game-related nomenclature and slang. That often leads to transcript errors which means the text analysis will suffer from a high frequency of false negatives (i.e. missing something that should be flagged) and false positives (i.e. flagging something that is not a policy violation). These error types lead to different problems that are costly to resolve, can lead to missed issues, and don’t live up to the goal of the moderation policy. 

Part of the solution is to use a [custom-trained ASR model](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation). That will help reduce transcription errors. Very often, game makers also need to have a refined natural language understanding (NLU) model to provide context that is often unclear from the text alone. Voice chat provides more robust data signals that can help differentiate between abusive and collegial verbal exchanges, which can further reduce the occurrence of false positives and negatives. 

Another challenge is [cost](https://www.speechly.com/blog/3-common-voice-chat-moderation-mistakes). Transcribing every voice chat in the cloud can run up large computational processing bills very quickly. Try it sometime. Few organizations can afford this and, as a result, scale back their voice chat moderation plans or cancel them altogether. This can be mitigated by running some or all of the ASR transcription locally on the users’ devices. 

Finally, there is the issue of speed. Voice chat is a real-time activity, while most moderation is conducted in after-the-fact audits or based on user-submitted complaints. That means the moderation is really adjudication. It takes place after the abusive or harassing behavior is over. Real-time solutions can flag these problematic conversations while they are in process and potentially mitigate the negative effects on the victims and prevent further spreading or virality. [On-device speech recognition](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) is one method to significantly speed up speech transcription, which is the first step in real-time monitoring.

### Where Voice Chat Moderation is Headed

Game makers increasingly need to provide voice chat and need better tools for their moderation program. Speechly focuses on voice chat monitoring and provides the ASR and optional NLU as an API. Game makers can just connect to Speechly’s API and feed their existing moderation solutions directly with higher quality data. Speechly can also quickly train [custom ASR models](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation) and run the solution [on user devices](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) to deliver higher accuracy, lower cost, and lower latency. 

We didn’t originally set out to solve this problem. Some game makers were using our API to power the speech recognition for non-player characters and asked if our technology could help. It turns out the Speechly architecture and performance are well suited to address the voice chat moderation gap. As we began assisting game makers, we learned even more about the real requirements behind the problem, and we refined our API solution for moderation. You can learn more [here](https://www.speechly.com/products/moderation). 

Let us know if you have questions about how we help game makers tackle the voice chat moderation gap and also if you have any particular requirements that you would like to see added to our products. We are certain voice chat moderation is a problem worth solving for the benefit of gamers and game makers alike.

Major gaming studios like Riot Games, Roblox and Sony are recording voice chats for moderation, but the tools for content moderation today typically suffer from low accuracy, high cost, and high latency. A new technical approach is needed to fill the voice chat moderation gap.   

Why Games Need Better Voice Chat Moderation

At Speechly, we are known for the speed of our Automatic Speech Recognition and Natural Language Understanding. This is all thanks to our [Streaming API](https://docs.speechly.com/reference/streaming-api).

However, as our user base has continued to grow so has the demand for new product features. One of the most requested features to date is the ability to use Speechly for transcribing large amounts of pre-recorded audio or video content.

Given that demand, we are excited to release the [Speechly Batch API](https://docs.speechly.com/reference/batch-api) for Enterprise users.

## Transcribe Large Amounts of Pre-Recorded Audio and Video with the Speechly Batch API

The Speechly Batch API enables users to easily and privately send large sets of pre-recorded audio or video files to Speechly for Speech Recognition. This makes it easy to complete tasks like [Transcription](https://www.speechly.com/products/transcription), [Moderation](https://www.speechly.com/products/moderation) or other types of Speech Analysis on large amounts of off-line audio or video data.

To use the Speechly Batch API, users simply submit the audio and read the results after it is processed. The throughput of the Speechly Batch API supports processing thousands of hours of audio per hour. You can submit the audio directly to Speechly or give the Google Storage or Amazon S3 file URL, making it easy to scale.

With the Speechly Batch API you can expect the same performance and accuracy that you would expect from the Streaming API. We also offer Enterprise customers data annotation services for continued monitoring and improvement in the performance of your Speech Recognition.

When deploying the Speechly Batch API, customers have the ability to deploy On-Premise or in a Private Cloud. These deployment options help customers leverage Speech Recognition in the most secure and cost-effective manner possible.

## Example Scenarios for the Speechly Batch API

The use cases for the Speechly Streaming API and the Speechly Batch API are the same, however the Streaming API is used for live online scenarios while the Batch API is for offline scenarios. This means you should use the Streaming API if you need to process speech in real-time as it’s being recorded. Use the Batch API if your audio or video has already been recorded and you need to process later on. Below are a few examples where the Speechly Batch API is ideal:

**User Generated Content Monitoring** - Massive amounts of podcasts and video are being recorded and uploaded everyday. Speechly makes it easy to transcribe any user generated audio or video content uploaded to your platform for use cases such as [Moderation](https://www.speechly.com/products/moderation), Content Categorization or Content Indexing, Measuring Brand Awareness and Strategic Ad Placement.

**Meeting Analysis** - The Speechly Batch API makes it simple to convert meetings into [transcriptions](https://www.speechly.com/products/transcription) for later analysis. This can be used for use cases such as creating Meeting Summaries or making it easy to search back through past discussions for specific information.

**Customer Support** - Speechly can quickly and accurately transcribe recorded customer calls. This can be used for use cases such as Measuring Agent Performance and Agent Training or extracting & documenting relevant customer support information. This is different from monitoring a conversation in Real-Time, where the goal could be to offer the Agent relevant information as the call is happening. This would be another example of where the Speechly Streaming API would be better suited.

If you are interested in learning more about how to use the Speechly Batch API, you can read more in our [Documentation](https://docs.speechly.com/features/on-premise/). If you would like to learn more about getting access to the [Speechly Batch API](https://docs.speechly.com/reference/batch-api), [Contact Us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/new-feature-release-batch-api-for-transcribing-pre-recorded-audio) today.

Today we are excited to announce the Speechly Batch API for Transcribing massive amounts of pre-recorded audio or video content.

New Feature Release: Batch API for Transcribing Pre-Recorded Audio

Voice chat has become an essential feature in many games and social media platforms. Axlebolt Studios [found](https://unity.com/case-study/axlebolt-standoff-2#voice-chat-mobile) that adding Voice Chat to Standoff 2 improved 90 day retention rates by over 60%.

Tod Bouris of Vivox showed figures dramatically higher at [Unite Copenhagen](https://www.youtube.com/watch?v=IwZawWFVoPs) in 2019. “The metrics show that people that use communications during their gaming, game more and more often than those who don’t. Voice is a social element that adds stickiness and retention to your games that you can’t get from anywhere else.” Voice chat users, he said, spent twice the amount of time playing as non-voice users and were five times more likely to be playing after five weeks.

An Oxford Academic study from 2007 found that “voice chat leads to stronger bonds and deeper empathy than text chat.” As Subspace put it in 2021, “Voice deepens the immersive world, helps forge social bonds, and strengthens online play.”

Given the benefits and user expectations, it is no wonder that online games, social media platforms, and metaverses are providing voice chat. However, there is a [downside](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation).

<WhitePaperBanner
  title="4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse"
  description="We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation."
  filePath="/uploads/4-takeaways-on-voice-moderation-in-online-gaming-the-metaverse\.pdf"
/>

## The Toxicity Highway

We also know from [research](https://www.speechly.com/blog/online-harassment-stats) by Pew, ADL, and others that voice chat is also a primary source of toxic behavior and harassment in online games. And the impact of these negative behaviors leads between a quarter and a third of users to reduce their gameplay or avoid specific games altogether.

The answer to this problem is, of course, moderation. But this is where the complications emerge. Voice chat is more complex than text chat because you first need to capture and analyze the voice data before you can apply the text analysis tools. This is where companies make mistakes that undermine their key objectives.

## Voice Chat Moderation Mistake 1 - The Human Touch

There may always be a human element in [voice chat moderation](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation), but you just cannot scale humans cost-effectively enough to monitor all chats or user generated content at all times. One human-led approach is to simply record chats and then only review them when a complaint is filed. This is also time intensive, provides no early warning system, introduces individual bias into the evaluation process, and hiring typically cannot keep up with usage growth.

Human-led moderation only seems to work at a very small scale with volunteer community monitoring. Smaller Discord servers are an example of this. Anything else requires automation of the moderation process.

## Voice Chat Moderation Mistake 2 - The Sloppy Transcription

Not all automated voice chat monitoring will give you the same results. The most common mistake that arises with these solutions is poor transcriptions. After you have transcribed the voice chat into text, you can more rapidly and cost-effectively analyze the data for toxicity and harassment or validate complaints.

What happens when the transcriptions are poor? A poor transcription can result in errors of omission and commission. The omission errors miss the bad behavior because they either misunderstand words or the context of the comments. The commission errors flag normal behavior is bad. Another term for this is [False Negatives](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation) and [False Positives](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation).

[Automatic speech recognition](https://www.speechly.com/blog/nlu-voice-speech-recognition-terms-glossary) (ASR) is not easy to do well. Users are often speaking quickly, may have accents, and use industry or company specific jargon in their speech. Plus, the audio quality is not always good, there may be noise in the background, and it is highly variable across users and environments. All of these factors make it hard to produce a high-quality transcript which can undermine the automated analysis of the data.

These same challenges may lead to errors of commission – identifying problems where they don’t exist. This can also be a big issue. If you incorrectly flag a user or take action against a user without cause, it undermines their loyalty and often leads to negative comments both within and outside of the game or community. Plus, inadvertently flagging a benign comment for human moderation review adds cost unnecessarily.

Few ASR solutions are designed for all of these challenges. Producing high-fidelity transcripts and analyses often require you to train the speech recognition model for a particular use case. The off-the-shelf cloud speech recognition models typically have significant limitations in customization in addition to one more big problem.

## Voice Chat Moderation Mistake 3 - The Money Pit 

It is clear that human-led moderation practices don’t scale very well in terms of cost. The same is true for many cloud-based ASRs. One moderately sized [metaverse](https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse) with 50,000+ daily active users (DAUs) saw a bill of nearly $15,000 for one day of cloud-based voice chat transcription from a popular provider. And this pricing rate doesn’t decline significantly as scale rises.

Relying solely on cloud-based transcription can become prohibitively expensive very quickly. People using voice chat are active users, and that means they talk a lot, and that means your transcription costs rise rapidly. One company we spoke with found that their efforts to “optimize” cost led to poorer results and other hidden expenses, so they abandoned moderation altogether. We know that is not a good idea. So, the question is, how do you provide high-quality automated [moderation tools](https://www.speechly.com/products/moderation) that minimize errors and cost while also providing a healthy voice chat experience?

## Avoiding Voice Chat Moderation Mistakes

Speechly offers solutions that run in the [cloud](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy), [on device](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy), or in a [hybrid model](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy). In addition, our accuracy is higher than leading ASR solutions such as Google, even before the optimization of custom model training which is a feature we also offer. These elements combine to provide a dramatic improvement in both cost and accuracy.

Many games, social networks, and metaverses have learned some hard lessons about text chat moderation. Our hope is that you can avoid having to learn the new lessons of voice chat moderation by avoiding three of the common mistakes.

How did we learn about these? We met with over 20 professionals working in online gaming, the metaverse and social media earlier this year to learn about their challenges. They were interested in using Speechly for some other features, and it turned out to be an optimal solution for automating [voice chat moderation](https://www.speechly.com/products/moderation). And they liked the fact that we can provide the data to the existing moderation software without having to duplicate systems or replace them.

Let us know if you have faced these mistakes in the past or have any questions. You can reach out to us using our [Contact Us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/3-common-voice-chat-moderation-mistakes) form or try out our API for free by [Signing Up](https://api.speechly.com/dashboard/#/signup?ref=https://www.speechly.com/blog/3-common-voice-chat-moderation-mistakes) for an account.

*Photo by George Becker from [Pexels](https://www.pexels.com/photo/1-1-3-text-on-black-chalkboard-374918/)*

Voice chat has become an essential feature in many games and social media platforms making Moderation a critical thing to get right. 

3 Common Voice Chat Moderation Mistakes

Admittedly, we didn’t know much about this at Speechly before 2022. Our main focus has been on building a fast, accurate, and efficient voice user interface for applications. However, during our time in the Y Combinator accelerator, several companies suggested they could use our technology to help them moderate voice chat and content in [games](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation), [metaverses](https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse), and other applications. So, we began studying the problem.

What we found was a compelling issue with plenty of challenges and few good options. There were several solutions for text chat moderation, but companies had learned that [voice moderation](https://www.speechly.com/products/moderation) was either prohibitively expensive or too inaccurate to implement with any confidence. That led us to apply Speechly to the voice moderation problem.

In our own efforts to understand the problem better, we came across a number of interesting studies that we thought might be useful if you are also researching online harassment. Below is a compilation of research on the topic. We will update this post as new studies are released and learn about earlier data.

Let us know what we missed. Please drop a link using our [Contact Us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/online-harassment-stats) form, and we will add it here. We hope the data here proves useful in your work!

## Pew Research on Online Harassment

Pew Research has been [tracking online harassment](https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/) since 2014 and has some useful trendline information along with insightful recent findings.

![](/uploads/majority-say-online-harassment-is-a-major-problem.png)

![](/uploads/2-3-adults-under-30-harassed-online.png)

![](/uploads/online-harassment-of-women-more-severe.png)

![](/uploads/under-30-more-likely-harassed-online.png)

![](/uploads/41-americans-experienced-online-harassment.png)

![](/uploads/women-sexual-harassment-doubled-2017.png)

![](/uploads/women-vs-men-most-recent-harassment-experience.png)

![](/uploads/most-recent-harrassment-online-location.png)

![](/uploads/online-harassment-major-problem.png)

![](/uploads/1-3-americans-support-suing-sites-of-harassment.png)

### **TLDR Takeaways from Pew Research:**

* 41% of U.S. adults have personally experienced online harassment, and 25% have experienced more severe harassment.
* The majority of younger adults have encountered harassment online.
* While men are slightly more likely to experience harassment online, women are more likely to be upset about it and think its a major problem.
* Experience with certain types of online abuse varies by age, gender, race or ethnicity.
* Roughly four-in-ten Americans have personally experienced online harassment.
* Women are more likely than men to be harassed online. The total accounts of online sexual harassment have doubled since 2017.
* Social media is the most common location for online harassment. However, for men harassment is more likely to take place in online gaming.
* Younger adults are more likely to have been harassed online while gaming, text/messaging app, or online dating vs on a forum, social media, or in personal email.
* 55% of Americans consider online harassment to be a major problem.
* Users are becoming critical of the job social media companies are doing to address online harassment.

## ADL on Online Harassment in Social Media

The Anti-Defamation League (ADL) has also conducted extensive research into online harassment. The organization produces two reports, one for online harassment overall with a particular focus on social media, and another for gaming. The [2022](https://www.adl.org/resources/report/online-hate-and-harassment-american-experience-2022) and earlier reports for social media shed new light on the problem.

![](/uploads/online-harassment-percentage-since-2020.png)

![](/uploads/stop-or-reduced-usage-from-online-harassment.png)

![](/uploads/platforms-where-harassment-took-place-vs-use.png)

![](/uploads/men-vs-women-online-harassment-experience-ever.png)

![](/uploads/social-platforms-where-harassment-took-place-2020-2022.png)

![](/uploads/worried-of-being-targeted-online.png)

![](/uploads/action-taken-by-platforms-online-harassment.png)

![](/uploads/online-harassment-experience-over-last-12-months.png)

### TLDR Takeaways from ADL Online Hate and Harassment Survey:

* Online harassment has remained stable across platforms since 2020 despite tech companies public commitment to improve safety on their platforms.
* In response to being harassed, almost a third of users (29%) stopped or reduced their use of platforms altogether, especially Facebook (19% reduced their use, and 10% stopped altogether).
* Youth are more likely to report harassment on Instagram and Snapchat vs Facebook than adults.
* The data suggests that harassment was less common on Twitter than on Facebook, more common than on YouTube or Reddit, and comparable to the likelihood of being harassed on Instagram.
* Comparing Harassment vs Use you are most likely to experience harassment on Facebook (81%) followed by Twitter (44%), Snapchat (41%), Instagram (37%), Discord (35%), TikTok (31%), Twitch (26%), Youtube (21%), and Reddit (18%).
* More than a third (37%) of women reported being harassed at some point compared to 43% of men.
* Of those worried about future harassment, 62% were worried about being harassed for their political views, 53% for their physical appearance, 47% for their race or ethnicity, 44% for their religion, and 43% for their gender.
* Of the respondents who faced physical threats 53% said they reported the content; only half of those reports led to any action by the platform.
* A third of respondents who were harassed reported being called offensive names.

## ADL on Online Harassment in Gaming

ADL also has extensive data on online gaming experiences. These include [survey results](https://www.adl.org/hateisnogame) that express both positive and negative experiences.

![](/uploads/positive-experience-in-multiplayer-games-last-6-months.png)

![](/uploads/positive-gaming-experiences-last-6-months.png)

![](/uploads/severe-abuse-online-is-getting-worse.png)

![](/uploads/disruptive-behavior-in-gaming.png)

![](/uploads/influence-of-disruptive-behavior-in-gaming.png)

![](/uploads/controversial-topic-reaction-online.png)

![](/uploads/adult-safety-settings-for-children.png)

### **TLDR Takeaways from ADL Hate is No Game Survey:**

* The vast majority of young gamers—more than nine out of ten—reported some form of positive social experience in online multiplayer games.
* Online games at their best can function as social platforms connecting people and building communities for a multitude of lived experiences.
* For the third consecutive year, harassment in online games has not decreased. Five out of six adult gamers experience harassment in online multiplayer games—more than 80 million American adults.
* Three in five young people experienced harassment in online multiplayer games—nearly 14 million young gamers.
* Over a quarter of young gamers who experienced harassment in online multiplayer games quit specific games.
* A third of young gamers changed how they play, including not speaking in voice chat and altering their usernames. Voice chat is notorious for being a significant locus of in-game abuse.
* The most common responses to exposure to extremism and disinformation were ignoring it (18%) and reporting or blocking the players involved (17%).
* Less than half of parents or guardians surveyed reported having implemented the safety controls in online multiplayer games that were analyzed in this survey.

## The Experience of Women Gamers

You can see from some of the data above that women’s experience with harassment while gaming differs from men. Some [examples](https://www.ivint.org/gaming-hidden-sexism-and-harassment/) of [additional data](https://www.statista.com/statistics/232383/gender-split-of-us-computer-and-video-gamers/) are included below.

![](/uploads/gender-gap-in-online-gaming.png)

![](/uploads/abuse-of-female-gamers-by-males.png)

![](/uploads/share-of-gamers-male-vs-female.png)

![](/uploads/59-women-mask-gender-gaming-online.png)

S﻿ource: GamesIndustry.biz

### TLDR Takeaways on Experience of Women Gamers:

* Women make up a large % of online gamers with the split being 52% Male and 48% Female.
* Despite a lot of women playing games, the majority of game developers are Male - with 76% Male and 22% Female.
* In the U.K. online harassment from men towards women is causing women to stop playing games altogether.
* Online harassment towards women is so common that 59% of women mask their gender while playing games online.

Photo by Andrea Piacquadio: Pexels

Online harassment is as old as the internet. However, where it was once rare and infrequent, it is now increasingly common. The data all points in one direction and is compiled here.

Online Harassment Statistics that Matter for 2022

Speech to Text technology can be deployed in various ways, such as in the Cloud, On-Device, or On-Premise (Server or Private Cloud). However, there are various Pros and Cons in how you deploy that can affect the Cost, Speed, and Privacy of the experience you build. In this post, we will cover the differences between Cloud, On-Device, and On-Premise Speech to Text deployment and scenarios where you should consider ditching the Cloud for an [On-Device](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) or On-Premise deployment.

## Speech to Text: On-Device vs On-Premise vs Cloud

Whether you are running [Speech to Text](/blog/nlu-voice-speech-recognition-terms-glossary) On-Device, On-Premise, or in the Cloud the core outcome remains the same. Speech to Text enables developers to convert audio to text for various use cases, such as [Transcription](/products/transcription) for Video Calls or [Moderation](/products/moderation) for Video Game chats. However, there are many more use cases for Speech to Text.

Speech to Text can be deployed in multiple ways. The most common way that Speech to Text is deployed is in the Cloud. This simply means that audio is converted into text using the help of a cloud provider such as Google or Amazon, where the audio is captured on a users device, sent to the cloud for transcription and instruction from the developer on what to do with the transcription, before being sent back to the users device.

Other ways of deploying Speech to Text include On-Device or On-Premise. This simply means that Transcription is taking place directly on the user's device running the application or within a company's private server stack or private cloud. While the use cases for On-Device or On-Premise Speech to Text are similar in nature, meaning at the core there is still the conversion of audio into text, deploying in this fashion comes with some additional benefits to consider.

*[Learn more about running Speech to Text On-Device or On-Premise with Cloud-grade performance](/contact?ref=https://www.speechly.com/blog/when-to-run-speech-to-text-on-device-or-on-premise-vs-in-the-cloud)*

## When to run Speech to Text On-Device or On-Premise

Running Speech to Text On-Device or On-Premise has 3 main benefits: Cost, Speed, & Privacy.

### Cost

Most Speech to Text or Speech Recognition solutions are Cloud based products. However, running Speech to Text in the Cloud requires sending large amounts of audio over the internet to be processed. For use cases where there is a lot of audio to be transcribed, like in a Video Call or Stream, the cost can climb fast making Speech to Text an unviable feature. With the ability to run Speech to Text directly on the user's device or On-Premise, the cost can come down by up to 10x depending on the provider.

### Speed

Another key pitfall with many cloud based Speech Recognition providers is the inability to deliver real time Speech to Text. Even with the current speed of sending information back and forth between the cloud, there is still a noticeable lag in speed for the majority of Speech to Text products that greatly disrupts the end user experience. Running Speech to Text On-Device or On-Premise also is a great way to increase the speed of the transcription since the transcription process is never required to leave the end user or companies product ecosystem.

### Privacy

The final, but arguably most important reason to run Speech to Text On-Device or On-Premise is Privacy. We live in a world where consumers' attention to privacy is at an all time high. Even the concept of technology listening to complete tasks like transcription can make people uncomfortable.

Running On-Device or On-Premise allows companies to build experiences that leverage Speech to Text while giving users confidence that their valuable Voice Data is remaining private, either by never leaving their device or by remaining secure with the company delivering the experience.

## Speech to Text Accuracy: On-Device vs On-Premise vs Cloud

Speech to Text technology is powered by large Machine Learning models which historically has made it difficult to deliver the same accuracy in On-Device or On-Premise experiences vs in the Cloud. Until recently, running Speech to Text anywhere but in the Cloud meant a significant drop in accuracy performance as this environment usually required running smaller and less sophisticated Speech Recognition models.

However, at Speechly the Speech to Text models used by the On-Device and On-Premise solution are the same as the ones used in our Cloud Based offering. This means you can get 95%+ accuracy with Speech to Text Transcription in the Cloud, On-Device, or On-Premise.

## Building On-Device & On-Premise Speech to Text

There are still use cases for Speech to Text technology where a cloud based deployment makes sense. These [scenarios](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) are not limited to, but usually will have the characteristic of Lower Overall Voice Data volume. This simply means that there is a small amount of information to be transcribed at any given time - such as giving simple Voice Search inputs to a website.

When it comes to high volume scenarios, such as Transcribing a Video Call or [Moderating](https://www.speechly.com/products/moderation) a Voice Chat in an [online game](https://www.speechly.com/blog/why-games-need-better-voice-chat-moderation), deploying Speech to Text either On-Device or On-Premise can bring you Cost, Speed, and Privacy benefits. It is important to keep these factors in mind when finding a Speech to Text technology partner.

[C﻿ontact our Product Team](https://www.speechly.com/contact) if you would like to learn more about running Speech to Text On-Device or On-Premise

*Photo by [Juairia Islam Shefa](https://unsplash.com/@juairiaa?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/computer-cell-phone?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)*

When deciding to deploy Speech to Text technology On-Device vs in the Cloud you should consider Cost, Speed, & Privacy.

When to Run Speech to Text On-Device or On-Premise vs in the Cloud

As the internet has matured, so have the available content types that users can enjoy and approaches for moderating online activity. Most of the early content online was dominated by text and visual based content where audio played a smaller role. Flash forward to the 2020s and you see a very different ecosystem that is dominated by images, videos, and even real time communication channels such as Clubhouse, Twitter spaces or the various [Metaverse](https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse) universes like Roblox that enable Voice Chat.

Since [voice-based moderation](https://www.speechly.com/blog/the-case-for-real-time-voice-chat-moderation-technology-in-the-metaverse) has received less attention over the years than text-based moderation, this post will outline how to use [Speech to Text](https://www.speechly.com/blog/nlu-voice-speech-recognition-terms-glossary) technology to efficiently and effectively moderate Voice Chats and Voice-based content online. We will focus on keyword detection based moderation in which the task is to find specific instances of bad words (or short phrases) from a users’ speech.

<WhitePaperBanner
  title="4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse"
  description="We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation."
  filePath="/uploads/4-takeaways-on-voice-moderation-in-online-gaming-the-metaverse\.pdf"
/>

## Foundation of Voice Moderation: Efficient Speech to Text

In order to build a [Voice Moderation](/products/moderation) solution, you will need to use Speech to Text technology. Speech to Text does exactly as the name implies - it converts a user's speech into text that can be used for various downstream tasks, such as Moderation. It’s important to know that Speech to Text is a Machine Learning technology which means that even when it is trained for a specific industry, there can still be instances where the system will make a mistake. This means that on occasion it will “hear” something else than what was actually said.

These Speech to Text mistakes may require special attention in the context of Voice Moderation, because inappropriately implemented moderation can be detrimental to the user experience. The moderation system should not miss obvious cases of toxicity, but accusing innocent users of bad behavior too often can be even more harmful. For this reason, it is important for businesses to consider the tradeoffs in building a Moderation solution that is super strict vs more tolerant. Doing this requires an understanding of Speech to Text False Negatives and False Positives.

## What are Speech to Text False Negatives and False Positives?

False Negatives and False Positives are the most important variables to consider when using Speech to Text technology to build a Voice Moderation solution.

* **False Negative**: A is a scenario where the Speech to Text misses a “bad” word/phrase and transcribes it as something “good”.
* **False Positive**: A is a scenario where the Speech to Text transcribes a “good” word/phrase as something “bad”.

Regardless of how accurate the Speech to Text system is, it is impossible to completely avoid the scenario of False Negatives and False Positives. However, what businesses do have control over is the sensitivity of their Speech to Text. With this decision in Speech to Text sensitivity comes moderation trade offs to consider between False Negatives and Positives.

![Confusion Matrix for Binary Classification](/uploads/false-positive-false-negative-matrix.png 'Confusion Matrix for Binary Classification')

## Tradeoffs between Speech to Text False Positives and False Negatives with Voice Moderation

There is a direct trade off with Speech to Text False Negatives and False Positives. If you reduce one variable, the other is going to increase. To better understand this, let's look at an example using a Human Moderator and 2 different scenarios.

### Situation

When the Moderator hears profanity, assume there are 2 levels of confidence to gauge how well they heard the word: ***HIGH*** and ***LOW***.

* If confidence is ***HIGH***, they are more likely to have heard correctly.
* If confidence is ***LOW***, they are more likely to have heard incorrectly.

The Moderator has 2 Possible Scenarios:

* Scenario 1) Only flag when confidence is ***HIGH***.
* Scenario 2) Flag when confidence is ***HIGH*** or ***LOW***.

### Scenario 1 - HIGH Confidence

In this scenario, the Moderator will flag ***less often*** as they will be more selective in the flagging process.

* This means there will be ***More False Negatives*** since there will be some cases that were missed that should’ve actually been flagged.
* This means there are ***Less False Positives*** as the Moderator tends to only flag correct cases.

### Scenario 2 - HIGH or LOW Confidence

In this scenario, the Moderator will flag ***more frequently*** as they will be less strict in the flagging process.

* This means that there will be ***Less False Negatives*** since the Moderator is more actively flagging even when not 100% certain on what they heard.
* This means there are ***More False Positives*** since the Moderator likely made more mistakes with the lower confidence threshold.

### Decision - Weighing False Negatives and False Positives

In this situation, the business has to decide whether to be more or less strict in their moderation flagging process.

* If they favor a ***less strict*** policy, they will follow Rule 1 where the Moderator must have a high degree of confidence before flagging.
* If they favor a ***more strict*** policy, they would follow Rule 2 where the Moderator would be less strict in flagging, but ensure no profanity is making it through.

The tradeoff decision made in this Human Moderator example is exactly how you should look at tradeoffs with Speech to Text for Moderation!

## Finding the Right Balance of False Negatives and False Positives for Your Business

For each level of False Negatives there is a matching level of False Positives. This is called the **Operating Point** of the system. Finding the ideal Operating Point with Speech to Text is the main goal when it comes to Moderation, however where this Operating Point will exist is determined by the context of an individual business and their goals.

For example, if no profanity is tolerated, you might opt for a higher False Positive rate, where flagging is then verified by a human in the loop. Alternatively, if you would like to fully automate moderation and do not have the resource to verify every flag with a human, you would opt for a smaller False Positive rate. (But may have to tolerate more False Negatives as a consequence.)

In order to use Speech to Text for [Voice Moderation](https://www.speechly.com/products/moderation), you need a solution that can be adjusted to target the specific type of language that is relevant to your business and use case. By using a dedicated Speech to Text model that is trained with data that is relevant for your use case, Speechly can reduce false negatives without adversely affecting the false positive rate.

If you would like to learn more about how Speechly’s Speech to Text technology can be adapted to target specific jargon from your business or industry, reach out to the team with our [Contact Us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation) form.

*Cover Photo by Karolina Grabowska: [Pexels](https://www.pexels.com/photo/person-holding-tuning-pegs-4472108/)*

When moderating Voice-based content, there is a tradeoff between false alarms and real cases where moderation is required. Balancing these depends on your goals.

How to Fine Tune Speech to Text for Voice Moderation

### TLDR;

* Voice chat has become a key feature for creating social connections in online games and is expected to be even more important in metaverse environments, some of which will have more significant social features than games do today.
* Voice chat users are known to access games more frequently and engage in longer sessions. This makes them very valuable as voice chatters are both more active users and are creating user-generated content that extends the experience offered by the game or metaverse.
* However, voice chat is the top channel for [online harassment](https://www.speechly.com/blog/online-harassment-stats) in online games, and we have already seen that behavior translate directly over to metaverses.
* Online harassment is rising in frequency and severity over the past five years.
* This takes a toll on the users and also has very real costs for the gaming platform. Nearly one-third of consumers subjected to online harassment avoid games with reputations for toxicity, and more than one-in-four have left games due to bad behavior by other users.
* Moderating [voice chat](https://www.speechly.com/blog/3-common-voice-chat-moderation-mistakes) is more difficult and costly than text chat.
* New techniques around [Speech to Text](https://www.speechly.com/blog/how-to-fine-tune-speech-to-text-for-voice-moderation) (STT) training and [on-device](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy) processing are poised to revolutionize voice chat moderation and become standard features for emerging metaverses. Without these technologies, metaverses are at undue risk that a few bad actors will undermine adoption before it even gets started.

<WhitePaperBanner
  title="4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse"
  description="We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation."
  filePath="/uploads/4-takeaways-on-voice-moderation-in-online-gaming-the-metaverse\.pdf"
/>

### Online Harassment is Getting More Severe

“Roughly four-in-ten Americans have experienced online harassment,” says the [Pew Research Center](https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/).  While the total share of consumers reporting a harassment experience didn’t change between 2017 and 2021, the research organization says, “more severe encounters have become more common.” This includes significant rises in physical threats, stalking, and sexual harassment.

![](/uploads/pi_2021.01.13_online-harrasment_0-01-1.webp)

While the overall harassment numbers from Pew were flat at 41% since 2017, users experiencing the more severe behaviors rose from 18% to 25%. Studies show that users actively avoid online games and stop using the games where harassment is common. Metaverse spaces share many characteristics of games and that will invariably include harassment. Oh, and the number one channel for harassment in games is … you guessed it … voice chat.

### The Problem for Metaverse Developers

This presents a problem for metaverse developers. According to the [Anti-Defamation League (ADL)](https://www.adl.org/resources/report/hate-no-game-harassment-and-positive-social-experiences-online-games-2021), “Abusers, who use voice chat in online games to target individuals, often evade detection because the tools and techniques to detect hate and harassment within a game’s voice chat lag behind those that moderate text communication.”

The challenges associated with moderating voice chat range from privacy to the options and timeliness of enforcing policies. These issues also exist for text chat moderation. However, voice chat has two other big issues in terms of accuracy and cost. Voice chat must first be accurately transcribed from speech to text before analysis can take place. Transcription technology has advanced considerably outside the gaming world over the past decade, but it is not optimized for monitoring online harassment in dynamic digital environments.

You might think that a simple solution would be just to turn off text and voice chat to avoid these problems. However, that runs counter to the evolution of gaming becoming more social.

### Voice Chat Improves the User Experience

Studies show that there are many positive psychological effects of adding social communication to games that enhance the overall experience. Games without chat are at a disadvantage to those with it. This is likely to have an even greater impact in metaverses where gaming is not the central activity. Social connectivity will be an essential feature. That means voice chat moderation must be a top priority for metaverse builders.

An Oxford Academic study from 2007 found that “voice chat leads to stronger bonds and deeper empathy than text chat. As Subspace put it in 2021, “Voice deepens the immersive world, helps forge social bonds, and strengthens online play.”

![](/uploads/axelbolt_chart_retention-vivox_0.webp)

To put more concrete numbers to this, [Axlebolt Studios](https://unity.com/case-study/axlebolt-standoff-2) found that 90-day player retention rose by 63% after implementing voice chat. The average revenue per user (ARPU) also rose by 12%. Axelbolt Studio’s Salah Sivushkove told Unity in an interview, “voice chat is unequivocally a must-have mechanic for multiplayer games.”

### What Metaverses Must Learn from Multiplayer Games

The ADL [revealed](https://www.adl.org/resources/report/hate-no-game-harassment-and-positive-social-experiences-online-games-2021) concerning results for online gaming in 2021. A national survey found that “Five out of six adults (83%) ages 18-45 experienced harassment in online multiplayer games—representing over 80 million adult gamers.” That figure has risen for three straight years. In addition, “Three out of five young people (60%) ages 13-17 experienced harassment in online multiplayer games—representing nearly 14 million young gamers.”

The ADL’s [2020 report](https://www.adl.org/resources/report/free-play-hate-harassment-and-positive-social-experience-online-games-2020) indicated that about half of the harassment took place in voice chat during gameplay. Text chat harassment during gameplay totaled 39% while out-of-match voice chat was again higher in terms of bad behavior at 28% to 22%.

![](/uploads/free-to-play-800-6.png)

That same year 28% of online gamers avoided specific games due to their reputation for online harassment and 22% stopped playing. By 2021, those figures rose to 30% and 27% respectively. This means there is a very real economic cost of toxic behavior to game makers beyond the human impact.

Riot Games recognized this problem could no longer be ignored for its Valorant title last year. The company [announced](https://www.makeuseof.com/riot-games-recording-valorant-voice-chats/) that it was changing the terms of service and would begin recording and analyzing voice chats. A blog post announcing the change included the comments:

“Disruptive behavior on voice comms is a huge pain point for a lot of players. And we believe one of the ways to combat it is by providing quick and accurate ways to report abuse or harassment so we know when to take action. We also need clear evidence to verify violations of behavioral policies before we take action and to help us share with players on why a particular behavior may have resulted in a penalty.”

Unfortunately, we already know that this isn’t just an issue for online games. A woman in the UK claimed earlier this year that she had been verbally and sexually harassed in Meta’s Horizon Worlds metaverse. She wrote on [Medium](https://medium.com/kabuni/fiction-vs-non-fiction-98aa0098f3b0), “Within 60 seconds of joining — I was verbally and sexually harassed — 3–4 male avatars, with male voices…A horrible experience that happened so fast and before I could even think about putting the safety barrier in place. I froze.”

### Voice Chat in the Metaverse

[Motley Fool](https://www.fool.com/investing/2022/04/13/prediction-5-billion-people-will-be-in-the-metaver/) says, “there are approximately 400 million users of metaverse and metaverse-like worlds. By 2030, Citi predicts there could be up to 5 billion.” Some parts of the online gaming world already fall into that metaverse-like category. They offer insight into how the metaverse spaces will evolve and challenges they are sure to face.

Communication in the metaverse will not be typing and text-first. It will be speaking and voice-first. You are using your hands to navigate metaverse virtual worlds. The only practical way to communicate most of the time is by voice. Otherwise, you wind up with a start-and-stop experience where you are constantly waiting for someone to finish typing or stop moving so they can type and communicate. Even non-gaming metaverse experiences are subject to these interactive dynamics.

Moreover, the need for policing may not be limited to humans in the metaverse. Recall Microsoft’s infamous Tay chatbot deployed to Twitter. “Microsoft is battling to control the public relations damage done by its ‘millennial’ chatbot, which turned into a genocide-supporting Nazi less than 24 hours after it was let loose on the internet,” said an article in USA Today.

Many metaverses are deploying non-player characters (NPCs), and some of those have learning engines that could eventually generate conversations that stray from the normal script and head into “Microsoft Tay territory.” This would be a disaster because the metaverse wouldn’t even be able to assign the blame to a user. The question for metaverse builders is how to manage the risk.

### The Challenge

The key issues of moderating voice chat boil down to accuracy and cost. Both relate to the conversion of Speech to Text that takes place before any analysis can be conducted. If the transcript is inaccurate, then you risk missing harassing behavior or mistakenly labeling non-harassing voice chat as toxic and alienating an innocent user. Neither of these outcomes are good for users or the metaverse community.

#### STT Accuracy Challenge Examples

Speechly recently participated in the famous Y Combinator tech startup accelerator. Over the course of that program, Speechly leaders had the opportunity to interview executives at 70 companies and ask them about their current challenges. Several specific themes around Speech to Text accuracy emerged as important.

* “Moderation only works if the system is able to accurately monitor the conversations that are taking place.”
* “General accuracy isn’t good enough…Businesses need the ability to build a Speech to Text model for their specific use case."

#### STT Cost Challenge Examples

We also learned that the Speech to Text solutions are too costly for many games and metaverses. While offerings from Google and Amazon are easy to access, their cost can be prohibitive at scale. The transcription is an added cost beyond what you encounter with text chat and can exceed $10,000 per day even for relatively small user bases of under 100k DAU.

* “Cost is problematic for us because of the amount of data processed on a daily basis.”
* “The cost \[of STT] was so limiting that we had to optimize our approach to the point that it became not worth it.”
* “The cost of Speech to Text technology is so high that it has forced companies that deal with moderation problems to get creative in how and when they use Speech to Text and therefore have to be creative on what to monitor rather than being able to monitor everything. Or, they have to forgo voice moderation altogether.”

It is a difficult situation for game and metaverse builders. They need to have voice chat as a feature. It is essential that voice chat is moderated. The cost of traditional Speech to Text transcription solutions is too high to be economically viable. Something has to give.

### The Solution

At Speechly, we have worked alongside select partners to address the challenges around Accuracy and Cost with Speech to Text when using it for Voice Chat Moderation. With our API, developers have the ability to train Speech to Text models for their specific use case rather than using a generalized model. This helps create Voice Moderation that is accurate for the words that matter for the use case at hand.

Speechly also can be deployed [On-Device or On-Premise](https://www.speechly.com/blog/on-device-vs-cloud-speech-recognition-comparing-privacy-cost-and-accuracy). This results in a more cost-effective and private solution. Running Speech to Text On-Device or On-Premise can be up to 100x cheaper than the cloud. Also, this option is private by design as the Voice Chat data is never required to leave the users device or your company's data center.

If you are interested in learning more about how to use Speech to Text technology for Voice Chat Moderation, [contact us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/why-voice-chat-moderation-can-make-or-break-a-metaverse) to become a partner!

### Whats Next

Metaverse adoption is taking off. With more than 400 million users today and the expectation of five-to-tenfold growth, every problem is going to require a scalable solution. Gaming has provided metaverse builders with an early warning about what to expect. Voice chat will be a key feature for metaverse spaces but it comes with the risk of harassment and toxicity. Those risks have very real consequences in terms of user adoption, experience, and retention. The significant challenges in implementing [voice chat moderation](https://www.speechly.com/blog/3-common-voice-chat-moderation-mistakes) that go beyond text chat is a complicating factor. However, there are next technologies and techniques arising just in time for the Cambrian metaverse explosion.

*Cover Photo by [Jessica Lewis](https://unsplash.com/@jessicalewiscreative?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/virtual-reality?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)*

Voice chat is the top channel for online harassment in games, and that will translate directly over to metaverses if unaccounted for.

Why Voice Chat Moderation Can Make-or-Break a Metaverse

<WhitePaperBanner
  title="Speech as an Accelerant – Are Voice Assistants Necessary?"
  description="Learn more about how Voice Assistants are holding back Voice-enabled experiences."
  filePath="speechacclerant_opus_final_v2.pdf"
/>

A handful of voice assistants introduced by tech giants have become household names, but their high-profile has not been accompanied by reliable results and repeated use. The lack of ubiquitous voice apps and general blasé uptake by consumers resulted in [Opus Research](https://opusresearch.net/wordpress/) and Speechly asking the question: *are Voice Assistants necessary*? The outcome is a white paper that examines the common attributes of successful voice apps that make it easy for individuals to use very brief utterances to accomplish complex, but everyday, tasks.

## White Paper Topics:

* **Learnings from Successful & Unsuccessful Voice Experiences**
* **Impact of Speed with Voice User Interface (UI) Features**
* **Making more of Search and Navigation**
* **Preparing for Voice in the Metaverse**

A few Big Tech voice assistants have become household names, but their high-profile has not been accompanied by reliable results and repeated use. Why?

Speech as an Accelerant – Are Voice Assistants Necessary?


Google Assistant first broke out of the phone and into the home via the Smart Speaker in [2016](https://www.androidpolice.com/google-shutting-down-assistant-conversational-actions-app-actions-for-android/), following the Made by Google event. They also opened up access to 3rd party developers to build Conversational Actions for the smart speakers they released.

Conversational Actions, also known as Google Actions or Voice Apps, will [“sunset” or shutdown](https://developers.google.com/assistant/ca-sunset) in 1 year, on June 13, 2023. This will effectively end the development of 3rd party voice experiences on smart speakers and smart displays.

## What does this mean for Voice Tech at Google?

Developers of Conversational Actions are [being encouraged by Google](https://developers.googleblog.com/2022/06/Helping-Developers-Create-Meaningful-Voice-Interactions-with-Android.html) to transition their Voice Apps into Android Apps which they can then Voice-Enable using App Actions with Android. App Actions enable users to say voice commands to quickly access Android App functionality. This is powered by Google Assistant’s intent mapping and Natural Language Understanding (NLU).

Google is also moving from “Voice First” to a “Voice Forward” narrative. This appears to be shifting their strategy around Voice Tech to be less conversational and more on getting tasks done by taking advantage of Voice alongside the screens found with many Android Devices. In the blog post covering [Creating Voice Apps for Android](https://developers.googleblog.com/2022/06/Helping-Developers-Create-Meaningful-Voice-Interactions-with-Android.html) Rebecca Nathenson, Director of Product Management writes, “Whether someone asks Assistant to start a workout, order food, or schedule a grocery pickup, we know users are looking for ways to get things done more naturally using voice.”. There is a focus on getting things done by Voice and not on starting a conversation - a juxtaposition from the “Voice First” narrative that has dominated the headlines for years.

## Why is the Shutdown of Conversational Actions happening?

The [Google Developers blog](https://developers.googleblog.com/2022/06/Helping-Developers-Create-Meaningful-Voice-Interactions-with-Android.html) continues and gives further background on why Conversational Actions are being shutdown. Nathenson writes, “While Conversational Actions were an excellent way to experiment with voice, the ecosystem has evolved significantly over the last 5 years and we’ve heard some important feedback: users want to engage with their favorite apps using voice, and developers want to build upon their existing investments in Android. In response to that feedback, we’ve decided to focus our efforts on making App Actions with Android the best way for developers to create deeper, more meaningful voice-forward experiences.”

Bret Kinsella of Voicebot.ai also [spoke on this topic](https://voicebot.ai/2022/06/13/google-assistant-actions-voice-apps-to-sunset-focus-shifts-to-android-apps/) saying, “This transition has been in process since at least Google I/O 2019 whether that was widely recognized or not. It was clear then that Google Actions were being subordinated within the Android ecosystem and that all of the incentives and organizational structure would drive an Android-first approach. This was bound to leave voice app developers in a difficult spot…”.

## Validation of Voice UI as a Feature vs Conversational Voice UIs

At Speechly, we have built our technology from the ground up to support [Voice UIs as a Feature](https://www.speechly.com/blog/voice-uis-as-a-feature-vs-conversational-voice-uis) in Mobile and Web experiences vs the traditional Conversational Voice UI experience you find with Voice Assistants. We have always believed that Voice UIs as a Feature are an [inevitable step](https://www.speechly.com/blog/the-inevitability-of-voice-ui-features) in the [Evolution of UIs](https://www.speechly.com/blog/evolution-of-uis), before we will see Voice Assistants take root. Right now, Voice Features are delivering the most value in [Mobile Apps](https://www.speechly.com/blog/the-fastest-ui-for-the-web-and-mobile) and will continue to do so until Voice UIs become an expected modality for users.

It's great to see that Big Tech companies like Google, who have made massive investments in assistants, are realizing this. Voice UIs are at their best when adding efficiency in getting things done. This is only hindered by forcing users into a back and forth conversation with their technology.

## How can Android Developers easily Voice-Enable their Mobile App?

With the [Speechly API Android Client](https://github.com/speechly/android-client), Mobile App developers can easily build responsive, voice-enabled applications. With just a few lines of code you can add [Voice UI Features](https://www.speechly.com/demos) to your application such as Voice Search, Filtering, Form Filling, Input, or Command & Control.

If you are an Android Developer interested in Voice-Enabling your application, check out the [Speechly API Android Client on Github](https://github.com/speechly/android-client).

If you have a new use case in mind or need help Voice-Enabling your Android Application, reach out to the Speechly Team on our [Contact Us page](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/conversational-actions-shutdown-and-shift-focus-to-voice-ui-in-android-mobile-apps).

_Cover photo by Thomas Kolnowski on Unsplash_


Google Actions, or Voice Apps, will be shutdown as of 2023 while Google encourages developers to Voice-Enable Android Apps.

Conversational Actions Shutdown and Shift Focus to Voice UI in Android Mobile Apps


The User Interface experience for the web and mobile has changed very little in the past decade. We are still clicking and typing, tapping and swiping. Users can only do what a designer decides to make available. And, they are forced into a specific sequence of steps to accomplish any task.

Voice User Interfaces (Voice UIs) were supposed to change all of this because they are fast, flexible, and enable complex inputs with nearly zero effort. It has not happened to date because Voice Assistants were thrust upon us as the one and only Voice UI. It is not the right model for many websites and mobile apps. However, there is a better model for Voice UIs.

What is the goal of a UI? Typically, it is to help the user fulfill their intent efficiently and consistently. However, it can be hard to put the right information and controls in front of every user to meet their exact need at a given moment with a standard visual interface. However, we now have the technology to move from Designer-Constrained experiences to User-Driven experiences by leveraging Voice UIs as a Feature of our experiences. Download this white paper to learn more!

## What You Will Learn

- How Voice User Interface (UI) Features Outperform Voice Assistants
- Limits of the Voice Assistant Model
- Evolution of UIs
- How Voice Helps Create The Fastest UI for the Web & Mobile
- The Inevitability of Voice UI Features

<WhitePaperBanner
  title="Voice&nbsp;UIs as a Feature vs Conversational Voice&nbsp;UIs"
  description="Learn how Voice UI features are outperforming Voice Assistants."
  filePath="/uploads/speechly-whitepaper-voice-uis-vs-conversational-voice-uis.pdf"
/>


Checkout our most recent White paper exploring the benefits of Voice UIs as a Feature in Mobile & Web experiences vs traditional Conversational Voice UIs found with Voice Assistants.

Voice UIs as a Feature vs Conversational Voice UIs


Voice Assistant use has increased considerably over the past decade and has introduced many consumers to voice interaction. However, this model is actually not widely deployed. It seems very common because the tech giants have all introduced Voice Assistants and they are widely distributed through many of the most popular consumer devices, such as smartphones. However, there are many more Voice User Interfaces (UIs) deployed as a Feature in use today.


## Use Cases of Voice UIs as a Feature

### Navigation

Navigation apps are a good example where Voice is a UI Feature. There is no attempt at conversation and yet nearly all of them now have the ability to accept requests by voice. The natural language inputs are followed by a visual response. That could be information about the intended route, a map, or both.


### Banking

Banking apps provide another example. Erica from Bank of America, Eno from Capital One, U.S. Bank’s smart assistant, and Fargo from Wells Fargo all respond to inputs requested by Voice. None of them conduct multi-turn voice conversations. Why didn’t these companies simply replicate the model used by Alexa and Google Assistant? They didn’t need to. The Voice Input combined with a visual response was what added the most value to users.


### Music Streaming

Consider also the music streaming apps. Amazon Music, Apple Music, Pandora, and Spotify all offer Voice Search as a Feature for finding songs and to activate simple controls. While all of these implementations aside from Spotify are backed by Voice Assistants, they are still employed as simple, reactive Voice UIs. The audio response is typically just saying what song is about to play and then playing it - there is no back and forth conversation.


### Smart TVs

Voice UIs on smartphones work in a very similar way to Voice UIs for smart TVs. Cable and streaming television services are all adding Voice Search and Voice Commands for navigation and control. These solutions don’t pretend to make conversation. They respond to a request by loading information on a screen or by changing a television control based on the Voice Command. Voice UIs that simply respond to voice requests are actually far more common than Voice Assistants as a new Conversational Channel.


## Why Voice UI Features Have Outpaced Voice Assistants

The more prescient question is why so many of these companies are implementing Voice UIs of any kind. When it comes to mobile apps and mobile web, voice is more effective than touch-and-type interfaces. There is limited screen real estate to place buttons and thumb-typing is terribly inefficient. This is particularly true when you need to offer open-ended input (e.g. navigation/mapping and music streaming) or have a lot of features that are difficult to display in a simple menu (e.g. banking apps).

Voice UIs are also increasingly viewed as an accessibility feature for healthcare-related apps. It is a safety feature in cars where automakers want to offer an increasing number of features but endless menus can distract drivers. Touch and typing interfaces are not going away. However, they are increasingly being augmented by Voice UIs because of constraints, convenience, and changing consumer preferences.

On that final point, smart speakers, smart home products, and automobiles are playing a big role. There was a time when using your voice to control devices was uncommon. That is no longer true. Many people are using voice both at home and in the car. This is planting new habits around the use and expectation of Voice UIs in everyday digital interactions. It is no longer news when a smart home device or car adds a Voice Assistant. Nearly all of them support Voice UIs.


## Building Voice Features in Mobile and Web

This trend is continuing to expand into the mobile app space as well as the mobile web space, which is constantly adding new features to reach near parity with native mobile apps. And users want Voice UIs as a Feature as opposed to Voice Assistants, because they are a better match for user needs.

Speechly’s Voice Interface solution is aligned with these trends. If you want a full Voice Assistant, there are many solutions available from Big Tech, startups, and open source frameworks. However, the attention of developers, app and web publishers has shifted to solutions that can enable Voice UIs as a Feature.

Speechly offers a fast, accurate, and simple Voice UI API to build Voice Features fast. With Speechly any developer can be a Voice Developer.

In addition, Speechly delivers best-in-class responsiveness in terms of speed and has introduced a full duplex Voice Interface Solution for mobile, web, gaming, and the metaverse. There is no longer a need to wait for a Voice Assistant response. With Speechly, you can build solutions that simply react in real time to Voice Commands. And we are committed to providing the best Voice Interface solution for mobile app and web developers.


### Voice UIs as a Feature vs Conversational Voice UIs

There are limitations with the Voice Assistant model, however there are tangible opportunities for Voice UIs as a Feature in our Web, Mobile, Gaming, and Metaverse applications. If these opportunities are of interest to you, consider checking out our full white paper on “Voice UIs as a Feature vs Conversational Voice UIs”.

<WhitePaperBanner
  title="Voice&nbsp;UIs as a Feature vs Conversational Voice&nbsp;UIs"
  description="Learn how Voice UI features are outperforming Voice Assistants."
  filePath="/uploads/speechly-whitepaper-voice-uis-vs-conversational-voice-uis.pdf"
/>

_Cover photo by Veri Ivanova on Unsplash_

Voice Assistants have received a lot of attention over the last few years, but Voice UIs as a Feature have delivered all the value and will continue to do so in Mobile, Web, Gaming, and the Metaverse.

The Inevitability of Voice UI Features


Online video chat applications are used by millions of people every day. Many times it would be handy to go back to what somebody might have said earlier, yet quite few video chat applications offer real-time transcription.

In this tutorial we'll learn how to create a WebRTC video chat application that transcribes the users’ speech in real-time using the Speechly Browser Client. We'll also cover how to use a `MediaStream` instead of the users’ microphone as well as how to use our new VAD (Voice Activity Detector) feature for a completely hands-free experience.

This guide uses vanilla JavaScript, HTML and CSS. The same core concepts apply whether you choose to use React, Vue or any other framework.

## Step 1: Creating a WebRTC Video Chat

This guide uses [Muaz Khan’s WebRTC Meeting](https://github.com/muaz-khan/WebRTC-Experiment/tree/master/meeting) example as a starting point. Kudos to him for making this available!

### Setup the project with the following files

```bash
speechly-webrtc/
├── index.html
├── style.css
└── main.js
```

### Create the basic scaffolding

In `index.html` create a basic HTML document and add the following content. It's very basic for now, we'll expand this later when adding more features.

```html
<div class="app">
  <div id="lobby" class="lobby">
    <h2>Start a meeting</h2>
    <button id="new-room">New meeting</button>
  </div>
  <div id="room" class="room">
    <div class="room-container">
      <div class="room-header">
        <h2 id="room-name"></h2>
        <button id="leave-room">Leave</button>
      </div>
      <div id="streams" class="room-streams"></div>
    </div>
  </div>
</div>
```

### Link the libraries

In `index.html` add the following script tags before the closing `</body>` tag.

```html
<script src="https://webrtc.github.io/adapter/adapter-latest.js"></script>
<script src="https://cdn.webrtc-experiment.com/CodecsHandler.js"></script>
<script src="https://cdn.webrtc-experiment.com/IceServersHandler.js"></script>
<script src="https://cdn.webrtc-experiment.com/meeting.js"></script>
<script type="module" src="main.js"></script>
```

### Setup the video chat

In `main.js` start by setting up the meeting object and a few HTML elements

```js
const meeting = new Meeting();

const roomName = document.getElementById("room-name");
const streamsContainer = document.getElementById("streams");
const newRoomBtn = document.getElementById("new-room");
const leaveRoomBtn = document.getElementById("leave-room");
```

Handle adding of a new stream, placing yourself first

```js
meeting.onaddstream = function(e) {
  if (e.type == "local") streamsContainer.insertBefore(e.video, streamsContainer.firstChild);
  if (e.type == "remote") streamsContainer.appendChild(e.video);
};
```

Handle signaling using websocket

```js
meeting.openSignalingChannel = function(onmessage) {
  let channel = location.href.replace(/\/|:|#|%|\.|\[|\]/g, "");
  let websocket = new WebSocket("wss://muazkhan.com:9449/");
  websocket.onopen = function() {
    websocket.push(JSON.stringify({
      open: true,
      channel: channel
    }));
  };
  websocket.push = websocket.send;
  websocket.send = function(data) {
    if (websocket.readyState != 1) {
      return setTimeout(function() {
        websocket.send(data);
      }, 300);
    }
    websocket.push(JSON.stringify({
      data: data,
      channel: channel
    }));
  };
  websocket.onmessage = function(e) {
    onmessage(JSON.parse(e.data));
  };
  return websocket;
};
```

Handle users leaving the room

```js
meeting.onuserleft = function(userid) {
  let video = document.getElementById(userid);
  if (video) video.parentNode.removeChild(video);
};
```

Bind buttons for creating a new meeting and leaving the meeting

```js
newRoomBtn.onclick = function() {
  let name = Math.random().toString(36).slice(2, 10)
  meeting.setup(name);
  roomName.textContent = name
};

leaveRoomBtn.onclick = function() {
  meeting.leave();
  location.reload();
};
```

### Try it out

Start a local development server of your choice (i'm using the [Live Server](https://marketplace.visualstudio.com/items?itemName=ritwickdey.LiveServer) extension for Visual Studio Code) and view it in your browser.

On the newly created page go ahead start a new meeting. The browser will ask for permission to use your microphone and camera. Press **Allow** and you should have something like this running.

![video chat 1](/uploads/speechly-webrtc-1.png)

### Showing/hiding of the meeting room and lobby

Now that the basic meeting is working, let's show/hide the lobby and meeting room as needed. Start by defining a few HTML elements along with a helper function.

```js
const meetingLobby = document.getElementById("lobby");
const meetingRoom = document.getElementById("room");

function setupMeetingRoom(roomid) {
  meetingLobby.style.display = "none";
  meetingRoom.style.display = "grid";
  roomName.textContent = roomid;
}
```

Then update the `newRoomBtn.onclick` event to use the new helper function.

```js
newRoomBtn.onclick = function() {
  let name = Math.random().toString(36).slice(2, 10)
  meeting.setup(name);
  setupMeetingRoom(name);
};
```

Then add some basic styles. Remember to import the style sheet by using `<link href="style.css" rel="stylesheet" />`.

```css
.app {
  font-family: sans-serif;
  position: absolute;
  top: 0;
  right: 0;
  bottom: 0;
  left: 0;
}

.lobby {
  display: flex;
  flex-direction: column;
  justify-content: center;
  max-width: 300px;
  height: 100%;
  margin: auto;
}

.room {
  display: none;
  grid-template-columns: 2fr 1fr;
  height: 100%
}
```

Now the meeting room should be hidden by default and only the lobby should be visible. Once you join a meeting the lobby is hidden and the meeting room is shown.

## Step 2: Adding Speechly

Now that the basic video chat app is working, let's add the [Speechly Browser Client](https://www.npmjs.com/package/@speechly/browser-client) and start transcribing speech.

### Create a new Speechly application

Head over to the [Speechly Dashboard](https://api.speechly.com/dashboard/), log in or create an account, and create a new application using the **Empty** template.

![dashboard new app](/uploads/speechly-webrtc-2.png)

### Deploy the application

Since we're interested in transcribing speech, there's no need for us to write any SAL configuration. Just hit **Deploy** and copy the **App ID**, you'll need it shortly.

![dashboard deploy](/uploads/speechly-webrtc-3.png)

### Integrate the Speechly Browser Client

In `main.js` import the Speechly Browser Client and create an instance of it.

```js
import { BrowserClient } from "//unpkg.com/@speechly/browser-client?module=true"

const speechly = new BrowserClient({
  appId: "YOUR-APPID-FROM-DASHBOARD",
  debug: true,
  logSegments: true,
  vad: {
    enabled: true,
    noiseGateDb: -24.0
  }
});
```

The VAD is one of our latest features and it's perfect for this case. It takes care of connecting/disconnecting to Speechly backend based on input levels to your microphone. It has a [lot more options](https://github.com/speechly/speechly/blob/main/libraries/browser-client/docs/interfaces/client.VadOptions.md) than shown here, but for now you just need to enable it and set a sensible threshold to get started.

Then in the `meeting.onaddstream` method, change the function to `async` and attach the `MediaStream` to the Speechly Browser Client. At the same go, hide the video controls since they aren't really that useful here.

```js
meeting.onaddstream = async function(e) {
  await speechly.attach(e.stream);
  e.video.controls = false;
  if (e.type == "local") streamsContainer.insertBefore(e.video, streamsContainer.firstChild);
  if (e.type == "remote") streamsContainer.appendChild(e.video);
};
```

Now when you start a meeting and open your console, you should see speech segments being logged!

### Showing the transcript in the UI

First create an element that will display the transcript, place it next to the `room-container` div.

```html
<div id="room" class="room">
  <div class="room-container">
    <div class="room-header">
      <h2 id="room-name"></h2>
      <button id="leave-room">Leave</button>
    </div>
    <div id="streams" class="room-streams"></div>
  </div>
  <div id="transcript" class="room-transcript"></div>
</div>
```

Then use the `onSegmentChange` method to listen to speech segments and show them in the newly created container once the segment is marked as `isFinal`. As the segment contains an array of uppercase words, they first need to be lower cased and then joined into a string.

```js
const transcriptContainer = document.getElementById("transcript");

speechly.onSegmentChange(segment => {
  if (segment && segment.isFinal) {
    let text = segment.words.map(w => w.value.toLowerCase()).join(" ");
    let div = document.createElement("div");
    div.textContent = text;
    transcriptContainer.appendChild(div);
  }
});
```

Then add a bit of styles to separate the segments from each other and make the transcript div scrollable.

```css
.room-container {
  padding: 32px;
}

.room-transcript {
  overflow-y: auto;
  display: flex;
  flex-direction: column;
  gap: 12px;
  background-color: #f5f5f5;
  padding: 32px;
}

.room-transcript div {
  background-color: #e5e5e5;
  padding: 10px 12px;
  line-height: 1.25;
}
```

### Optional: showing tentative segments

The above solution adds a segment to the transcript when the segment is marked as `isFinal`. If you wish to make the transcript even more real-time, you can show the tentative segment and keep on updating that until the segment is marked as `isFinal`.

```js
speechly.onSegmentChange(segment => {
  if (segment) {
    let text = segment.words.map(w => w.value.toLowerCase()).join(" ");

    let div = document.createElement("div");
    let id = segment.contextId + "-" + segment.id;
    div.id = id;

    let segmentDiv = document.getElementById(id);
    if (segmentDiv) segmentDiv.textContent = text;
    if (!segmentDiv) transcriptContainer.appendChild(div);

    if (segment.isFinal) {
      segmentDiv.textContent = text;
    }
  }
});
```

### Try it out

By now, you should have an video chat application that transcribes speech in real-time!

![video chat 2](/uploads/speechly-webrtc-4.png)

## Step 3: Multiple Users in the Same Meeting

### Allowing others to join the meeting

First add a button for copying the current meeting id as well as a place for the joining user to input it.

```html
<div class="app">
  <div id="lobby" class="lobby">
    <h2>Start a meeting</h2>
    <button id="new-room">New meeting</button>
    <br/>
    <h2>Join a meeting</h2>
    <div class="lobby-join">
      <input type="text" id="meeting-room-id" placeholder="Meeting ID">
      <button id="join-room">Join</button>
    </div>
  </div>
  <div id="room" class="room">
    <div class="room-container">
      <div class="room-header">
        <h2 id="room-name"></h2>
        <button id="copy-room">Copy</button>
        <button id="leave-room">Leave</button>
      </div>
      <div id="streams" class="room-streams"></div>
    </div>
    <div id="transcript" class="room-transcript"></div>
  </div>
</div>
```

Then add two new methods, one for checking for created meeting rooms and for alerting for each new meeting. You'll also need to bind the two new buttons click events.

```js
const copyRoomBtn = document.getElementById("copy-room");
const joinRoomBtn = document.getElementById("join-room");

meeting.onmeeting = function(room) {
  if (!room) return

  joinRoomBtn.onclick = function() {
    let id = document.getElementById("meeting-room-id").value;
    if (id !== room.roomid) return
    meeting.meet(room)
    setupMeetingRoom(room.roomid)
  }
};

meeting.check();

copyRoomBtn.onclick = function() {
  navigator.clipboard.writeText(roomName.textContent)
}
```

In the `meeting.onmeeting` method we check if the inputted id matches any of the received meeting rooms. If it does, the user will join that meeting.

### Finishing touches

Now that multiple video streams appear in the room, they layout looks a bit off. Let's fix that by adjusting the header and displaying videos in a grid.

```css
.room-header {
  display: flex;
  align-items: center;
  gap: 8px;
  margin-bottom: 32px;
}

.room-header h2 {
  margin: 0;
}

.room-streams {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(40%, 1fr));
  gap: 16px;
}
```

Users might join using their phone in which case their video will be most likely in portrait mode which causes a layout issue. Also, you might notice that your own video appears to be mirrored. Luckily there's an easy solution for both cases.

```css
video {
  object-fit: contain;
  background-color: #000;
  width: 100%;
  height: 100%;
  max-height: 70vh;
}

video#self {
  transform: rotateY(180deg);
}
```

The end result should look like this:

![video chat 3](/uploads/speechly-webrtc-5.png)

### Caveat

While it's quite easy to support multiple participants in a meeting, this guide does not try to cover transcribing the speech of multiple participants. In a case where another user joins the meeting, the Speechly Browser Client will use the `MediaStream` of whoever joined the meeting last and display their transcript. If you wish to compile a transcript from all parties, the best way to go about this is to:

1. Capture the local `MediaStream`
1. Broadcast the segments using websocket to all parties
1. Order the segments and compile a transcript on the client side

## Conclusion

You can find [the final project here](https://github.com/speechly/webrtc-speechly).

In this tutorial, you learned about building an WebRTC video chat that Speechly to to provide real-time transcription. You also learned how to use a `MediaStream` as an audio source as well as about our new VAD feature. For the next step, consider deploying the application ([Netlify](https://www.netlify.com/) and [Vercel](https://vercel.com/) are both great options) and trying it out with your friends.

We hope you enjoyed this tutorial. Feel free to reach out to us on our [Github Discussions](https://github.com/speechly/speechly/discussions) page or via Intercom if you have any questions. Thank you!


Learn how to build a WebRTC video chat application that uses the Speechly Browser Client to transcribe audio from a MediaStream.

Create a WebRTC Video Chat App With Speechly Transcription


Voice can be the most efficient UI for Web and Mobile, but the back and forth conversational model needs to be ditched. The conversational model embodied by Voice Assistants does not take full advantage of screens that are native with web and mobile experiences. When you lean into the screen when building a Voice-enabled experience you can build fast and easy to use UIs that blend Voice and Touch/Type.

## Leveraging Screens with Voice UIs

Web and mobile User Interfaces (UIs) have a screen and that affordance should be used to  maximum effect. Voice input is very efficient, but an audio response like you find with Voice Assistants is not. The screen is more efficient for the output because it can react immediately to inputs and can convey far more information without it being too overbearing for the user. To get a sense about how much faster a Voice UI as a Feature can be compared to the Voice Assistant model, consider the use case of placing an order with Best Buy.

## Online Shopping Use Case: Best Buy

![Online shopping use case](/uploads/bestbuy-speechly-usecase.png)

The graphic above shows 4 Best Buy purchase scenarios.  Starting with using Speechly to enable Voice as a Feature, the Mobile App, the Website, and the Alexa app. With twelve words spoken, Speechly can return a highly specific response in just nine seconds. It only involves a voice command from the user with an immediate visual response from the screen. By contrast, the same request took 2:04 minutes as an Alexa skill, included 19 turns in the conversation, and five errors (See video below). This shows in practice the speed that can be unlocked from Voice UIs as a Feature vs Voice Assistants, the main value prop discussed with Voice-enabled experiences.

<YouTube videoId="XJ4BnEIiAjo" />

But, is a Voice UI as a Feature better than an existing touch or mouse click interface? Even these highly streamlined Graphical User Interface (GUI) experiences can’t compete with a full-duplex Voice UI. Tapping in the Best Buy mobile app takes 33 seconds to complete the same purchase task and a mouse-driven GUI on the web took 37 seconds. That is 3-4 times slower than direct voice input.

## Value from Getting Voice UIs Right

When users can consistently get what they want by simply uttering a few words, they are likely to reward the company or brand with higher sales, more loyalty, and higher customer satisfaction. Also, by helping customers quickly get where they want you also see fewer abandoned carts and page bounces, resulting in higher conversion. The Speechly approach with Voice UIs as a Feature helps unlock and deliver those benefits consistently.

And using Voice to Filter through products is only 1 use case of a Voice UI. Other great features that span beyond E-commerce include Voice Search, Voice Form Filling, Voice Input, and Voice Command & Control which can all be demoed at [Speechly.com/demos](https://demos.speechly.com/fashion/).

If you would like to learn more about the Speechly outlook on Voice UIs as a Feature vs Voice Assistants, download our full white paper on “Voice UIs as a Feature vs Conversational Voice UIs”.

<WhitePaperBanner
  title="Voice&nbsp;UIs as a Feature vs Conversational Voice&nbsp;UIs"
  description="Learn how Voice UI features are outperforming Voice Assistants."
  filePath="/uploads/speechly-whitepaper-voice-uis-vs-conversational-voice-uis.pdf"
/>

_Cover photo by Florian Steciuk on Unsplash_

Abandoning the Voice Assistant model for a Voice UI as a Feature results in the most efficient UI since the Touchscreen.

The Fastest UI for the Web and Mobile


We have seen the User Interface (UI) for computers evolve rapidly since [IBM punched cards](https://www.ibm.com/ibm/history/ibm100/us/en/icons/punchcard/) became the dominant input/output medium for computers back in the 60’s. Fast forward to the 2020’s and we see a world dominated by two UIs, Point-and-click alongside typing for the computer and touch-and-swipe on the smartphone. While Voice Assistants have made an attempt to become the next UI in the evolution, this post will show how the next evolution is more likely to be Voice as a UI feature in computer and smartphone experiences.

## Punch Cards, Graphical User Interfaces and Touchscreens

User interfaces have evolved very logically over time. Punched Cards gave way to switches which in turn handed off to the first real open-ended input mechanism, typing. This was the dominant paradigm for nearly 20 years before the invention of the Graphical User Interface (GUI) and the mouse. However, it was a decade before that interface became common and nearly twenty years before the GUI clearly took over computing interfaces. That was nearly a 40-year run.

![Evolution on UIs vs Computing](/uploads/evolution-uis-computing.png)

Touch interaction became the next big transition. It was first demonstrated in the 1960s, but it would also be 40 years before it found a true home in the smartphone. That was over a decade ago in 2008 with the iPhone and since that time there have been parallel dominant UIs. Point-and-click alongside typing for the computer and touch-and-swipe on the smartphone.

## Voice UI As The Next Evolution?

It seems obvious that voice technology will usher in the next important UI revolution. Thirty years ago Nuance debuted the Dragon dictation system. This type of technology eventually wound up in enterprise call centers and even automobiles within a decade. However, it still is not common for computers in either the consumer or business sectors. This is despite the fact that [speech is 3-to-5 times faster than typing](https://hci.stanford.edu/research/speech/paper/speech_paper.pdf) and it enables users to efficiently get what they want no matter what buttons or menus are available.

The real breakthrough came in 2011 with the introduction of Siri on the Apple iPhone. This new Voice Assistant category was seen as the next UI evolution. Siri and its competitors like Google and Samsung promised a human-like interaction when handling user requests. That proved to be a promise the Big Tech could not keep. Complaints about how the Voice Assistants didn’t work created a stigma around the solutions that took many years to shed. The Voice Assistant providers found themselves in the “Habitability Gap”, coined by Roger K. Moore. The “Habitability Gap” is a scenario where the closer a solution gets to humanlike conversational ability, the more usable they are until a certain point where the assistants cannot meet the user requirements causing interactions to fall apart.

## Voice UIs as a Feature

Responding to consumer criticism, the leading smartphone companies focused their attention on Voice-enabling command-and-control features such as initiating a phone call, setting a calendar appointment, and asking for directions. These narrowly defined use cases proved far easier to execute consistently and began rebuilding consumer confidence in the interface. This is notable. In order to succeed, the technology had to revert backward along the flexibility continuum. The technology had improved a great deal, but was not ready to cross the gap.

Amazon’s introduction of Alexa in 2014 and Google’s alternative two years later muddied the waters further. Alexa was introduced to support an entirely new device without a screen. It needed to be more capable and conversational because there was no screen to fall back on when the user became stuck. Google Assistant followed and also decided to employ the same UI for smart speakers and Android-based smartphones.

This furthered the rise of half-duplex systems where only one participant in a conversation, the human or the machine, can act at a time, while the other waits.  Despite bold promises of humanlike conversational experiences and encouragement of 3rd parties to build these experiences, consumers use the features that provide the most value consistently. These are simple request-and-response interactions on smart speakers such as requesting music from a streaming service or radio station, asking simple questions, and setting timers.

![Smart Speaker Use Case Frequency January 2021](/uploads/smart-speaker-usecase-jan21.png)

This trend was also evident on smartphones where Alexa, Bixby, Google Assistant, and Siri jockeyed to be the favored Voice Assistant. The top Voice Assistant use cases to emerge on smartphones are asking questions, placing a phone call, sending a text, getting directions, and setting timers and alarms. Lofty Voice Assistant expectations that often land in the “Habitability Gap” have seen far less use than popular request-and-response features that live in the space just before the habitability cliff.

## Solving User Problems With Voice

Even though consumers were clearly showing the technology providers what they wanted, the Voice Assistant stack was built to support far greater flexibility than was required. The Voice Assistants were over-engineered for the tasks consumers wanted to employ. It is no wonder that many website, web app, and mobile developers looked at Voice Assistants as overly complex and inadvertently applied that sentiment to the viability of all Voice UI Features.

The logical evolution from click and touch is to a much simpler Voice UI solution that helps users actually find the information they need and complete their intended tasks more efficiently, rather than making Voice into a new channel or platform. The ability to support natural language input and accurately identify user intent was the critical innovation. Multi-turn conversations turned out to be superfluous.

If you would like to learn more about the Speechly outlook on Voice UIs as a Feature vs a Channel, download our full white paper on “Voice UIs as a Feature vs Conversational Voice UIs”.

<WhitePaperBanner
  title="Voice&nbsp;UIs as a Feature vs Conversational Voice&nbsp;UIs"
  description="Learn how Voice UI features are outperforming Voice Assistants."
  filePath="/uploads/speechly-whitepaper-voice-uis-vs-conversational-voice-uis.pdf"
/>

_Cover photo by Eugene Zhyvchik on Unsplash_

From Punched Cards to Touch Screens, User Interfaces have evolved significantly. Will Voice be next in this evolution?

Evolution of UIs


We have seen the adoption of Voice Assistants start to flatten in 2022, with [adoption around 50-60%](https://voicebot.ai/2022/04/15/voice-assistant-adoption-clustering-around-50-of-the-population/) of the US population for Smart Speakers, Smart Phones and In the Car. While there are various challenges facing Voice Assistants to continue its growth, the main inhibitor for Voice Assistant growth lies in the User Experience itself. Users do not want to have “Conversations” with their technology, they want to get tasks done, and this limitation is the core emphasis of this post.

## Voice Assistant Challenges

Turn-based conversations favored by the popular voice assistants like Amazon Alexa, Google Assistant, or Apple Siri can sometimes be more convenient than manual input methods, but they are often too slow and tedious. The assistant must wait for you to finish speaking before it can determine your request and then defaults into a spoken response. You then must wait for the assistant to stop talking before moving on to another step.

This makes for a terrible User Experience when you recognize that the assistant has misunderstood the intent (what the user is trying to accomplish with the Voice Command), but still must wait until it finishes speaking before making the correction. And barging into the "conversation" while the voice assistant is speaking to clarify the original request is rarely successful. Instead, you typically wind up starting over with your initial command, albeit with a slight tweak in how you say the command.

So it is not a big surprise when you look at smart speakers and smartphones, the most popular use cases for voice assistants are predominantly user-directed requests. Smart speaker users ask questions, ask to play music, ask for the weather, ask to set a timer or alarm. Smartphone voice assistant users ask to initiate a call, or for directions, or music. Many pose questions expecting a straightforward response or answer. These are all single-turn interactions meaning there is no back-and-forth conversation required or desired. A simple request is spoken and the device completes the task. That's it.

## What does this tell us?

We rarely need to _Converse_ with our digital devices through multiple turns to complete our task of interest. The real focus should be on _Spoken Commands_. When pairing these Spoken Commands with a website or mobile app, think of “Voice as a Feature” vs an Assistant and you are likely to deliver a better user experience.

This insight is why at Speechly we developed the model of building Voice UIs as a Feature that blends alongside existing ways of interacting with a screen such as typing, tapping, or swiping.

## Aligning Limitations with Expectations

Academic research backs up this observation. Many people are familiar with the uncanny valley. The uncanny valley describes the relation between how humanlike an artifact is and someone's emotional response to it. The emotional response increases as human likeness increases up to a point where it suddenly plummets.

![Uncanny Valley](/uploads/uncanny-valley.png)

Roger K. Moore of the University of Sheffield's Speech and Hearing Research Group cited materials from Mike Phillips during a 2006 IEEE workshop that found a similar pattern with conversational capability. This model compared the usability of a voice interactive solution with its flexibility in conversational dialogue. The closer the solutions get to humanlike conversational ability, the more usable they are until a certain point where the assistants cannot meet the user requirements. He calls this the “Habitability Gap”. Moore commented:

> "There appears to be a non-linear relationship between flexibility and usability... As flexibility increases with advancing technology, so usability increases until users no longer know what they can and cannot say, at which point usability tumbles and interaction falls apart."

![Habitability Gap](/uploads/habitability-gap.png)

## Voice UIs as a Feature vs Conversational Voice UIs

There are limitations with the Voice Assistant model, however there are tangible opportunities for Voice as a Feature in our Web and Mobile applications. If these opportunities are of interest to you, consider checking out our full white paper on “Voice UIs as a Feature vs Conversational Voice UIs”.

<WhitePaperBanner
  title="Voice&nbsp;UIs as a Feature vs Conversational Voice&nbsp;UIs"
  description="Learn how Voice UI features are outperforming Voice Assistants."
  filePath="/uploads/speechly-whitepaper-voice-uis-vs-conversational-voice-uis.pdf"
/>

_Cover photo by Sigmund on Unsplash_

Voice Assistants ushered in a wave of excitement around Voice Technology, but years later the limits of Voice Assistants are clear.

Limits of the Voice Assistant Model


## What's new in Speechly Browser Client v2.0

The main new feature with the Speechly Browser Client v2.0 is the capability to flexibly choose an audio source for the client via the [Media Capture and Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Media_Streams_API). This is a significant evolution to the client and it's also a breaking change.

Previously the only way to provide the client with audio was to use the device’s default microphone. It was challenging to control which microphone was used if there were multiple microphones available. When initializing the Speechly Browser Client, the client behind the scenes silently chose the first audio capture device it found.

Moreover, if you had audio available for example as a live `MediaStream` or in an audio file, and thus no need for a microphone, you had to resort to elaborate workarounds.

The Speechly Browser Client v2.0 fixes these issues.

Using the default microphone is still straightforward. It exists in a separate `BrowserMicrophone` class that you can initialize and attach to the client, and everything works as before. However, a nice bonus of separating the default microphone implementation from the client is that you ask for the microphone permission only when the microphone is initialized, instead of when the client is created!

Furthermore, now you can also attach any existing `MediaStream` object from which the audio will be read. This allows to easily integrate Speechly for example to WebRTC applications that expose incoming and outgoing audio as `MediaStream`s.

Finally, to make things easier when dealing with audio files, we've added a `uploadAudioData` function which decodes the given audio data and uploads it to the API. This currently works with popular file types such as WAV, MP3, M4A and others.

Check out [Speechly Browser Client v2.0 on NPM](https://www.npmjs.com/package/@speechly/browser-client).

## How to upgrade to Speechly Browser Client v2.0

### Install the package

```bash
// Using Yarn
yarn add @speechly/browser-client

// Using NPM
npm install --save @speechly/browser-client
```

### Updates to microphone

Speechly Browser Client v2.0 extracts the microphone to a separate class and as a result the initialization looks a bit different. Also note that `startContext` and `stopContext` have been renamed to `start` and `stop`.

```js
// Before
import { Client, Segment } from '@speechly/browser-client';

const client = new Client({ appId: 'your-app-id' });
await client.initialize();

client.onSegmentChange((segment: Segment) => {
  console.log(
    'Received new segment from the API:',
    segment.intent,
    segment.entities,
    segment.words,
    segment.isFinal,
  );
});

await client.startContext();
setTimeout(async function () {
  await client.stopContext();
}, 3000);
```

```js
// After
import {
  BrowserClient,
  BrowserMicrophone,
  Segment,
} from '@speechly/browser-client';

const client = new BrowserClient({ appId: 'your-app-id' });
const microphone = new BrowserMicrophone();
await microphone.initialize(); // must be called from a user triggered event!
await client.attach(microphone.mediaStream);

client.onSegmentChange((segment: Segment) => {
  console.log(
    'Received new segment from the API:',
    segment.intent,
    segment.entities,
    segment.words,
    segment.isFinal,
  );
});

await client.start();
setTimeout(async function () {
  await client.stop();
}, 3000);
```

### Usage with audio files

You can now use the new `uploadAudioData` function to send an `AudioBuffer` directly to the client without using the microphone.

```js
const client = new BrowserClient({ appId: 'your-app-id' });
const response = await fetch('url-to-audio-file');
const buffer = await response.arrayBuffer();
await client.uploadAudioData(buffer);
```

For more details, check out our [GitHub repository](https://github.com/speechly/speechly/tree/main/libraries/browser-client). Happy developing!


The Speechly Browser Client v2.0 is now available on NPM. In this post we’ll cover the major changes as well as how to upgrade to the new version.

Speechly Browser Client v2.0 Released


The world of AR/VR, Gaming, and Metaverse experiences continues to grow, so we are excited to release our Speechly Client Library for Unity. This release is the result of growing demand for being able to easily add a Real-Time Voice UI as a Feature to these experiences to make them more interactive and easier to navigate.

## How it Works

This client library streams audio from a Unity or .NET app to the Speechly cloud Voice API and provides a C# API for receiving real-time Speech-to-Text (STT) transcription and Natural Language Understanding (NLU) results. This makes it easy to enable users to Command and Control the environment around them using their Voice.

Check out the [Speechly Client for Unity](https://docs.speechly.com/client-libraries/usage/?platform=Unity) in the Speechly Documentation to start developing your voice-enabled gaming or metaverse experience today. For more technical details, check out our [GitHub repository](https://github.com/speechly/speechly-unity-dotnet).

## See it in Action

<YouTube videoId="1gfrjOl7cgs" />

## What is Speechly?

Speechly is a Voice API that makes it easy to build voice features into games, AR/VR/XR experiences, mobile apps and websites. If you would like to learn more about Speechly or get access to our API, you can [sign up](https://api.speechly.com/dashboard/#/signup) for a free account or reach out to us [here](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/client-library-for-unity-to-easily-add-voice-command-control).


With the Speechy Client Library for Unity, developers can now easily add Voice UIs as a Feature to their experiences built with Unity.

Client Library for Unity to Easily Add Voice Command & Control


When the Voice Tech industry discusses use cases for Voice Technology, the primary focus is on using your voice to control some sort of device. However, this focus negates to ask one essential question: is this where the best use cases for Voice Technology exist today? In facing this necessary reality, there are many use cases outside of simply controlling the devices around us. These can be found in contexts where there are conversations taking place - such as video conferencing, online multiplayer game voice chats, and customer service call centers to name a few, that are not discussed nearly to the same level. Why is this the case and why should the voice tech industry be paying closer attention to Human to Human Interaction (HHI) scenarios? I will answer both of those often overlooked questions as well as provide use cases I believe the Voice Tech industry should be paying more attention to.

## Why the Voice Tech industry has been primarily focused on using Voice to control devices

When you think of Voice Technology, the first thing that comes into your mind is probably one of the popular Big Tech Voice Assistants, such as Apple&#39;s Siri, Amazon&#39;s Alexa, or Google Assistant. This is not a surprise. If you look at the world of Voice Technology prior to the introduction of Siri and later Alexa, most Voice Tech conversations were taking place in the world of R&amp;D labs vs in the real world. In fact, it wasn&#39;t until 2018 that [Microsoft reached Human Parity](https://www.microsoft.com/en-us/research/uploads/prod/2018/04/5ad57de362ee0-5ad57de362f1dHuman-Parity-with-CNTK-@-GTC-Mar-2018-v2.pdf.pdf) with Automatic Speech Recognition (ASR) and the following year that Microsoft also reached [Human Parity with Natural Language Processing](https://www.microsoft.com/en-us/research/blog/machine-reading-systems-are-becoming-more-conversational/) (NLP).

As Voice Technology started to show more promise back in the mid 2010&#39;s, you also saw massive investment from Big Tech companies into their respective Voice Assistant platforms. For example, as of 2018 Amazon had over [10,000 people](https://voicebot.ai/2018/11/15/amazon-alexa-headcount-surpasses-10000-employees-here-is-the-growth-rate/) working on Alexa alone and also stood up a $200M dollar Venture Fund to spur investment into companies building supporting services and applications for the platform. Although Amazon is arguably the top performer when it comes to a Big Tech company starting new projects without the fear of shutting them down, such as the Fire Phone, companies like Amazon have remained persistent with investing into their Voice Assistant platforms. This is despite a [stall of adoption](https://voicebot.ai/2022/03/02/the-rise-and-stall-of-the-u-s-smart-speaker-market-new-report/) by the Voice Assistant market.

When you look specifically at the sheer investment into Voice Assistant experiences by the biggest technology companies in the world, it&#39;s easy to understand how they control the majority of mindshare on &quot;the best&quot; use cases for Voice Technology. The question then arises: has this caused the wider Voice Technology industry to overlook great use cases for the underlying Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) technologies that help power Voice Assistants? I believe it has, particularly within the context of HHI scenarios.

## Why should we pay attention to where Voice Technology can add value in Human to Human Interactions?

Humans have been interacting and transacting online since the early days of the internet, however with the recent COVID-19 pandemic, we saw a massive acceleration in digital transformation across all industries. Satya Nadella, the CEO of Microsoft [said](https://www.microsoft.com/en-us/microsoft-365/blog/2020/04/30/2-years-digital-transformation-2-months/), &quot;We&#39;ve seen two years&#39; worth of digital transformation in two months. From remote teamwork and learning, to sales and customer service, to critical cloud infrastructure and security…&quot;. Alongside the explosion of remote work and services like video conferencing, we also saw big changes on the consumer side of things with the rise of a new space called [Social Audio](https://www.highfidelity.com/blog/most-popular-social-audio-apps), paved initially by Clubhouse and quickly followed by Big Tech companies like Twitter, Facebook and Spotify.

A key detail with the digital transformation scenarios previously mentioned is that they have a common and relevant attribute: Human to Human communication taking place in a digital world. With this also comes a massive amount of audio data that presents both challenges and opportunities for the developers of these experiences.

## Challenges and Use Cases for Voice Tech in an Audio Rich World

### On Device Transcription

First, I want to start with a use case that has been a mainstay for ASR technology, but has seen some unique challenges alongside the rise of a world that has gone more remote, Transcription. Traditionally, ASR technology has been run in the cloud for transcription. However, as the usage of experiences that leverage ASR have continued to grow, so have the costs associated with these experiences - making running in the cloud less attractive or viable from a business perspective. One solution to this cost problem is running ASR on-device vs in the cloud. While that is easy to write down in a strategy deck, the technical challenges around being able to run ASR on-device are less straightforward. Solving this challenge, though, comes with an additional value proposition outside of cost alone, which is the added increase in speed of transcription.

### Moderation

The next challenge I would like to address is moderation. Whether looking at new domains where conversations are taking place, such as Social Audio or Virtual Reality, or more established domains like Social Media or Online Gaming, moderation has always been a hot topic. Even Microsoft has had to [shut down](https://www.cnet.com/tech/computing/microsoft-shutters-part-of-its-social-metaverse-for-safety-reasons/) part of their &quot;Social Metaverse&quot; due to an inability to properly moderate the conversations that were taking place.

This is a problem that cannot be solved simply by adding more human moderation for 2 distinct reasons. First, humans come with bias and no matter what you do to eliminate that bias, history has proven it&#39;s nearly impossible to eliminate it from decisions made regarding moderation. The second is the sheer scale of audio data that needs to be monitored. Although apps like Clubhouse have seen a decline in downloads over the last few months, they have created a new category all together that has captured the attention of Big Tech as at least a feature of their platforms and saw a growth in daily rooms being created to [700k](https://influencermarketinghub.com/clubhouse-stats/#toc-19). Real time ASR and NLU technologies present a great tool to augment human moderation for specifically being able to call out things like hate speech, harassment, and profanity with audio data.

### Assistance

The final challenge I would like to discuss can be generally described as Assistance. Specifically, I am referring to Customer Support and Sales scenarios. These domains have also seen a rapid rise in the amount of conversational data that is captured as more of our professional work and everyday consumer habits have continued to shift into a more digital and remote world.

With both Customer Support and Sales, voice technology has been a tool for post-call analysis to better train agents on how to improve performance in future calls. However, advancements in voice technology have enabled us to take the next step and make these insights happen in Real-time as opposed to after a conversation has already ended. The benefit here is clear: agents can now learn during their calls while also having an Assistance tool to better serve customers and prospects.

## Now what?

Should the Voice Technology industry shift its focus to ignore using voice to control the devices around us? No! Rather we should constantly challenge ourselves to think outside of the box on what the best use cases are for modern day Voice Technologies. The best use cases might be closer than you may think.

Did any new use cases come to mind for Voice Tech while you were reading this? Let us know your ideas or general thoughts from this post on Twitter [@SpeechlyAPI](https://twitter.com/SpeechlyAPI).

_Cover photo by Cam Adams on Unsplash_

Modern Voice APIs and Artificial Intelligence (AI) have created new ways for voice technology to enhance user and employee experiences that go beyond device control alone.

Overlooking Great Voice Technology Use Cases


COVID-19 has resulted in a massive shift in user behavior. Primarily, our lives are becoming more digital in both everyday consumer life as well as in the enterprise as the world was forced to adopt “Remote Work”. This reality has led to a massive increase in the amount of audio data the world is generating on a daily basis at both work and home. Here is a list of 30 data points to show the impact of COVID-19 on remote work and to put into perspective just how much audio data is generated by the world.

## COVID-19 impact on Remote Work & Time Spent Online

### Remote Work

##### One of the most noticeable impacts of the COVID-19 pandemic was the quick shift to remote work.

* 20% to 25% of the workforces in advanced economies could work from home between three and five days a week. This represents four to five times more remote work than before the pandemic. - [McKinsey](https://www.mckinsey.com/featured-insights/future-of-work/the-future-of-work-after-covid-19)
* Employers started adjusting their workplaces to fit a new hybrid working model as nearly 70% of full-time U.S. workers have worked remotely, with many continuing to do so. - [Owl Labs](https://owllabs.com/state-of-remote-work/2021/)
* 90% of respondents that worked from home during the pandemic said they were as productive -- or more -- when compared to the office. - [Owl Labs](https://owllabs.com/state-of-remote-work/2021/)

### Time Spent Online

##### Another noticeable impact of the pandemic was an increase in time spent online and the importance of the internet.

* 90% of adults say the internet has been essential or important to them personally during the pandemic. - [Pew Research](https://www.pewresearch.org/internet/2021/09/01/the-internet-and-the-pandemic/)
* 40% of adults have used the internet in a new way during the pandemic. - [Pew Research](https://www.pewresearch.org/internet/2021/09/01/the-internet-and-the-pandemic/)
* 72% of parents say their kids are spending more time in front of screens than prior to the pandemic. - [Pew Research](https://www.pewresearch.org/internet/2021/09/01/the-internet-and-the-pandemic/)

## Showing the Rise in Audio Data

### Social Audio and Call Center Data

##### Audio data has risen across consumer and enterprise use cases through new categories such as Social Audio and existing channels like Call Centers.

* Clubhouse surpassed 10 million registered users as of February 13, 2021. - [Voicebot](https://voicebot.ai/2021/02/23/clubhouse-surpasses-10-million-users-after-musk-zuckerberg-rogan-and-mrbeast-join-and-starts-drawing-more-scrutiny/)
* The rapid growth of Social Audio on Clubhouse caught the attention of Big Tech companies like Twitter, Facebook, Spotify, Reddit, Amazon, and LinkedIn who have all added Social Audio as a feature of their own platforms. - [HighFidelity Link](https://www.highfidelity.com/blog/most-popular-social-audio-apps), [Social Media Today](https://www.socialmediatoday.com/news/linkedin-launches-test-of-audio-rooms-announces-new-formats-for-live-event/616776/), [Voicebot](https://voicebot.ai/2022/03/08/amazons-new-amp-app-spins-a-social-audio-radio-dj-mash-up-with-amazon-music-catalog-as-the-soundtrack/), [Reddit](https://www.reddit.com/talk)
* Alongside the pandemic Discord users rose from 56M in 2019 → 140M in 2021 with peak concurrent users hitting 10.6M in 2020. - [Business of Apps](https://www.businessofapps.com/data/discord-statistics/)
* Discord users spend 4 billion minutes in conversation daily. - [The Verge](https://www.theverge.com/2020/6/30/21308194/discord-gaming-users-safety-center-video-voice-chat)
* Overall call volume to contact centers jumped over 600% from normal levels, while agent call capacity dropped by 20% during COVID-19. - [Deloitte](https://www2.deloitte.com/il/en/pages/innovation/Solutions/Innovation_Challenges_Challenge_1.html)

### Video Conferencing and Social Video Data

##### Video usage has also surged in new and traditional channels such as videos on Social Media or Video Conferencing - all of which come with a multitude of Audio data.

* Zoom had over 300 million meeting participants per day in 2020 - [GetvVoIP](https://getvoip.com/blog/2020/07/07/video-conferencing-stats/#:~:text=62%25%20of%20companies%20use%20three,active%20daily%20users%20in%202021.)
* Google Meet had over 100 million daily meeting participants in 2020 - [GetvVoIP](https://getvoip.com/blog/2020/07/07/video-conferencing-stats/#:~:text=62%25%20of%20companies%20use%20three,active%20daily%20users%20in%202021.)
* Microsoft Teams had 250 million active daily users in 2021. - [GetvVoIP](https://getvoip.com/blog/2020/07/07/video-conferencing-stats/#:~:text=62%25%20of%20companies%20use%20three,active%20daily%20users%20in%202021.)
* Cisco WebEx currently has over 324 million users. - [GetvVoIP](https://getvoip.com/blog/2020/07/07/video-conferencing-stats/#:~:text=62%25%20of%20companies%20use%20three,active%20daily%20users%20in%202021.)
* 500 million viewers watch 100 million hours of video content on Facebook daily - [99Firms](https://99firms.com/blog/facebook-video-statistics/#gref)
* As of Q4 2021, Tik Tok had 1 billion MAUs. - [Business of Apps](https://www.businessofapps.com/data/tik-tok-statistics/)
* As of 2019, 500 hours of video are uploaded to YouTube every minute worldwide - [Tubefilter](https://www.tubefilter.com/2019/05/07/number-hours-video-uploaded-to-youtube-per-minute/)
* Twitch has 140 million monthly active users and 30 million daily active users. - [Earthweb](https://earthweb.com/twitch-statistics/)
* On Twitch creators Broadcasted for 88.7 Million Hours in January 2021 alone. - [Influencer Marketing Hub](https://influencermarketinghub.com/twitch-stats/#toc-8)

### Multiplayer Games Data

##### Multiplayer gaming continued to grow through 2022 across many major brands. A crucial component of any Multiplayer game is reliable Voice Chat which is why this makes our list.

* PUBG has over 520M monthly active users and over 43M daily active users, as of March 2022. - [ActivePlayer](https://activeplayer.io/pubg/)
* Minecraft has over 172M monthly active users and over 15M peak players in a day, as of March 2022. - [ActivePlayer](https://activeplayer.io/minecraft/)
* Fortnite has over 271M monthly active users and over 24M peak players in a day, as of March 2022. - [ActivePlayer](https://activeplayer.io/fornite/#:~:text=In%202021%2C%20our%20records%20shows,and%2025%20million%20daily%20players.)
* Roblox has over 213M monthly active users and over 21M max players in a day, as of March 2022. - [ActivePlayer](https://activeplayer.io/roblox/)
* Apex Legends has over 119M monthly active users and over 11M max daily players, as of March 2022. - [ActivePlayer](https://activeplayer.io/apex-legends-live-player-count-and-statistics/#:~:text=In%202019%20Apex%20Legends%20garnered,players%20daily%20across%20all%20platforms.)
* League of Legends has over 128M monthly active users and over 11M peak players in a day, as of March 2022. - [ActivePlayer](https://activeplayer.io/league-of-legends/)

### Metaverse and AR/VR Data

##### The Metaverse is the current hottest topic among Big Tech companies. Whether using your Voice to control the AR/VR environment around you or to communicate with a friend, it is a necessary component in the Metaverse.

* Meta’s social VR platform Horizon has 300,000 monthly active users. - [The Verge](https://www.theverge.com/2022/2/17/22939297/meta-social-vr-platform-horizon-300000-users)
* Meta is building their own Voice Assistant specifically for the Metaverse. - [Vox](https://www.vox.com/recode/22948097/mark-zuckerberg-voice-assistant-metaverse-ai-announcement)
* Rec Room has 37M users as of December 2021, 450% higher than the previous year. - [TechCrunch](https://techcrunch.com/2021/12/20/rec-room-raises-145m-at-a-3-5b-valuation-for-its-user-generated-immersive-gaming-platform/)
* Other platforms like Mozilla and Microsoft have also recognized the opportunity that lies within the Metaverse and have created their own platforms, Hubs and Mesh. - [Mozilla](https://hubs.mozilla.com/), [Microsoft](https://www.microsoft.com/en-us/mesh)

## Challenges and Opportunities with Audio Data

The rise in audio data can lead to challenges like the cost of being able to moderate discussions happening on social media, as well as opportunities like being able to provide tools to create a more efficient customer service experience. Do you think there are any other challenges or opportunities that arise alongside the growth in Audio Data? Let us know your thoughts on Twitter [@SpeechlyAPI](https://twitter.com/SpeechlyAPI).

_Cover photo by Kristin Wilson on Unsplash_

How societal shifts at home and in the office are impacting the amount of audio data generated in 2022.

30 Data Points that Prove Audio Data is on the Rise


Before we jump into voice technology specifically, let’s start with technology more broadly. While a lot of digital innovation is driven by some combination of technology and user need, it’s very easy to fall into the trap of innovation for innovation’s sake. See: [Smalt](https://www.theverge.com/circuitbreaker/2017/8/3/16088526/smalt-smart-salt-shaker-app-alexa-smartphone), a smart salt shaker, [Juicero](https://www.theguardian.com/technology/2017/sep/01/juicero-silicon-valley-shutting-down), a $400 wifi connected juicer, and of course, the [CueCat](https://en.wikipedia.org/wiki/CueCat), a barcode scanner that required a lot of personal information in order to serve you advertisements. While these all sound like something you might have laughed at in a SkyMall or SharperImage catalog, they collectively raised over $300m - and none of them exist today.

In the highly popular [Design Thinking](https://www.interaction-design.org/literature/article/5-stages-in-the-design-thinking-process) methodology, before you can solve a problem you need to define it. The first step of that process is to Empathize. This crucial starting point requires immersion into the human experience, interviewing and observing subjects. The goal is to challenge your own assumptions and see people holistically in their environments to better understand the issues they experience. It’s only after this immersion that you move on to the next step, which is to Define the Problem.


## Define the Customer Problem Before Developing the Solution

The best products all start with a clear, singular focus on a well-defined problem. From [Duo Security](https://duo.com/)’s focus on making companies more secure by making security easier for employees, to [Lemonade](https://www.lemonade.com/)’s focus on simple, accessible, and fast insurance coverage, there’s power in remaining focused on solving problems. Duo’s founders knew that convoluted and fragmented security solutions that users couldn’t (or wouldn’t) use effectively, put companies at risk. Lemonade realized that confusing insurance policies with convoluted sign up experiences made getting insurance something to dread, and lead to analysis paralysis of multiple different policies. Both of these companies have used technology and good user experiences to grow their businesses. And while they both exist in vastly different categories, they have one big thing in common: they both dove deep into the needs of their users when designing their products.

Most companies do just a handful of things; whether you’re a restaurant focused on making and serving food, or a doctor’s office focused on patient care, the customer comes with a specific end-goal in mind. By talking to and observing your customers, pain points in the overall customer journey can emerge, and they often don’t look like what you’d imagined.

An important thing to remember when you’re looking to define the customer problem is that it doesn’t need to be earth shattering. It could just be a seemingly small pain point that they encounter frequently, or something that happens infrequently but creates a big, negative impact.

As an example, in the quick service restaurant industry, the carryout experience is full of tension. A customer arrives at the location, unsure of whether or not their food is ready. They enter the store and wait to be acknowledged. Their food sits under a heat lamp while they try to get someone’s attention. By the time they get the food, it doesn’t taste fresh. They reach out to management complaining that the food wasn’t fresh. Management sees food freshness as a problem and works to fix it - when in reality, the problem was that the customer didn’t know when to arrive to get their food, and they didn’t know how to announce themselves in order to quickly receive their order upon arrival. Those problems can both be solved by some scaleable combination of operations and technology.


## Identify Pain Points Easily Solved Using Voice Technology

Whether the pain point is tied to the limitations of a touch only interface, like multiple steps to search or filter, or one that comes out of the usage of voice, like being the target of harassment in an online voice chat, voice technology offers new ways to provide clear and immediate value to the user.


### The ability to search by voice in a web or mobile application.

Allowing people to tap to activate a microphone and speak their request rather than type it manually can save time, and cut down on typos in the search experience. The added benefit of voice search with a screen is that you have the ability to display prompts that can help the user by setting expectations and scoping the context of your UI.


### Voice filtering to more quickly sift through information.

Allowing the user’s voice to command and control aspects of the visual interface takes the experience to the next level. Being able to speak naturally and ask the UI to return complex queries like, “Show we women’s shoes in a size nine, in black…actually white. And sort them to show the best deal first,” without tapping to select and deselect multiple categories and sorting orders, hits at a common digital experience pain point. You can try this experience out in our [e-commerce demo](https://demos.speechly.com/fashion/) to see just how much faster it really is.


### Using voice transcription and Reactive Voice UI to assist call center agents with form completion.

While many call center software platforms offer agents the ability to save common information that allows them to autocomplete forms, many involve lengthier forms that require a healthy amount of manual input. Whether it’s a request for a detailed medical history or an open-ended query from a homeowner looking for help from a contractor, agents are often acting as stenographers, transcribing the caller’s information while trying to provide them with good service. Leveraging voice technology to run in the background and manage the transcription and information input automatically saves time and lets the agent focus on making a human to human connection with the caller. It’s using AI to improve the customer experience.

If you’re unsure if using voice for transcription and form completion to support agent assistance makes sense for your company, we encourage you to examine how your callers are interacting with your agents, how long it's taking them to complete manual data input, the quality of the information being input, and whether there are any complaints about the experience.


### Voice list building, or adding items to a cart using voice commands.

Imagine for a moment that the word “list” doesn’t immediately bring to mind a notepad and pen. In the digital context, a list can be anything from a collection of to-do’s to a detailed food order in a cart. One of the most common “analog” voice list building experiences is placing an order at a drive-thru. When efficiency and ease of use are top of mind, using voice to add items to a list (or cart) is a natural way to improve a digital experience.


### AI powered voice moderation to improve online community experiences

Whether in the metaverse or online multiplayer games, voice chat is a popular method for communication and collaboration. It has also, unfortunately, been a popular method for harassing strangers online. People targeted by the harassment find themselves with negative associations tied to the online game or community, and some may abandon them altogether to avoid further harm. For the companies that rely on community members and players, this represents an [existential threat to their business.](https://www.speechly.com/blog/the-case-for-real-time-voice-chat-moderation-technology-in-the-metaverse) Implementing voice technology behind the scenes that can leverage AI to help support moderation efforts is not only scalable, it offers real-time recognition and understanding that has a direct impact on real people.


## From Ideation to Implementation of Voice Technology Solutions

Once the customer problems and pain points have been defined, the ideation and prototyping can begin. It’s important to take the idea of a paper prototype and adjust it for the voice experience. In practice, that can look like faking an experience and testing it with users to determine if it’s worth building.

Another option is to build a quick prototype using a simple API like [Speechly’s API](https://api.speechly.com/dashboard/#/signup), which allows for quick builds and deployment into existing tech stacks. That means less time dealing with incompatibility and more time focused on testing and iteration across multiple different AI powered voice technology solutions.

Whatever direction you go in, remember to center the user in everything that you do. They are ultimately the ones who decide whether what you’ve built offers them any value.


Using Design Thinking and AI to improve customer experience

How to Use Voice Technology to Solve Customer Problems


We’re excited to share that we’ve partnered with Wix.com to bring Speechly’s voice technology to their 200 million users worldwide.

Speechly collaborated closely with the Wix.com team on the design of a voice enabled feature package that can easily be integrated into [WixStores](https://www.wix.com/app-market/wix-stores) sites. This collaboration makes it possible for WixStores shoppers to experience machine learning powered voice enabled features like voice search and voice filtering, all customized for each site.


## Adding a Voice User Interface to Wix Sites

Step by step tutorials have been put together to add a [voice search bar](https://www.wix.com/velo/blog/post/real-time-voice-to-text-search-with-velo-and-speechly) and to add [custom Reactive Voice UI components](https://www.wix.com/velo/blog/post/voice-commerce-is-the-future-of-e-commerce).

<YouTube videoId="XR_2E6FEQ7o" />

Beyond the “behind the scenes” technology, the package includes things like the customizable Push to Talk UI component, which makes it simple for shoppers interested in activating the voice feature as a part of their shopping experience.

Speechly’s proprietary Spoken Language Understanding® technology powers the experiences by extracting the meaning and intentions of the user’s spoken language, and turning it into useful data - without forcing them to alter the way they naturally speak. That means that Wix sites will now be able to map speech to intents in real-time, allowing visitors to see their words instantly generate reactions within the site.


## Why Using a Voice API Makes Sense

From the ease of integration across multiple browser types to the ease of scalability and simplicity of integration into an existing tech stack, [voice APIs ](https://api.speechly.com/dashboard/)are great options for developers looking to add voice features and functionality to their digital experiences.

Wix users can experience it for themselves using the Speechly for [Wix Stores Velo package](https://www.wix.com/velo/example/Speechly-Integration).

_Cover photo by Vojtech Okenka on Pexels_

Partnership between Speechly and Wix.com makes it easy for eCommerce Wix Stores to add voice technology to their sites

Powering the Future of Voice Commerce with Wix.com


## Moderation needs in the metaverse

As Meta continues forward with their commitment to the growth of the metaverse, they’re also grappling with the reality that harassment in VR could turn mainstream consumers away. Their incoming CTO, Andrew Bosworth, referred to this as an [“existential threat”](https://www.businessinsider.com/facebook-meta-andrew-bosworth-vr-toxic-metaverse-2021-11) to their plans for the metaverse expansion.

The threat is a very real one. Microsoft recently shuttered elements of their [AltspaceVR public social hubs](https://www.cnet.com/tech/computing/microsoft-shutters-part-of-its-social-metaverse-for-safety-reasons/) and made plans to increase moderation to ensure that the platform is safe. Voice chat has been used to [sexually harass players using Oculus](https://www.cnet.com/tech/gaming/features/as-facebook-plans-the-metaverse-it-struggles-to-combat-harassment-in-vr/) for gaming. The potential for harm in these new spaces is obvious and the need for effective moderation solutions is clear.


## Real time voice audio offers new opportunities for harassment - and for AI powered moderation solutions

It’s important to note that this isn’t an issue that can be easily solved; Mike Masnick, founder of Techdirt, wrote about what he calls [Masnick’s Impossibility Theorum](https://www.techdirt.com/2019/11/20/masnicks-impossibility-theorem-content-moderation-scale-is-impossible-to-do-well/). He argues that, “content moderation at scale is impossible to do well.” (It’s worth calling out that he still feels it’s something that needs to be done.)

What’s interesting about moderation in the metaverse is that you have multiple different modalities at play. People can talk to each other _and_ they can interact with each other through simulated touching and gesturing. Moderation must be occurring across both modalities in order to be effective and solutions for both should be flexible enough to allow them to work together in parallel, to provide additional context and improve the quality of the moderation efforts.

When people talk to each other, they’re listening not just to the words being said but to the way that they’re said. They observe the body language of the speaker. They know the context of the relationship with the speaker. All of these things factor into the way that the words spoken are processed and understood by the listener. For moderation purposes the **understanding** of all of these things together is key, and it has to be done accurately and **quickly. **Why? Because a recent survey found that [60% of kids and 83% of adults have experienced harassment](https://www.forbes.com/sites/jemimamcevoy/2021/09/15/60-of-kids-experience-harassment-while-gaming-online-and-many-are-exposed-to-white-supremacy-survey-finds/?sh=39cfc0151f4e) in online multiplayer games. That is a huge human impact and the online gaming voice experiences offer a lot of parallels to the metaverse experience but now with new, more interactive, ways to cause harm.

This potential for harm is something that all of the big players in building out elements of the metaverse are aware of. If their platform does not have technology in place to help identify, investigate, and intervene in situations like this, their platform becomes a tool that harms people. That’s not good for people and it’s not good for business.

This space is interesting to the team at Speechly because the challenge posed is one that [our technology](/products/interfaces/) is uniquely positioned to help address. Ideally the technology would be deployed as a flexible chat moderation API with a custom model to suit each specific community and environment. The ability to simultaneously run automated speech recognition and natural language processing means that we’re able to help moderation systems respond faster, and with more accuracy.


## How Artificial Intelligence (AI) can support voice chat moderation

If you’ve ever read a transcript of a conversation, you know that it can leave a lot to be desired. The ability to create these transcriptions in real time as people are speaking is at the heart of what is needed for successful voice chat moderation. Then you add in the layers that bring it to life and the context and understanding necessary to determine if something was said that should be escalated.

Building AI powered models around things like sentiment analysis, volume fluctuations, and tone can all be used to help understand the context of what was said. Remember that in the metaverse, unless someone is streaming and recording the experience, harassment that is spoken leaves no “evidence” left behind. There’s no comment to screenshot, no profile to click to better identify the harasser. The experiences often move quickly and the harasser can quickly move on without any intervention. Unless. Unless there’s an AI layer built in to help identify, intercept, and intervene in real time.


## The future of voice chat moderation

As companies continue their push into new forms of multimodal online experiences in the metaverse, the need for effective moderation will only grow. The types of harassment will shift and expand along with the capabilities of the metaverse and the technology to monitor and moderate it will need to expand alongside it.

The sooner that AI powered models are deployed, the smarter and more effective the technology will become, and the better everyone’s experiences will be.

_Cover photo by Julia M Cameron on Pexels_

Why the future of the metaverse is dependent upon robust voice chat moderation APIs and AI technologies

The Case for Real Time Voice Chat Moderation Technology in the Metaverse


The [Reactive Voice UI](https://www.speechly.com/blog/voice-user-interfaces-examples/) paradigm is especially well suited for the web and mobile environments - really any experience with a touch screen.

In the marketplace today, many digital voice experiences follow the legacy voice assistant model, like the one popularized by Apple's Siri back in 2011. While these assistant driven voice UIs optimize for conversation, Reactive Voice UIs optimize for task completion. The bulk of voice assistant usage over the last decade has been reserved for single-utterance requests like, "play music" or "turn off the lights." In fact, the [2020 Smart Audio Report](https://www.nationalpublicmedia.com/insights/reports/smart-audio-report/) found that the top five tasks requested of voice assistants were to play music, get the weather, set a timer, check the time, and tell a joke. This is not surprising, given the amount of effort and time required to accomplish more complex tasks in a turn-based, conversational experience.

What do we mean by turn-based? The person and the AI assistant must take turns speaking in order to be understood. The person speaks, and only when they've stopped talking does the NLU kick in to process the spoken language input, determine the intent, and then return with a text-to-speech response. They each must wait for the other to respond before they're able to move forward. It's one-way, asynchronous communication.

As someone is trying to uncover what the assistant can do and is stumbling through everything it _can't_ do, it quickly becomes a time intensive back and forth conversation to reset. The frustration this can create is so a part of our cultural zeitgeist that Googling "yelling at Alexa" returns thousands of results.

It's a sequential waterfall process where the value is delivered at the very end. Likewise, any errors happen silently throughout the process, only to be surfaced at the end when it is too late to recover from them.

In the legacy voice assistant experiences, voice serves as both the UI and operating system. In the Reactive Voice UI design philosophy, voice serves a function as a feature alongside other modalities. The idea is not to substitute an existing Graphical User Interface (GUI) with voice but rather to complement or augment it along the parts of the user journey where a type and swipe input would otherwise be tedious. These include tasks like searching and inputting complex information.

Reactive Voice UIs are characterized by [multimodal UI](https://waracle.com/blog/mobile-app-development/why-multimodal-ui-is-the-future-of-mobile/) mechanics that enable voice input to generate a visual output. That means that the person can speak and see their words generate a reaction within the visual element directly, without an "assistant" managing the experience.

Whether using voice to search within a site, or [voice picking](https://www.speechly.com/solutions/logistics/) to manage inventory, the UI and the user are able to communicate with one another in both directions simultaneously. Despite not being a "conversation" it's a much more natural (and fast) way to communicate and get things done.

## Managing the Limitations of Artificial Intelligence in Voice Technologies

AI based product experiences have one thing in common: when they work, they're magical. When they fail, they fail catastrophically. For instance an autonomous driving experience: a magical experience is driving you safely from home to work. A catastrophic experience would be driving you off a cliff. With voice, a magical experience might be nailing a complex pizza order in one go. A catastrophic experience might be accidentally sending (and paying for) twenty pizzas to be delivered to your old address.

If you look at humans, we aren't that great at understanding spoken language. The typical human word error rate is around five percent. That means that if a sentence is ten words long you will have at least one word misunderstood in every other sentence. This tells you that verbal communication is very error prone to begin with.

However, what's different for humans is that these errors in understanding don't cost very much, as humans are able to quickly recover from them.

One big reason that voice AI experiences are still struggling to grow market share is that the mistakes in understanding often feel too costly. We can tackle this in two ways: either by trying to make the AI smarter and smarter, to the point where they won't make mistakes, or by trying to make the failures cost less. We at Speechly believe that the latter, more pragmatic approach, is the way to go.

If you're familiar with the principles of [modern software delivery](https://medium.com/the-value-maximizers/what-is-agile-ac19f79de430), this will likely be familiar to you:

> Succeeding in communication is about short cycles, incremental delivery, being iterative, failing fast, getting feedback, delivering value early, transparency and adaptation.

These same core principles apply to Reactive Voice UIs.

With Spoken Language Understanding™ technology and a visual Reactive Voice UI displayed on a screen, you can maintain a [fast feedback loop](https://demos.speechly.com/fashion/). That means that you're able to deliver value early and enable quick recovery from errors in understanding. This keeps errors from compounding, helps build trust with the user, and makes the experience feel seamless.

## How to Convert an Alexa Skill into a Multi-Modal Experience

It takes an immense amount of work from a [conversational design](https://marvelapp.com/blog/principles-of-conversational-design/) and development perspective to build voice assistant experiences. These voice-only experiences are not bad - they are just limiting, from a user perspective. We built Speechly to offer up an alternative, one that starts with a focus on the user. If you've built an Alexa Skill and are interested in expanding its reach beyond the Amazon ecosystem, we created a [simple conversion tool](https://www.speechly.com/blog/introducing-alexa-to-speechly-conversion/) that lets you create a new Speechly application from an existing Alexa skill in a few simple steps. With these features folks can easily turn their Alexa skill into a streaming Speechly voice application that can then be used to enable Reactive Voice UI experiences across the web and in mobile apps.

It's free to start building: [https://docs.speechly.com/basics/getting-started/](https://docs.speechly.com/basics/getting-started/)

If you have questions, a specific use case or a POC you want to try out, our [inbox is open](mailto:hello@speechly.com).

_Cover photo by Tiger Lily on Pexels_


From Voice Search to Voice Picking, Reactive Voice UIs shine in web and mobile applications

The Importance of a Reactive Voice UI When Using Voice as an Interface 


In the past five years we've seen tremendous technological advancement in the voice and [Natural Language Understanding (NLU)](/products/interfaces/) space. In 2016 we saw [speech recognition reach human parity](https://www.theverge.com/2016/10/18/13326434/microsoft-speech-recognition-human-parity) in some of the classical conversation speech recognition benchmarks. Alexa launched and Google introduced their assistant smart speaker. Speechly was founded with the idea that the asynchronous turn-based conversational model could be improved upon.

The advancement has continued, growing leaps and bounds in less than a decade. We now have superhuman accuracy in Automatic Speech Recognition (ASR) as well as Natural Language Understanding (NLU) for many of the most well known ASR and NLU benchmarks.

However, despite these technological advancements, voice as a User Interface (UI) and as a UI modality has yet to live up to its promises. Most people still only use voice technology as a way to hear the weather, turn off the lights in their home, or to voice search short queries in a browser.

The reason? While the technology has advanced, the user experiences have primarily remained the same, trapped in the context of a conversational assistant style experience. The end result is a gap between what the technology is capable of, what people want, and what current day voice UIs actually deliver.

![phones](/uploads/phones.png)

_Even something as ubiquitous as touchscreen technology didn't see widespread adoption until the introduction of the iPhone, which made the experience feel natural and intuitive. iPhone image credit: [Rafael Fernandez](https://commons.wikimedia.org/wiki/User:TheGoldenBox), [IPhone 1st Gen](https://commons.wikimedia.org/wiki/File:IPhone_1st_Gen.svg), [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)_

For a modality to take off, it has to feel effortless - magical, even. Most of the technology and design approach today doesn't meet expectations.

That's a controversial hot take for a company rooted in the voice technology industry - but the only way that we advance and grow the industry is by looking at it objectively and working to build better, higher quality experiences.

So what is quality? The classical definition is that a product is of high quality when performance meets expectations. Let's look at why the improvement in technological quality has yet to result in improvements in user perception of quality.

## Voice Enabled AI Problems Require Human Focused Solutions

People are very good at detecting fakes, and the closer something comes to resembling human behavior, the more the small deviations from this behavior start to feel disturbing. It's that shift into the uncanny valley where the creepiness outweighs the cool.

Many of the voice experiences over the last decade are reliant upon the assistant persona to manage the handoff into any third party applications. This ties the success of the voice channel to the voice assistant persona's ability to manage against the uncanny valley feeling. Voice experiences have been built as one-off applications, often with little to no visual elements. The tech has been focused on trying to make the AI feel like a human by forcing users into a conversation, with the idea that it will feel natural.

Humans cannot speak and listen at the same time. Therefore any conversational communication happening in these legacy assistant experiences is one direction at a time, and not simultaneous. As smart as the AI gets and as "human" as the technology is made to feel, it still doesn't resonate as a good experience because of how slowly information is exchanged.

On the other hand, it is easy for humans to process visual information and speak at the same time. When sighted people speak to each other, they're often watching for visual cues from the other person to show that they understand, or have a question, as they speak. With newer technology that leverages standard Graphical User Interface (GUI) elements, you can build a voice enabled experience _for a human_ that includes a visible reaction on a screen to show understanding. In the voice-only experiences, this visual feedback is typically missing.

When you add voice to the visual UI you are making the machine more powerful, and the experience more intuitive. With Reactive Voice UIs the [visual interface](https://www.interaction-design.org/literature/topics/ui-design) and the user communicate with one another in both directions simultaneously. The communication is fast. The experience feels natural.

From a user perspective, it's the difference between trying to interact with a peculiarly behaving almost-human to controlling a highly functioning machine. From the designer's perspective, it means having access to another tool in the [toolkit](https://dreamy-cori-a02de1.netlify.app/design-philosophy/set-right-context/) to help drive UI design forward.

## The Role of a Screen in a Voice User Interface

By leveraging the screen and existing user interface design best practices, building with voice starts to feel much more accessible and intuitive to both the user and the designer.

Screens are incredibly helpful when it comes to setting expectations and scoping the context of your UI. Voice-only experiences often give users analysis paralysis because there's no intuitive way to understand what they can do or say, and there's a limited understanding of what features are supported. Users are left to guess at what is possible, which means that they encounter a long list of things that are _not_ possible along the way.

When voice-only assistants present the experience as infinite, it quickly becomes clear just how limited it can be.

Outside of voice, most applications are designed to do just a few things but to do those few things, and communicate what they are and how to use them, very well. If we apply that same idea to experiences with voice, the voice UI should use existing UI conventions and the visual elements of the screen to communicate the scope of what is possible to the user.

## The Impact of Voice Technology & User Interaction Design Principles on Voice UI Adoption in the Marketplace

The gap between user expectations and value delivered to the user in many voice experiences can be significantly reduced by applying the design principles of [Reactive Voice UIs](https://www.speechly.com/blog/ui-components-for-voice-uis-in-the-web/) to the design and development of voice experiences to help properly set expectations up front, and improve the delivery of the value by mapping it directly back to the user interface.

With Reactive Voice UIs, the designer builds in visual feedback elements that help the user better understand how and when to use voice for a more efficient experience. This can look like commonly understood elements such as a microphone button with a "Push to Talk" button or an overlay component that provides feedback about the voice input.

![speechly ui](/uploads/speechly-ui.png)

When these features are combined with a new technology called [Spoken Language Understanding](https://www.youtube.com/watch?v=NEYKiM9Ta9s)™ (SLU), it allows the user to speak and have the UI instantly map their words to actions within the UI. In practice, SLU and a Reactive Voice UI come together to create experiences like this:

<YouTube videoId="xI68NT8D1m8" />

It feels almost…magical.

_Cover photo by Andrea Piacquadio on Pexels_

How Reactive Voice UIs & Spoken Language Understanding™ improve voice user experiences from legacy voice tech, grow feature adoption on web and mobile

Next Generation Voice User Interface Design & Development


After years spent focused on our core technology and the recent introduction of our commercial product, we are thrilled to announce that we’ve been selected to join Y Combinator’s Winter 2022 batch.

Y Combinator (YC) is a startup accelerator headquartered in California. They got their start in 2005 and have since helped launch more than 3,000 companies, including Stripe, Reddit, DoorDash, Coinbase, Dropbox, and Instacart. Twice a year they select a handful of startups to invest in and support. The acceptance rate for the YC accelerator is between 1.5% - 2%, so to be selected is a huge honor and a testament to the work the team has put in over the years.

All companies get a [YC company profile listing](https://www.ycombinator.com/companies/speechly) and access to an incredible network of founders and investors.

With our inclusion in the YC Winter ‘22 batch, Speechly is well positioned to continue to grow our footprint and expand our impact by bringing more experiences “powered by Speechly” to the marketplace. We’ll continue to work on our core technology and make improvements to the developer experience within our product. Our commitment to our users and to helping more people speak and feel heard, will now have the added support of the resources unlocked by YC.

It’s hard to believe that just a couple of years ago we were preparing to launch our commercial product into a world being completely reshaped by a pandemic. In the weeks and months since, we’ve found incredible levels of support from the voice technology community, our users, and of course, our greatest cheerleaders - our friends and family.

We would not be where we are today without the partnership, trust, and support of each of those people. We are grateful and looking forward to the next few months, as we dive headfirst into the work ahead.

We’ll be sharing our journey here on the blog, on [The Speechly Podcast](https://anchor.fm/the-speechly-podcast/episodes/The-Speechly-Podcast---Introduction-e15htlq), and of course [Twitter](https://twitter.com/speechlyapi), [LinkedIn](https://www.linkedin.com/company/speechly/), and [YouTube](https://www.youtube.com/speechlyapi). As always, it’s free to [start building](https://api.speechly.com/dashboard) with our API and you can connect directly with our product team on [GitHub](https://github.com/speechly/).

If you’re interested in being a part of the team, [check out our open roles](https://www.speechly.comhttps://www.speechly.com/careers/).


Speechly’s Reactive Voice User Interface API has been selected by the coveted startup accelerator

Speechly Joins Y Combinator’s Winter 2022 Batch


For the first time since 2019, the Speechly team was able to attend an in-person [Voice Summit](https://www.voicesummit.ai/) on December 7th and 8th at the Renaissance Arlington Capital View Hotel in Arlington, Virginia. Over 400 people attended in-person, with even more tuning in virtually to talk about voice technology and the broader voice industry.

Speechly CEO Otto Söderlund flew in from Helsinki to do a live demo

[![Image](/uploads/tweet1.png)](https://twitter.com/OttoSoderlund/status/1468238216688017415?s=20&t=48o-gpPK0M3ifbfZad1Lpg)

and present a Keynote that included a [surprise announcement](https://www.speechly.comhttps://www.speechly.com/press/press-releases/speechly-releases-alexa-conversion-feature/)

[![Image](/uploads/tweet2.png)](https://twitter.com/SpeechlyAPI/status/1468611489363902471?s=20&t=afrqKzgerLtED9YGjL8RaA)

From the simple things like being able to enjoy our morning coffee together

[![Image](/uploads/tweet3.png)](https://twitter.com/SpeechlyAPI/status/1468587452101189638?s=20&t=dSUidPVn1W3Sp0y0-m81NQ)

to spending quality time connecting with people

[![Image](/uploads/tweet4.png)](https://twitter.com/OttoSoderlund/status/1468402326813786113?s=20&t=WsxOgUlZTerj3abvBA17kw)

it was a great venue for deep discussion about some of the themes we’re seeing in the voice industry.

## Top themes from Voice21

**There’s been a clear shift from voice as a novelty to voice as a utility.**

This is something that JP from [Vixen Labs](https://vixenlabs.co/) talked about in his presentation: the idea that voice was for “play things.” It kept popping in conversations about how the industry has changed recently. Discussions about voice center around use cases, user experience, and business objectives. Users are looking for more from voice, and companies are looking for new ways to offer value - often in the form of increased efficiency or ease of use. That means that voice is not “just” a vehicle for a back and forth conversation with an assistant or bot. It’s a means to an end.

And to that end, it’s **less about new channels, and more about improving the experience on existing ones**.

Developing and driving usage of new, voice-only, channels can be difficult and expensive - especially if you step outside of the Big Tech assistant ecosystem. Driving usage of new voice features on existing channels requires less effort, and often drives higher usage and reduced churn on those channels. Why? At the end of the day, people are results oriented - when they realize [how much faster they can get things done with their voice](https://www.youtube.com/watch?v=xI68NT8D1m8), everything else becomes frustratingly slow by comparison. When the experience is better, people come back to it.

All of that means that the market is maturing, and with that maturation we shift into optimization.

**Developers are looking for foundational technologies that will help them optimize and streamline their tech stacks.**

Building legacy voice experiences has been either tightly tied to Big Tech device centric platforms or a cobbled together mashup of Natural Language Understanding, Speech Recognition, and CSS. These legacy tools have provided a strong foundation from a hands-on learning perspective but they’ve also required long ramp-up times, capital investment and come with trade-offs in performance or user experience. As voice continues to prove itself as a viable feature and user expectations continue to grow, the trade-offs become less viable.

We left the event energized by the conversations and excited about where the voice industry is heading. See you in 2022!


Top voice industry themes in 2021, from voice user experience to voice technology.

Voice Summit Reflections, December 2021


Everyone who has built a robust natural language user interface, be it a voice UI or a chatbot, knows that getting all the details right can take quite a bit of work and iterations. Also, while most tools for building natural language UIs are designed around the same principles (intents and entities/slots), getting to grips with the intricacies of each platform can be similar to learning a new programming language.

Hence, to lower the barrier of entry to Speechly from legacy voice UIs, we are excited to introduce a new feature that lets you create a new Speechly application from an existing Alexa skill in a few simple steps. With this feature, folks can easily turn their Alexa skill into a streaming Speechly voice application that can then be used to enable Reactive Voice UI experiences on the Web.

## Here’s how it works

1. Export your Alexa Interaction Model (as a JSON file) from the Alexa developer console.
2. Create a new application on the Speechly Dashboard.
3. Upload the Alexa Interaction Model into the new application. It will be converted to a Speechly configuration that you can edit as if it was written for Speechly from scratch.
4. Click “deploy”, and you’re done. In a few minutes the application is trained and ready for you to try out in the Speechly Dashboard.
5. Use the Speechly Client Libraries and Web UI Components to integrate the Speechly application to your own website UI and business logic.

<YouTube videoId="0t2tijg9UgI" />

## Supported Alexa features

At the time of launch, we support the following Alexa slot types:

1. `AMAZON.NUMBER` (mapped to `SPEECHLY.NUMBER`)
1. `AMAZON.ORDINAL` (mapped to `SPEECHLY.SMALL_ORDINAL_NUMBER`)
1. `AMAZON.DATE` (mapped to `SPEECHLY.DATE`)
1. `AMAZON.TIME` (mapped to `SPEECHLY.TIME`)
1. `AMAZON.PhoneNumber` (mapped to `SPEECHLY.PHONE_NUMBER`)
   (Note that there may be small differences in the way the returned slot values are normalized.)

Good to note: as per Speechly’s single-turn Reactive Voice UI paradigm, the conversion feature does not support dialogue management features that Alexa skills sometimes rely on. The resulting Speechly application can recognize the intents and slots as they appear in the Alexa Interaction Model, however, Speechly will not trigger follow-up questions for instance if the user’s utterance is missing a required slot.

If you are new to Speechly, take a look at our [Design Philosophy Guide](/blog/voice-application-design-guide/). There you will find guidelines and best practices for creating websites & applications with Reactive Voice UIs.

## Let us know how we can make this better

We believe that this feature is useful for those of you who have heavily invested in developing Alexa skills, but would still like to give Speechly a try. This will be especially useful when you are interested in evaluating Speechlys NLU accuracy on your application. (An arguably necessary step to take when considering any new voice UI platform.)

The feature is now released as a public beta, and we are happy to receive feedback on how to make it more useful for you. Likewise, please let us know if you would need support for other platforms in addition to Alexa.

[Start developing today](https://api.speechly.com/dashboard/) with Speechly, see [our demos](https://demos.speechly.com/fashion/), and check out [our documentation](https://docs.speechly.com).

Amazon Alexa and all related logos and motion marks are trademarks of Amazon.com, Inc. or its affiliates.

_Cover photo by Jessica Lynn Lewis on Pexels_


Create a new Speechly application from an existing Alexa skill in a few simple steps.

Introducing the Alexa-to-Speechly conversion feature


For the past 5 years at Speechly, we have been researching and developing tools to easily add Fast, Accurate, and Simple Voice User Interfaces (Voice UIs) in Mobile, Web, and Ecommerce experiences. In this article, we'll introduce the concepts and guidelines we've found effective in creating Multi-Modal Voice experiences that enable users to complete tasks efficiently and effectively.

At Speechly, we approach Voice as an Interface. We believe Voice UIs should blend alongside existing modalities - like typing, tapping, and swiping - and take advantage of a visual display for providing real-time feedback to the user. As a result, a Speechly powered website/app can be controlled with both the [Voice UI](/blog/what-is-voice-user-interface/) and the Graphical User Interface (GUI), allowing the user to choose the best input method for the occasion. You can also think of a Voice UI as a controller for app actions which makes it [retrofittable to an existing application](/blog/voice-user-interfaces-for-react/).

We contrast the “Speechly Model” to the popular “Voice Assistant Model” for Voice UIs seen in products like Apple’s Siri, Google’s Assistant and Amazon’s Alexa. All of these experiences are conversational in nature, optimized for hands-free use with voice, and overlook the best uses of a Voice UI in a Multi-Modal context.

## Setting the right context

### 1. Don't build a Voice Assistant

Voice Assistants are digital assistants that are built for “Conversational Experiences” - where the user speaks a Voice Command and the system typically utters back a Voice Response. Certain hand’s free scenarios can be a good fit for the Voice Assistant model, such as IVR within Contact Centers, but it is not the best model when a user has access to a screen.

Instead of back and forth “Conversational Experiences”, Multi-Modal voice experiences should be based on real-time visual feedback. As the user speaks, the user interface should be instantaneously updated.

### 2. Design for Command & Control

When humans talk with each other, we do more than transmit information by using words. We use different tones and emotions to give different meanings to our words depending on the context of a situation. This is very human-like, but not the way we want to communicate with a computer.

With a Multi-Modal Voice UI, speech has only one function: Command and Control the system to do what the user wants. Be clear that the user is talking with a computer, don’t try to imitate a human. In most cases, the application should not answer in natural language. It should react by updating the user interface, just like when clicking a button or making a search.

### 3. Give visual guidance on what the user can say

<YouTube videoId="XWqHV1a32LM" />

An issue commonly described by users of Voice UIs is the uncertainty related to what commands are supported. Within the Voice Assistant context, this arises from the mission of General Voice Assistant platforms to create an all knowing Assistant.

Understanding the supported functionality with traditional GUIs is less of a problem. Placing a button in the user's shopping cart that reads “Proceed to Checkout” is a very strong signal to the user that checkout is supported and by pressing the button the user will indeed proceed to the checkout process. This aspect is missing from Voice-Only solutions and is a strong benefit for Multi-Modal Voice UIs.

### 4. Use voice ONLY for the tasks it's good for

<YouTube videoId="xI68NT8D1m8" />

Good design is about providing the user with the easiest tools for completing a task.

Voice works great for use cases such as [Voice Search](/blog/voice-search/) – “Show me the nearest seafood restaurants with three or more stars”, [Voice Input](https://youtu.be/XJ4BnEIiAjo?t=313) – “Add milk, bread, chicken and potatoes”, and [Voice Command & Control](https://www.youtube.com/watch?v=vD0gleP7Sxc&ab_channel=Speechly) - “Show sports news” or “Turn off all lights except the bedroom”.

On the other hand, touch is often the better option for quickly selecting from a couple of options.

There’s no need to replace your current user interface with an Assistant based Voice UI. A Multi-Modal Voice UI should blend as a UI Feature alongside existing modalities like typing, tapping, or swiping. Rather you should evaluate which tasks in your application are the most tedious and easiest to do by using your voice.

## Receiving commands from the user

### 5. Onboard the user

When a user sees a Voice UI for the first time, they will need some guidance on how to use it.

Guidance tips should be placed close to where the visual feedback will appear. You can hide the tips after the user has tried the Voice UI.

### 6. Avoid using a wake word

While voice assistants use a wake word so that they can be activated from a distance, your mobile or desktop application doesn’t need to. The hands free scenario is less relevant than you might initially think, as the user is already holding or within close proximity to a device. There are also privacy risks that are inherent with a Wake Word that are altogether avoided.

### 7. Use a Push-to-Talk button

Push-to-Talk (Button on Screen or Physical Key/Button on Device) is the best way to operate a microphone in an application with a Multi-Modal Voice UI. When the user is required to press a button while talking, it’s completely clear when the application is listening. This also decreases latency by making endpointing very explicit, eliminating the possibility of endpoint false positives (system stops listening prematurely) and false negatives (systems does not finalize request after the user has finished the command).

On the desktop you can use the spacebar for activating the microphone.

You can also add a slide as an optional gesture to lock the microphone for a longer period of time. WhatsApp has a good implementation of the design in their app.

### 8. Signal clearly when the microphone button is pushed down.

To make sure the user knows that the application is listening, signal clearly when the microphone button is pushed down. This is especially important when using the Push-to-Talk pattern.

You can use sound, animation, tactile feedback (vibration) or a combination to signal the activation. On a handheld touch screen device, make sure that the activated microphone icon is visible from behind the thumb when Push-to-Talk is activated.

## Giving feedback to the user

### 9. Use non-interruptive modalities for feedback

Non-interruptive modalities include haptic, non-linguistic auditory, and perhaps most importantly visual feedback. Using these modalities, the application can react fast and without interruption to the user. For instance, in the case of “I’m interested in t-shirts,” the UI would swiftly show the most popular t-shirt products, instantly enabling the user to continue with a refining utterance such as, ”do you have Boss.” This narrows down the displayed products to show only the Boss branded t-shirts.

On the other hand, using a voice response makes this experience complicated for the user as any ongoing user utterance will be abruptly interrupted. Voice Response is also a slow channel for transmitting information and for returning users, hearing the same speech synthesis can lead to a worse user experience over time.

### 10. Minimize latency with Spoken Language Understanding

One important part of user experience is the perceived responsiveness of the application. Designers are using tricks such as lazy loading, doing tasks on background, visual illusions and preloading of content to make their applications seem faster and this should be done with Voice UIs, too.

In Voice-Enabled applications, immediate UI reaction is even more important. Immediate UI reaction encourages the user to use longer utterances and to continue the voice experience. In case of an error, it enables the user to recover fast.

### 11. Steer user’s gaze and minimize visual unrest

When using voice effectively the user can control the UI an order of magnitude faster compared to tapping and clicking. This means there can be a lot of visual activity happening in the UI. It is important that the user can keep up with these UI reactions and understand the feedback.

Typically UI reactions manifest themselves in some sort of visual queues, micro animations and transitions. There is an instinctive inclination in the human visual cognition system to move visual focus to where movement is happening.

Therefore it is an antipattern to scatter UI reactions all over the visual field of the user, e.g. streaming transcription animation on top of the screen and other ui reactions at the bottom of the screen. This will result in the user's gaze bouncing back and forth on the screen making it nearly impossible to understand what is happening in the UI.

For this reason it is important to either centralize all visual UI reactions near one focal point, meaning that both the transcript as well as the visual transitions resulting from the Voice commands are shown very close to each other. The other option is to steer the user's gaze linearly on the screen with a cascade of animations happening either top to down or left to right.

Also, while a Voice UI needs to be as close to real-time as possible, you need to minimize flicker and visual unrest. You can use placeholder images and elements to make sure the application looks smooth and reacts fast.

## Recovering from mistakes

### 12. Show the transcript

Text transcription of a users voice input is the most important variable of feedback in case of an error. Lack of action tells the user their input was not correctly understood, but in case of an error in the Speech Recognition, the transcript can enable them to understand what went wrong quickly.

Transcripts can also be valuable for the user when everything goes right. It tells the user they are being understood and encourages them to continue with longer utterances. If you are using Speechly, you can use the tentative transcript to minimise feedback latency.

### 13. Produce results fast, but offer opportunity to correct

Natural Language Understanding is hard for many reasons. In addition to the Speech Recognition failing, the user can hesitate or mix up their words. This can lead into errors, just like a misclick can lead to errors with a GUI.

While there are multiple ways to reduce the amount of errors, the most important thing is to offer the user an opportunity to correct themselves quickly. Produce the best guess for correct action as quickly as possible and let the user refine that selection by either voice or touch.

### 14. Have an intent for verbal corrections

When users give long Voice Commands there is a higher likelihood that the user will make an error in their speech. This is not a problem if the users get real-time feedback and can correct themselves naturally.

Multimodality enables users to use the GUI to correct themselves, but make sure to include an intent for verbal corrections as well. This makes it possible for users to say something like “Show me green, sorry I mean red t-shirts” without “failure”.

### 15. Offer alternative ways to complete tasks without Voice

Another way to make corrections is with touch/click. Touch/click corrections are done best by offering the user a short list of viable options based on what they have said or done earlier.

If your user is filling a form by using voice commands, for example, they might only need to correct one field. It can be the most intuitive to tap the correct field and make the correction by using touch. Make sure you support both ways for corrections!

The big issue with voice assistants is that they are hard to use by touch. While voice is a great UI for many use cases, sometimes it’s not feasible. This is why all features in your application should be usable with both voice and touch. For example, you can use traditional search filtering with dropdown menus and include a microphone for using the filters by voice. This enables users to choose the modality that is best for the task at hand.

_Originally published November 27, 2020, updated November 10, 2021_


In this article, we'll introduce the guidelines and best practices for creating websites & applications with Multi-Modal Voice User Interfaces.

Speechly Guidelines for Creating Productive Voice-Enabled Apps


**Voice User Interfaces (Voice UIs) often refer to UIs that use voice both for user input and output. Voice UIs are typically built to enable a more efficient user experience. However, we frequently run into problems with voice-only UIs that result in confusion and frustration for users.**

At Speechly, we believe that many of the problems that exist with Voice UIs today can be mitigated or completely eliminated by adopting a multi-modal design philosophy. This means leveraging all the available modalities (voice, visual, touch) of the user's context to make the user interaction as easy and smooth as possible. One of the most fascinating platforms for multi-modal Voice UIs is the web, but if you look for design patterns for adding voice features to web applications, you will quickly realize a lack of quality resources.

To make designing and developing Voice UIs on the web easier, we are excited to release some of our research on this topic as a set of ready made UI components. These components can be used to give visual cues to users that the Voice UI is working as expected.

<YouTube videoId="HIgl7B37Jlc" />

## 4 UI Components for Voice UIs

- **[Push-To-Talk Button](https://docs.speechly.com/client-libraries/ui-components/push-to-talk-button/)** is a holdable switch for controlling the Voice User Interface.
- **[Big Transcript](https://docs.speechly.com/client-libraries/ui-components/big-transcript/)** is an overlay-style component that displays the real-time speech-to-text transcript and feedback to the user.
- **[Transcript Drawer](https://docs.speechly.com/client-libraries/ui-components/transcript-drawer/)** is an alternative for Big Transcript that slides down from the top of the viewport. It displays usage tips along with the real-time speech-to-text transcript and feedback.
- **[Intro Popup](https://dreamy-cori-a02de1.netlify.app/ui-components/intro-popup/)** is an overlay-style popup that is automatically displayed when the user first interacts with Push-To-Talk Button. It displays a customizable introduction text that briefly explains voice features microphone permissions are needed for. Intro Popup also automatically appears to help recover from a common problems.

## Our Multi-Modal Design Philosophy helps design better voice-enabled user interfaces

We believe most of the problems that face Voice UIs can be overcome with a multi-modal design philosophy. Below is the multi-modal design philosophy we embody at Speechly. This [Design Philosophy Guide](https://dreamy-cori-a02de1.netlify.app/design-philosophy/) should be used as a complimentary resource with the UI components above when designing or developing a Voice UI.

### Chapter 1: Setting the right context

- Resist the temptation to build an assistant.
- Design the interactions around _command & control_, not conversation
- Give visual guidance on what the user can say
- Use voice ONLY for the tasks it is good for

### Chapter 2: Receiving commands from the user

- Onboard the user
- When a pressable button is available wake word is not needed
- Prefer a push-to-talk button mechanism
- Signal clearly when the microphone button is pushed down

### Chapter 3: Giving feedback to the user

- Use non-interruptive modalities for feedback
- Minimize latency with [streaming Spoken Language Understanding](/products/interfaces/)
- Steer user’s gaze and visual attention
- Minimize visual unrest in triggered events

### Chapter 4: Recovering from mistakes

- Show the text transcript in real time
- Enable corrections both verbally and by using touch
- Offer an alternative way to complete the task using touch

## Free Voice UI Components for Download

You can find more information about these UI components inside our documentation. If you would like to access the Speechly UI component design files, they are now available in Figma and Sketch for download.

- [Speechly UI components and layout templates in Figma](https://www.figma.com/file/CqXMKQX6LNSnSai00P5xbz)
- [Download Speechly UI Components and layout templates for Sketch](https://speechly.github.io/speechly-ui-assets/speechly-ui.sketch)
- [UI components in docs.speechly.com](https://docs.speechly.com/client-libraries/ui-components/)

If you have any questions on how to best take advantage of our Voice UI components, please feel free to reach out to the team at design@speechly.com.


Ready-made UI components make development of Voice UIs faster.

UI Components for Voice UIs in the Web


**This year’s conference was held in the beginning of September in Brno, Czech Republic. Typically there would be some 2000 attendees at the conference, but due to Covid-19, this year most of the attendees joined the conference virtually. I was there on site with 350+ other researchers, and here are my impressions on the scientific catering in the field of automatic speech recognition (ASR).**

## New datasets

ASR is a data-heavy field. Industry leaders are using tens of thousands of hours of transcribed speech to train their models, but most of the ASR research has relied on much smaller publicly available corpora. Only very recently opportunities for using larger non-proprietary speech corpora have emerged. This year Facebook published Multilingual LibriSpeech (MLS), but that is limited to read-speech data. Now at Interspeech, two large ASR corpora were published, which extend the available domains:

- SPGISpeech offers 5000 hours of financial calls with rich formatting: [https://www.isca-speech.org/archive/pdfs/interspeech_2021/oneill21_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/oneill21_interspeech.pdf)

- GigaSpeech is a 10,000h multi-domain corpus drawing data from audiobooks, podcasts, and YouTube videos: [https://www.isca-speech.org/archive/pdfs/interspeech_2021/chen21o_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/chen21o_interspeech.pdf)

Also worth checking is Facebook's research paper which showed that an ASR model trained on publicly available corpora, combined with fine-tuning to target data, works well for real-world tasks:

- _Rethinking Evaluation in ASR: Are Our Models Robust Enough?, Tatiana Likhomanenko (Facebook, USA), Qiantong Xu (Facebook, USA), Vineel Pratap (Facebook, USA), Paden Tomasello (Facebook, USA), Jacob Kahn (Facebook, USA), Gilad Avidov (Facebook, USA), Ronan Collobert (Facebook, USA) and Gabriel Synnaeve (Facebook, France):_ [https://www.isca-speech.org/archive/pdfs/interspeech_2021/likhomanenko21_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/likhomanenko21_interspeech.pdf)

## wav2vec

wav2vec is an unsupervised (or "self-supervised" as they like to call it) method for learning speech representation. Its latest incarnation, wav2vec 2.0, is gaining popularity: At Interspeech there were 10 papers mentioning wav2vec in the title, and many more which used it in their experiments. The benefit in using such pre-trained representations are a drastic drop in the training data requirements, thus making it attractive for limited resource scenarios.

A nice analysis on the nature of wav2vec 2.0 was provided by Facebook in their paper:

- _Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training, Wei-Ning Hsu (Facebook, USA), Anuroop Sriram (Facebook, USA), Alexei Baevski (Facebook, USA), Tatiana Likhomanenko (Facebook, USA), Qiantong Xu (Facebook, USA), Vineel Pratap (Facebook, USA), Jacob Kahn (Facebook, USA), Ann Lee (Facebook, USA), Ronan Collobert (Facebook, USA), Gabriel Synnaeve (Facebook, France) and Michael Auli (Facebook, USA)_\
[https://www.isca-speech.org/archive/pdfs/interspeech_2021/hsu21_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/hsu21_interspeech.pdf)

## Trends in ASR

Rather than introducing a multitude of complex new network architectures, this year's focus appeared to be more on the practical side of ASR: Reducing the streaming latency, fitting the models on-device, and overall reducing computation and memory footprint. Various improvements were presented towards these goals, especially for transformer-transducer models.

For examples, see the following papers:

- _Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning, Songjun Cao (Tencent, China), Yueteng Kang (Tencent, China), Yanzhe Fu (Tencent, China), Xiaoshuo Xu (Tencent, China), Sining Sun (Tencent, China), Yike Zhang (Tencent, China) and Long Ma (Tencent, China)_\
[https://www.isca-speech.org/archive/pdfs/interspeech_2021/cao21b_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/cao21b_interspeech.pdf)

- _An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling, Tara N. Sainath (Google, USA), Yanzhang He (Google, USA), Arun Narayanan (Google, USA), Rami Botros (Google, USA), Ruoming Pang (Google, USA), David Rybach (Google, USA), Cyril Allauzen (Google, USA), Ehsan Variani (Google, USA), James Qin (Google, USA), Quoc-Nam Le-The (Google, USA), Shuo-Yiin Chang (Google, USA), Bo Li (Google, USA), Anmol Gulati (Google, USA), Jiahui Yu (Google, USA), Chung-Cheng Chiu (Google, USA), Diamantino Caseiro (Google, USA), Wei Li (Google, USA), Qiao Liang (Google, USA) and Pat Rondon (Google, USA)_\
[https://www.isca-speech.org/archive/pdfs/interspeech_2021/sainath21_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/sainath21_interspeech.pdf)

- _Reducing Streaming ASR Model Delay with Self Alignment, Jaeyoung Kim (Google, USA), Han Lu (Google, USA), Anshuman Tripathi (Google, USA), Qian Zhang (Google, USA) and Hasim Sak (Google, USA)_\
[https://www.isca-speech.org/archive/pdfs/interspeech_2021/kim21j_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/kim21j_interspeech.pdf)

Looking for something more exotic? Check out the research on non-autoregressive ASR:

- _An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition, Ruchao Fan (University of California at Los Angeles, USA), Wei Chu (PAII, USA), Peng Chang (PAII, USA), Jing Xiao (PAII, USA) and Abeer Alwan (University of California at Los Angeles, USA)_\
[https://www.isca-speech.org/archive/pdfs/interspeech_2021/fan21b_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/fan21b_interspeech.pdf)

RNN transducers are still going strong. Several publications had adopted the Hybrid Autoregressive Transducer (HAT) approach for combining external language models with the end-to-end model.

- Speechlys approach, [“Fast Text-Only Domain Adaptation of an RNN-Transducer Prediction Network”](https://www.isca-speech.org/archive/pdfs/interspeech_2021/pylkkonen21_interspeech.pdf), published officially at Interspeech and is a lighter-weight solution, but more about that later!

## And more...

Interspeech is a lot more than just an ASR conference, too much to cover in a single blog post. One interesting and timely topic was the COVID-19 challenge: detecting infection based on cough and speech samples!

- _The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates, Björn W. Schuller et al._\
[https://www.isca-speech.org/archive/pdfs/interspeech_2021/schuller21_interspeech.pdf](https://www.isca-speech.org/archive/pdfs/interspeech_2021/schuller21_interspeech.pdf)

You can browse the full list of publications at [https://www.isca-speech.org/archive/interspeech_2021/index.html](https://www.isca-speech.org/archive/interspeech_2021/index.html)


Our report from Interspeech, the largest scientific conference focusing on speech science and technology.

Interspeech 2021: Take-aways on Automatic Speech Recognition


## What is the Web Speech API?

The [Web Speech API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API) is an experimental browser standard that enables web developers to effortlessly process voice input from their users. Its simple API can turn on the device's microphone and apply a speech-to-text algorithm to convert whatever the user says into text that the web app can process. At first glance, it seems to open the door to voice-enabled web apps.

## The problem

However, [browser support for this API](/blog/web-speech-api-alternative/) is limited. At the time of writing, the majority of its support is centralised in browsers made by Google, who authored much of the API's [specification](https://wicg.github.io/speech-api/). Indeed, the only browsers that do support it are owned by big tech companies that have the scale to afford to include a free speech-to-text service. Apple has [recently joined](https://firt.dev/ios-14.5/#speech-recognition-api) Google in offering a Siri-based equivalent in Safari.

This has a couple of consequences. Firstly, web apps that use this API have a fragmented experience across browsers. One [example](https://github.com/brave/brave-browser/issues/3725#issuecomment-555733958) is Duolingo, which only offers its voice exercises on Chrome. Indeed, even amongst the browsers that do offer the API, the speech-to-text algorithm differs between them, resulting in different transcriptions and different user experiences between browsers. For example, these are some ways different implementations of the API could yield different results:

1. A word may be transcribed correctly by some implementations and incorrectly by others.
2. A word may be transcribed incorrectly in different ways.
3. A word that is recognised correctly may still be formatted differently. A typical point of contention is how to format units, dates, numbers, times, etc. For example, "1cm", "1 cm", "1 centimetre", "1 centimeter", "8 December 2019", "8 Dec 2019", "08/12/2019", "08.12.19", "one million", "1000000", "1 000 000", "1,000,000", and so on.
4. Implementations across browsers are upgraded in a different cadence and something that worked previously in one browser might not work in the next upgrade.

Secondly, there is a trust factor. Developers using the API probably don't realise that they are sending their users' voice data to a service owned by a big tech company like Google. They may assume that the transcription algorithm runs on the device when it is in fact performed in the cloud. Owning the browser and the speech recognition service also gives these companies the power to make arbitrary changes to the API, including turning it off, as well as lock out other browser vendors. An [example](https://github.com/brave/brave-browser/issues/3725#issuecomment-555694620) is Brave, a browser based on Chromium, which is unable to use Google's speech recognition service due to restrictions imposed by Google. Such restrictions widen the feature gap between browsers like Chrome and the rest of the field.

## The solution

Browsers do not have to be limited to using the speech recognition services owned by Google and Apple. There are more widely supported browser standards like the [Media Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Media_Streams_API) that can enable developers to stream audio data from a microphone to _any_ service. The Web Speech API can be replicated by building code on top of these APIs, escaping the vendor lock-in imposed by the browser's native choice of speech recognition service. Indeed, it can be replicated on browsers that _don't_ support it in the first place.

Code that implements missing browser functionality like this is called a [polyfill](https://developer.mozilla.org/en-US/docs/Glossary/Polyfill). The good news is that there exists a polyfill for the Web Speech API that uses Speechly’s speech recognition service under the hood. Any web app using this polyfill would be able to provide a consistent voice-enabled user experience across all browsers, using an API that the developer has chosen, can configure, and can trust.

The code for the polyfill can be found [here](https://github.com/speechly/speech-recognition-polyfill). It can be used in isolation, but if you are using React to build your web app, we recommend you combine it with [react-speech-recognition](https://github.com/JamesBrill/react-speech-recognition) for the simplest set-up.

## An example using React

The repositories both include examples of the two libraries working together and full API documentation, but we'll repeat the basic example here to give you a taste.

First, Start developing with Speechly and get an app ID. You can find a quick guide for that [here](https://dreamy-cori-a02de1.netlify.app/examples/stt-only/).

Next, install the two libraries in your React app:

```bash
npm i --save @speechly/speech-recognition-polyfill
npm i --save react-speech-recognition
```

We're going to make a simple push-to-talk button component. When held down, it will display a transcript from the microphone. When the button is released, transcription will end. Using your Speechly app ID, create a React component like the following:

```js
import React from 'react';
import { createSpeechlySpeechRecognition } from '@speechly/speech-recognition-polyfill';
import SpeechRecognition, { useSpeechRecognition } from 'react-speech-recognition';

const appId = '<INSERT_SPEECHLY_APP_ID_HERE>';
const SpeechlySpeechRecognition = createSpeechlySpeechRecognition(appId);
SpeechRecognition.applyPolyfill(SpeechlySpeechRecognition);

const Dictaphone = () => {
  const { transcript, listening } = useSpeechRecognition();
  const startListening = () => SpeechRecognition.startListening({ continuous: true });

  return (
    <div>
      <p>Microphone: {listening ? 'on' : 'off'}</p>
      <button
        onTouchStart={startListening}
        onMouseDown={startListening}
        onTouchEnd={SpeechRecognition.stopListening}
        onMouseUp={SpeechRecognition.stopListening}
      >Hold to talk</button>
      <p>{transcript}</p>
    </div>
  );
};
export default Dictaphone;
```

Run your web app, hold down the button and speak into your microphone (you may need to give the browser permission to use the microphone first). You should see your speech transcribed like this:

![image](/uploads/speechly-polyfill-example.png)

Give it a try and let us know how you get on! If you have any feedback on either library, raise a GitHub issue on the [polyfill repository](https://github.com/speechly/speech-recognition-polyfill) or [react-speech-recognition](https://github.com/JamesBrill/react-speech-recognition).


Speechly enables full browser compatibility for Web Speech API using a polyfill.

Speechly enables full browser compatibility for Web Speech API


## What is "Field Service" and "Field Service Management"?

[Field Service](https://www.skedulo.com/what-is-field-service-management/) consists of “workers in the field” who perform various types of tasks. These can include things such as installation, repair, or maintenance of hardware. However, as technology continues to evolve we see the field service landscape evolving as well and expanding into other verticals such as education and healthcare. For the purpose of this post I am going to refer to “Field Service” as any service or task that is performed outside of company walls for a customer where a worker is required to physically visit the customer to complete the job.

Field Service Management (FSM) is simply how an organization decides to manage all of the different pieces of completing work in the field. This is no small task and requires organizations to track a lot of data to ensure they are delivering a quality service to their customers. Data that is frequently tracked in FSM can be seen in the list below:

- Job Start, Completion, & Travel Time
- Equipment Used
- Employee Performance
- Customer Feedback & Surveys

## Challenges in Field Service Management

As you can see in the list above there is a lot of communication between field and office workers that needs to take place alongside a lot of data collection. This puts a lot on the plate of workers in the field who must relay this information, but are also required to complete the field service of interest while delighting customers in the process. This leads to a myriad of challenges in FSM. I am going to look closer at the challenges with Data Accuracy and 1st Time Fix Rate as I believe these major challenges can be solved with elegant implementation of [voice technology](/categories/voice-tech/) in existing FSM software.

## Challenge #1 - Data Accuracy

According to a [PwC survey](https://online.hbs.edu/blog/post/data-driven-decision-making) of over 1,000 senior executives, highly data-driven organizations are three times more likely to report significant improvements in decision making. However, data is only valuable if it is accurate and accessible. In-the-field workforces have been around for decades and as a result many teams have relied on manual methods for gathering and managing data. These processes usually involve legacy tech such as spreadsheets or quite literally writing notes on paper. With this type of process it is easy for data to be innacurate.

As a result of this you have seen companies from startups to big technology companies build FSM software that puts mobile technology at the center of the data management strategy. While these products have improved the data collection for workers in the field, with a lot of credit going to automating steps in the data collection process, there are still [scenarios that require manual input](https://www.speechly.com/blog/ideal-scenarios-multimodal-user-interfaces/) from in-the-field workers. Examples of this could be getting customer feedback or inputing context specific information that relates to the job at hand. It is important that these data points are also tracked with a high level of scrutiny. They can play a significant role in data driven decision making, and given the manual process can lead to inaccuracies in the data.

## Challenge #2 - Fix Rate

The [Fix Rate](https://fsd.servicemax.com/2015/04/13/first-time-fix-rate-field-service-metrics-that-matter/) looks at how successful in-the-field workers are at completing their tasks when they show up to a service call. The goal is to have a high first-time fix rate, where you are able to solve the problem the first time you engage with a customer. According to Aberdeen Group, the “best-in-class” field service organizations achieve an 88% average first-time fix rate, with “average” organizations achieving 80% and “laggards” coming in at 63%. This should be a top priority for field service organizations as improving this metric can result in improvements to the company bottom line.

One of the leading causes for a bad first-time fix rate is lack of education or skills to complete the task at hand. When an in-the-field worker is confused about the task they have to complete, it can result in rebooking the service or spending too much time at a particular jobsite. Either way, lack of skills can be a real challenge when trying to improve a company's first-time fix rate. Organizations should also be looking to better educate their employees, but they should also explore how they can avoid rework on the spot when inevitable problems come up for in-the-field workers.

## Field Service Management Needs Innovation

Customers need to be front of mind when completing field service work, however organizations also need to be mindful of their employees completing the work. One way of doing this is making sure organizations offer tools that enable workers to complete their job as efficiently and effectively as possible. There are a myriad of FSM software options in the market, but despite these tools there is still a lot of innovation that needs to be done to make sure these solutions are actually meeting the needs of in-the-field workers. [Emergence Capital conducted a survey](https://www.prnewswire.com/news-releases/the-state-of-technology-for-the-deskless-workforce-2020-301192698.html) called “The State of Technology for the Deskless Workforce 2020”. There were two key insights I want to highlight: A) 60% of deskless workers believe there is room for improvement in the technology they use to perform their jobs. B) 75% of deskless workers spend most of their time using a piece of technology that isn’t crafted for their use case.

## 3 Main Benefits of Voice Technology in Field Service Management

Voice technology is a perfect feature to improve existing FSM software. Since voice technology can easily be embedded into existing mobile applications, product teams can create Natural User Interfaces that take advantage of normal mobile features like swipe or type, but also allows for voice input where it makes sense for the end user. You can see the value of having voice input alongside a mobile app in the demo video below:

<YouTube videoId="Gis1qGtclsY" />

There are three main benefits that can be realized with an elegant integration of voice technology in existing FSM software. The first benefit comes from the efficiency of voice as an input vs typing. Smartphone speech recognition is 3x as fast as human typing, however the real efficiency with voice technology comes from eliminating the need for a conversational experience as seen with [popular voice assistants](https://www.speechly.com/blog/real-time-voice-user-interfaces/) like Alexa or Siri. Forcing a conversational experience on the end user eliminates the value that comes from having a mobile screen at our fingertips. The efficiency with voice tech can be unlocked by understanding where there is actual [efficiency with using voice](https://www.speechly.com/blog/advantages-of-voice-user-interfaces/) as an input vs just tapping something on your screen.

The second benefit with voice technology plays off of the value from efficiency. This benefit is improving the quality of data you are capturing. While many different pieces of data can be tracked automatically, there are still different data points that will need to be updated manually by in-the-field workers. To expand on the insights from the Emergence Deskless Workforce research mentioned above, the leading cause of dissatisfaction stems from inefficiencies with the technology workers use. If workers are not enthusiastic or engaged with the software they use for data collection you are destined to get inaccurate data. There is a big incentive for workers to find technology that actually makes them more efficient and voice technology provides this incentive if implemented properly. This mix of efficiency and engaged users leads to a recipe for accurate data collection.

The third and final benefit focuses on improving the first-time fix rate using voice technology. Organizations with an in-the-field workforce are ultimately going to be performing some type of in person service call, maintenance, check-in, etc. Voice technology enables companies to have an additional modality to access company documentation, instruction manuals, best practices, or any other resource of interest that might help an in-the-field worker complete their task the first time they show up at a service call. Voice technology allows workers to use natural language to both find answers to questions efficiently and also take advantage of the mobile screen for instances where the worker may need step-by-step instruction or visuals to reference.


Voice Technology alongside existing Field Service Management (FSM) mobile solutions can help solve common problems found in FSM

Using Voice Technology in Field Service Management


Voice technology is moving fast and old insights turn bad quickly. Here we have collected some of the freshest and most interesting insights from various voice related surveys and researches.

The pandemic has certainly affected voice tech, too, and germs-free is quickly becoming more important than overall hands-free.

![Reasons for using voice](/uploads/reasons-for-using-voice.png)

## 2020

Adobe published their **2020 Voice Survey** which contains a lot of interesting data points about voice tech. You can find the full report [here](https://xd.adobe.com/ideas/principles/emerging-technology/voice-technologys-role-in-rapidly-changing-world/). Some of the most interesting insights include:

- 31% of voice users count sanitation as a benefit of voice tech

- 37% would use voice user interface to check bank balance

- 29% would use voice to book doctor's appoitment (in fact, we have created a configuration for such use case. You can see a demo [here](https://www.youtube.com/watch?v=tSi7vJuIyT0))

- 28% would request grocery delivery by voice (you can read more about voice and grocery shopping [here](https://www.speechly.com/blog/grocery-ecommerce-user-experience/))

- 18% of users currently use voice for using health and fitness apps

- 86% says voice tech could make business/events more sanitary. This figure has increased significantly due to the pandemic.

- As you can imagine, home automation is a common use case for voice: 56% of users would open door with voice, 55% would use elevator controls and 49% vending machine controls. We have built our home automation demo, too and you can try it out [here](https://home-automation-app-demo.herokuapp.com/)

- Only one in four use voice beyond simple searches (52% for maps/driving, 51% for text and chat and 46% for music control). This might partly do with the fact that voice search is not available in most applications and websites. Read about our approach to voice search [here](/blog/voice-search/)

- 57% say that better understanding of voice tech would cause them to use voice more. We see voice still having kind of a chicken-egg -problem, where efficient and natural user experiences are still rare and hence not many developers think of adding voice features to their apps. When we start seeing first really well working examples, voice will quickly become a dominant user interaction modality

- 62% feel awkward using voice with others present. This is why voice-only solutions do not work. The user should have the ability to switch between voice and other interaction modalities seamlessly.

- Skill discovery is still a problem with legacy voice platforms and the amount of users who do not know how to complete a task with voice tech went up 14% year-over-year. Onboarding users to voice is very important

- 49% of respondents to Adobe survey predict that by 2025 voice will be better able to suit their needs

eMarketer released some interesting data points. In their [research](https://www.emarketer.com/content/voice-assistant-use-reaches-critical-mass) the key takeaway is that voice assistants have reacthed the critical mass:

- There's 128 million monthly active users in the US for voice assistants. This amounts to almost 40% of all US internet users and one third of the overall population

- Smartphones and smart speakers are the most popular voice assistant devices

In the report, eMarketer mentions:

> Over time, we expect the number of voice-assistant users to rise even further as the software finds its way into more devices, including cars, wearables, smart TVs, appliances and other connected gadgets.

Supporting eMarketer research, [Gartner's Intelligence Report: Optimizing Voice Search and Features for Mobile Commerce](https://emtemp.gcom.cloud/ngw/globalassets/en/marketing/documents/intelligence-report-optimizing-voice-search-and-features-for-mobile-commerce-excerpt.pdf) finds that:

- 26% use smart speakers and voice assistants at least once a week

- 39% of US population uses voice features on their smartphones

- But still, only 15% of brands analyzed provide voice search on mobile apps. Based on our research, this has to do with the fact that there hasn't prveiously been good developer tools available for such experiences.

- While mobile commerce and voice is a match made in heaven, 90% of consumers do not use smart speaker to shop. This probably has to do with problems in product discovery with voice-only assistants

- Hands-free is obviously one great benefit for voice tech and 32% of respondents reporterd interested in hands-free voice tech.

SoundHound conducted [a research on the business value of voice assistants](https://voices.soundhound.com/pdfs/voice-AI-research-report.pdf). Some interesting facts in that research include:

- 84% of respondents are deploying voice assistants to mobile apps, compared to 54% on smart speakers. Voice is moving away from dedicated devices and will rather become a part of all digital experiences.

- The most common business functions to offer voice solutions are customer service (81%), sales (52%), store operations and marketing (38%). IT and HR have the lowest voice adoption with 6% and 3% respectively.

- Perceived end-user benefits of voice include better customer experience, customer satisfaction, hands-free access and faster. You can read about the advantages of voice user interfaces in [this artlcle](/blog/advantages-of-voice-user-interfaces/)

![End user benefits for voice](/uploads/end-user-benefits-for-voice.png)

- [Statista](https://www.statista.com/statistics/1134244/barriers-to-voice-technology-adoption-worldwide/) made researcch on barriers for voice adotion. It's interesting to see that up to 73% of respondents see accuracy being the biggest adoption for voice tech. 38% of respondents see complexity of deployment and integration as the being a barrier.

Speechly can help in both fronts. Our speech recognition technology is without customization [on par with the leading voice providers](https://www.youtube.com/watch?v=1hcdCrFl-MQ&ab_channel=Speechly) and by customizing the models for the specific acoustic environment or vocabulary, we can get even better results.

![Barriers for voice adoption](/uploads/barriers-for-voice-adoption.png)

- Another [research by Statista](https://www.statista.com/statistics/973815/worldwide-digital-voice-assistant-in-use/) mentions that worldwide there's 4.2 billion voice assistant devices in use.

- [Gartner](https://www.gartner.com/en/newsroom/press-releases/2019-01-09-gartner-predicts-25-percent-of-digital-workers-will-u#:~:text=By%202021%2C%20Gartner%2C%20Inc.,than%202%20percent%20in%202019.) finds one fourth of digital workers using a virtual employee assist by 2021.

- [Capgemini](https://www.capgemini.com/ch-en/2020/07/the-day-after-rethinking-business/) mentions 77% of consumers intending to increase their use of touchless technologies – and that the rate of use of online sales will increase by 30% by the end of 2020. Another proof for voice having a larger role in post-pandemic world

## Other facts

- In 2019, voice search was worth $2 Billion USD according to [WebFX](https://www.webfx.com/voice-search-content-optimization-service.html#:~:text=Transparent%20voice%20search%20optimization%20services,services%20that%20maximize%20your%20revenue.). Once voice search enters mobile devices and websites, this figure will most probably be a lot bigger

- In fact, already two years ago, one third of US population was using voice search according to [eMarketer](https://www.emarketer.com/content/us-voice-assistant-users-2019)

- [BrightLocal](https://www.brightlocal.com/research/voice-search-for-local-business-study/) finds that men are 30% more likely than women to use voice search. It might have to do with the fact that voice recognition has had [significant race and gender biases](https://hbr.org/2019/05/voice-recognition-still-has-significant-race-and-gender-biases)

- [Juniper Research estimated in 2018](https://www.juniperresearch.comhttps://www.speechly.com/press/press-releases/digital-voice-assistants-in-use-to-8-million-2023) that by 2023 there will be 8 billion digital voice assistants in use. The same report finds that voice commerce will reach $80 billion per annum by the same year

- According to consulting powerhouse [PWC](https://www.pwc.com/us/en/services/consulting/library/consumer-intelligence-series/voice-assistants.html), 57% of users use voice monthly rather than typing when searching something. No wonder, as voice is up to four times faster than typing on a smart phone

- [Narvar](https://see.narvar.com/2018-04-09-GLO-EN-WebContent-eBook-Connecting-with-Shoppers-Consumer-Report_2018-Report-Connecting-with-Shoppers-EN---PDF-LP.html) finds the age group 45-60 being the fastest growing segment for voice shopping. This is inline with our own research for users for a Speechly-powered grocery shopping application. For that application, the amount of over 40-years-old was significantly higher than for legacy shopping experiences without voice functionalities


The freshest and most interesting voice technology insights for 2021

Insights and Statistics on Voice Technology


Finland is a home to many highly succesful tech companies and startups ranging from gaming companies such as [Rovio](https://www.rovio.com/) and [Supercell](https://www.supercell.com/) to delivery company [Wolt](https://wolt.com/en) or IoT company [UROS](https://uros.com/).

In fact, Finland has been leading Europe in the amount of VC capital collected for two consecutive years. Last year, foreign investments to Finnish startups increased by a whopping 48%.

This is why we are humbled to be selected to [Top 10 Most Promising Startups in Finland](https://www.talouselama.fi/uutiset/talouselama-valitsi-10-lupaavinta-startup-yritysta-paatimme-katsoa-pitkalle-tulevaisuuteen-katso-tasta-koko-lista/bd08f4d5-3d62-4ecf-b726-d27b818a221f?ref=linkedin:5277) by Talouselämä, the leading business media here in Finland.

Other companies that made the list include:

Boksi.com - automating influencer marketing
Infinited Fiber Company - turning trash into premium textile fiber
Origin by Ocean - saving the environment by refining seaweed
QHeat - developing sustainable heating solutions
ReceiptHero - turning old-fashioned receipts digital
Spacent - creating technology for dynamic workplaces

The three companies that retained their spot on the list are
IQM Quantum Computers - building quantum computers, such as Finland's first
Aiven - open source database infrastructure management
Varjo - building high-performance XR headsets

Speechly is building developer tools for making applications smarter with voice functionalities. Our unique approach enables real-time visual feedback from the moment the user starts speaking. This allows the user to verify that their input is correctly understood, encouraging them to go on with the voice experience or enabling them to correct themselves in case of an error.

You can read more about the benefits of our technology [here](https://www.speechly.com/blog/advantages-of-voice-user-interfaces/) or see examples of what you can build with Speechly [here](https://www.speechly.com/blog/voice-user-interfaces-examples/)

If you are interested in hearing how our technology can improve your application, [reach out to us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/most-promising-startups-finland).


Talouselämä, a leading Finnish business media selected ten of the most promising startups in Finland for 2021. Speechly was one of them

Speechly Selected as One of The Most Promising Startups in Finland 2021


Success in eCommerce requires that your customers not only find what they are looking for but also find products that they didn't know they need. Good search is the key in both scenarios.

## Slow product discovery is one of the top frustrations for ecommerce shoppers

Some customers enter your store knowing exactly what they are looking for, whereas others only have a vague idea of what they want and need some inspiration. In traditional knowledge management terms this is referred to as _findability_ and _discoverability_.

In e-commerce context _findability_ refers to the ease of finding a specific product that the customer is looking for, whereas _discoverability_ refers to the ease of finding potential product alternatives meeting a more vague idea of the need. Why is this important? Because to simplify, e-commerce shoppers have two modes that they are on while shopping, which both need to be catered in an optimal way.

<table>
  <tr>
    <td className="font-bold">Journey</td>
    <td className="font-bold">User task</td>
    <td className="font-bold">Buying behavior</td>
    <td className="font-bold">Time spent</td>
  </tr>
  <tr>
    <td>Product Discovery</td>
    <td>Look for ideas and alternatives</td>
    <td>Impulsive, opportunistic</td>
    <td>More</td>
  </tr>
  <tr>
    <td>Deal Finding</td>
    <td>Look for best deal (price, availability, delivery time)</td>
    <td>Rational, effective</td>
    <td>Less</td>
  </tr>
</table>

**TABLE:** _Typical journeys that e-commerce shoppers are on while looking for products_.

In Journey 1, customers are browsing and want to see as relevant options as effectively as possible. They have not yet made the final buying decision, but might act impulsively. In Journey 2, customers have already made the buying decision and are really looking for the place where they can get the specific product as cheap and fast as possible.

Ensuring that customers find the right products to fit their needs from your catalog as effectively as possible is a top priority for e-commerce merchants and relevant for both Journeys. As a matter of fact, according to studies the slowness of [finding the right product](/blog/voice-search/) is one of the top frustrations for ecommerce shoppers.

_“32% of UK shoppers feel that it takes a long time to find what you want”_

## The dominant product discovery method, product search combined with filters, is major driver of churn

Search alone is a powerful tool if the user knows exactly what they want (Journey 2), but if they are still considering the options (Journey 1), their search terms would by definition be more vague resulting in a large set of search varying results.

So ecommerce merchants have added filters to allow users to better refine the search results of these more vague searches. This is the most common approach for product discovery and most modern online stores use it. However [according to Forrester](https://www.forrester.com/report/Googleize+Your+SiteSearch+Experience/-/E-RES124541) bad product search experience still accounts for 68% of churn, so something must be broken.

Let's go through the most common product search usability problems in touch screen and web applications.

### Product search problem #1: Too many filters

![Screenshot of a eCommerce store with too many search filters](/uploads/image1.png)

Illustration of an online store that has thirty filters

If the product search has too many search filters, user will have problems browsing through them, even if they can all be relevant. In addition, users might be looking for these filters with different names which means they have problems finding the correct filter even if the store had them.

### Product search problem #2: Too many filter values

![Scnreenshot of an ecommerce store with too many search filter values](/uploads/image2.png)

Illustration of an online store filter, where there are so many values that the filter requires a separate own search box within the filter

Sometimes the search filters itself have too many values and selecting the correct one becomes tedious. For example, let's suppose the user can select a color as a search filter. On the other hand, adding all possible colors can help the user find the exact color they are looking for but on the other hand, finding the correct shade of blue becomes difficult.

Hence, there's a balance between adding more filter values for more granular search or removing filter values for simplified search.

### Product search problem #3: Confusing categorization

![Screenshot of confusing search categories](/uploads/image3.png)

Illustration of confusing categories

Sometimes categories itself can be confusing because of naming conventions or other issues. In the example, there are too many brands for the search to be effective. Sometimes the problem is synonyms and naming.

For instance, should a user find USB memory sticks under category "Storage", "Memory", "Accessories" or maybe "Other"?

## Voice search and filtering can radically improve the product discovery experience

There must be a better way. What if instead of pointing, clicking and writing, **users could just say what they are looking for**? And immediately see the results updated in real time in front of their eyes. And what if correcting or changing your mind would be easy and smooth? Here’s a demo of how that can work in practice:

<YouTube videoId="xI68NT8D1m8" />

Using voice to complement visual product browsing makes discovery significantly [easier, faster and more enjoyable](/blog/improve-workforce-efficiency-voice-uis/). Users get less frustrated and browse through more products. Voice powered product discovery and search experience can help e-commerce players improve their conversion and retention, and differentiate from their competitors.

One clear benefit of voice search is that it supports synonyms and different ways of expressing the same thing naturally and intuitively. There's no need to give a single name for a category, but the same category can have many names.

If the user is looking for "pants", but the designer has named this category "trousers", voice search can support both names. Typical graphical user interface can't support both, because this would lead to our problem #2 - too many filter values.

If you want to learn more about voice-enabled product search and filtering, contact us! You can find more cool examples of voice user interfaces [here](/blog/voice-user-interfaces-examples/).


Designing voice-first applications requires new approaches to UX and UI design. In this post, we'll go through some best practices for designing voice-driven user interfaces.

How e-Commerce Players Can Improve Product Discovery with On-site Voice Search


When companies integrate voice technology, most will default to think of generalized Voice Assistant platforms like Amazon Alexa or Google Assistant. This mindshare has been earned by Big Tech and the innovation they have pushed forward in voice technology, however a lot has changed since Siri was announced in [2011](https://www.apple.com/newsroom/2011/10/04Apple-Launches-iPhone-4S-iOS-5-iCloud/) or Alexa in [2014](https://techcrunch.com/2014/11/06/amazon-echo/).

A major evolution is the alternative options businesses have to integrate [voice technology](/categories/voice-tech/), outside of the Big Tech providers. The purpose of this post is to understand the true value of domain ownership when integrating voice technology. In order to do this we need to first understand some of the risks associated with building voice technology alongside Big Tech and two optimal alternatives for integrating voice technology.

## Risk to Build

There are risks that need to be paid attention to when assessing the value of [building a voice experience](/blog/advantages-of-voice-user-interfaces/) using a legacy technology [Voice Assistant platform](/blog/why-smart-speakers-are-not-the-future-of-voice/) like Amazon Alexa or Google Assistant. The obvious threat that comes to mind is data. Handing over relevant customer data to the largest companies in the world with nearly infinite resources should be enough reason to raise caution.

Companies should also consider the control of their experience with their end customer. Generalized Assistant platforms offer voice technology to developers, so long as they are willing to operate within their specified domains of interest like smart speakers and smart home automation. While building out these new platforms might be a top priority for a company like Amazon, it ignores if this is the best place to build a valuable voice experience for your customer. Product teams should ultimately decide where a voice technology may fit within existing digital experiences to solve an end user problem.

A company’s brand should also be paid attention to. When looking at a company's brand, there is a gatekeeper to generalized Voice Assistant platforms in the form of a [wake word](https://www.speechly.com/blog/nlu-voice-speech-recognition-terms-glossary/). It is challenging to build brand awareness when you have to insert another company's name before getting to your brand's experience. There is also long term risk with developing a new user behavior for your product and potentially associating that product with a Big Tech company.

The risks mentioned above should be considered by companies looking to integrate voice technology, from startups to Fortune 500 brands. While the underlying voice technology large internet companies have brought to market and popularized is truly innovative, the reality is there are other providers outside of these players that companies should be aware of and consider.

# Alternatives

There are a few pieces of core technology that power the voice experiences people are familiar with today, such as Speech-To-Text (STT) for quickly speaking text messages or Natural Language Understanding (NLU) that tries to actually understand the intent of what a user is saying. The two most relevant alternatives to building a voice experience on a Generalized Voice Assistant platform are Independent or “Owned Assistants” and Voice Manipulated User Interfaces or “Voice UI”. These alternatives leverage a handful of different core voice technologies, such as STT and NLU mentioned above.

Owned Assistants are similar in nature to generalized assistants, as it relates to the end user experience. They are able to provide a conversational experience, however they are able to build this assistant experience within an existing mobile application or website. This can be a good alternative for brands who want a conversational experience without the risks mentioned above. However, there are natural complexities with approaching voice technology in a “conversational” manner.

The other alternative to a generalized assistant is a Voice UI. With a Voice UI, voice input can be used as an additional option to manipulate the graphical elements of a UI just like clicking, tapping, swiping and typing. Looking at voice technology from this perspective, you can think about where voice as an input can be a tool in solving end user problems without ignoring the other benefits of a rich UI.

<YouTube videoId="xI68NT8D1m8" />

The goal of this post is not to debate Owned Assistants vs Voice UIs, rather focus on the Value of Domain Ownership when integrating voice technology.

# Domain Ownership Value

**Brand Control**

Using a wake word to launch a voice assistant was popularized by Apple’s Siri. One could argue simple access to voice control via a wake word has been one of the key drivers in voice technology adoption across the globe. However, just because something has always been done a certain way does not mean it has to be done forever into the future. Integrating voice into your existing domains allows you to control your brand messaging. Some scenarios may call for a branded wake word while others may benefit from voice input alongside a tap of a screen or touch of a button. More attention should be given to training a brand’s end users to say a wake word with a Big Tech brand while building out new user behavior with voice technology.

**Experience Control**

If you look at existing digital domains like a website or mobile application as a starting point for integrating voice technology, it gives you the ability to better control the end user experience. We are still in the early innings as it relates to voice as an input for technology. Sure, we are familiar with Interactive Voice Response (IVR) call center experiences of “Say 1 for X”, but the underlying technology powering voice experiences today has come a long way. Pair this innovation in voice technology with a rich UI and it makes for a perfect relationship.

Innovation in far-field voice technology, specifically with the announcement of Alexa in 2014, pulled general attention away from designing best practices for voice as an input in websites and mobile applications and more towards figuring out smart speakers. There is a real opportunity for companies to reset their outlook with voice technology and realize the power of voice within existing digital domains. These domains provide a perfect opportunity for companies to establish best practices for solving end user problems with a voice UI or Owned Assistant while also giving the ability to fully control the evolution of their end users experience with voice technology.

**Data Control**

Building quality end user experiences with voice technology requires a lot of data. Handing over valuable customer data to take advantage of an emerging platform can be a tough decision to make. Integrating voice technology into existing domains allows companies to maintain control of their valuable customer data, while still being able to take advantage of the innovation in voice technology.

This applies to an initial integration of voice technology, but also is important when thinking into the future. It is hard to guess the trajectory of any emerging technology or user behavior. However, you have a better chance of iterating and using the technology to build a valuable experience if you have full access to the usage data. Business interests in creating valuable end user experiences and Big Tech’s interest in solving Generalized AI are at odds when it comes to this valuable user data. The only way to get truly raw access to this end user data is by integrating voice technology within an existing digital domain.

Speechly is paving the way in giving product teams the ability to [unlock end user value](/blog/improve-workforce-efficiency-voice-uis/) with Voice UI’s. If you are interested in learning more please [reach out to us](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/value-of-domain-ownership-voice-technology).


Learn the true value of domain ownership when using voice technologies

Value of Domain Ownership With Voice Technology


There are a few similarities that arise across useful Multi-Modal Voice Interface use cases. As a product builder or team member, being able to identify these scenarios can lead to lucrative opportunities to give users a 10x experience using Spoken Language technology and plant the seed to more sophisticated experiences in the future. In this blog post I am going to discuss the value of Multi-Modal Voice Interfaces, the Scenarios where they thrive, and give a few Examples of these interfaces in action.

## Value of Multi-Modal Voice Interfaces

**Real-Time**

The most valuable aspect of a Multi-Modal Voice Interface is the fact that it allows for developers to truly leverage the power of real-time Spoken Language Understanding ([SLU](https://www.speechly.com/blog/nlu-voice-speech-recognition-terms-glossary/)) technology. At a high level, this works by streaming Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) simultaneously, something that is usually done in a one-after-the-other fashion in standard voice interfaces or conversational experiences. This one-after-the-other process tends to encourage shorter user utterances, like common smart speaker Voice Assistant commands, since users are unaware of whether or not the Voice AI is understanding them.

[Streaming SLU](https://www.speechly.com/blog/real-time-voice-user-interfaces/) alongside a screen allows for developers to give visual cues to a user, much like the way we communicate with each other on a daily basis. Whether in person or on a Zoom call, humans give different visual cues to signal to a friend or colleague a whole array of different meanings based on the conversation. For example, when giving a demo or pitch of a new product a presenter is always looking for a head nod from the crowd. That head nod gives the presenter valuable information on if their product or service is actually relevant to that target group of users. Streaming SLU gives Multi-Modal Voice Interfaces that head nod. By understanding a user in real time, a developer can give the user different visual cues to let the user know they are being understood.

**Use Voice, Swipe or Type - Whatever is Right**

Voice alone as an input can be a fantastic experience in certain situations like home automation, asking basic questions, or starting timers. However, the addition of a screen alongside Spoken Language technology allows for voice, swipe, and type inputs to thrive where they make the most sense for the end user. The reality is forcing a voice input only, or conversational experience, can be stressful for a user that is not used to that type of experience.

It is better to approach products from the perspective of how you can best solve a user or market problem. For some problems, a conversational experience might make sense. However, some experiences simply cannot be forced into a solely conversational mold. Using Spoken Language to control technology can be a fantastic way to supplement swipe and typing with an additional input method that is [3x as efficient](https://news.stanford.edu/2016/08/24/stanford-study-speech-recognition-faster-texting/).

## Ideal Scenarios for Multi-Modal Interfaces

**Information Heavy Tasks**

Scenarios that require users to engage in complex searches or repetitive inputs into a system can be a great opportunity for a Multi-Modal Voice Interface. There are a few reasons this can be a valuable scenario. Although ASR has achieved near [human parity](https://arxiv.org/abs/1610.05256), creating enjoyable end user products that use voice interfaces can be a challenge due to the different complexities that come with language in different contexts. The best way to ensure a good Multi-Modal Voice Interface Experience is to have as much contextual data as possible to give to the SLU model.

This is also important for the end user experience. Although spoken language is a great input for technology, most users are not familiar with the voice modality for everyday experiences outside of basic Voice Assistant controls. If a user has an existing understanding of the context and jargon around a particular process, taking advantage of the voice modality within a familiar experience is easier to overcome.

**Speed & Value**

I have already discussed how the voice modality can be 3x more efficient than typing on a mobile phone. However, speed alone is not a good reason to build a new product. Speed is important when you can attribute it to actual end user or business value. The reality is many businesses have Omnichannel experiences with complex customer journeys and employee responsibilities. I will explore further examples of this later in the post. This reality provides ample opportunity to assess where a Multi-Modal Voice Interface might be a good fit across an organization.

**Existing Digital Experience**

I understand a certain percentage of people who read this post will assume that “Multi-Modal Voice Interface” refers to a voice-enabled device with a screen from a company like Amazon or Google. Multi-Modal Voice Interfaces can apply to contexts outside of the smart speaker Voice Assistant ecosystems. As the point above about speed suggests, businesses should look inward when assessing where to implement Spoken Language technology as opposed to outward at unproven emerging platforms. Existing digital experiences are a better way for you to control the user voice experience and plant the seed for more sophisticated experiences down the road.

It’s not hard to understand why having full control of your brand, user experience, and data would be valuable when building a completely new way for users to interface with your company or product. The reality is, best practices are still being defined across different sectors that are applying Spoken Language technology. Giving product teams full control of brand, user experience, and data allows them to plant the seed with Spoken Language technology and iterate over time to build the best user experience. We can speculate over best practices, but the reality is we have not started to scratch the surface on what is possible with modern day Multi-Modal Voice Interfaces that leverage SLU. This provides a real opportunity to define the future on what user experience looks like with Spoken Language technology.

## Use Case Examples

I believe that the three use cases that I discuss below are great examples that check the box for each of the 3 Scenarios mentioned above.

**Voice Commerce Search, Filtering and Purchasing**

E-Commerce has completely changed the way we buy things. With the COVID-19 pandemic, E-Commerce growth accelerated up to [6 years](https://www.forbes.com/sites/johnkoetsier/2020/06/12/covid-19-accelerated-e-commerce-growth-4-to-6-years/?sh=325b2141600f). Despite this growth, E-Commerce product search, filtering and purchasing is outdated. Users are required to search and filter by inputting tedious amounts of data into complex menu hierarchies. This scenario is a perfect example of how [Natural Language Voice Search](https://www.youtube.com/watch?v=xI68NT8D1m8&feature=emb_logo) could provide an efficient experience that results in both user value and business value. The user can use their voice to find items more efficiently resulting in less churn and more items being added. This correlates to a direct benefit for the business. It's a win-win making E-Commerce a great place to integrate a [Multi-Modal Voice Interface](https://www.speechly.com/blog/bring-multimodality-voice-commerce/).

**Workforce Efficiency**

Everybody knows the value of data in business today. For this reason, there is a lot of attention spent on how to acquire the most accurate data in the most efficient way possible. In certain professions, such as healthcare, [real estate](https://www.speechly.com/blog/improve-workforce-efficiency-voice-uis/), finance or law there are a lot of legal and paperwork requirements that come with the day to day operations of the business. Multi-Modal Voice Interfaces are perfect in these scenarios. Professionals in jobs that require detail-oriented processes to be followed often become intimately familiar with the paperwork, data collection, and data input that is required of them. Being able to complete these processes more efficiently results in a professional being able to book more business which can benefit both that individual and the businesses top line revenue.

**Voice-Directed Warehousing**

Warehouses are an ideal place to implement Spoken Language technology leveraging devices like a phone, tablet, or screen on a piece of machinery. [Voice-Directed Warehousing](https://www.speechly.com/blog/voice-picking/) is the process of managing a warehouse worker or machine by using a Multi-Modal Voice Interface. Allowing warehouse employees to use Spoken Language as an interface to these devices allows for more efficient and accurate data capture while creating safer warehouse environments due to the hands-free ability. Workers are more efficient due to the 3x speed of voice as an input. We have seen warehouses that leverage SLU technology like Speechly are able to quickly adapt the ASR to their unique acoustic ecosystems allowing for higher quality data capture. The safe environment is a byproduct of allowing workers to have minimal time in front of a screen and maximum time with their eyes being up and alert. Overall, it's hard to argue with the value of Voice-Directed Warehousing for both employees and business owners alike.

As you can see there are opportunities that exist across industries to start leveraging Multi-Modal Voice Interfaces today to create better user experiences. If you are ready to Plant the Voice Tech seed with your users, [reach out to Speechly](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/ideal-scenarios-multimodal-user-interfaces).


Identify best use cases for multimodal user interfaces and create 10x user experiences

Ideal Scenarios for Multimodal User Interfaces


### In this post

In this post, we'll go through some use cases and examples for natural voice user interfaces in various domains and user tasks.

Not many apps or websites already employ a voice user interface because of a lack of developer tools for building them. These examples use Speechly Spoken Language Understanding technology for a natural voice UI, enhancing the touch screen user experience with voice functionalities.

- Form filling with voice
- Voice in eCommerce and search filtering with voice
- Adding items from a big inventory, such as grocery eCommerce
- Professional applications
- Information heavy data input
- Voice in VR/AR and gaming
- Web applications with voice user interfaces
- Speechly’s speech recognition accuracy

Speechly offers a unique tool for building [real-time multimodal voice user interface](/blog/what-is-voice-user-interface/). Our technology can be applied to any industry or domain to enhance current touch user interfaces with voice functionalities.

Speechly offers a Spoken Language Understanding API that returns user intents and entities in real time for user voice input. This approach enables end users to see the result of their voice commands visually as they speak instead of the traditional smart speaker paradigm that is based on turn-based queries.

Real-time visual feedback is the key to efficient and intuitive user interfaces, because it allows users to multitask. Instead of saying something and waiting for the answer, end users can speak in a stream of consciousness fashion, correcting themselves if needed. On the other hand, the visual feedback encourages users to go on with the voice experience.

We have collected some examples of user tasks that can be solved more effectively with our voice technology. In short, voice user interface works great if:

- Your users know what they want to achieve
- Data quality is important
- User tasks are repetitive

Let's see our examples.

### Form filling by voice

Voice is a great solution for information heavy, repetitive tasks such as [form filling](/blog/turn-any-web-form-into-a-voice-form/). Filling forms on a mobile device can be cumbersome because of difficult typing and common usability issues on different screen sizes and mobile browsers.

In our demo, we enhanced an existing HTML form with voice functionalities. The form can be manipulated by using touch or voice simulatenously. End user can use natural language to fill the form and gets instantenous feedback on the form.

By seeing the form, the user knows exactly what kind of questions they need to answer and they can the form in any order and by using any interaction modality.

<YouTube videoId="XWqHV1a32LM" />

### eCommerce search filtering

[Search](/blog/voice-search/) is one of the most important parts of a eCommerce customer experience. Up to 30% of eCommerce visitors will use search for navigating and a user who doesn't find what they are looking for is a lost customer.

A major share of Google searches are already done by using voice, but very few eCommerce sites offer a similar experience.

Speechly makes it simple to add voice functionalities to eCommerce stores. Again, the user can use natural language to search for products and unlike with traditional categories, voice categories naturally supports synonyms. No matter if the user asks for pants or trousers, they find what they are looking.

It's also important that the user interface updates in real time. This enables user to correct themselves in case of an error and encourages them to go on with the voice experience.

<YouTube videoId="xI68NT8D1m8" />

### Grocery shopping

[Grocery shopping](/blog/grocery-ecommerce-user-experience/) is a special kind of shopping experience, because the user wants to add a lot of familiar products from a large inventory to their shopping cart as easily as possible.

Traditional user experience requires a lot of repeated searches and selections, but voice enables the user to just say out loud the products they want and see them added to their cart. If they need to change a certain product, for example by changing a milk to another brand of milks, they can do it easily by just clicking the product.

<YouTube videoId="yzdSVV4xjb0" />

### Professional applications

Professional applications are a great use case for voice functionalities, because the language used in these settings is accurate and commonly shared by everyone.

Speechly can be used to create efficient user experience for [professional applications](/blog/improve-workforce-efficiency-voice-uis/) in many industries and domains. In this example, airline maintenance workers can easily report anomalies and defects in airplane cabin.

You can also read about our offering for [warehouse professionals and logistics](/blog/voice-picking/).

<YouTube videoId="kIjR-TWatFI" />

### Voice in VR/AR

Virtual reality environments can offer a very immersive experience that can showcase for instance real estate locations easily and accurately, even amidst pandemic situations.

However, the first time user experience in these environments suffer from clunky hand controllers that are unintuitive and hard to use. Learning these controllers take time from the actual experience.

Voice, on the other hand, is a very intuitive and natural way to interact in a virtual reality environment. Speechly created a virtual reality environment with our partner ZOAN that improves the first time user experience significantly.

<YouTube videoId="QIsld57q1cw" />

### Information heavy data input

Speechly can be used to [improve form filling](/blog/turn-any-web-form-into-a-voice-form/) when efficiency and data quality is important.

The following demo showcases a CRM use task in which a sales professional can input sales data by using voice. This leads into better data quality and improved data collection.

CRM is a great example of how voice can improve data input. The quality of the data is very important and data input is done in a repetitive way. Similar examples include health apps such as meal tracking and fitness tracking and other professional applications.

<YouTube videoId="6GcgPcMOuQk" />

### Web applications with voice UIs

Unlike [most other solutions](/blog/web-speech-api-alternative/), Speechly is supported by all modern browsers and can be used to create awesome voice experiences for web.

In our demo application, we created a simple photo editing application that is used solely by using voice. It supports natural language and the user can see the effects being applied to the image in near real time.

<YouTube videoId="EvKWlOlwLHY" />

### Speech recognition accuracy

Speechly is not optimized for pure speech recognition. Our models are configured for a certain use case and we use this configuration to bias the speech recognition model. This helps improve the accuracy.

However, our speech recognition accuracy is still on par with general purpose speech recognition software such as Google Cloud Speech.

In this demo video, Speechly and Google [Webspeech API](/blog/web-speech-api-alternative/) are transcribing the Jobs keynote from the first iPhone release event.

<YouTube videoId="1hcdCrFl-MQ" />

You can try out our general accuracy [here](https://api.speechly.com/dashboard/#/playground/ead4b9e7-e5c4-48ed-9dae-3c530916ed76?language=en-US). Do note that the ASR accuracy improves significantly when the models are configured for your use case.

## Conclusions

Voice GUIs can be used to improve user experience in wide variety of applications and domains. If you want to hear how your application's user experience can be improved with modern voice technologies, submit your email and we'll contact you as soon as possible.

If you are still not convinced, here's what our customers think of working with us. You can also read more about the [advantages of voice user interfaces](/blog/advantages-of-voice-user-interfaces/).

<YouTube videoId="etvskJy3hqw" />


Reactive voice user user interfaces enable intuitive and efficient experiences that improve key metrics

Examples of Natural Voice User Interfaces


[Voice user interfaces](https://www.speechly.com/blog/what-is-voice-user-interface/) allow users to interact with a computer system or application by using voice and speech commands. Voice user interfaces make use of speech recognition and natural language understanding technologies.

The obvious advantage of a voice user interface is that it allows a hands-free, distract-free way to use an application while still focusing most of their attention on another task. However, that's not the only or even the main advantage of a [well-designed voice user interface](https://www.speechly.com/blog/voice-application-design-guide/).

The main advantages of voice UIs include:

### Speed

According to a Standford study, speaking is at least four times faster than typing on a touch screen device. This makes voice a great input method for information heavy tasks, such as [filling complex form](https://www.speechly.com/blog/turn-any-web-form-into-a-voice-form/)s and [searching from a large inventory of items](https://www.speechly.com/blog/voice-search/).

### Intuitiveness

Even after using dozens of different email clients, finding certain rarely used features such as vacation responder or signature will be somewhat difficult on a new system. The user knows that the feature is somewhere, but it's impossible to know where it is before browsing through many different menus and options.

Voice, on the other hand is very different. The user can just say something like "change my signature" and they'll find the setting they are looking for immediately. Many cars already benefit from this kind of voice features and [over a third](https://www.statista.com/statistics/957808/us-consumers-using-automotive-voice-assistants-by-frequency/) of US driver's license holders use these features monthly.

### Flexibility

Voice user interfaces can support many ways of expressing the same thing. Let's get back to the vacation responder example mentioned before. The user might call the feature either Out of Office -message or a Vacation responder. If they are looking for a "Vacation responder" from the menus, they might miss the "Out of Office -responder" item even if they saw it.

The designer has to decide the name for the feature and stick with it. Some users will think it's the most natural name for that but some would prefer the other name. This is not the case with voice UIs.

A voice user interface can support dozens of synonyms and ways of expressing the same thing. No matter how the user expresses their wish, the user interface will react accordingly.

### Multi-tasking

Voice UIs enable the user to focus their attention on another task. This is especially useful when driving a car or [a forklift](https://www.speechly.com/blog/voice-picking/) as it improves safety and productivity.

It can also help users multitask inside an application. For example in gaming, players can change a camera or switch weapons without navigating in deep menus.

### Accessibility

While accessibility is essential for those suffering from various impairments, it is beneficial for all of us. Groups of people who can depend on voice features include people with disabilities that make the use of keyboard and mouse impossible, people with chronic conditions such as Repetitive Stress Injuries, who want to limit their use of keyboard and mouse, and people with cognitive disabilities.

Some examples of well-designed voice user interfaces include our [fashion eCommerce demo](https://demos.speechly.com/fashion/) and

## Disadvantages of voice user interfaces

Voice user interfaces, especially when not implemented correctly, have some disadvantages, too. These disadvantages do not prevent the use of voice UIs, but they are something that a product team should be aware of.

### Privacy

People might not be willing to speak in public spaces because they are either being considerate towards others or due to privacy reasons.

Privacy might be an issue also because of the news regarding major tech companies and smart speakers. While this is not an issue with voice UIs as such, it is something that should be taken into account by being as open as possible with how the user data is being handled.

### Personal preferences

Some users may not like talking to a computer or just prefer texting. These preferences can be static or context-dependent. For example, a user might prefer texting over voice when searching for health-related information but prefer voice when searching for hotels.

### Not suitable for all user tasks

While voice can be the fastest and most suitable interaction modality for many user tasks, it's not a silver bullet for all user tasks. Selecting an item from a list of a few is probably easiest by using touch and drawing is most certainly easier with a mouse or touch. Voice on the other hand, is especially great for selecting from a large inventory of items and inputting information-heavy data such as most forms.

## Multi-modality and combining GUI and VUI

One important goal of a product owner or a designer is to make the product or application as easy and intuitive to use as possible by leveraging all relevant technologies and design methodologies.

Adding voice as one tool in the toolbox can yield good results. Voice should not be added to the product because it can be done but rather, because it is the best way to solve certain user tasks.

Just like voice-only is rarely the best way to approach a design problem, most often it's not GUI-only either. A touch screen with a voice modality is a great combination for creating [efficient and easy-to-use user interfaces](/blog/improve-workforce-efficiency-voice-uis/).

Most applications can leverage voice modality in some features. Most applications will also need a screen. GUI and VUI should not be seen as alternatives, but rather as enhancements that can improve each other.

One of the biggest problems with smart speakers is the lack of a touch screen. That's why selecting an item on a smart speaker is very cumbersome.

GUIs on the other hand have some other deficiencies. As the screen real estate is limited, new features are either hidden behind nested menus or the UI gets cluttered with buttons. And to put it bluntly, [GUIs are not human-compatible](https://www.wired.com/1993/06/1-6-guis/). Even if we have kind of gotten used to them, there's nothing intuitive or easy in many common GUI design patterns such as hamburger menus or double-clicking.

We at Speechly are proponents of efficient user interfaces. We think that a user interface should be designed to be powerful tools that help users achieve their goals quickly. This is especially important with applications that are used often – and most product owners know that retaining users are the holy grail of any successful application.

If the user knows what they want to achieve, they can most probably say it out loud faster than they can browse through menus and click buttons. Especially so if what they are trying to achieve is information-heavy. Think of something like purchasing weekly [groceries](/blog/grocery-ecommerce-user-experience/): searching and selecting repeatedly is slow compared to just saying out loud all the items you want to add to your shopping cart.

### Full-duplex data processing

Human brains process information in two distinct systems: visuo-spatial system that's in charge of visual and spatial information and a linguistic system that takes care of speech information.

Because these systems are different, it's rather easy to drive a car and speak at the same time. When doing it, we simply employ both of these systems.

However, it's not possible to do two things at the same time in either of these systems. This is why it's not smart to drive a car and text simultaneously and a discussion where two people are talking simultaneously is next to impossible to follow.

A graphical user interface without voice features or a voice-only user interface such as a smart speaker is limited by this. If a user asks something from a smart speaker, they'll have to wait patiently until the smart speaker has finished answering. This is especially cumbersome if the answer is lengthy. If they could ask something and see the answer on their screen, they could immediately start refining the question.

## Examples

Examples on how [multi-modal voice user interfaces](/blog/bring-multimodality-voice-commerce/) leveraging the best parts of traditional graphical user interfaces and voice features can be seen below.

**Voice search in eCommerce**

<YouTube videoId="xI68NT8D1m8" />

**Voice forms**

<YouTube videoId="XWqHV1a32LM" />

If you want to try out the fashion demo yourself, you can access the demo [here](https://demos.speechly.com/fashion/)

## Conclusions

Voice can improve user experience and make human-computer interaction more efficient. However, it should not be thought of as an alternative to current graphical user interfaces, but rather as an enhancement for those.

Combining the best part of graphical user interfaces and voice user interfaces enables efficient, intuitive, and easy-to-use user interfaces while not sacrificing anything from the current user interface.

By using real-time Spoken Language Understanding API such as Speechly, designers can enhance their current user interfaces with voice functionalities. Speechly can be applied to any industry or domain and with our design guidelines and developer support, teams can build awesome user experiences in a short time.

If you are interested in improving your current applications' user experience, [leave your email address](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/advantages-of-voice-user-interfaces) and our industry specialist will contact you as soon as possible.


Voice has certain advantages over traditional user interfaces. Leveraging these advantages can create unique user experiences in any industry or domain.

Advantages of Voice User Interfaces


Voice picking has been employed for decades. But only recently have technologies such as Speechly’s Spoken Language Understanding enabled intuitive and accurate real-time voice-user interfaces that maximize efficiency with minimal customization and development time.

## What is voice picking?

Voice-directed warehousing (VDW), voice picking, pick by voice, voice-enabled warehouses, and speech-based picking all refer to the same thing. It’s a paperless, hands-free, and eyes-free computer system that employs voice commands for warehouse processes.

Warehouses have been frontrunners in using voice technology in [improving workforce efficiency](https://www.speechly.com/blog/improve-workforce-efficiency-voice-uis/). Voice picking has been used at least since the early 1990s, but recently the technology has matured enough to make other technologies almost obsolete. The market is expected to grow significantly over the coming years, due to decreasing costs and improved accuracy.

Voice-directed warehouses typically use a [multimodal](https://www.speechly.com/blog/nlu-voice-speech-recognition-terms-glossary/) voice user interface that can both direct the operator and take commands from the operator using voice. However, many warehouse operators use the voice-user interface only for data input from the operator to the system, and the information from the system to the operator is shown visually on their screen.

Voice-directed warehousing is well-suited for keeping an operator's hands and eyes free, allowing them to focus more on the task at hand.

It can be used in all kinds of storage environments; freezing and noisy environments are not a problem for Speechly’s voice technology. It can be used in warehouses with a large and small number of SKUs alike, and can be adapted to any process.

## How does voice picking work?

In a voice-directed warehouse (VDW), operators are equipped with a device, often a mobile phone, tablet, or a voice-dedicated terminal and headset. Typically, the headset is equipped with noise-canceling features for better performance in loud environments. In addition to voice and touch, the device can also support RFID and barcodes for increased efficiency in certain situations.

Modern voice picking employs speech recognition and natural language understanding technologies for improved accuracy and intuitiveness. Speechly has a unique approach to these technologies by combining these processes into a Spoken Language Understanding system that returns accurate results for voice commands in real time.

When an employee starts their shift, orders are imported from the host system — such as an ERP (Enterprise Resource Planning software) or a WMS (Warehouse Management System) — to the device and then processed. After processing and sequencing, the instructions for what item the operator should pick and where to find it are either spoken out loud (with a text-to-speech system) or shown on the screen.

When the operator is in the correct location, they confirm that they are picking the correct items by checking in to the location. After that, they confirm the products by speaking the product code or another identifier printed on the product. The operator also confirms the quantity they are about to pick. In case of incorrect or inaccurate confirmation, the voice application can correct the operator multi-modally.

Depending on the implementation, location information, RFID and other technologies can be used to optimize the route and maximize the efficiency of the operator.

Typical voice commands that operators use include product code strings, quantities, and locations. The operator can also slow down or hasten the voice user interface. A [well-designed multimodal voice user interface](https://www.speechly.com/blog/voice-application-design-guide/) is the key to highest efficiency.

## Benefits of voice picking

Benefits of using [voice in a warehouse setting](/solutions/logistics) include, but is not limited to:

- **Faster and more efficient picking**
  Faster and more efficient picking: Picking is the most expensive and labor-intensive warehouse process. It can constitute more than half the cost of a typical distribution center. Voice-user interfaces increase hourly pick rate.
- **More accurate reporting and better data quality**
  Anomalies — such as broken or missing items — can be reported in real time, resulting in better data quality and cost savings.
- **Safer warehouse environments**
  Safety is a priority in an efficient logistics facility. Hands-free and distraction-free operation of voice-user interfaces reduces injuries and accidents.
- **No need for printing and distributing picking documents in paper**
  Because orders are imported directly from the ERP or WMS to the employees' mobile devices, operators are ready to start picking right after they start their shift.
- **Decreased training time for new employees**
  Unlike traditional barcode and RFID scanners and hard-to-use enterprise software, voice-user interfaces are intuitive and require less than a day of training time for new employees. This can be a great benefit in warehouses with many seasonal employees.
- **Improved efficiency due to operators being able to do two things at once**
  Voice-directed warehouses enable operators to spend up to 95% of their work time picking, rather than reporting and searching for documents.
- **Improved customer satisfaction due to no incorrect shipping**
  Incorrect shipping is costly and reduces customer satisfaction. With voice picking, mistakes are massively reduced.
- **Effectiveness in cold environment**
  Traditional user interfaces are hard to use in cold storages and environments that require operators to wear gloves.
- **Happier employees**
  Simplified operations lead to happy, productive employees and decreased employee turnover.

Unlike some older voice systems currently employed in warehouses, [voice user interfaces](/solutions/logistics) built with Speechly require no per user training of the speech recognition models. The model is trained once, and it will work for all old and new employees.

All interactions between the system and the operator can be tracked — this enables management to track progress in real time and audit trail to resolve anomalies.

Voice picking can be easily integrated into any WMS and ERP with productivity increases of up to 40%. Because of easy implementation and major productivity increases, typical voice projects in warehouses have a relatively short ROI of about 6 to 12 months. Due to improved data quality, it enables warehouse management to track and analyze progress and reallocate resources in almost real time.

The technology doesn’t have to be limited to just picking, though — voice can be used in most other warehouse processes, such as cross-picking, quality control, packing, sortation, replenishment, receiving, and put-away.

## How to get started with Speechly in warehouses

Speechly’s Spoken Language Understanding technology offers industry-leading accuracy without the need for special hardware. Our technology works for all accents and can be adapted to all processes. Typically, a POC can be built that integrates to current ERP or WMS and supports most common warehouse processes in less than a month.

Our pricing is competitive and is based on the amount of audio data sent to our API. Typical costs for using our API in a warehouse setting are some thousands of euros per month. Speechly works on all mobile devices and can be used in custom hardware, too.

If you’re interested in learning more about how voice technology can help your logistics workforce be more efficient and improve your business data quality, [leave your email address](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/voice-picking) and our industry specialist will contact you with more details.


Learn how voice picking and voice-directed warehousing with real-time Spoken Language Understanding improves efficiency and key metrics in your warehouse

Voice Picking with Modern Technologies


[Voice user interfaces](https://www.speechly.com/blog/what-is-voice-user-interface/) for the most people mean smart speakers that turn a regular home into a fun sci-fi inspired command center. And while these smart speakers are fun and popular, one thing that doesn't come to mind when thinking about them is improved workforce efficiency.

However, voice features should not be primarily thought as a [smart home gimmick](https://www.speechly.com/blog/why-smart-speakers-are-not-the-future-of-voice/). For smaller and larger corporations alike, they can be a great way to improve employee efficiency in [all industries and domains](https://www.speechly.com/use-cases/).

Voice user interfaces can improve workforce efficiency by making data collection easier and faster. They can also help professionals in logistics, factories and workshops control equipments and machinery easier and safer.

## Improved data quality brings clear benefits

Back in the days before computers, most professionals reported by using paper and pen. This works well for short durations and small amounts of data. However, the demand for more and better data has increased tremendously and data has become a key driver for many industries.

The better the data quality is, the bigger the benefit is. Let's consider a maintenance operation in a factory that is in charge of keeping all the machines up and running. The main objective for them is to report and fix the broken equipment as efficiently as possible.

If employees are able to report accurately what was broken, where and how long it took to fix the problem and with what costs, they are able to learn and improve their processes. With data, they'll be able to maintain the equipment in a way that is cheaper and more efficient.

The number one thing that enables them to get good and accurate data is quick and efficient data collection, preferably right on the spot where their employees are working. Voice input can be [up to four times faster than typing](https://hci.stanford.edu/research/speech/Ubicomp18_pdf.pdf) on a touch screen. This means that employees using voice instead of touch are able to input four times more data or spend one fourth of the time they would otherwise spend on data collection.

Voice user interfaces can be used when maintenance personnel are crouching in hard-to-reach areas, while sales people are driving back from a sales meeting or while a forklift operator is collecting pallets. If they would choose to type, the data collection would be delayed and hence the quality of the data would be decreased.

## Repetitive and information-heavy tasks

<YouTube videoId="6GcgPcMOuQk" />

**Inputting data to a CRM is a repetitive and information-heavy task that can be made more efficient with a voice UI**

A majority of professionals spend a big chunk of their day [filling out the same form](https://www.speechly.com/blog/turn-any-web-form-into-a-voice-form/) or application. However, the specific information they fill into these forms is very important.

Let's consider a real estate agent. When they get a new customer, they'll typically fill in some kind of form that asks for the property type, asking price, name of the customer and other relevant information. This is only the first form in a long list of forms that goes along with the real estate industry. An average realtor can have up to 30 customers per year, so filling all these forms can become cumbersome. In Florida, for example, realtors have [over a hundred different forms](https://www.floridarealtors.org/tools-research/form-descriptions) they might need for completing a sale.

When they start selling, they'll hopefully get offers and counter offers that also require some data input. After the price has been agreed upon, there can be even more forms about escrow and the list goes on.

An experienced realtor knows all these forms by heart and knows exactly what kind of information they'll need to input, but they still have to go through each of them manually.

This kind of form filling is a part of many professionals daily work. Almost all professionals fill some type of sales document, anomaly report, review, or daily reports and almost no one enjoys it. Still, getting them right is of immense value to the business.

By enhancing CRM's, ERP's and other professional systems with voice interface functionalities, this data can be collected faster and with better accuracy.

## Accurate and efficient slang makes accurate and efficient user interfaces

One reason why a voice interface is a great solution for professional use is that in many areas of expertise there's a very specific lingo that everyone is familiar with using

For instance "[primp beemer](https://www.mentalfloss.com/article/77618/17-secret-slang-terms-your-doctor-might-be-using)" might not mean much to most of us, but in a doctor lingo it would mean a woman who is pregnant for the first time and is obese.

This kind of slang has evolved just because it is important for professionals in any domain to communicate in an efficient and accurate manner. From a voice user interface perspective, it serves exactly the same purpose and makes building complex voice functionalities [a lot simpler](/blog/advantages-of-voice-user-interfaces/) than in cases where the context is more open ended and free form.

Let's consider retail business, for example. A consumer looking for the smaller Series 6 Apple Watch with GPS and cellular could ask something like "the new Apple smart watch with mobile data and with the smaller screen" or "Apple Watch Series 6 40mm cellural". However, the professional sales person would probably refer to it as "Watch6,3" – the official identifier code for the said device or by product code "M0DV3", if they would refer to a specific color.

This is important from a voice user interface perspective because i[f the context is accurate, it makes the voice functionalities efficient and accurate](https://www.speechly.com/blog/voice-application-design-guide/). It can be difficult to teach the underlying [Spoken Language Understanding](https://www.speechly.com/blog/nlu-voice-speech-recognition-terms-glossary/) model to understand all the possible ways a user can refer to a specific product or other phenomena, but it's fairly easy to cover the standard ways professionals refer to the same product or phenomena.

## How to evaluate whether voice data entry is a good fit

Voice is a great solution for data entry in almost all cases. If one or more of the following statements is true for you, then voice data entry could be a valuable option for your organization to consider.

- Data is business critical and can create a competitive edge
- Data collection is a repetitive task
- Employees are familiar with the data collection process, eg. they know what kind of data they need to input
- Data collection happens or can happen in the field, rather than at the office
- Data is quantitative and structured rather than qualitative and non structured
- Current data collection processes could be improved

Voice can also be a useful modality for professional apps, if employees are doing repetitive command and control tasks such as in factories, logistics and back office.

Speechly has experience voice-enabling various professional applications in many industries. Our technology can be applied to any [web, mobile or desktop application](https://docs.speechly.com/client-libraries/).

[Contact our expert](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/improve-workforce-efficiency-voice-uis) to learn more on how your business data entry can be improved by using modern voice interface technologies.


Voice user interfaces can improve workforce efficiency by enabling professionals complete their tasks faster and more accurately

Improve Workforce Efficiency with Voice User Interfaces


Voice Commerce, or V-Commerce, is a topic that comes up frequently when discussing the opportunities with voice technology. This is not surprising due to the fact that voice technology news frequently predicts that V-Commerce will be an [$80B opportunity](https://www.fool.com/investing/2019/02/22/amazon-google-win-voice-commerce-market.aspx) by 2023. For everyday users of voice technology, this may seem optimistic when you think of all the different challenges that face V-Commerce experiences today. Given these challenges, I believe companies will need to embrace Multi-Modal Voice Commerce, or voice experiences that go alongside existing Digital Experiences.

## Challenges Facing Voice Commerce Today

With any emerging technology there is bound to be problems, or opportunities, that come along with it. Voice technology is no different. There are a handful of recurring problems that arise with voice technology in general and these problems are usually magnified when applied to V-Commerce. Three common problems that users reference within V-Commerce are the actual lack of a screen, fear of being misunderstood, and concerns with privacy.

### Lack of Screen

A common complaint from users that try different V-Commerce experiences is the [fact](https://www.dailymail.co.uk/sciencetech/article-2542583/Scientists-record-fastest-time-human-image-takes-just-13-milliseconds.html) that many voice-enabled devices do not have a screen. Without a screen to give users a sense of comfort that their utterances are being understood and executed properly, it’s hard to imagine purchases being made outside of basic everyday items and reorders.

#### Problems with Accuracy

Another common problem with V-Commerce is the users' fear of being misunderstood. Frequent users of voice enabled experiences are ok with a voice assistant that is unable to understand them when asking simple requests like answers to a question or song request. However, the risk of being misunderstood while making a financial transaction is likely to be more heavily scrutinized by users.

#### Privacy Issues

The final problem I want to address that comes up frequently with V-Commerce is privacy. [According to Voicebot](https://voicebot.ai/2020/05/11/privacy-concerns-rise-significantly-as-1-in-3-consumers-cite-it-as-reason-to-avoid-smart-speakers/), ⅓ of U.S. adults are concerned about smart speakers recording them and will not purchase a device, double the amount of 2018. Set aside potential smart speakers owners, there are also heightened concerns around privacy coming from existing smart speaker owners. Just like user fear of being misunderstood, privacy concerns with V-Commerce are heightened due to the fact that it revolves around a financial transaction.

Many of these problems can be better addressed with the addition of a screen to a voice experience creating Multi-Modal Voice Commerce. First, I will discuss a few general reasons why I think Multi-Modal Voice Commerce is the future of V-Commerce. I will then go into why I think businesses interested in creating valuable end user voice experiences should ditch smart speakers and start building voice features in their own digital domains.

## 3 Reasons for Multi-Modal Voice Commerce

### Buying is Visual

Humans have always looked for ways to improve how we transact and trade. We have progressed from making and trading our own goods, to Main Street mom-and-pop businesses, to large scale retail enterprises, to immersive E-Commerce stores. Although humans have consistently innovated how we purchase products, one variable that has also remained consistent is the visual component of purchasing goods. Humans are skeptical and it’s human nature to want to see and better scrutinize an object we are interested in purchasing. Another interesting fact, we process visual information in a fraction of the time compared to other modalities.

This observation gives E-Commerce oriented businesses an opportunity to lean into and leverage existing digital assets to create voice experiences. Product teams spend countless hours optimizing and perfecting both mobile and web experiences where customers are already spending time. By going all in on voice assistants as a Voice Commerce strategy, you leave out a major part of what makes online stores successful: images. Rather, businesses should enhance their current online stores by leveraging the voice modality and give customers a truly value-add experience.

#### Real-Time Validation

Most voice experiences today, even Multi-Modal experiences on popular Voice Assistants like Alexa or Google Assistant that show a transcript of what you are saying, are [turn-based experiences](https://www.speechly.com/blog/why-smart-speakers-are-not-the-future-of-voice/) and lack real time validation that the user is actually being understood. When I say “turn-based experience” I define it as Automatic Speech Recognition (ASR) to produce a transcription of what was said followed by Natural Language Processing (NLP) to understand the intent of the user. The real opportunity with Multi-Modal Voice Commerce relies on Spoken Language Understanding (SLU).

Spoken Language Understanding is slightly different from the turn based approach I mentioned above, but can lead to a drastically different experience for users. SLU does ASR and NLP simultaneously in real-time. This allows for users not only to see an actual transcript of what they are saying, but also allows for a designer to take advantage of a screen to illustrate whether or not the system is understanding the users intents in real-time. This leads to comfort for the user knowing that they are being understood, but also results in longer utterances.

Multi-Modal Voice Commerce also allows for user validation on whether or not they are being listened to. With privacy being a top concern of both potential voice technology users and existing users, Product teams need to pay careful attention to how they address privacy. Using visual components, such as a [Microphone On/Off button](https://www.speechly.com/blog/voice-application-design-guide/), are a good remedy for privacy concerns with voice technology.

#### Efficient for Users

According to a study from [Stanford](https://news.stanford.edu/2016/08/24/stanford-study-speech-recognition-faster-texting/#:~:text=%E2%80%9CWe%20knew%20speech%20recognition%20is,who%20helped%20run%20the%20experiments.), speech recognition is 3x faster than typing into a smartphone. There have been many predictions on what voice experiences might become in the future, and I am as excited as everyone else about that future, but there is one absolute fact about voice technology. Voice is the most efficient way to interact with technology. This makes existing digital experiences, such as E-Commerce websites and mobile applications, the perfect domain for a Voice User Interface. Users are able to make purchase decisions based on products they can actually see, but are able to do things such as [search, filter](https://www.speechly.com/blog/voice-search/), and checkout more efficiently with a Multi-Modal Voice Interface.

## Invest: Multi-Modal Voice Interface vs. Voice Assistant Platforms

There is a difference between Voice Assistant platforms, such as Google Assistant or Amazon Alexa, and companies like Speechly that enable developers to easily embed Voice User Interfaces in existing websites and apps. When it comes to building voice technology that is useful for users in E-Commerce, I believe it is better to approach voice as a modality to build immersive Multi-Modal experiences rather than an emerging platform opportunity. Approaching voice technology through this lens first provides the opportunity to immediately build features with value by bringing efficiency to your users.

See our Voice Search & Filtering Demo below:

<YouTube videoId="xI68NT8D1m8" />

### Plant the Voice Tech Seed

Starting with a V-Commerce use case like the Search and Filtering Demo above not only provides immediate value to users, but also plants the seed for future innovation around Multi-Modal Voice Interfaces. Searching or filtering products using your voice may seem simple in nature, but do not underestimate the power of user behavior change. With any user behavior change comes massive opportunities for innovation. Giving users a feature that is easy to digest and provides immediate value gives Product teams the opportunity to offer more sophisticated features down the line.

V-Commerce may have its problems, but I believe many of these problems are less concerning if we approach voice technology as a modality to build efficient Multi-Modal experiences. Multi-Modal Voice Commerce allows companies and brands that are interested in voice technology to “Walk before they run” by starting with features that make sense to users and make the purchase journey more efficient. Giving customers true value through voice technology, from day 1, is the only way to lay the foundation for building more sophisticated voice experiences in the future.

If you are interested in turning your E-Commerce store into a Voice Commerce powerhouse, [leave your email address](https://www.speechly.com/contact?ref=https://www.speechly.com/blog/bring-multimodality-voice-commerce) and our industry professional will contact you.


Voice commerce should not mean eCommerce on a smart speaker, but rather a multi-modal experience supported by voice.

Bring Multi-Modality to Voice Commerce


Speechly has existed now for about for five years. We are a team of 13 experienced software developers and machine learning experts and for the most part of that five years, we’ve been operating in stealth mode, focusing on building our core technologies. Now it’s time to tell what we’ve achieved so far.
We are building a [developer tool](/develop/) for improving touch screen user experience by voice functionalities. We don’t believe that smart speakers and voice assistants are the best use case for voice, but voice should be thought of as an add-on to current mobile applications’ and websites’ user interface. Voice is a modality, not a complete user interface.

Touch screen user interfaces definitely need improvements: while selecting from a few options is easy, selecting for example 30 items from an inventory of 20.000 is pretty cumbersome.

Typing is notoriously hard, too. Most humans speak about three times faster with less errors than they type. In short, voice is a great solution for information heavy tasks. While there are good solutions for speech recognition, there’s really no tools that would enable developers build the kind of user interfaces we’ve envisioned for voice.

2020 was our first year when we really published something out in the wild. We’ve built our technology for the past five years and now Speechly is finally in a stage that a developer can configure a model, integrate it to their application and build an awesome voice user interface. In this post, I’ll summarize our achievements.

## 1 Spoken Language Understanding accuracy matching Google

We run our own ASR and NLU technologies that provide both transcript and meaning (intents and entities) in real-time. During the 2020 we achieved significant increases in both ASR and NLU accuracy.

We evaluate the accuracy of our engine by transcribing the data we receive with both our own and with [Google Cloud Speech API](https://cloud.google.com/speech-to-text). Based on our results, our Spoken Language Understanding is in a typical voice user interface task **15% more accurate** than Google.

Because ASR is a hard task, this is not to claim that our technology is better than Google in all cases. It means that when building voice user interfaces, Speechly outperforms Google in most cases, even without training the model separately for a certain use case.

In a real case, Speechly can further be optimized by using the actual user data for retraining the model. This improves the accuracy typically by another 10-15%.

## 2 Client libraries for most important web and mobile platforms

During the 2020, we’ve published three client libraries that make integrating Speechly to an application simple and fast. Handling GRPC API, real-time audio streaming and of course parsing the results is a cumbersome task and the client libraries take most of the workload off our developers.

Our [browser-client](hhttps://dreamy-cori-a02de1.netlify.app/client-libraries/usage) can be used in all web applications in modern browsers and [React client](https://dreamy-cori-a02de1.netlify.app/client-libraries/usage/?platform=React) makes development on React framework even easier.

For the iOS, we released the [iOS client](https://dreamy-cori-a02de1.netlify.app/client-libraries/usage/?platform=iOS) and our [Android client](https://dreamy-cori-a02de1.netlify.app/client-libraries/usage/?platform=Android) will be published very soon. After that, developers can easily build a unified voice user interface on all major platforms.

We have created a simple tutorial application for all of the client libraries for a gradual learning curve on all platforms.

## 3 Demos showcasing our technology

Speechly is a tool for building real-time voice functionalities that [integrate seamlessly](/blog/voice-application-design-guide/) to existing touch or web user interfaces.

We [don’t think](/blog/real-time-voice-user-interfaces/) smart speakers or "voice-only” solutions is the best way to use voice and rather advocate multimodality and real-time visual feedback.

Some of the demos we built in 2020 include a [fashion eCommerce voice search filtering](https://demos.speechly.com/fashion/) app and a classic [home automation](https://home-automation-app-demo.herokuapp.com/) app. Try them out!

<YouTube videoId="xI68NT8D1m8" />

## 4 Speechly Annotation Language features for configuring voice user interfaces

Our Speechly Annotation Language (SAL) is a syntax we use to annotate example utterances that are used to train our models. In 2020 we added many new features to SAL:

- Alphanumeric sequences
- Dates
- Permutations
- Multi-intent utterances
- Variables
- Canonical entities for easy handling of synonyms
- Lookup tables for handling large inventories

With these features, developers and designers can create complex voice user interfaces with a minimal amount of example utterances. Because the same model can be used on all platforms, the user experience is unified.

## 5 Improved latency in our GRPC API

When iPhone nailed the user experience with the touch screen, one of the key features was the very responsive user interface that reacted immediately to user input. This is a key issue also for voice user interfaces.

We’ve improved our latency in 2020 significantly and now we can proudly say that our API is real-time with tail latency of under 200 milliseconds.

Low latency is the key to intuitive user experience in two ways: first, it enables user to correct themselves naturally by using voice and second, it encourages the user to go on with the voice experience.

Compare this to the traditional smart speaker user experience that first starts by uttering a wake word that sometimes fails. Once the wake word is recognized and user starts speaking, they’ll know whether they were understood only after they have stopped speaking and the system has processed the input. If the answer is wrong, the user needs to start again from the beginning.

## 6 Speechly Dashboard

In March we published the first version of the Speechly Dashboard, a web application for building and configuring Spoken Language Understanding models with the Speechly Annotation Language.

The Dashboard supports nearly all Speechly features and it’s the fastest way for getting up to speed with our technology. Hundreds of developers have already created their models and tried them out in the Speechly Playground.

## 7 Other achievements

We’ve also published our [developer documentation](https://docs.speechly.com), many [videos on using our technology](https://www.youtube.com/channel/UCvEar86D8CnD3zDJSydOdbA) and [command line tools](https://docs.speechly.com/features/cli) for integrating Speechly to multi-user development workflows.

We renewed our website to better position our product and hired many new developers and machine learning experts. Our founders have been interviewed in many industry leading podcasts and we were nominated as one of the Europe’s Hottest Startups.

If you want to work with us and build awesome developer tools for next-generation voice user interfaces, please check our [careers page](https://www.speechly.com/careers/).

## What next?

Overall, we are pretty happy with our 2020. We’ve now built a technology stack that enables efficient user interfaces that improve user experience significantly. In 2021 we focus on showing the world some cool examples of our technology.


Our team made some great progress in 2020 in enabling all developers to become voice developers.

What Did the Speechly Team Achieve in 2020?


To demonstrate writing voice-enabled apps in practice, we’ll build a smart home controller app that responds with real-time visual feedback to spoken commands like:

_"Switch off the radio in the living room."_

_"Turn on the lights in the bedroom."_

### Design goals

The app is going to be built on two design pillars in particular:

- Responsiveness, so that the user is confident that the app follows the user’s speech.

- Robustness, so that app behaves nicely even if the interpretation of the user’s intents changes during the sentence.

### Understanding real-time speech recognition

By definition, speech-to-text systems give you text (transcript) to work with. Real-time systems provide partial sentences as soon as words are recognised and continue to refine the transcript as the speech progresses.

In addition to this, Speechly, for one, goes a step further by providing you with tagged keywords (entities) and the intent of the sentence as soon as they are recognised. At first, they would be tagged as tentative, and later turn to final.

The above example sentences would be deconstructed as follows:

_"Switch off the **radio** in the **living room**"_

\_→ \_Intent: turn_off, entities: radio (of type device), living room (of type room)

The tags are defined in a voice interface configuration in Speechly Annotation Language (SAL). While I’ve pre-configured the keywords for this example, you can learn how to create your own voice interfaces **[here](https://docs.speechly.com/reference/sal/)**.

While we're getting a constant stream of words as the sentence is being uttered, the speaker’s true intentions are only confirmed at the end of sentence (which Speechly calls a final segment).

### Responsiveness

Let’s assume the user would say _"Turn the kitchen lights... on"_.

If we would wait until the end of the sentence before providing any feedback to the user, the time from manipulation (speech) to desired effect may increase to several seconds, rendering the user experience unresponsive and clumsy.

To mitigate this, we will highlight the objects mentioned in the sentence (appliances like _"lights"_ and rooms) to give the user an early confirmation of where and what changes would happen. The speaker may even use this near real-time information to alter his spoken command.

If we wanted, we could even take a more forward leaning approach by actually toggling the lights when we have enough information about the user's intent.

### Robustness

To deliver the robust experience users expect, we need to be prepared for the (luckily, rare) occasion that the intent changes as the speech progresses, sometimes at the very last moment. What if the user would have finished the above sentence with an _"...off"_?

We'll facilitate the changes (big or small) by storing a copy of the app state when the user starts a new sentence. Then we simply recalculate the new state over and over again using the information we receive as the user speaks. By operating on the whole app state allows speech to control all aspects of the application: adding, removing and modifying data on any number of objects are all handled in a similar manner. Reflecting the information upon the last stable state becomes especially important should the interpretation of the user’s intent change in the middle of the sentence. Finally we store the last tentative state as the new stable state for upcoming sentences.

This approach assumes that your app state is fairly compact so that you can effectively create a new copy of the entire app state upon new information becomes available from the speech to text engine, which may occur up to 10 times a second. Also, the user interface needs to be fast enough to keep up with state updates so that won't choke the performance.

### Demonstrating the key concepts in a home automation app

Let's create a sample app to see how it all comes together.

I'm assuming that you have some experience with React so you probably already have [node/npm](https://nodejs.org/) installed. If you want run the demo, prepare a React TypeScript project, but with contents of src/App.tsx replaced with this **[Gist](https://gist.github.com/arzga/da22da22782e0b79c2271ed0f206d6df)** like so:

```bash
npx create-react-app home-automation --template typescript
cd home-automation
# Download and replace src/App.tsx with the Home Automation app
curl https://gist.githubusercontent.com/arzga/da22da22782e0b79c2271ed0f206d6df/raw > src/App.tsx
# Install dependencies
npm install
npm install @speechly/react-client @speechly/react-ui
npm start
```

Before walking thru the code, a word about some of the choices I’ve made:

- I used TypeScript with React to help me avoid the dumbest mistakes with the code. More info about TypeScript in React **[here](https://create-react-app.dev/docs/adding-typescript/)**.
- I'm using a home automation voice interface defined in Speechly’s SAL syntax and pre-deployed so we can just use it in the following code. More about that **[here](https://docs.speechly.com/reference/sal/)**.
- Some implicit styling is visible in some of the snippets. A reminder from my StyledComponents experiments, but without the dependency...

#### Rendering the app

The main render function is probably pretty much what you’d expect it to be in a React app. The whole of the app is wrapped in a &lt;SpeechProvider> which connects to the Speechly cloud services and enables speech-to-text for any contained component. The appId points to a pre-configured voice interface that defines the keywords and phrases you can use in this app.

```js
export default function App() {
  return (
    <div className="App">
      <SpeechProvider
        appId="a14e42a3-917e-4a57-81f7-7433ec71abad"
        language="en-US"
      >
        <BigTranscriptContainer>
          <BigTranscript />
        </BigTranscriptContainer>
        <SpeechlyApp />
        <PushToTalkButtonContainer>
          <PushToTalkButton captureKey=" " />
        </PushToTalkButtonContainer>
      </SpeechProvider>
    </div>
  );
}
```

#### Data model for the app state

The app has a monolithic state object (think of a React/Redux store). The data model is just a collection of rooms with device states in them. It’s worth noting that the names match those of the entities defined in the voice interface configuration mentioned above. This way the entities returned by the speech-to-text API are easy to connect with the model.

```js
const DefaultAppState = {
  rooms: {
    'living room': {
      radio: false,
      television: false,
      lights: false,
    },
    bedroom: {
      radio: false,
      lights: false,
    },
    kitchen: {
      radio: false,
      lights: false,
    },
  },
};
```

#### Interpreting the speech segment

The details of the state manipulation logic reside in `alterAppState`, which takes the segment and last “stable” appstate and returns a new app state object with segment information reflected on it.

`selectedRoom` and `selectedDevice` are used to highlight the objects the user talks in the user interface.

```js
const alterAppState = useCallback(
  (segment: SpeechSegment): AppState => {
    switch (segment.intent.intent) {
      case 'turn_on':
      case 'turn_off':
        // Get values for room and device entities.
        const room = segment.entities
          .find((entity) => entity.type === 'room')
          ?.value.toLowerCase();
        const device = segment.entities
          .find((entity) => entity.type === 'device')
          ?.value.toLowerCase();
        setSelectedRoom(room);
        setSelectedDevice(device);
        // Set desired device powerOn based on the intent
        const isPowerOn = segment.intent.intent === 'turn_on';
        if (
          room &&
          device &&
          appState.rooms[room] !== undefined &&
          appState.rooms[room][device] !== undefined
        ) {
          return {
            ...appState,
            rooms: {
              ...appState.rooms,
              [room]: { ...appState.rooms[room], [device]: isPowerOn },
            },
          };
        }
        break;
    }
    return appState;
  },
  [appState],
);
```

#### The stable and the tentative app state

As the user speaks, the `useEffect` below fires as a response to changed words, entities and intent in `segment`. The new `tentativeAppState` is then resolved by calling `alterAppState`. Upon the end of the sentence (indicated by the `segment.isFinal` flag) the last `tentativeAppState` is stored as the new “stable” `appState`.

```js
function SpeechlyApp() {
  const { segment } = useSpeechContext();
  const [tentativeAppState, setTentativeAppState] = useState<AppState>(DefaultAppState);
  const [appState, setAppState] = useState<AppState>(DefaultAppState);
  const [selectedRoom, setSelectedRoom] = useState<string | undefined>();
  const [selectedDevice, setSelectedDevice] = useState<string | undefined>();

  // This effect is fired whenever there's a new speech segment available
  useEffect(() => {
    if (segment) {
      let alteredState = alterAppState(segment);
      // Set current app state
      setTentativeAppState(alteredState);
      if (segment.isFinal) {
        // Store the final app state as basis of next utterance
        setAppState(alteredState);
        setSelectedRoom(undefined);
        setSelectedDevice(undefined);
      }
    }
    // eslint-disable-next-line react-hooks/exhaustive-deps
  }, [segment]);
...
```

#### Rendering the app state

The remaining part is rendering the rooms as boxes with devices in them. The renderer uses the information both in the appState and tentativeState to highlight changes to the device states. The selected room and devices are also visualised during the utterance.

```js
return (
    <div
      style={{
        display: "flex",
        height: "100vh",
        flexDirection: "row",
        justifyContent: "center",
        alignItems: "center",
        alignContent: "center",
        flexWrap: "wrap",
      }}
    >
      {Object.keys(appState.rooms).map((room) => (
        <div
          key={room}
          style={{
            width: "12rem",
            height: "12rem",
            padding: "0.5rem",
            borderWidth: "2px",
            borderStyle: "solid",
            borderColor: selectedRoom === room ? "cyan" : "black",
          }}
        >
          {room}
          <div
            style={{
              paddingTop: "1rem",
              display: "flex",
              flexDirection: "row",
              justifyContent: "start",
              alignItems: "start",
              flexWrap: "wrap",
            }}
          >
            {Object.keys(appState.rooms[room]).map((device) => (
              <div
                key={device}
                style={{
                  flexBasis: "5rem",
                  margin: "0.2rem",
                  padding: "0.2rem",
                  background:
                    selectedDevice === device &&
                    (!selectedRoom || selectedRoom === room)
                      ? "cyan"
                      : "lightgray",
                }}
              >
                {device}
                <br />
                {appState.rooms[room][device] ? (
                  tentativeAppState.rooms[room][device] ? (
                    <span style={{ color: "green" }}>On</span>
                  ) : (
                    <span style={{ color: "red" }}>Turning off...</span>
                  )
                ) : !tentativeAppState.rooms[room][device] ? (
                  <span style={{ color: "red" }}>Off</span>
                ) : (
                  <span style={{ color: "green" }}>Turning on...</span>
                )}
              </div>
            ))}
          </div>
        </div>
      ))}
    </div>
  );
}
```

That’s it!

If you created the React app and downloaded the Gist, you should be able to run it, hold the mic button (or hold down the space bar) and try saying combinations of ”turn on”, ”turn off”, ”lights”, ”radio”, ”television” and rooms like ”living room”, ”bedroom” and ”kitchen”.

Hopefully you now have an idea how you can integrate a voice interface to your React app. The next step would be creating something of your own. A good starting point would be thinking of what kind of phrases you’d like to use and sketch them out in Speechly Dashboard.

### Footnotes

- You'll notice that nothing will happen if you leave out a part of the sentence. This example can (and probably should) be improved by allowing the user to specify the key information (the room, device and power state) spread over multiple utterances. This would make the voice experience more flexible and more pleasant to use.
- alterAppState is very reducer-like. It could actually be a reducer, but it would not be able to directly conjure any side-effects like trigger animations/transitions, although they are not showcased in this example.
- For multimodal use, the example could be improved by storing setAppState also at the start of a new utterance. The current approach, which uses setAppState only at the end of the utterance will not work gracefully if the widgets were also manipulatable with touch or mouse, as the old app state is restored upon starting a new utterance. Any app state changes made using GUI would be lost.
- Please note that it's currently possible that the transitional state visualisation may go unnoticed if the tentative period is very short. Improved visualisation may use something like react-spring to launch a clearly visible effect upon a state change, which would ensure that the user has the time to see it.
- If you're interested, there's an article specifically about guidelines for high productivity voice apps **[here](https://www.speechly.com/blog/voice-application-design-guide/)**.

Happy hacking!

Ari


Learn the best practices for handling speech input in a Speechly React app.

Handling Speech Input in a React App


The most used tool for voice user interfaces on the browser is the Web Speech API and SpeechRecognition API, but there are major limitations with both technologies.

First, Web Speech API is [only available](https://caniuse.com/speech-recognition) for Chrome. SpeechRecognition API is also available for Firefox and some derivates of these, but the low support makes them unfeasible for production use in any real-life application.

![Support for WebSpeech API is limited](/uploads/webspeech-api-support.png)

Second, Web Speech API and SpeechRecognition API provides only the transcription of the user's speech. They don't provide any context or meaning (natural language understanding) for this input. While in some use cases that only need the transcription it's not an issue, for more complicated user tasks and for building user interfaces natural language understanding needs to be solved somehow.

Speechly is the first developer tool built from the ground up for building [voice user interfaces](/blog/what-is-voice-user-interface/). Our Spoken Language Understanding API integrates [speech recognition](/blog/nlu-voice-speech-recognition-terms-glossary/#s) (ASR) and natural language understanding (NLU) to a single Spoken Language Understanding API for low latency and improved accuracy.

In addition to wide browser support, Speechly is available for touch screen platforms (Android, iOS and React Native) which makes building cross-platform applications very simple. This makes Speechly the best WebSpeech API alternative for voice user interfaces.

One important aspect when comparing voice APIs is of course the speech recognition accuracy. Speechly benefits from the fact that it's always configured for a certain use case and this configuration is used to bias the speech recognition model.

Biasing helps Speechly correctly catch product names, professional lingo and other harder words. Even without biasing, our speech recognition accuracy is on par with Google's WebSpeech API, as you can see in the video below.

In the video, a standard, non-biased Speechly model is running simultanously with the Google Webspeech API test and both are transcribing Steve Jobs' keynote speech in the first iPhone launch event.

<YouTube videoId="1hcdCrFl-MQ" />

## What is natural language understanding and why do I need that?

Natural language understanding is a branch of machine learning that enables computer systems to extract meaning from text or speech input. It reduces natural language into structured data that typically consists of intents and entities (slots) that modify these intents.

While this might sound complicated, let's give a simple example to clarify it. If the user says something like "Show t-shirts", the user intent is probably something like "show_products" and it has an entity "t-shirt". Naturally, the user might also say something like "Show jeans". In this case, the entity would be the same – "show_products" but the entity would be "jeans".

If we are 100% sure that our users will always use either of these two utterances in exactly this format, we can use a very simple regular expression as our natural language understanding algorithm.

But most often this is not the case. Rather, the user can express this same intent in many different ways. Maybe they say something like "I'd like to see turtlenecks" or "Do you have any tees?"

A good natural language understanding algorithm can extract the meaning out of all these utterances and always return with the same intent and entity, no matter how the user expresses themselves.

WebSpeech and SpeechRecognition APIs don't have any natural language understanding capabilities and if you need that, you'll need to start learning [SpaCy](https://spacy.io/) or some other natural language understanding tool. This increases development time significantly and adds complexity.

## Why Spoken Language Understanding?

Now as we've learned, a voice user interface needs two distinct parts: speech recognition to transform user speech into text and natural language understanding to extract meaning (intents and entities) from that text. WebSpeech and SpeechRecognition APIs only offer speech recognition.

If you have ever used Google Assistant, Alexa, or Siri, you've probably seen that while the text transcript appears in near real-time while the user speaks, when the user stops speaking there is a small delay after which the action happens. This is where the natural language understanding happens and the action that the user wanted is performed.

Speechly is a Spoken Language Understanding API that provides both of these functions in a fully streaming fashion. When the user starts talking, the API begins returning both the transcript and the "meaning", eg. intents and entities for this input. This makes applications built with Speechly very responsive and fast to react to user input.

In fact, Speechly returns both interim and final results for both the transcript and for intents and entities for even faster feedback.

Unlike SpeechRecognition or WebSpeech API, Speechly [browser-client](https://docs.speechly.com/client-libraries/usage) is supported by [all modern browsers](https://docs.speechly.com/client-libraries/supported-browsers/) on mobile and desktop. You can also use Speechly for iOS and Android and we are adding more client libraries in the future, too. You can find all our client libraries [here](https://docs.speechly.com/client-libraries) for up-to-date status.

The streaming fashion of Speechly enables natural end-user utterances such as "Show me t-shirts... sorry I mean jeans". For most other voice UI APIs, this kind of query fails because of end pointing (or failure in natural language understanding): the system recognizes the small pause in between as a signal for the end and starts processing the first part of the utterance without taking into account the last part.

Another important thing that streaming enables is real-time visual feedback. If we think about our example utterance "Show t-shirts" it can show the t-shirts as soon as the user has stopped speaking. This encourages the user to go on and they can continue with something like "for men... in size large".

Configuring the natural language understanding model on Speechly is very simple and can be done either in our web dashboard or by using our command line tools. The former works great for simple projects and initial models and the latter is better for projects with several developers collaborating on the same model.

## Spoken Language Understanding demo

Here's a quick demo showing this a web application built with Speechly Spoken Language Understanding in action:

<YouTube videoId="xI68NT8D1m8" />

As you can see from the demo, real-time visual feedback is the key to natural voice user interfaces. We believe that the lack of real-time feedback is the reason, why [the "iPhone moment" has not happened yet](/blog/real-time-voice-user-interfaces) for voice UIs. This kind of real-time feedback can't be done with either WebSpeech or SpeechRecognition API.

You can see the differences between responsiveness also by checking out [this GitHub project](https://jnguyen9763.github.io/chesswithspeech/) that is using WebSpeech API for a chess game. Then compare it to this video which shows a similar (albeit more simple!) chess game built with our JavaScript client.

<YouTube videoId="yKXuqR6swBE" />

Just like the iPhone succeeded with the touch screen because of its very responsive and intuitive user experience, voice UIs need the same responsiveness and intuitiveness to really succeed.

## Alternatives for WebSpeech API

### Amazon Transcribe

[Amazon Transcribe](https://aws.amazon.com/transcribe/) is Amazon's text-to-speech API that suffers from the same limitations than WebSpeech API and SpeechRecognition API.

While it does offer accurate speech recognition, it does not have natural language understanding capabilities, which makes it slow and non-responsive for voice user interfaces.

#### IBM Watson Speech to Text

[IBM Watson Speech to Text](https://www.ibm.com/cloud/watson-text-to-speech) is another paid for speech-to-text API that does not include NLU capabilities.

### Microsoft Bing Speech API

[Microsoft Bing Speech API](https://azure.microsoft.com/en-us/services/cognitive-services/speech-services/) is Microsoft's answer to speech recognition, but unfortunately does not support natural language understanding either.

### Assembly AI

[Assembly AI](https://www.assemblyai.com/) offers great features for speech to text, including profanity filters and multiple models for different accents. It's a bit cheaper than the other altenatives, but does not support NLU, either.

### Speechly

[Speechly](https://www.speechly.com/) offers fully streaming real-time Spoken Language Understanding API for integrating responsive voice user interfaces for any web application.

## Conclusions

Building voice user interfaces for browser applications can't be done without natural language understanding capabilities. While it is possible to use another tool for speech recognition and another for NLU, it adds complexity and most probably increase latency so that real-time visual feedback is not achievable.

This makes Speechly the only available tool that enables complex voice user interfaces in browser with a single API and with wide support for different browsers.

If you are interested in building real-time voice user interfaces for React or JavaScript, you can start using Speechly by completing our tutorials. You can follow either the [React tutorial](https://dreamy-cori-a02de1.netlify.app/client-libraries/usage/?platform=React) or [JavaScript tutorial](https://dreamy-cori-a02de1.netlify.app/client-libraries/usage) depending on the platform you are developing on.

If you want to learn more about what kind of applications Speechly enables, you can refer to our [Use cases](https://www.speechly.com/use-cases/) section.


Speechly provides an alternative for Web Speech API React that works in all modern browsers and is optimized for real-time voice user interfaaces

Web Speech API Alternatives for Voice User Interfaces


## What’s wrong with current voice UIs

There is a lot of positive momentum around [voice interfaces](/blog/what-is-voice-user-interface/). We’ve all seen the stats: the number of smart speakers in US households has risen steadily for the past five years. The share of [voice searches](/blog/voice-search/) of all search engine traffic has soared. Yet, the type of revolution that the touch screen gave birth to, after the launch of the iPhone, hasn’t really happened for voice — despite all the hype. Why is that?

Anyone who has a bit of experience from using voice interfaces knows that you can do simple things like setting on the alarm, switching on the lights, or playing your favorite music on Spotify pretty easily. However, if you try to do anything more sophisticated, say, [order pizza](https://www.speechly.com/use-cases/) for your eight best friends, reserve an intercontinental flight for a family of five, or buy a new party dress online, the chances are that you're going to fail miserably. For more demanding and for most real-world tasks, the user experience with voice just isn’t there yet — at least as it is for the touch screen.

Still, contrary to what people might think, the problem is not really anymore in speech recognition or natural language understanding accuracy. Even for fairly open-ended domains, both speech recognition and natural language understanding accuracy is pretty close to human parity.

The problem lies rather in the way these systems give feedback to the user. Typically, when a voice command is uttered, the modern [voice assistants](/blog/why-smart-speakers-are-not-the-future-of-voice/) wait until the user has stopped talking (using the technique called endpointing) before they start processing it. This works great for short things like “turn on the lights” or “Play X on Spotify”. However, for more complex tasks this is a disaster.

Imagine if you need to express something that requires a longer explanation. When looking for a new t-shirt, a person might be tempted to say something like “I’m interested in t-shirts for men ...in color red, blue or orange, let’s say Boss ...no wait, I mean Hilfiger ...maybe size medium or large ...and something that’s on sale and can be shipped by tomorrow ...and I’d like to see the cheapest options first.”

When uttering something this long and winding to a traditional voice UI, most likely something will go wrong, resulting in the familiar, “Sorry, I didn’t quite get that.” Having just made the extended effort of explaining your intent to the system, this is an extremely frustrating experience. Or even worse, the endpointing might trigger a false positive half way the utterance, causing the voice assistant to prematurely resolve the intent and start a voice synthesis response, interrupting the speaker violently and irritatingly.

In the following sections of this article, we will introduce the powerful techniques of streaming spoken language understanding and reactive [multi-modal voice user interfaces](/blog/voice-application-design-guide/) that address the very problem the current generation of voice UIs suffer from.

## Streaming Spoken Language Understanding

Contrary to the traditional voice systems that rely on endpointing to trigger a response, the systems using streaming spoken language understanding actively try to comprehend the user intent from the very moment the user starts to talk. The idea is that as soon as the user says something actionable, the UI is able to instantly react to it.

<YouTube videoId="tSi7vJuIyT0" />

The benefit here is that if the system does not understand the user, the UI will instantly signal this back. This way the UI will fail fast but also recover quickly as the user can immediately stop, correct, and continue. On the other hand, if the system does understand the user, also this information is conveyed immediately. This gives reassurance to the user that their message is going through, and that they can continue their expression. As long as the system understands, the user can just go on and on, which results in longer and more complex utterances that are supported.

Moreover, the immediate feedback from both the small failures and successes of the UI can be combined in a way that the users can correct either themselves or the UI in an online manner, e.g., “I’m interested in Boss, no, I mean Tommy Hilfiger”. This ushers a way for the UI to not only support more sophisticated and complex UI workflows but also a more stream-of-consciousness way of expressing the users’ intent. This is more natural for humans and requires much less effort than the very specific way that the current voice UIs require the utterances to be given.

## Reactive and non interruptive multi-modal voice user interfaces

The most prominent feedback modality of the current day voice interfaces is voice synthesis. As a feedback mechanism, however, this works poorly as any ongoing user utterance will be abruptly interrupted — a problem exhibited commonly in the contemporary voice UIs. As a feedback mechanism, voice is also a pretty narrow band. Instead, the feedback should be given with a non-interruptive modality. Such modalities include haptic, non-linguistic auditory, and perhaps most naturally and expressively, visual feedback. Using these modalities, the UI can react fast and without interruption to the user. For instance, in the case of “I’m interested in t-shirts,” the UI would swiftly show the most popular t-shirt products, instantly enabling the user to continue with a refining utterance, ”do you have Boss.” This narrows further down the displayed products to show only the Boss branded t-shirts.

<YouTube videoId="xI68NT8D1m8" />

This iteration loop reminisces a familiar setting to everybody: human face-to-face communication, which is, in effect, a reactive, multimodal communication setup. In fact, it is a common misperception that human face-to-face conversation is primarily turn-based (or half-duplex) in a similar fashion that chatbots or voice assistance are. Meaning, first I say something, then you say something, then I say something again, and so on. Not exclusively. The human face-to-face conversation is very much full-duplex: as one person talks, the listener gives feedback with nods, facial expressions, gestures, and interjections like aha and mhm. Furthermore, if the person listening doesn't understand what is being said, they are likely to start making more or less subtle facial expressions to signal their lack of comprehension. This is the tight full-duplex feedback loop that makes human face-to-face communication so efficient. The same efficiency is exhibited in the reactive multi-modal voice user computer interfaces!

## Voice is not the UI, voice is a modality!

At the height of the voice assistant hype ushered in by Amazon Alexa, many probably heard the flying phrase “Voice is the UI!”. This article disagrees. Voice is a modality! By augmenting a UI with voice in combination with other available modalities, the result can be an extremely efficient UI.

> The perfect interface is when you can use touch and voice seamlessly and choose the best option for the context, sometimes interchangeably

This efficiency comes from how the modalities work together, not from voice alone. Voice is, for instance, a very efficient method for inputting rich information. For scrolling, swiping, pointing, or selecting between a couple of valid alternatives, touch is probably better. For displaying complex multidimensional information, the visual display is unbeatable. Combining all of these modalities in a smart way is the killer app.

## What was the revolutionary thing in the iPhone?

Circling back to where we started, the iPhone moment. What made the iPhone so powerful? Well, the very intuitive swipe and pinch gestures with which the user could effortlessly control their phone. However, when the iPhone came out, the touch screen wasn't a new thing. There had been prior touch screen devices. They just sucked! You could swipe or press, and a second or two later something would happen.

What the iPhone brought to the mix was the extremely fast feedback that its touch screen could provide to the user, resulting in the very intuitive, fluid, and satisfying user experience of controlling your phone. Voice UIs based on streaming spoken language understanding are a similar type of revolution. The streaming spoken language understanding offers extremely fast, fluid, and intuitive feedback to the user akin to what the iPhone brought to the controlling devices back in 2007. However — this time, voice is in the center, providing user experiences that we haven’t seen before, ushering the iPhone moment at last for voice as well.


The extremely fast feedback that the iPhone touch screen experience provided to the user, resulting in a very responsive and intuitive user experience is still missing from current voice user interfaces.

Why Hasn’t the iPhone Moment Happened Yet for Voice UIs


Speechly has published a white paper on improving user experience in grocery eCommerce. This blog post shares the key findings of the paper. You can download the white paper as a PDF at the bottom of this page.

## Grocery industry is facing one of the most fundamental industry shifts in history.

Many grocery retailers in the Western world are facing stagnating growth of sales and falling prices. Total sales have grown only 2% yearly for the past decade and increasing competition from discount chains and increasing labor and commodity costs has contributed to evaporation of over half of the combined profit of publicly traded grocery retailers between 2012 and 2017.

### More than half of the grocery sectors profits have vanished since 2012

![Market share of grocery retail channels 2016 vs 2025](/uploads/retail-losses-chart.png)
_CHART: Economic value add of publicly traded grocery retailers, billions USD. (Source: [McKinsey](https://www.mckinsey.com/industries/retail/our-insights/reviving-grocery-retail-six-imperatives))_

According to [eMarketerer](https://www.emarketer.com/content/grocery-ecommerce-2019), online foods and beverages are the fastest growing ecommerce product category in the Europe with almost 20% growth. The COVID-19 pandemic has only increased the pace of growth with many retailers reporting three figure growth rates.

Still the same research estimates that groceries will be among the least penetrated markets for the years to come. UK, France, South Korea, Japan and China are a bit more developed, but the share of eCommerce is still nowhere bigger than 10% of total sales.

![Market share of grocery retail channels 2016 vs 2025](/uploads/chart-marketshare-grocery.png)
_CHART: Traditional supermarkets are losing market share to new channels_

In this paper we will argue that the biggest reason for retail ecommerce having not grown faster is bad [customer experience](/blog/voice-application-design-guide/) especially in shopping cart creation phase. The [most efficient way to improve customer experience](/blog/advantages-of-voice-user-interfaces/) in cart building is voice technology.

## Three reasons that still constrain e-commerce sales in grocery retail

Not many millenials would even consider buying flight tickets from a brick-and-mortar store. Still they do their grocery shopping pretty much just like their grandparents did. Almost half of all Britons have never bought groceries online. There are three major reasons why consumers prefer offline experience over online in groceries.

**Delivery costs**; average cost for collecting the items and delivery per basket is around 8-13 EUR. Usually about 80% of this cost is transferred to the customer.

For example Amazon spends a whopping 25% of it’s total revenue on shipping and fulfillment costs.

**Lack of trust**; consumers are accustomed to selecting the freshest foods themselves. They don’t necessarily trust collectors to do that for them. Same goes with replacement products; while the customer is perfectly okay if their 1,5 litres of Coca-Cola is replaced with 1 litre of Coca-Cola, they might not be okay with it being replaced by Pepsi of any amount.

**Bad user experience**; a typical supermarket has about 30.000 to 50.000 SKUs. While customers might enjoy browsing through dozens of red dresses before purchase decision that’s usually not the case with milk or other low-value products. The current customer experience in creating a grocery shopping cart is cumbersome and requires a lot of scrolling and clicking through.

In fact, 85% of online grocery shoppers use the same store they used the first time because it’s a lot easier to modify previous orders than to start from scratch.

Voice is a great addition to grocery eCommerce experience, as purchasers know the products they are looking for and don't need to [search and browse](/blog/voice-search/).

<YouTube videoId="yzdSVV4xjb0" />

_Voice is a natural addition to grocery eCommerce shopping cart creation_

## Case: Walmart is using shoppable recipes to engage consumers

![Walmart logo](/uploads/walmart-logo.png)

Walmart is collaborating with [Tasty](https://www.businessinsider.sg/walmart-partners-with-buzzfeeds-tasty-2017-12/), a Buzzfeed spinoff that makes recipes and cooking videos. Users of Tasty iPhone app can find over 4000 recipes and filter them by ingredients, dietary needs or difficulty. The app has over 300.000 reviews with an average rating of 4.9 and is among the top 20 of all food apps on the App Store.

Now, with the Walmart collaboration users can also add the ingredients of all Tasty recipes to Walmart shopping cart and either do in-store pickup or have them delivered home.

Sumaiya Balbale, the vice president of e-commerce, mobile, and digital marketing of Walmart said that while ”consumers today are shopping very differently”, Walmart wants to find new ways to get its products in front of shoppers.

Tasty has almost 100 million Facebook followers and over 60 million viewers for its recipe videos, who can now conveniently fulfill their grocery needs from 2200 Walmart stores right after being inspired by a cooking video.

## Read more by downloading the white paper

This is the end of the preview of our white paper. Download the white paper and get access to conclusions and other data points!

<WhitePaperBanner
  title="Voice & the Grocery Industry"
  description="Learn how voice is changing the way people shop for groceries."
  filePath="/uploads/grocery-voice-speechly-web.pdf"
/>


Speechly published a white paper on how voice technology can benefit grocery retail market. The white paper is available completely for free, no strings attached!

Speechly is joining Roblox

About Speechly

Latest blog posts

4 Voice Chat Solutions for Virtual Reality

Speechly Has Received SOC 2 Type II Certification

Countering Extremism in Online Games - New NYU Report