Combating Voice Chat Toxicity in VR Games: Speechly and Gym Class
Mar 20, 2023
5 min read
Gym Class VR is a basketball game that was preparing to launch on Meta Quest after a very successful Beta. Voice chat is an important social element of the game, but the team noticed evidence of toxic behavior emerging. After trying speech recognition from cloud service providers, they quickly learned this was a cost-prohibitive approach and turned to Speechly.
Gym Class VR is a basketball game that was preparing to launch on Meta Quest. In addition to fun game mechanics, it also has a voice chat feature that makes it a social experience. However, there was also some evidence of toxic behavior emerging in voice chat and the company didn’t know whether the problem was widespread or mostly isolated incidents.
The team was also very serious about building a healthy social space as part of the game experience. To do this, it needed a way to measure the problem and put controls in place to weed out toxic behavior.
Gym Class had tried popular cloud service providers for speech recognition to see if it could establish a baseline measurement for toxic behavior. However, at a market price of $1 per hour for transcription, those solutions turned out to be cost-prohibitive. So, Gym Class began looking for a new solution.
The Problem with Toxic Behavior
Toxic behavior can create a cascading series of problems if it is not addressed. Initially, it can undermine the game experience for a few players. If left unchecked, over time, it can shape the game’s community culture and leave some players feeling unwelcome and uncomfortable.
In addition, it is harder to get five-star reviews when a few bad actors are undermining the game experience. Worse still, these bad experiences can lead directly to one-star reviews, which can turn prospective players off before even trying the game. This is particularly frustrating for game makers when bad reviews emerge that have nothing to do with the game itself but are driven by a handful of toxic players.
Data from Apptentive and other providers show that there is a direct correlation between app store star ratings and new user acquisition. This is partially driven by how users search for new games on their own and also by how the app stores rank game titles. Star ratings matter. Gym Class knew that lowering toxicity would meet their goals from a game culture standpoint and also could translate into better star ratings.
The Proactive Imperative
Gym Class’ goals were pretty straightforward. The company first needed to measure the level of toxicity in the game. That understanding could then be used to decrease the amount of toxicity, improve player experience, and support a successful launch in the Meta Quest app store.
Most game makers today treat voice chat moderation strictly from a complaint-led model. That means they only are aware of toxicity that is reported. A recent consumer survey of U.S. online gamers found that only about 36% of victims of toxic incidents originating in voice chat, have ever filed a complaint. Even those that have filed a complaint don’t do it for every incident.
The implication for Gym Class was clear. A complaint-led process would miss the vast majority of incidents. The company would need to take a proactive approach that involved monitoring voice chat sessions for toxicity. This would enable Gym Class to more effectively measure the problem and figure out the best way to eliminate toxic behavior where practical and mitigate the impact when it did occur.
The On-device Solution
Gym Class already knew it needed a highly accurate automated speech recognition (ASR) solution. Speech recognition and transcription accuracy are the first steps in any monitoring of natural language conversations. The company also wanted to ensure it correctly identified toxic incidents by taking context into account, so it didn’t miss cleverly disguised toxic behavior. And the context-based analysis was important to mitigate the likelihood of false positive events which arise when a benign statement is flagged as toxic.
Given that Gym Class has several unique aspects of its VR game mechanics and culture, it was going to need a custom AI model to drive high accuracy. It also became clear that the only economically feasible solution would be to run the monitoring on user devices as part of the downloaded app.
If you run transcription through a cloud provider, you are paying for all of the data processing. For any individual gamer utterance, it may not exorbitantly expensive, but the costs add up quickly for any game with a significant user base and frequent voice chat use. The cloud provider option mentioned earlier added $1 of cost for every player hour.
However, if you run the speech recognition locally on the user device, you only need to send messages to the game makers’ servers when an incident is detected. This turns out to be an order of magnitude less expensive than using a cloud provider. The approach also means proactive monitoring is suddenly economically feasible.
Comparing ASR Models
Gym Class asked Speechly to benchmark several ASR solutions to assess cost and performance. This evaluation included two cloud providers, a Whisper on-prem deployment, one Speechly on-prem, and Speechly one on-device model. Speechly was the only ASR evaluated on-device as the cloud providers do not offer this option, and the Whisper model was too large to be feasible for the use case requirements. The results showed a strong rationale for implementing an on-device solution.
Cost / audio hour
The results made clear that Speechly’s custom ASR model both on-device and on-prem provided better accuracy in terms of Recall (i.e. identifying true positive toxic behavior). False positives were near zero and at par or below other AI model implementations. In addition, Speechly costs were 90% to 95% lower than cloud deployments and one-third to one-sixth the cost of an OpenAI Whisper implementation.
Cost is a key barrier to using cloud providers for these types of applications. However, the analysis for Gym Class also revealed that the generalized cloud models also had lower accuracy, and Azure showed a higher false positive rate. It is hard for generalized AI models to compete with customized models in terms of accuracy for use cases as specific as a particular game. In the end, Speechly’s custom speech recognition models offered higher accuracy in addition to being smaller and more cost-efficient.
On-device deployments with this level of accuracy were not an option for game makers just a couple of years ago. The reason is the package size for a custom speech recognition model was simply too large to include in a game’s executable file. Recent advances in speech recognition around optimization for application size and processing requirements have made this approach viable today for everything from PCs and consoles down to mobile devices and VR headsets.
Speechly has been at the forefront of this change, particularly in deploying custom ASR models directly on devices. Our research has focused on building better-than-cloud-grade speech recognition that can be deployed on-device, on-prem, or in the cloud.
Implementing Speechly’s voice chat monitoring solution enabled Gym Class to proactively address toxic behavior and execute a successful Meta Quest store launch. And it was made possible at a reasonable cost.
Gym Class VR’s toxic incident rate and complaints are just a fraction today compared to before the solution was implemented, and the company has a clear method for measuring incident rates. Now, the game can be judged on the merits of its mechanics and how fun it is for players without the risk of a few users undermining the experience for everyone else.
Today, Gym Class VR has a 4.9-star rating on the Meta Quest store and over 28,000 positive reviews. Data from VRDB.app show it became the highest-rated experience in the entire store in March 2023. You should give it a try.
Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.
Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.
Jun 01, 2023
1 min read
Countering Extremism in Online Games - New NYU Report
A recent NYU report exposes how extremist actors exploit online game communication features. In this blog we expand on NYU's data and recommendations for maintaining safety and security in online gaming communities.
May 30, 2023
4 min read
What You Can Learn from The Data in Xbox’s Transparency Report
The 2023 Xbox Transparency Report is (likely) around the corner. Our first blog broke down how the moderation process works at Xbox, but this blog will take a deep dive into the data from the inaugural report comparing Reactive vs Proactive moderation.