Blog

use cases

How to Fine Tune Speech to Text for Voice Moderation

Antti Ukkonen

Aug 29, 2022

5 min read

When moderating Voice-based content, there is a tradeoff between false alarms and real cases where moderation is required. Balancing these depends on your goals.

Copy link
Mail
LinkedIn
Facebook
Twitter

As the internet has matured, so have the available content types that users can enjoy and approaches for moderating online activity. Most of the early content online was dominated by text and visual based content where audio played a smaller role. Flash forward to the 2020s and you see a very different ecosystem that is dominated by images, videos, and even real time communication channels such as Clubhouse, Twitter spaces or the various Metaverse universes like Roblox that enable Voice Chat.

Since voice-based moderation has received less attention over the years than text-based moderation, this post will outline how to use Speech to Text technology to efficiently and effectively moderate Voice Chats and Voice-based content online. We will focus on keyword detection based moderation in which the task is to find specific instances of bad words (or short phrases) from a users’ speech.

Download White Paper

4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse

We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation.

Foundation of Voice Moderation: Efficient Speech to Text

In order to build a Voice Moderation solution, you will need to use Speech to Text technology. Speech to Text does exactly as the name implies - it converts a user's speech into text that can be used for various downstream tasks, such as Moderation. It’s important to know that Speech to Text is a Machine Learning technology which means that even when it is trained for a specific industry, there can still be instances where the system will make a mistake. This means that on occasion it will “hear” something else than what was actually said.

These Speech to Text mistakes may require special attention in the context of Voice Moderation, because inappropriately implemented moderation can be detrimental to the user experience. The moderation system should not miss obvious cases of toxicity, but accusing innocent users of bad behavior too often can be even more harmful. For this reason, it is important for businesses to consider the tradeoffs in building a Moderation solution that is super strict vs more tolerant. Doing this requires an understanding of Speech to Text False Negatives and False Positives.

What are Speech to Text False Negatives and False Positives?

False Negatives and False Positives are the most important variables to consider when using Speech to Text technology to build a Voice Moderation solution.

False Negative: A is a scenario where the Speech to Text misses a “bad” word/phrase and transcribes it as something “good”.
False Positive: A is a scenario where the Speech to Text transcribes a “good” word/phrase as something “bad”.

Regardless of how accurate the Speech to Text system is, it is impossible to completely avoid the scenario of False Negatives and False Positives. However, what businesses do have control over is the sensitivity of their Speech to Text. With this decision in Speech to Text sensitivity comes moderation trade offs to consider between False Negatives and Positives.

Confusion Matrix for Binary Classification

Tradeoffs between Speech to Text False Positives and False Negatives with Voice Moderation

There is a direct trade off with Speech to Text False Negatives and False Positives. If you reduce one variable, the other is going to increase. To better understand this, let's look at an example using a Human Moderator and 2 different scenarios.

Situation

When the Moderator hears profanity, assume there are 2 levels of confidence to gauge how well they heard the word: HIGH and LOW.

If confidence is HIGH, they are more likely to have heard correctly.
If confidence is LOW, they are more likely to have heard incorrectly.

The Moderator has 2 Possible Scenarios:

Scenario 1) Only flag when confidence is HIGH.
Scenario 2) Flag when confidence is HIGH or LOW.

Scenario 1 - HIGH Confidence

In this scenario, the Moderator will flag less often as they will be more selective in the flagging process.

This means there will be More False Negatives since there will be some cases that were missed that should’ve actually been flagged.
This means there are Less False Positives as the Moderator tends to only flag correct cases.

Scenario 2 - HIGH or LOW Confidence

In this scenario, the Moderator will flag more frequently as they will be less strict in the flagging process.

This means that there will be Less False Negatives since the Moderator is more actively flagging even when not 100% certain on what they heard.
This means there are More False Positives since the Moderator likely made more mistakes with the lower confidence threshold.

Decision - Weighing False Negatives and False Positives

In this situation, the business has to decide whether to be more or less strict in their moderation flagging process.

If they favor a less strict policy, they will follow Rule 1 where the Moderator must have a high degree of confidence before flagging.
If they favor a more strict policy, they would follow Rule 2 where the Moderator would be less strict in flagging, but ensure no profanity is making it through.

The tradeoff decision made in this Human Moderator example is exactly how you should look at tradeoffs with Speech to Text for Moderation!

Finding the Right Balance of False Negatives and False Positives for Your Business

For each level of False Negatives there is a matching level of False Positives. This is called the Operating Point of the system. Finding the ideal Operating Point with Speech to Text is the main goal when it comes to Moderation, however where this Operating Point will exist is determined by the context of an individual business and their goals.

For example, if no profanity is tolerated, you might opt for a higher False Positive rate, where flagging is then verified by a human in the loop. Alternatively, if you would like to fully automate moderation and do not have the resource to verify every flag with a human, you would opt for a smaller False Positive rate. (But may have to tolerate more False Negatives as a consequence.)

In order to use Speech to Text for Voice Moderation, you need a solution that can be adjusted to target the specific type of language that is relevant to your business and use case. By using a dedicated Speech to Text model that is trained with data that is relevant for your use case, Speechly can reduce false negatives without adversely affecting the false positive rate.

If you would like to learn more about how Speechly’s Speech to Text technology can be adapted to target specific jargon from your business or industry, reach out to the team with our Contact Us form.

Cover Photo by Karolina Grabowska: Pexels

About Speechly

Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.

Latest blog posts

company news

Speechly is joining Roblox

Hannes Heikinheimo

Sep 19, 2023

1 min read

voice tech

4 Voice Chat Solutions for Virtual Reality

Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.

Matt Durgavich

Jul 06, 2023

5 min read

company news

Speechly Has Received SOC 2 Type II Certification

Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.

Markus Lång

Jun 01, 2023

1 min read