As the internet has matured, so have the available content types that users can enjoy and approaches for moderating online activity. Most of the early content online was dominated by text and visual based content where audio played a smaller role. Flash forward to the 2020s and you see a very different ecosystem that is dominated by images, videos, and even real time communication channels such as Clubhouse, Twitter spaces or the various Metaverse universes like Roblox that enable Voice Chat.
Since voice-based moderation has received less attention over the years than text-based moderation, this post will outline how to use Speech to Text technology to efficiently and effectively moderate Voice Chats and Voice-based content online. We will focus on keyword detection based moderation in which the task is to find specific instances of bad words (or short phrases) from a users’ speech.
Download White Paper
4 Key Takeaways on Voice Moderation in Online Gaming & the Metaverse
We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation.
Foundation of Voice Moderation: Efficient Speech to Text
In order to build a Voice Moderation solution, you will need to use Speech to Text technology. Speech to Text does exactly as the name implies - it converts a user's speech into text that can be used for various downstream tasks, such as Moderation. It’s important to know that Speech to Text is a Machine Learning technology which means that even when it is trained for a specific industry, there can still be instances where the system will make a mistake. This means that on occasion it will “hear” something else than what was actually said.
These Speech to Text mistakes may require special attention in the context of Voice Moderation, because inappropriately implemented moderation can be detrimental to the user experience. The moderation system should not miss obvious cases of toxicity, but accusing innocent users of bad behavior too often can be even more harmful. For this reason, it is important for businesses to consider the tradeoffs in building a Moderation solution that is super strict vs more tolerant. Doing this requires an understanding of Speech to Text False Negatives and False Positives.
What are Speech to Text False Negatives and False Positives?
False Negatives and False Positives are the most important variables to consider when using Speech to Text technology to build a Voice Moderation solution.
- False Negative: A is a scenario where the Speech to Text misses a “bad” word/phrase and transcribes it as something “good”.
- False Positive: A is a scenario where the Speech to Text transcribes a “good” word/phrase as something “bad”.
Regardless of how accurate the Speech to Text system is, it is impossible to completely avoid the scenario of False Negatives and False Positives. However, what businesses do have control over is the sensitivity of their Speech to Text. With this decision in Speech to Text sensitivity comes moderation trade offs to consider between False Negatives and Positives.
Tradeoffs between Speech to Text False Positives and False Negatives with Voice Moderation
There is a direct trade off with Speech to Text False Negatives and False Positives. If you reduce one variable, the other is going to increase. To better understand this, let's look at an example using a Human Moderator and 2 different scenarios.
When the Moderator hears profanity, assume there are 2 levels of confidence to gauge how well they heard the word: HIGH and LOW.
- If confidence is HIGH, they are more likely to have heard correctly.
- If confidence is LOW, they are more likely to have heard incorrectly.
The Moderator has 2 Possible Scenarios:
- Scenario 1) Only flag when confidence is HIGH.
- Scenario 2) Flag when confidence is HIGH or LOW.
Scenario 1 - HIGH Confidence
In this scenario, the Moderator will flag less often as they will be more selective in the flagging process.
- This means there will be More False Negatives since there will be some cases that were missed that should’ve actually been flagged.
- This means there are Less False Positives as the Moderator tends to only flag correct cases.
Scenario 2 - HIGH or LOW Confidence
In this scenario, the Moderator will flag more frequently as they will be less strict in the flagging process.
- This means that there will be Less False Negatives since the Moderator is more actively flagging even when not 100% certain on what they heard.
- This means there are More False Positives since the Moderator likely made more mistakes with the lower confidence threshold.
Decision - Weighing False Negatives and False Positives
In this situation, the business has to decide whether to be more or less strict in their moderation flagging process.
- If they favor a less strict policy, they will follow Rule 1 where the Moderator must have a high degree of confidence before flagging.
- If they favor a more strict policy, they would follow Rule 2 where the Moderator would be less strict in flagging, but ensure no profanity is making it through.
The tradeoff decision made in this Human Moderator example is exactly how you should look at tradeoffs with Speech to Text for Moderation!
Finding the Right Balance of False Negatives and False Positives for Your Business
For each level of False Negatives there is a matching level of False Positives. This is called the Operating Point of the system. Finding the ideal Operating Point with Speech to Text is the main goal when it comes to Moderation, however where this Operating Point will exist is determined by the context of an individual business and their goals.
For example, if no profanity is tolerated, you might opt for a higher False Positive rate, where flagging is then verified by a human in the loop. Alternatively, if you would like to fully automate moderation and do not have the resource to verify every flag with a human, you would opt for a smaller False Positive rate. (But may have to tolerate more False Negatives as a consequence.)
In order to use Speech to Text for Moderation, you need a solution that can be adjusted to target the specific type of language that is relevant to your business and use case. By using a dedicated Speech to Text model that is trained with data that is relevant for your use case, Speechly can reduce false negatives without adversely affecting the false positive rate.
If you would like to learn more about how Speechly’s Speech to Text technology can be adapted to target specific jargon from your business or industry, reach out to the team with our Contact Us form.
Cover Photo by Karolina Grabowska: Pexels