- The most popular voice assistants (Alexa, Siri, Google) all use half-duplex architectures, which means the user and the assistant must take turns to speak – you cannot interrupt
- This turn-taking requirement limits the versatility of half-duplex systems because neither party can act until the other has finished their turn; this slows system response times and can be tedious for the user when the voice assistant misinterprets the request
- Half-duplex architectures also limit use cases such as real-time proactive voice chat moderation; that is why you often see these solutions showing significant processing delays and only after-the-fact issue flagging - transcription processing followed by text analysis is a half-duplex system design
- Full-duplex architectures enable bi-directional communication as both parties are always listening even when speaking or acting
- Full-duplex systems are less common today but offer valuable features because they employ real-time understanding where the system begins predicting the user intent from the very first word uttered
- This means that users can speak to correct the AI’s understanding as soon as it is apparent there is an issue which enables more efficient interactions
- It also means full-duplex can perform actual proactive moderation of live voice chat because the system is not batching text to be analyzed but instead analyzing the meaning of the user speech in parallel with transcription
- This truly proactive moderation feature can make a big difference when toxic material is uploaded or toxic behavior is occurring – the difference between a few seconds and a couple of minutes can have a big impact as Twitch recently learned
Whether you call it speech AI, conversational AI, voice AI, or prefer some other term, you most likely assume it means turn-based communication. That is not surprising. General purpose voice assistants such as Alexa and Siri are rooted in this model and that is the primary point of reference for most people familiar with conversational interactions or even chatbots.
It begins with a human on their turn making a request to a leading voice assistant. The voice assistant waits for human to complete the request (i.e. utterance). On its turn, the voice assistant starts by processing the full statement through speech recognition and natural language understanding and then responds. That response might be via text-to-speech, an audible sound, an image, or by completing a task. However, you typically cannot engage the AI again until it completes its response. The human user can engage again, followed by the voice assistant.
The technical term for these turn-based conversational models is half-duplex. Half of the two communicating parties can communicate at a time. That means one at a time turn-taking communication – human, then AI, then human, then AI, and so on. That isn’t very humanlike.
Humans typically use what is called full-duplex communication when interacting with each other. While half-duplex systems only enable information to travel in one direction at a time, full-duplex communications enable simultaneous information flow in multiple directions. No one is required to wait for their turn. This is a key engineering differentiator for Speechly and one reason why customers seek us out for capabilities that they cannot implement with half-duplex systems.
Few people realize how radically full-duplex communications transform what is possible in conversational interactions. The real-time nature of a full-duplex architecture also enables other novel use cases, such as real-time voice chat moderation. Full-duplex voice AI was also recently in the news, but more on that in a minute.
Don’t Take Turns, Barge In
When was the last time you attempted to interrupt Alexa or Siri when they were speaking? How did that work out for you? These assistants will drone on to complete what they believe their task to be, and you simply have to wait. There is no concept of “barging in” while the other party is talking or processing information and deciding what to do. This limitation is true even though a barge-in could help the conversation more efficiently and accurately meet the user’s goal.
The one way you can barge in on Alexa or Siri is to utter their wake word. However, that essentially resets the context, and the user is required to start over instead of building upon the progress the conversation has made toward the goal. This is relevant for information sharing and task completion. Let’s consider some full-duplex communication examples.
Full-duplex information sharing example:
|1. "Who were the lead actors in Blade Runner?"||2. There was Rutger Hauer, Harrison...|
|3. "No. In the sequel."||4. Harrison Ford again, Ryan Gossling, Ana de Armas, and Robin Wright.|
|Human||A full-duplex Al|
|1. "What is the weather like today?"||2. The weather in New York City is...|
|3. "I mean in Boston."||4. The weather in Boston is clear and 68 degrees right now with a high of 75 and cloud cover forming in the afternoon.|
Full-duplex task completion example:
|1. "Can you hand me a plate?"||2. The person begins to hand over a green plate.|
|3. "The white one would be better."||4. The person takes back the green plate and hands over a white plate.|
|Human||A full-duplex Al|
|1. "Show me basketball shoes."||2. A variety of shoes begin to populate a screen.|
|3. "Show me only Nike shoes."||4. All non-Nike shoes are removed from the screen.|
|5. "Only red."||6. All shoes that are not red are removed from the screen.|
The misunderstanding of user intent may be due to incomplete information from the requesting party. However, in each case, the requesting party can easily refine their request based on the first indications of activity by the responding party.
Many of these scenarios can be frustrating when using a half-duplex system because you have to wait for an incorrect task to be completed or for the system to deliver the wrong information before starting the query again with more detail. This waiting and inability to introduce real-time collaboration to reach the conversation’s goal is just as annoying when speaking with a human as with an AI.
Half-duplex systems can get the job done in many cases. Users simply have to adjust their expectations and accept a certain amount of inefficiency and frustration from time to time. It also means that these systems cannot fulfill the requirements for many real-time interactions.
The secret behind full-duplex natural language processing (NLP) is real-time understanding. As soon as a user begins to speak, the system begins predicting their intent and starts taking action. It doesn’t wait until the user finishes speaking. That means a correct early prediction could actually fulfill a request before it is fully expressed.
If you have multimodal feedback such as a screen, full-duplex also provides another powerful feature. As the system is visually fulfilling the request, the user can see what the AI is doing and correct an inaccurate understanding. You can see a video example below.
Full-Duplex Application for Voice Chat and UGC Moderation
Real-time functionality also enables novel use cases such as voice chat moderation for games and social networks. If there is toxic behavior or harassment in progress, you want to identify it immediately and begin taking action. Otherwise, you hava to wait until after the perpetrator stops speaking to begin your analysis.
You see this in many voice chat moderation implementations today. Many are only able to review the transcript of voice chat long after the conversation is over. It is an audit-based reactive approach. Others attempt to provide information sooner but often with a 30-second to multi-minute delay. Again, that delay typically means the conversation may be over or had additional time to escalate with a more severe negative impact. This delayed reaction is better than the audit-based approach, but it undermines a platform’s ability to be proactive and quickly mitigate harm.
This also impacts use cases for user-generated content (UGC). After toxic or inappropriate content is uploaded or the live stream starts, the race is on. The longer the content is available for consumption, the more users are likely to see it, and the more negative effects accumulate.
Speechly’s Otto Söderlund highlighted what is at stake and how latency can impact serious situations in a recent speech at the Voice 2022 conference. He commented:
“Speed can be really critical in detecting harmful content online. Consider, for example, the Buffalo shootings. It took Twitch two minutes to actually cut down the live stream of the shootings that were broadcasting on their platform. You think, ‘That is not a long time. It is pretty fast.’ Right? But, it wasn’t fast enough to prevent a viral spreading of those videos to the wider public.”
The reason most platforms have moderation is to protect the users from harmful content. Whether it is voice chat or UGC, speed matters. A full-duplex NLP architecture is the only way to provide true real-time proactive moderation of conversations and content. The latency of half-duplex systems introduces higher risk because they are always several steps behind the bad actors.
Full-Duplex in the News
However, almost no enterprises, game makers, or social platforms are even aware of the distinction between full and half-duplex NLP architectures. They typically don’t even know full-duplex features are an option. That is because users have been conditioned by the expectations set by the half-duplex feature constraints of the general purpose voice assistants combined with a disincentive for vendors to expose the limitations of their technology architecture.
One company that recently joined the full-duplex tribe is SoundHound. The company demonstrated full-duplex online form filling for restaurant orders. This is a great full-duplex use case because the user can see immediately when the AI makes a mistake in the form entry and begin to take corrective action. This may not be as high stakes as some of the moderation use cases, but it definitely can provide a better user experience and higher throughput for order processing.
It was the higher stakes issues of content and voice chat moderation that led several game makers, metaverse virtual worlds, and social media services to ask Speechly for assistance. Their “transcribe and best-efforts response” approaches introduced risk for users and the companies themselves, while also carrying very high costs. Speechly’s full-duplex architecture, plus the real-time natural language understanding engine, plus the ability to deploy on devices, in the cloud, or as a hybrid, turned out to be a unique solution mix to address an intractable problem.
Our expectation is that full-duplex conversational systems will continue to see adoption growth because it is better for users and enables real-time use cases where speed is of the essence. Having other companies like SoundHound discuss this alternative approach is sure to draw more attention to what full-duplex can do as well as the limitations of half-duplex systems. We also expect this to become a standard requirement for most voice chat moderation solutions going forward.
Let us know if you have any questions about full-duplex NLP and how the technology may be a better fit for your conversational AI or voice chat moderation use case.