Voice is the new touch screen
When the iPhone entered the market, touch screens replaced the physical keyboard as our main point of interaction with hand-held devices, such as mobile phones, with their intuitive and easy to learn usage. Touch screens felt incredibly natural for users.
Instead of using buttons to control a device, users, largely without hints or buttons, quickly understood that swiping left leads to going back and swiping right to go forward, following the logic of sinistrodextral system most users are accustomed to. Considered a revolutionary innovation, the application of touch screens quickly spread beyond handheld devices to all kinds of user interfaces, such as car dashboards, shopping centre monoliths, and thermostats.
This same phenomenon can be seen happening today with voice-enabled services and devices. Voice as an addition to the current touch-based user interfaces is the next frontier of how humans interact with the technology around them.
Transferring money to a friend isn’t considered a particularly difficult task with modern digital banking services. However, even this simple task via a bank’s mobile app requires the following steps: opening the application, tapping to make a transfer, choosing a friend from a list, inserting the sum to be transferred, setting the date of transfer, and lastly, confirming the payment.
This same process could be complete via a voice-enabled process simply by saying, “Transfer 100 euros to Mike tomorrow.” and confirming the payment by using the touch screen. Voice-enabled services make human interaction with devices more convenient, easier, faster, and most importantly; incredibly natural and human-like.
Not only are voice-enabled services and products significantly improving customer experience and efficiency, but the input of data via voice can also power mission-critical tasks where safety is a top priority. As speaking does not disrupt actions or require eye-contact, the use of voice in such tasks not only enables opportunities for new capabilities, but also crucially increases safety.
The advancement of the technology has always been focused on making devices closer to users, so that human-computer interaction (HCI) would become as natural and human-like as possible. Voice is the technology that makes the next major step in user interface design possible.
Status quo of voice-powering solutions
The voice-tech industry already boasts a wide variety of solutions for building voice capabilities. The most common voice horizontals include:
Speech-to-text (STT) software enables recognition and translation of spoken language into text using automatic speech recognition (ASR) technology. Speech-to-text is a rather basic technology which incorporates knowledge of linguistics, computer science, and electrical engineering.
Speech-to-text solutions transfer speech audio into phonemes, words, and then sentences. The language models define how words come together in sentences. That is where problems with speech-to-text solutions commonly occur. Out-of-vocabulary queries and special lexicon tend to be hard to recognise and homophones also make it difficult for STT engines to identify correctly. After that, the transcribed text needs a natural language understanding (NLU) system to extract intents of an utterance, which results in poorer performance, especially if domain-specific vocabulary is included.
Voice ecosystems allow developers to create voice-based technologies for use within third party ecosystems, driven by use in connected speaker products, smartphones, and computers.
Ecosystems such as Amazon Alexa, Google Assistant and Apple Siri are great examples of such voice ecosystems. Nowadays even more than before privacy issues become an increasingly important matter when it comes to such ecosystems. There are thousands of apps available on smart speakers which leads to brands getting lost and such ecosystems being the point of interaction for user engagement instead of your application. We’ve written before on why smart speakers are not the future of voice.
Conversational AI refers to the use of messaging apps, speech-based assistants, and chatbots to automate communication and create personalized customer experiences. In another blog post, Tips for a beginner in voice UI design, we stated: “your users don’t want to converse, they want to command”. Conversational AI is a promising technology, yet its usage tends to be rather entertaining than useful and efficient. In current use cases, doing simple tasks like placing an order can turn into a lengthy conversation with a device, instead of one command that is clear to formulate for a user and easy to understand for a machine.
Speechly’s approach to the future of voice
Speechly developes technology with which developers and companies can easily add voice-enabled UI features into their existing products and services. Speechly is an end-to-end spoken language understanding (SLU) API. Our technology is a combination of speech recognition and natural language understanding. End-to-end SLU allows us to extract the meaning of user’s utterances in real-time for increased accuracy and efficiency.
Real-time speech processing also improves user experience, since while talking, a user gets simultaneous feedback and a device can start working a task even before a user finishes an utterance. We call Speechly a speech-to-intent solution, which emphasises its nature - extracting meaning from speech in real time.
Another key feature of Speechly is multi-modality. Modern user interfaces must support multiple modalities as input/output channels between user and machine. Examples of modalities include vision (display, XR), touch, and voice (both for input and output).
Despite all our love for voice tech, not all tasks should be replaced with voice. For example, in cases where a user needs to work with large texts or navigate within a large image or a map, voice can be a great support but is not suited to be the main channel of interaction.
Hence Speechly is designed to be used along with other modalities such as vision (display, XR) and touch in cases where voice supplement is suitable and beneficial. With the help of Speechly you can augment existing ways of interaction by introducing voice to your service and customize the solution according to your needs using our speech-to-intent API.
What can be done with Speechly?
Speechly enables developers to add voice functionalities to any app on any platform with our unique speech-to-intent API. Voice has unlimited possible applications in various industries and use cases. Below are three key ways how voice can be used.
Voice search is the most common and widespread application of voice technology. There are few main reasons for voice-enabled search to be highly used.
First, as was mentioned before, voice is fast, searching with voice is on average 3.7x faster than typing.
Second, voice is particularly well-suited for use on mobile devices. The growing usage of mobile globally coupled with the always-online nature of mobile devices makes voice search a function with growing popularity.
And third, voice search seems for many users to be simply much more convenient, users do not have to focus on typing, it feels natural, and more engaging as well. Therefore, by allowing your users to search with voice, you can significantly improve customer experience, and, what is increasingly important nowadays, better cater your service for mobile traffic. Speechly API solution can easily help you to add voice search functionality to your service.
Speechly demo for grocery shopping
SOK, one of the largest grocery retailers in the Nordics, used Speechly in building their mobile application, where SOK customers can search for groceries and build their shopping list via voice.
Voice commands is another typical usage of voice technology. Voice commands refer to any tasks that a user can ask a device to perform with spoken requests.
Most current use cases for voice commands are limited to basic assistance of a user, for example, with controlling smart home devices, built-in mobile assistants, and smart speakers.
We believe that voice commands will inevitably continue to expand to more complex cases, where voice commands are not only more convenient to use, but also improve efficiency and increase safety.
Thanks to Speechly API you can add voice command capabilities to any use case, regardless of how comprehensive and diverse the requested tasks are, Speechly technology allows you to build a custom solution while guaranteeing top accuracy and real-time performance.
Speechly demo for virtual reality
The demo above shows the logic of how our real-time speech-to-intent engine works in virtual reality environment
Voice-enabled data input
Another common use case for voice interfaces is data input, where a device does not need to react to users utterance, but instead understand the meaning of what has been said and codify it.
There are a lot of potential applications for voice-enabled data input, one of them has been described in our article Turn any web form into a voice form, where we suggest that many existing web forms should allow data input using voice. Such data inputs using web form like interfaces are commonly used by both regular consumer and professional use, for example in manufacturing and logistics where workers have to fill in various reports frequently.
While most current solutions provide speech-to-text capability only, Speechly can transcribe users utterances and also extract the meaning of it in real time, which means that Speechly both allows raw data (text) entry and also allows to get a meaning of such input at the same moment user is speaking.
Speechly demo for airline booking
Speechly solution can power data entry in various use cases. The demo above shows how Speechly allows filling a structured web form in one go.
Try out Speechly now
Speechly is a simple entry point into a voice-enabled future. Our spoken language understanding API allows you to create a custom voice user interface, whether that’s a voice search, voice commands or data entry - Speechly performs with almost human level accuracy.