Speechly has existed now for about for five years. We are a team of 13 experienced software developers and machine learning experts and for the most part of that five years, we’ve been operating in stealth mode, focusing on building our core technologies. Now it’s time to tell what we’ve achieved so far. We are building a developer tool for improving touch screen user experience by voice functionalities. We don’t believe that smart speakers and voice assistants are the best use case for voice, but voice should be thought of as an add-on to current mobile applications’ and websites’ user interface. Voice is a modality, not a complete user interface.
Touch screen user interfaces definitely need improvements: while selecting from a few options is easy, selecting for example 30 items from an inventory of 20.000 is pretty cumbersome.
Typing is notoriously hard, too. Most humans speak about three times faster with less errors than they type. In short, voice is a great solution for information heavy tasks. While there are good solutions for speech recognition, there’s really no tools that would enable developers build the kind of user interfaces we’ve envisioned for voice.
2020 was our first year when we really published something out in the wild. We’ve built our technology for the past five years and now Speechly is finally in a stage that a developer can configure a model, integrate it to their application and build an awesome voice user interface. In this post, I’ll summarize our achievements.
We run our own ASR and NLU technologies that provide both transcript and meaning (intents and entities) in real-time. During the 2020 we achieved significant increases in both ASR and NLU accuracy.
We evaluate the accuracy of our engine by transcribing the data we receive with both our own and with Google Cloud Speech API. Based on our results, our Spoken Language Understanding is in a typical voice user interface task 15% more accurate than Google.
Because ASR is a hard task, this is not to claim that our technology is better than Google in all cases. It means that when building voice user interfaces, Speechly outperforms Google in most cases, even without training the model separately for a certain use case.
In a real case, Speechly can further be optimized by using the actual user data for retraining the model. This improves the accuracy typically by another 10-15%.
During the 2020, we’ve published three client libraries that make integrating Speechly to an application simple and fast. Handling GRPC API, real-time audio streaming and of course parsing the results is a cumbersome task and the client libraries take most of the workload off our developers.
We have created a simple tutorial application for all of the client libraries for a gradual learning curve on all platforms.
Speechly is a tool for building real-time voice functionalities that integrate seamlessly to existing touch or web user interfaces.
We don’t think smart speakers or "voice-only” solutions is the best way to use voice and rather advocate multimodality and real-time visual feedback.
Our Speechly Annotation Language (SAL) is a syntax we use to annotate example utterances that are used to train our models. In 2020 we added many new features to SAL:
With these features, developers and designers can create complex voice user interfaces with a minimal amount of example utterances. Because the same model can be used on all platforms, the user experience is unified.
When iPhone nailed the user experience with the touch screen, one of the key features was the very responsive user interface that reacted immediately to user input. This is a key issue also for voice user interfaces.
We’ve improved our latency in 2020 significantly and now we can proudly say that our API is real-time with tail latency of under 200 milliseconds.
Low latency is the key to intuitive user experience in two ways: first, it enables user to correct themselves naturally by using voice and second, it encourages the user to go on with the voice experience.
Compare this to the traditional smart speaker user experience that first starts by uttering a wake word that sometimes fails. Once the wake word is recognized and user starts speaking, they’ll know whether they were understood only after they have stopped speaking and the system has processed the input. If the answer is wrong, the user needs to start again from the beginning.
In March we published the first version of the Speechly Dashboard, a web application for building and configuring Spoken Language Understanding models with the Speechly Annotation Language.
The Dashboard supports nearly all Speechly features and it’s the fastest way for getting up to speed with our technology. Hundreds of developers have already created their models and tried them out in the Speechly Playground.
We renewed our website to better position our product and hired many new developers and machine learning experts. Our founders have been interviewed in many industry leading podcasts and we were nominated as one of the Europe’s Hottest Startups.
If you want to work with us and build awesome developer tools for next-generation voice user interfaces, please check our careers page.
Overall, we are pretty happy with our 2020. We’ve now built a technology stack that enables efficient user interfaces that improve user experience significantly. In 2021 we focus on showing the world some cool examples of our technology.
If you are interested in staying in the loop, please subscribe to our newsletter below.
Stay up to date in the voice industry.
Speechly React Client enables developers to integrate voice functionalities to their React applications
In this article, we'll introduce the guidelines and best practices for creating voice enabled applications for touch screens.
The extremely fast feedback that the iPhone touch screen experience provided to the user, resulting in a very responsive and intuitive user experience is still missing from current voice user interfaces.