Speechly API

The official Speechly Quick Start that helps you get started with developing with Speechly.

Overview

Speechly API consists of three services: SLU provides spoken language understanding, WLU provides written language understanding and Identity provides authentication and identity service.

Difference between SLU and WLU is that SLU is used for audio data whereas WLU works with text. SLU provides speech recognition (ASR) and natural language understanding (NLU) by extracting intents and entities from the voice data. WLU provides only the NLU part but works with the same model.

Initializing Speechly API

A new stream is started by initializing the SLU engine with a config value. Configuration requires a token that can be obtained through the Identity service. Authentication requires a unique device ID and Speechly application ID.

Once the Identity service is provided with a valid application ID and a device ID, it returns a token that must be sent along with a SLUEvent.START request.

Last part of initializing the Speechly SLU API is configuring the audio input. This configuration is sent as a SLUConfig message and contains sample rate, amount of channels and encoding of the audio data and the language used.

User voice input can be of any length but it must be divided into chunks of less than one megabyte. Minimum sample rate is 8000Hz (16000Hz recommended), minimum amount of channels is one and the only supported encoding so far is signed PCM.

A sample of a valid audio file can be downloaded here.

After the configuration has done, the stream moves into “SLU event loop” state.

SLU event loop

Speechly audio event loop.
Speechly event loop

The basic SLU event loop is:

  • Client sends SLUEvent.Event.START event when the user wants to start speaking.
  • Server sends SLUResponse.started when it’s ready to receive audio for an utterance. It also sends client the ´audio_context` UUID.
  • Client sends configuration
  • Client sends audio chunks. Chunks must be under one megabyte.
  • Server sends SLUResponse for events and utterance results. For each segment in the utterance it sends a rolling ID.
  • Client sends SLUEvent.Event.STOP event when no more audio is to be sent for this utterance.
  • Server processed all audio received until stop, and sends segment results. Finishing with `finished response.

Only one segment/utterance can be active at a time, but the old segments can still concurrently send results when already working the new segment.

SLUEvent.Event.START is sent before stopping the current on-going utterance with SLUEvent.Event.STOP the whole stream is killed with an error. This is to ensure that the clients are well behaved.

See SLUResponse for information on server sent messages.

Understanding server responses

Once the client starts sending audio stream, server starts sending SLUResponses of different kind. The results can be either tentative or final. The tentative results are typically used for prefetching data and for UI purposes but after the server sends final results, these should be discarded and only the final results should be used for application business logic.

Audio stream that is sent to the server can include one or more segments and can be of any length. A segment is typically one sentence, but it can consist of several sentences or one sentence can include more than one segment. A segment can contain zero or one intents, any number of entities and values for these entities. Audio stream is divided into segments by user silence or if the server recognizes a new intent.

Audio stream stays open until it is stopped with SLUEvent.Event.STOP. If the user keeps a pause and then continue speaking, the utterance is divided into two segments. The first segment gets final results and the new segment begins getting first tentative results.

Diagram on how Speechly API respons to user audio
Speechly server starts sending tentative results as soon as it gets audio.

Glossary

Utterance: Utterance is Speechly terminology is something that the end-user says. The utterance is sent to the Speechly API. Speechly API returns the utterance transcript, intent and entities and if necessary, divides it into segments.

Segment: Segment is an utterance that contains one or zero intents. One utterance can contain one or more segments. For example “Turn on living room and change the color to red” is an utterance that consists of two different segments, because there’s two intents: one intent about turning off the light and other about changing the color of the light.

Intent: Intent is the classification of an utterance and the purpose for that utterance. It’s analogous to a method in programming: for example the intent of an utterance “Add milk to my shopping list” is to add a product to a shopping list. Segments can have zero or one intent.

Entity: Entity is a modifier for an intent. For example in utterance “Add milk to my shopping list” the intent of adding products (ADD_PRODUCT) to a shopping list has a modifier (PRODUCT) that has a value milk. A segment can have any number of entities and values for these entities.


Profile image for ottomatias

Last updated by ottomatias on March 2, 2020 at 14:11 +0200

Found an error on our documentation? Please file an issue or make a pull request