Author profile picture
By Ottomatias Peura
Calendar iconDecember 10, 2019

If you are just beginning your journey as a voice enthusiastic, the number of weird terms can feel overwhelming. What is an utterance and what is the difference between speech recognition and voice recognition? Here’s a list of important terms and their explanations.

A

Artificial Intelligence (AI): A computer program or model that is used to solve complex tasks. For example, a model that can transform audio input to speech is artificial intelligence. True artificial intelligence doesn’t exist at least yet. Machine learning is a sub-field of artificial intelligence and in recent years, some have begun using the terms artificial intelligence and machine learning interchangeably.

Algorithms: A finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation. Basic algorithms in machine learning include clustering, classification, regression, and recommendation.

Accuracy: Share of correct predictions made by the model (see Model). The better the accuracy, the better it performs in a specific task.

Acoustic Model: A representation that maps “the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word.”

Algorithm: A finite sequence of well-defined, computer-implementable instructions used to generate a machine learning model. Examples include linear regression, decision trees, support vector machines, and neural networks.

Alexa: A smart speaker and voice platform built by Amazon. Alexa is available as a stand-alone product and licensable by third-party hardware and software providers using the Alexa Voice Service.

Alexa Voice Service (ASR): A set of APIs, tools and hardware kits that can be used to create voice-powered applications, hardware, and services on top of Alexa. Founded in 2015.

Alexa Skill: Software that third-party developers build to add new functionality to Alexa. An Alexa Skill is to Alexa what a mobile app is to the iOS or Android mobile platforms. Developers use the Alexa Skills Kit to build Alexa skills, submit such skills to Amazon for certification, and upon certification and publication of the skills, enable end-users of Alexa (through Echo products and products Alexa enabled through AVS) to discover and enable the skills from the Alexa Skills Store.

Alexa Skills Store: The “App Store” of Alexa Skills. Alexa users can use the Skills Store to find new skills for their smart speaker.

Alexa Skills Kit (ASK): An SDK that is needed to build and launch an Alexa skill. If you want to build an Alexa Skill, this is what you’ll need.

Always Listening Device: A typical setup for smart speakers; the device is constantly listening to audio input but only when it hears the wake word, it starts recording and handling the audio.

Artificial neural networks (ANN): Computing systems that are vaguely inspired by biological neural networks that constitute animal or human brains. Such systems can learn to perform tasks by considering training data without being programmed specifically to do so. Use cases for artificial neural networks include speech recognition and computer vision.

ASR or Automatic Speech Recognition: Technology that transforms speech (audio signal) into text. It incorporates knowledge of linguistics, computer science, and electrical engineering.

ASR Tuning: The activity of iteratively configuring and training the ASR model for better accuracy and speed.

Attribute: See Feature.

B

Barge-in: The ability of the user to interrupt system prompts while those prompts are being played. If barge-in is enabled in an application, then as soon as the user begins to speak, the system stops playing its prompt and begins processing the user’s input.

Baseline: The reference point against which the performance of a model (see Model) is compared. Baseline helps developers of a model to assess whether a new model is more useful than the old one.

Batch: Batch, or minibatch, is the set of examples used in one iteration of model training using gradient descent.

Bayesian networks: Also known as causal networks, belief network, and decision network, Bayesian Networks are graphical models for representing multivariate probability distributions. They aim to model conditional dependence, and therefore causation, by representing conditional dependence by edges in a directed graph.

Bias: The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

Big data: Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with large amounts of structured or unstructured data that is too complex to be handled by usual data-processing software and tools.

Bixby: Samsung’s Alexa/Google Home competitor, launched in July 2017.

C

Confidence Score: A probability that is returned by the ASR and that reflects the confidence that the ASR has in the result provided. A 1.00 confidence means that the ASR is as certain as it can be that it has returned the correct result. For example, for certain audio input, the ASR can have the confidence of 0.85 that the result is “flight” and confidence of 0.55 that the result is “light”

Confidence Threshold: A threshold below which ASR results are ignored. For example, if the Confidence Threshold would be set to 0.6 the result “light” would not be considered in the previous example. Important, because in theory there’s an endless amount of very improbable results for any given audio input.

The Cooperative Principle: A concept introduced by Paul Grice, a philosopher of language. It describes how listeners and speakers act cooperatively and mutually accept one another to be understood in a particular way. The cooperative principle is divided into four maxims of conversation, see Gricean maxims.

Classification: A supervised learning algorithm technique for machine learning. In classification, data is categorized into predefined classes. For example, an email can either be ‘spam’ or ‘not spam’.

Clustering: Clustering is a machine learning technique that assigns examples to clusters. It’s a method of unsupervised learning. Common use cases include visualization, segmentation, and search.

Cortana: Microsoft’s voice assistant, launched in the United States in April 2014. Competitors include Alexa and Google Assistant.

D

Deep learning: Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks. Learning can be supervised, semi-supervised or unsupervised. Deep learning simulates the workings of the human brain.

Deep Neural Network (DNN): A deep neural network is a neural network with a certain level of complexity, a neural network with more than two layers. Deep neural networks use sophisticated mathematical modeling to process data in complex ways. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. (See ANN)

Dictation software: A computer application that transcribes each word the users say into text. See ASR or Speech-to-text.

Directed Dialog A directed dialogue system presents users with a range of options and prompts them to pick one. For example, the system asks “Where do you want to fly?” and waits for user prompt that should include a city name and nothing else. See mixed-initiative dialog.

Discovery (skill discovery): The process of discovering and learning what a certain system can do. Discovery in voice is a non-trivial problem since unlike in traditional apps, the user can’t click through and browse different menus and dialogs.

Disfluency: Utterances such as “a-ha,” “hmm,”, “oh” etc. that are used when hesitating or when claiming retention of a speaking dialog turn.

E

Earcon: An earcon is a brief, distinctive sound that represents a specific event or conveys other information. It’s analogous to an icon in traditional user interfaces.

Echo (The Amazon Echo): An entry-level smart speaker released by Amazon in November 2014. Later it has come to represent the family of Amazon smart speakers (eg. Echo Dot, Echo Tap, Echo Look and Echo Show)

Echo Cancellation: A technique that filters out audio coming out of a device while processing incoming audio for speech recognition into that same device. By being “aware” of the audio signal that it is generating, a system processing an audio signal that includes that signal along with, say, spoken audio from a user, would then be able to process more accurately the signal coming from the user.

End-pointing: The recognition of the start and the end of a speaker’s utterance for the purpose of ASR processing.

Entity: An entity modifies an intent (see Intent). For example, if a user says “Book me a flight to Boston”, the entities are “flight” and “Boston”. Entities are sometimes referred to as slots.

F

False Accept (false positive): An instance where the model mistakenly accepted a sample as a valid response.

False Reject (false negative): An instance where the model mistakenly rejected a sample as an invalid response.

Far-Field Speech Recognition: Speech recognition technology processes speech from a distance (usually 10 feet away or more). Smart speakers are typically powered by Far-Field Speech Recognition. If speech recognition is performed on a hand-held, mobile device (eg. Siri or Google Assistant), it is called Near Field Speech Recognition. The difference between these is the ambient noise.

Feedworward neural network (FNN): A feedforward neural network is a simpler artificial neural network wherein connections between the nodes do not form a cycle as in recurrent neural networks (RNN, see RNN). Information flows only to one direction

Feature Features are individual measurable properties of the phenomenon. Features typically act as the input for training your model (see Model). For example, if you have a dataset that consists of height, weight, and sex, these would be the features of your data.

Few-shot learning: Few-shot learning is a machine learning approach, usually employed in classification, designed to learn effective classifiers from only a small number of training examples. See One-shot learning.

G

Google Action: The equivalent of an Alexa Skill for Google Assistant. It allows 3rd party developers to build apps for Google Assistant.

Google Assistant: A virtual assistant developed by Google. Primarily available on mobile and smart home devices (smart speakers). Competes directly with Apple’s Siri or Microsoft’s Cortana.

Google Home: Google’s smart speaker device family, launched in October 2016. Uses Google Assistant.

Grammar: Typically a file or a database that contains the list of words and phrases to be recognized by a speech application. Grammars may also contain bits of programming logic to aid the application.

The Gricean Maxims: A set of specific rational principles observed by people who obey the Cooperative Principle (see above). These principles enable effective verbal conversational communication between humans. The four maxims are 1. Maxim of quality - Try to make your contribution one that is true. 2. Maxim of quantity - Make your contribution as informative as is required and not more informative than required 3. Maxim of relation - Make your contribution relevant to the topic of discussion 4. Maxim of manner - Be perspicuous

H

Heuristics: A technique designed for solving a problem more quickly when classic methods are too slow, or for finding an approximate solution when classic methods fail to find any exact solution. This is achieved by trading optimality, completeness, accuracy, or precision for speed. In a way, it can be considered a shortcut. A heuristic function also called simply a heuristic, is a function that ranks alternatives in search algorithms at each branching step based on available information to decide which branch to follow. For example, it may approximate the exact solution.

Houndify: A voice platform launched in 2015 by music identifier service SoundHound. It enables developers to integrate speech recognition and Natural Language Processing systems into hardware and other software systems.

Hidden Markov Model (HMM): Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobservable (i.e. hidden) states. Before the deep learning algorithms for speech recognition, HMM and GMM were two must-learn technology for speech recognition.

I

Intent: Intent is the intention or meaning of the user that is extracted from speech or text. For example, if the user says “I want to fly to Boston”, the user intent is to book a flight (to Boston, namely). See Entity.

J

Junction tree algorithm: A method used in machine learning to extract marginalization in general graphs. In essence, it entails performing belief propagation on a modified graph called a junction tree. The graph is called a tree because it branches into different sections of data; nodes of variables are the branches.

K

Keras: An open-source neural-network library for Python designed to enable fast experimentation with deep neural networks.

L

Language model: A language model captures the regularities in the spoken language and is used by the speech recognizer to estimate the probability of word sequences. One of the most popular methods is the so-called n-gram model, which attempts to capture the syntactic and semantic constraints of the language by estimating the frequencies of sequences of n words.

Label: Label is the output of your model (see Model). For example, if you have a model that predicts the sex based on height and weight, the sex would be the label. (see Feature)

Lexicon: A list of words with pronunciations. For a speech recognizer, it includes all words known by the system, where each word has one or more pronunciations with associated probabilities.

M

Machine learning: Machine learning is a set of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions or algorithms, relying on patterns and inference instead. Machine learning algorithms are designed to learn and improve over time when exposed to new data. It is seen as a subset of artificial intelligence (See Artificial intelligence).

Mixed-initiative Dialog: A more complex dialog than directed dialog (see Directed Dialog) Interactions where the user may unilaterally issue a request rather than simply provide exactly the information asked for by system prompts. For instance, when the system asks, “Where do you want to fly?” instead of answering that particular question, the user may answer, “I’d like to travel by train” A Mixed-initiative system would recognize that the user-provided not the exact answer to the question asked, but also (additive), or instead (substitutive), volunteered information that was going to be requested by the system later on. Such a system would accept this information, remember it, and continue the conversation. In contrast, a “Directed Dialog” system would rigidly insist on the destination of the flight and would not proceed until it received that piece of information.

Model: A mathematical representation of a real-world process. To generate a machine learning model you will need to provide training data to a machine-learning algorithm to learn from. The more data is used to train the model, the better it becomes. Typically these models are built with tools such as Tensorflow or PyTorch.

Multi-modality: Multimodal interaction provides the user with multiple modes of interacting with a system, for example with voice and touch.

N

Natural Language Processing (NLP): Natural language processing is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLP extracts the meaning or intent from a user’s utterance or typed text. (See Intent)

N-Best: In speech recognition, given an audio input, an ASR (see ASR) returns a list of results, with each result ascribed a “confidence score” (See Confidence). N-Best refers to the “N” results that were returned by the ASR and that were above the confidence threshold (See Confidence threshold). For instance, if the user were to say “flights”, 2-best results returned by the ASR “flights” and “lights”.

Near Field Speech Recognition: In contrast to far-field speech recognition, which processes speech spoken from a distance (usually of 10 feet or more), near field speech recognition technology is used for handling spoken input from a handheld mobile device that is used within inches or two feet away at most. The most common use case for near field speech recognition is mobile devices.

No-input Error: A situation where the ASR (see ASR) erroneously doesn’t detect any speech input from the user.

No-Match Error: A situation where the ASR is not able to match the user’s input to responses it expects the user to provide.

O

Out of Scope (OOS) Error: See No-match Error.

One-shot learning: A machine learning approach that tries to learn from a single example.

Overfitting A model is overfitting if it fits the training data too well and there is a poor generalization of new data. (See Underfitting). The problem of overfitting is usually solved by regularization or early stopping,

Outlier: A data point that differs significantly from other observations. Outliers can cause problems in statistical analysis and machine learning algorithms.

P

Part of speech: A part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. The eight parts of speech in English are noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection, for example. Words that are assigned to the same part of speech generally display similar syntactic behavior—they play similar roles within the grammatical structure of sentences—and sometimes similar morphology in that they undergo inflection for similar properties.

Persona: The personality of a voice-enabled system (formal, chatty, friendly, etc) that comes across the way the system engages with the user. The persona is influenced by factors such as the perceived gender of the system, the type of language the system uses, tone-of-speech and how the system handles errors.

Phoneme: Phoneme An abstract representation of the smallest phonetic unit in a language that conveys a distinction in meaning. For example, the sounds /d/ and /t/ are separate phonemes in English because they distinguish words such as do and to. To illustrate phoneme differences across languages, the two /u/-like vowels in the French words tu and tout are not distinct phonemes in English, whereas the two /i/-like vowels in the English words seat and sit are not distinct phonemes in French.

Progressive Prompting: The technique of beginning an exchange by providing the user with minimal instructions and elaborating on those instructions only if encountering response errors (e.g., no-input, no-match, etc.).

Prompt: The instruction or response that a system gives to the user. For example “What’s your name?”

PyTorch: PyTorch is a developer tool and a machine learning library, used for a wide range of machine learning applications such as natural language understanding or computer vision. It is open-source and primarily developed by Facebook.,

R

Regression: A statistical approach that estimates the relationships among variables and predicts future outcomes or items in a continuous data set by solving for the pattern of past inputs, such as linear regression in statistics. Regression is foundational to machine learning and artificial intelligence. It predicts a real numbered label given an unlabeled example.

Regularization: Regularization is a technique to make the fitted function smoother. This helps to prevent overfitting. The most widely used regularization techniques are L1, L2, dropout, and weight decay.

Reinforced learning: Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. It is about taking suitable action to maximize reward in a particular situation. In reinforced learning, the model gets either rewards or penalties for the actions it performs and the goal is to maximize the total reward. The model is given the rules for getting rewards but no hint on how to achieve those rewards.

Recognition Tuning: The activity of configuring the ASR’s settings to optimize and improve recognition accuracy and processing speed. (See Model)

Recurrent neural networks (RNN): A class of artificial neural networks where connections between units form a directed graph along a temporal sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs, which makes them applicable to tasks such as speech recognition. (See FNN)

S

Sampling: A process in statistical analysis in which a number of observations are taken from larger data. These observations are selected using a predefined process so that extrapolations can be made. For example, in political polls, usually about 2000 pollsters (a sample size of 2000) represent the whole population.

Sample: The result of sampling; a set of data selected from a population using a predefined process.

Sample rate: The sample rate is the number of samples (typically of audio or video) carried per second, measured in Hz or kHz (one kHz being 1000 Hz). The sample rate determines the maximum data frequency that can be reproduced.

Second Orality: Secondary orality is a concept introduced by Walter J. Ong. Secondary orality is orality that is dependent on literate culture and the existence of writing, such as a television anchor reading the news or radio.

Siri: A virtual assistant that is part of Apple Inc.’s iOS, iPad OS, watchOS, macOS, and tvOS operating systems. It was released by Apple in 2011.

Speech recognition: The capability of the computer system to decipher spoken words and phrases and transcribe it into text. Also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). See ASR.

Speech To Text (STT): See ASR.

Speaker–dependent: Speech recognition software that can only recognize the speech of users it is trained to understand. Speaker–dependent software allows for very large vocabularies but is limited to understanding only select speakers.

Speaker–independent: Speech recognition software that can recognize a variety of speakers, without any training. Speaker–independent software generally limits the number of words in vocabulary but is the only realistic option for applications that must accept input from a large number of users.

Spoken language understanding (SLU): Natural language understanding for spoken language. Spoken language understanding systems extract meaning out of the speech. (See NLU, intent)

Supervised learning: Supervised learning is a machine learning process in which the model is trained using a training dataset.

Structured data: Clearly defined data with easily searchable patterns, such as a spreadsheet that contains the same set of columns for each row. For example, a list of employees that all have age, sex, salary, and title.

T

Tapered Prompting: The technique of eliding a prompt or a piece of a prompt in the context of a multistep interaction or a multi-part system response. For example, instead of the system asking repetitively, “What is your level of satisfaction with our service?” “What is your level of satisfaction with our pricing?” “What is your level of satisfaction with our cleanliness,” the system would ask: “What is your level of satisfaction with our service?” “How about our pricing?” “And our cleanliness?” The technique is used to provide a more natural and less robotic-sounding user experience.

TED-LIUM: A corpus that is often used for training speech recognition models. It contains transcriptions of TED speeches. The corpus is available for free download here

Text to Speech (TTS): A technology that converts text to synthesized speech that’s spoken by the system (in contrast to speech-to-text (see ASR or Speech-To-Text). TTS is usually used when the list of possible responses to be spoken by the system is very large, and therefore, recording all of the options is not practical.

TensorFlow: A free and open-source software library created by Google for dataflow and differentiable programming across a range of tasks. It is a symbolic math library and is often used for machine learning applications.

Training: The process of determining the ideal parameters comprising a model.

True negative: An instance in which the model correctly predicted the negative class. For example, if the model identified correctly that the email is not spam.

True positive: An instance in which the model correctly predicted the positive class. For example, if the model identified correctly that the email was spam.

Turing test: Developed by Alan Turing in 1950, it is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human.

U

Utterance: An uninterrupted chain of spoken language. An utterance is a natural, complete unit of speech bounded by the speaker’s silence, most commonly breaths or other pauses.

Underfitting: Underfitting refers to a model that can neither model the training data nor generalize to new data. The cause of poor performance in machine learning is either overfitting or underfitting the data. Causes of underfitting include training on the wrong set of features, training at too low learning rate and training with too high regularization rate. See Overfitting.

Unstructured data: Data that does not have a pre-defined data model or is not organized in a pre-defined manner. For example, a set of Wikipedia pages that all have different amounts of text, videos. photos, tables, and other information.

Unsupervised learning: The training of an artificial intelligence (AI) algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Unsupervised learning algorithms can perform more complex processing tasks than supervised learning systems.

V

Vocabulary: The total list of words the speech engine will be comparing an utterance against. The vocabulary is made up of all the words in all active grammars.

Voice First: Interfaces or applications are said to be “Voice first” when the primary interface between the user and the system is based on voice. It’s analogous to mobile-first. Voice first does not necessarily mean “Voice Only”. A Voice-First interface can have an additional, adjunct interface (usually a visual one, see Multimodality) that supplements the user experience. For instance, one can ask if the nearest post office is open, receive the answer verbally, and then be provided with additional details about the post office location on a visual interface (mobile app, desktop browser).

Voice Biometrics: Technology that identifies specific markers within a given piece of audio that was spoken by a human being and uses those markers to uniquely model the speaker’s voice. The technology is the voice equivalent of technology that takes a visual fingerprint of a person and associates that unique fingerprint with the person’s identity. Voice Biometrics technology is used for both Voice Identification and Voice Verification.

Voice Identification: The capability of discriminating a speaker’s identity among a list of possible speaker identities based on the characteristics of the speaker’s voice input. Voice ID systems are usually trained by being provided with samples of speaker voices.

Voice Verification: The capability of confirming an identity claim based on a speaker’s voice input. Unlike Voice Identification, which attempts to match a given speaker’s voice input against a universe of speaker voices, Voice Verification compares a voice input against a given speaker’s voice and provides a likelihood match score. Voice Verifications are usually done in an “Identity Claim” setting: the user claims to be someone and then is “challenged” to verify their identity by speaking.

Voice User Interface (VUI): The voice equivalent of a Graphical User Interface (GUI). VUI is a type of user interface that allows users to interact with electronic devices by speaking and listening to spoken text or “earcons”. Common voice user interfaces include Apple Siri on iPhone, Google Assistant on Android, and Alexa and other smart speakers.

W

Watson: A question-answering computer system capable of answering questions posed in natural language, developed in IBM’s DeepQA project.

Wake Word: The word or phrase that “wakes up” an always-listening device. Familiar examples include “OK Google” or “Hey Alexa”. The wake word is needed to distinct actual voice commands meant to be processed by the device from normal, human-to-human speech.

Weight: The importance of a feature in a certain model. If the weight of a feature is zero, it has no contribution to the model.

Weak AI: Also known as narrow AI is artificial intelligence that is focused on one narrow task.

Word Accuracy The word accuracy (WAcc) is a metric used to evaluate speech recognizers. The percent word accuracy is defined as %WAcc = 100 - %WER. It should be noted that the word accuracy can be negative. The Word Error Rate (See WER) is a more commonly used metric and should be preferred to the word accuracy

Word error rate (WER): The word error rate (WER) is the commonly used metric to evaluate speech recognizers. It is a measure of the average number of word errors taking into account three error types: substitution (the reference word is replaced by another word), insertion (a word is hypothesized that was not in the reference) and deletion (a word in the reference transcription is missed). The word error rate is defined as the sum of these errors divided by the number of reference words. Given this definition, the percent word error can be more than 100%. The WER is somewhat proportional to the correction cost.

X

Y

Z


Found errors in our definitions? Don’t agree with something or just want to thank for the resource? Send your feedback to hello@speechly.com!