Interspeech 2021: Take-aways on Automatic Speech Recognition

Janne Pylkkönen

Sep 29, 2021

4 min read

Our report from Interspeech, the largest scientific conference focusing on speech science and technology.

Copy link
Mail
LinkedIn
Facebook
Twitter

This year’s conference was held in the beginning of September in Brno, Czech Republic. Typically there would be some 2000 attendees at the conference, but due to Covid-19, this year most of the attendees joined the conference virtually. I was there on site with 350+ other researchers, and here are my impressions on the scientific catering in the field of automatic speech recognition (ASR).

New datasets

ASR is a data-heavy field. Industry leaders are using tens of thousands of hours of transcribed speech to train their models, but most of the ASR research has relied on much smaller publicly available corpora. Only very recently opportunities for using larger non-proprietary speech corpora have emerged. This year Facebook published Multilingual LibriSpeech (MLS), but that is limited to read-speech data. Now at Interspeech, two large ASR corpora were published, which extend the available domains:

SPGISpeech offers 5000 hours of financial calls with rich formatting: https://www.isca-speech.org/archive/pdfs/interspeech_2021/oneill21_interspeech.pdf
GigaSpeech is a 10,000h multi-domain corpus drawing data from audiobooks, podcasts, and YouTube videos: https://www.isca-speech.org/archive/pdfs/interspeech_2021/chen21o_interspeech.pdf

Also worth checking is Facebook's research paper which showed that an ASR model trained on publicly available corpora, combined with fine-tuning to target data, works well for real-world tasks:

Rethinking Evaluation in ASR: Are Our Models Robust Enough?, Tatiana Likhomanenko (Facebook, USA), Qiantong Xu (Facebook, USA), Vineel Pratap (Facebook, USA), Paden Tomasello (Facebook, USA), Jacob Kahn (Facebook, USA), Gilad Avidov (Facebook, USA), Ronan Collobert (Facebook, USA) and Gabriel Synnaeve (Facebook, France): https://www.isca-speech.org/archive/pdfs/interspeech_2021/likhomanenko21_interspeech.pdf

wav2vec

wav2vec is an unsupervised (or "self-supervised" as they like to call it) method for learning speech representation. Its latest incarnation, wav2vec 2.0, is gaining popularity: At Interspeech there were 10 papers mentioning wav2vec in the title, and many more which used it in their experiments. The benefit in using such pre-trained representations are a drastic drop in the training data requirements, thus making it attractive for limited resource scenarios.

A nice analysis on the nature of wav2vec 2.0 was provided by Facebook in their paper:

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training, Wei-Ning Hsu (Facebook, USA), Anuroop Sriram (Facebook, USA), Alexei Baevski (Facebook, USA), Tatiana Likhomanenko (Facebook, USA), Qiantong Xu (Facebook, USA), Vineel Pratap (Facebook, USA), Jacob Kahn (Facebook, USA), Ann Lee (Facebook, USA), Ronan Collobert (Facebook, USA), Gabriel Synnaeve (Facebook, France) and Michael Auli (Facebook, USA)
https://www.isca-speech.org/archive/pdfs/interspeech_2021/hsu21_interspeech.pdf

Trends in ASR

Rather than introducing a multitude of complex new network architectures, this year's focus appeared to be more on the practical side of ASR: Reducing the streaming latency, fitting the models on-device, and overall reducing computation and memory footprint. Various improvements were presented towards these goals, especially for transformer-transducer models.

For examples, see the following papers:

Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning, Songjun Cao (Tencent, China), Yueteng Kang (Tencent, China), Yanzhe Fu (Tencent, China), Xiaoshuo Xu (Tencent, China), Sining Sun (Tencent, China), Yike Zhang (Tencent, China) and Long Ma (Tencent, China)
https://www.isca-speech.org/archive/pdfs/interspeech_2021/cao21b_interspeech.pdf
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling, Tara N. Sainath (Google, USA), Yanzhang He (Google, USA), Arun Narayanan (Google, USA), Rami Botros (Google, USA), Ruoming Pang (Google, USA), David Rybach (Google, USA), Cyril Allauzen (Google, USA), Ehsan Variani (Google, USA), James Qin (Google, USA), Quoc-Nam Le-The (Google, USA), Shuo-Yiin Chang (Google, USA), Bo Li (Google, USA), Anmol Gulati (Google, USA), Jiahui Yu (Google, USA), Chung-Cheng Chiu (Google, USA), Diamantino Caseiro (Google, USA), Wei Li (Google, USA), Qiao Liang (Google, USA) and Pat Rondon (Google, USA)
https://www.isca-speech.org/archive/pdfs/interspeech_2021/sainath21_interspeech.pdf
Reducing Streaming ASR Model Delay with Self Alignment, Jaeyoung Kim (Google, USA), Han Lu (Google, USA), Anshuman Tripathi (Google, USA), Qian Zhang (Google, USA) and Hasim Sak (Google, USA)
https://www.isca-speech.org/archive/pdfs/interspeech_2021/kim21j_interspeech.pdf

Looking for something more exotic? Check out the research on non-autoregressive ASR:

An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition, Ruchao Fan (University of California at Los Angeles, USA), Wei Chu (PAII, USA), Peng Chang (PAII, USA), Jing Xiao (PAII, USA) and Abeer Alwan (University of California at Los Angeles, USA)
https://www.isca-speech.org/archive/pdfs/interspeech_2021/fan21b_interspeech.pdf

RNN transducers are still going strong. Several publications had adopted the Hybrid Autoregressive Transducer (HAT) approach for combining external language models with the end-to-end model.

Speechlys approach, “Fast Text-Only Domain Adaptation of an RNN-Transducer Prediction Network”, published officially at Interspeech and is a lighter-weight solution, but more about that later!

And more...

Interspeech is a lot more than just an ASR conference, too much to cover in a single blog post. One interesting and timely topic was the COVID-19 challenge: detecting infection based on cough and speech samples!

The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates, Björn W. Schuller et al.
https://www.isca-speech.org/archive/pdfs/interspeech_2021/schuller21_interspeech.pdf

You can browse the full list of publications at https://www.isca-speech.org/archive/interspeech_2021/index.html

About Speechly

Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.

Latest blog posts

company news

Speechly is joining Roblox

Hannes Heikinheimo

Sep 19, 2023

1 min read

voice tech

4 Voice Chat Solutions for Virtual Reality

Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.

Matt Durgavich

Jul 06, 2023

5 min read

company news

Speechly Has Received SOC 2 Type II Certification

Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.

Markus Lång

Jun 01, 2023

1 min read