Interspeech 2021: Take-aways on Automatic Speech Recognition
Sep 29, 2021
4 min read
Our report from Interspeech, the largest scientific conference focusing on speech science and technology.
This year’s conference was held in the beginning of September in Brno, Czech Republic. Typically there would be some 2000 attendees at the conference, but due to Covid-19, this year most of the attendees joined the conference virtually. I was there on site with 350+ other researchers, and here are my impressions on the scientific catering in the field of automatic speech recognition (ASR).
ASR is a data-heavy field. Industry leaders are using tens of thousands of hours of transcribed speech to train their models, but most of the ASR research has relied on much smaller publicly available corpora. Only very recently opportunities for using larger non-proprietary speech corpora have emerged. This year Facebook published Multilingual LibriSpeech (MLS), but that is limited to read-speech data. Now at Interspeech, two large ASR corpora were published, which extend the available domains:
wav2vec is an unsupervised (or "self-supervised" as they like to call it) method for learning speech representation. Its latest incarnation, wav2vec 2.0, is gaining popularity: At Interspeech there were 10 papers mentioning wav2vec in the title, and many more which used it in their experiments. The benefit in using such pre-trained representations are a drastic drop in the training data requirements, thus making it attractive for limited resource scenarios.
A nice analysis on the nature of wav2vec 2.0 was provided by Facebook in their paper:
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training, Wei-Ning Hsu (Facebook, USA), Anuroop Sriram (Facebook, USA), Alexei Baevski (Facebook, USA), Tatiana Likhomanenko (Facebook, USA), Qiantong Xu (Facebook, USA), Vineel Pratap (Facebook, USA), Jacob Kahn (Facebook, USA), Ann Lee (Facebook, USA), Ronan Collobert (Facebook, USA), Gabriel Synnaeve (Facebook, France) and Michael Auli (Facebook, USA) https://www.isca-speech.org/archive/pdfs/interspeech_2021/hsu21_interspeech.pdf
Trends in ASR
Rather than introducing a multitude of complex new network architectures, this year's focus appeared to be more on the practical side of ASR: Reducing the streaming latency, fitting the models on-device, and overall reducing computation and memory footprint. Various improvements were presented towards these goals, especially for transformer-transducer models.
Interspeech is a lot more than just an ASR conference, too much to cover in a single blog post. One interesting and timely topic was the COVID-19 challenge: detecting infection based on cough and speech samples!
Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.
Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.
Jun 01, 2023
1 min read
Countering Extremism in Online Games - New NYU Report
A recent NYU report exposes how extremist actors exploit online game communication features. In this blog we expand on NYU's data and recommendations for maintaining safety and security in online gaming communities.
May 30, 2023
4 min read
What You Can Learn from The Data in Xbox’s Transparency Report
The 2023 Xbox Transparency Report is (likely) around the corner. Our first blog broke down how the moderation process works at Xbox, but this blog will take a deep dive into the data from the inaugural report comparing Reactive vs Proactive moderation.