September 26, 2021

An Exploration Of Automated Speech Recognition

54 minutes

Summary

The overwhelming growth of smartphones, smart speakers, and spoken word content has corresponded with increasingly sophisticated machine learning models for recognizing speech content in audio data. Dylan Fox founded Assembly to provide access to the most advanced automated speech recognition models for developers to incorporate into their own products. In this episode he gives an overview of the current state of the art for automated speech recognition, the varying requirements for accuracy and speed of models depending on the context in which they are used, and what is required to build a special purpose model for your own ASR applications.

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.

When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!

Your host as usual is Tobias Macey and today I’m interviewing Dylan Fox about the challenges of training and deploying large models for automated speech recognition

Interview

Introductions

How did you get introduced to Python?

What is involved in building an ASR model?

How does the complexity/difficulty compare to models for other data formats? (e.g. computer vision, NLP, NER, etc.)

How have ASR models changed over the last 5, 10, 15 years?

What are some other categories of ML applications that work with audio data?

How does the level of complexity compare to ASR applications?

What is the typical size of an ASR model that you are deploying at Assembly?

What are the factors that contribute to the overall size of a given model?

How does accuracy compare with model size?

How does the size of a model contribute to the overall challenge of deploying/monitoring/scaling it in a production environment?

How can startups effectively manage the time/cost that comes with training large models?

What are some techniques that you use/attributes that you focus on for feature definitions in the source audio data?

Can you describe the lifecycle stages of an ASR model at Assembly?

What are the aspects of ASR which are still intractable or impractical to productionize?

What are the most interesting, innovative, or unexpected ways that you have seen ASR technology used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on ASR?

What are the trends in research or industry that you are keeping an eye on?

Keep In Touch

@YouveGotFox on Twitter

Picks

Tobias

The Hitman’s Wife’s Bodyguard

Dylan

Inspiration 4 Documentary

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.

If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

Learn Python The Hard Way

DeepSpeech

Wav2Letter

BERT

GPT-3

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Mycroft

Podcast Episode

CMU Sphinx

Pocket Sphinx

Gaussian Mixture Model (GMM)

Hidden Markov Model (HMM)

DeepSpeech Paper

Transformer Architecture

Audio Analytic Sound Recognition Podcast Episode

Horovod distributed training library

Knowledge Distillation

Libre Speech Data Set

Lambda Labs

Wav2Vec

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

...more

View all episodes

By Tobias Macey

4.4

100100 ratings