Automatic Speech Recognition
ASR systems converts speech into text.
Raw speech data are 1-d arrays of shape
- is sample rate
- is time length (in seconds)
Speech features are 2D arrays of shape
- : number of features
Acoustic Model
Each word w migh have different sounds ,
ASR system becomes , given a speech sample, predict a word.
Bayes Theorem
- : prior probability (language model)
- : likelihood (acoustic model)
The last equation is because is constant wrt
WER
WER: Word-Error Rate counts different kinds of errors that can be made by ASR at the word-level:
- Substitution Error: wrong word
- Deletion Error: skipped word
- Insertion Error: extra word
, or can be computed with dynamic programming ( is edit distance or Levenshtein Distance). N is the minimum number of error to edit the hypothesis H into the reference R.
Review Questions
- What is listen, attend, spell ASR model
- How to evaluate ASR system
- WER
- What are the 3 errors (sub, insrt, delete)
- What are the init and induction steps to compute WER using dynamic programming
- Optional: Do LeetCode Q72 (Edit Distance)