Notes
Sparseness
Problems with N-gram models:
- New words appear
- New bigrams occur more often
- New trigrams occur even more often
Smoothing
Some N-grams are really rare.
If we have no way to determine the distribution of unseen N-grams, how can we estimate them? (Smoothing)
Add-1 Smoothing (Laplace discounting)
Add 1 to count of every word
Given vocab size and corpus size .
MLE:
Laplace Estimate:
It gives a proper probability distribution, .
For bigrams,
Problem
Sometimes ~90% of the probability mass is spread across unseen events.
It only works if we know beforehand.
Add- Smoothing
Add .