Final Exam Review

No aids allowed.


  • Anything highlighted in yellow in lecture slides are important to know.

  • Go through the lectures

    • Work out the examples
  • The exam contains everything excepts bonus in assignments and "Aside" in slides.


  • Corpus-based linguistics
  • N-gram, Linguistic features, classification
  • Entropy and Information Theory
  • NLM and word embedding
  • Machine Translation (statistical and nerual)
  • Recent breakthroughs - (SOTA) transformer varints
  • HMMs
  • Automatic speech recognition (ASR)
  • Natural Language Understanding (NLU)
  • Information Retrieval (IR)
  • Interpretability and LLM

Text Classification

Bayes' Theorem


From Bayes rule, we can do MLE.

Goal: Maximize the likelihood of the training data using the model.


To solve the 0 problem, use smoothing.

Zipf's law: there are many words that appear only once.

Smoothing: steal frequency from the rich and give to the poor.

Add-δ\delta smoothing: distribute the frequencies more evenly.


Extrinsic Evaluation

Which model gives better performance. Performance can be accuracy.. F1 score, recall precision.

Intrinsic Evaluation

Perplexity: a way of meausing language model.

Get perplexity of each model, the lower the better.

Bleu score is for translation.


Measure of randomness.

e.g. if a language model says each word is equally likely, the entropy is high (1). Highly random.

Know how to calculate entropy, conditional entropy, joint entropy

Relations between entropies: H(X,Y)=H(X)+H(Y)I(X;Y)H(X, Y)=H(X)+H(Y)-I(X;Y)

Mutual Information


Information Theory: Learn how to do calculations, see tutorials.

Procedure of a Statistical Test

  1. State a hypothesis
    1. Null hypothesis and alternative hypothesis
  2. Compute some test statistics (P-value)
  3. Compare the statistics to a critical vlaue and report the test results

Null Hypothesis: Nothing has changed

Types of t-tests

  • one-sample t-test
  • two-sample t-test
  • paired t-test


There will be calculation in long answer.

Train HMM: modify the parameters of model, to maximize training data.

Backtrack Dynamic Programming

Forward-Backward Algorithm

Baum-Welch re-estimation


Know how to calculate BLEU scores.

BLEU=BP×(p1p2pn)1/nBLEU=BP\times(p_1p_2\cdots p_n)^{1/n}

Beam Search: top-K greedy

Idea: track the K-top choices of partial translations (hypotheses) at each step of decoding

Nerual Language Models

Old stuff in short long answer questions, new stuff in multiple choice.



What's the difference, what are the equations?


ELMo considers the entire sentence before embeddeing each token.


Long answer question on this.

Levenshtein Distance

How to calculate, dynamic programming.



Similarity Score

Vectorization: tf.idf

tf.idf: traditional method to vectorize the documents

Evaluating Retrieval Systems

  • Precision
  • Recall
  • F-score
  • Precision @ K

roc curve

Term Frequency

Interpretable NLP

Shapley Value

Multiple Choice and Short Answer.