Neural Language Models


  • can generalize better than MLE LMs to unseen n-grams
  • Can use semantic information as in word2vec


  • Can take relatively long to train
  • Number of parameters scale poorly with increasing context

Solution to Expensive Training

  • Replace rare words with <out-of-vocabulary> token

  • Subsample frequent words

  • Hierarchical softmax

  • Noise-contrastive estimation

  • Negative Sampling

Hierarchical softmax

Group words into distinct classes e.g. by frequency, c1c_1 is top 5%, c2c_2 is the next 5%


RNNs have feedback connections so that it remembers previous states.

  • Challenge: Gradient decays quickly


Long Short-term Memory

Idea: There is a separate "thread"/cell state/special vector stream that runs through the entire chain and stores the long-term information.

In a LSTM cell, there are several gates

  • Forget gate layer: compare ht1h_{t-1} and the current input xtx_t to decide which elements in cell state Ct1C_{t-1} to keep and which to turn off.
  • Input gate layer: 2 steps
    • one sigmoid layer decides which cell units to update
    • one tanh layer creates new candidate values Ct~\tilde{C_t}
  • Update Cell State
  • Output and feedback

Contextual Word Embedding


Embeddings from Language Models