NMT (Neural Machine Translation)
Neural Machine Translation (NMT) is a way to do Machine Translation with a single end-to-end neural network.
- Drop noisy channel
- No (explicit) alignments - end-to-end training
- Outperforms "SMT" by a large margin
Teacher Forcing
- Remove feed-forward recurrence from the previous output to the hidden units, and replace with ground truth for faster training.
- Only used in training
Attention
Decoder consider input (encoder hidden states) when making decisions. Attention scores determines which area to focus/attend.
- One decoder hidden state and all encoder hidden states are passed into the attention score funciton
Think of decoder hidden state at time t, as query.All encoder hidden states, as values.
Advantage
- Performance
- Solves bottleneck
- Sllows decoder to look at the source sentence directly
- Helps with the long-horizon (vanishing gradient) problem by providing shortcut to distant states
Transformer
- Replace RNN with attention
- Allow parallel computation
Q,K,V
Decoder has values, the encoder has keys and queries (memory).
is a similarity measure between Query and Key.
Positional Encoding
Add positional info of an input token in the sequence into the input embedding vectors.
Advantages compared to SMT
- Better performance
- More fluent
- Better use of context
- Better use of phrase similarities
- Single neural network to be optimized end-to-end
- Rather than optimizing multiple components individually
- Less human engineering effort
- No feature engineering
- Same method for all language pairs
Disadantages compared to SMT
Mainly due to the blackbox-like nature of neural networks.
- NMT is less interpretable
- hard to debug
- Difficult to control
- Can't easily specify rules or guidelines for translation
- Safety concerns
Sequence-to-Sequence Model (seq2seq)
Many NLP tasks can be phrased as seq2seq:
- Summarization (long -> short text)
- Dialogue (previous -> next utterance)
- Parsing (input text -> output parse as sequence)
- Code Generation (natural language -> code)
NMT directly calculates , the translation model. is the probability of next target word, given target words so far and source sentence . It's like a recursion, the next word depends on source and all previously generated words.
How to Train NMT?
Get a big parallel corpus (equivalent text in e.g. different languages).
Multi-layer RNNs (Stacked RNNs)
- RNNs are already "deep" on one dimension (time, they unroll over many timesteps, takes history into account)
- We can make them "deep" in another dimenssion by applying multiple RNNs (multi-layer RNN)
- 2-4 layers is best foir encoder RNN, 4 layers is best for decoder RNN
- skip connections/dense connections are needed to train deeper RNNs (e.g. 8 layers)
- Transformer-based models are usually deeper (12-24 layers).
- More complex representations - Lower RNNs compute lower-level features, higher RNNs compute higher-level features. - higher-level feaures could be, overall structure and sementics of sentence - lower-level features could be, basics of words (is it a name? adjective etc.)
Greedy Decoding
Greedy Decoding: generate target sentence by taking argmax on each step of the decoder (find the most probable word on each step).
Greedy decoding has no way to undo decisions. If later word can affect previous word, there is no way to undo.
Exhaustive Search Decoding
Ideally we want to find a (length T) translation that maximizes
Beam Search Decoding
On each step of decoder, keep track of k most probable partial translations (hypotheses)
- is the beam size (in practice 5-10)
- Not guaranteed to find the optimal solution
- Much more efficient than exhaustive search A hypothesis has a score which is its log probability:
- Scores are all negative because log of probability is always negative, higher score is better (higher probability) due to log's property.
- Search for high-scoring hypotheses, tracking top on each step
Decoding Example
In the following example beam size . In each round, only choose the 2 with highest scores to continue. Each of the 2 selected words will expand another 2 words, and keep going. This prevents the tree from growing exponentially. Each expansion is a partial hypothesis. In the end, trace back to obtain full hypothesis.
Problem
Longer hypotheses have lower scores because it's a product of probabilities. Solution: Normalize by length. So we have a per-word log probability.
Evaluation (BLEU)
BLEU: Bilingual Evaluation Understudy. BLEU compares machine-written translation to one or several human-written translation(s), and computes a similarity score based on:
- n-gram precision (usually for 1,2,3,4-grams)
- Overlap words
- Penalty for too-short system translations BLEU is imperfect
- Many valid ways to translate a sentence
- good translation can get poor BLEU score
Attention
Seq-to-Seq Bottleneck
All information in a single layer.
On each step of the decoder, use direction connection to the encoder to focus on a particular part of the source sequence.
Use the attention distribution to take a weighted sum of the encoder hidden states. The attention output mostly contains information from the hidden states that received high attention.
Attention always involves
- Computing the attention scores
- Taking the softmax to get attention distribution
- Use attention distribution to take weighted sum of values, to obtain attention output
Attention Variants
There ar eseveral ways to obtain attention scores from and
Basic dot-product attention: Multiplicative attention:
- is a weight matrix
- Problem: too many parameters in Reduced rank multiplicative attention:
- For low rank matrices
- Additive Attention