Skip to main content

Seq2Seq

Sequence To Sequence

Encoder-Decoder (Seq2Seq) made it possible to map input sequences to output sequences of different lengths, because "I want a coffee with milk" and "Yo quiero cafe con leche" have different total lengths of words (even if it's just "a"). This also allowed for taking in entire photos, and outputting a text representation (video captioning), outputting bounding box anchors of objects (object detection), and many other future architectures

Most Seq2Seq models are a combination of conditional language models with long range encoders that build up context, and auto-regressive decoders for generating text outputs. These decoders can be greedy and simply choose the most likely token, or then can perform general search over future text outputs as well

Encoder Decoder

Over time word embeddings and sentence embeddings were gaining traction, but things became awkward when there wasn't a fixed length 1:1 mapping from input to output. "My name is Luke" \rarr "Me llamo Luke" isn't a 1:1 mapping, and so machine translation and other similar tasks started hitting these sequence-to-sequence roadblocks

Ultimately, the thought process was to build an architecture with 2 parts:

  • Encoders dealt with variable length input sequences to transform it into a fixed state / fixed size vector
  • Decoders would take that fixed size hidden state, and output a variable length output

These could utilize Recurrent NN's for variable length input and output, or really any other architecture that allowed for repeating predictions, and then the mapping between variable length sequence to variable length sequence could be achieved

Encoder Decoder Freehand

The generic framework proposed in initial papers has:

  • Encoder RNN processes the input sequence of length nxn_{x} to compute a fixed length context vector CC
    • This context vector is usually just the final hidden state of the encoder, or some function of the hidden states if that's desired
    • This context vector is supposed to encode all relevant information, signals, etc about the input vector
  • Decoder RNN process uses the context vector to produce a target sequence of length nyn_{y}
    • The context is passed to every hidden state of the decoder, the decoder RNN uses this information to produce the target sequence of length nyn_{y}, which can of course vary from nxn_{x}

Encoder Decoder Freehand Specific

Encoder

The encoder portion is really just a straightforward RNN! It's main objective is to output a hidden state context vector cc, and to be able to handle varying length inputs and preserve their signal in cc

Each input (let's say one single sentence) will have multiple words / tokens, where each token is represented by a fixed size embedding (most likely something from Word2Vec) - each input feature vector xt\bold{x}_{t} can be represented as a static word embedding for the rest of this article

At any time step tt, the RNN transforms the input feature vector xt\bold{x}_{t} for token xtx_t, and also uses the hidden state ht1\bold{h}_{t - 1} to create the current hidden state ht\bold{h}_{t}

ht=f(xt,ht1)\bold{h}_{t} = f(\bold{x}_{t}, \bold{h}_{t-1})

This will happen over each of the input tokens, and ultimately we will use the set of hidden states {h0,h1,...,hT}\{\bold{h}_{0}, \bold{h}_{1}, ..., \bold{h}_{T}\} to create the final hidden state cc

c=q({h0,h1,...,hT})c = q(\{\bold{h}_{0}, \bold{h}_{1}, ..., \bold{h}_{T}\})

qq can be an identity function, which means cc is just the final hidden state, or it can aggregate all of the hidden states, use some weighted aggregation, etc

The example above utilizes a unidirectional RNN, but the same encoding process can be done with a bi-directional encoder as well. As long as it makes sense in the context of the overall architecture, we might as well utilize the extra information! In that case we would simply do the unidirectional pass in both directions, and then any hidden state would be a function of it's 2 unidirectional passes

ht=f(ht,ht1)\bold{h}_{t} = f(\overrightarrow{\bold{h}_{t}}, \overleftarrow{\bold{h}_{t-1}})

Below python snippet showcases how you can use an Embedding layer (word2vec or modern equivalent) along with some generic rnn calls to create a forward pass of an encoding layer

Show Python Script
class Seq2SeqEncoder(d2l.Encoder):  #@save
"""The RNN encoder for sequence-to-sequence learning."""
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = d2l.GRU(embed_size, num_hiddens, num_layers, dropout)
self.apply(init_seq2seq)

def forward(self, X, *args):
# X shape: (batch_size, num_steps)
embs = self.embedding(X.t().type(torch.int64))
# embs shape: (num_steps, batch_size, embed_size)
outputs, state = self.rnn(embs)
# outputs shape: (num_steps, batch_size, num_hiddens)
# state shape: (num_layers, batch_size, num_hiddens)
return outputs, state

Decoder

The decoder is a much more interesting part to focus on, as it starts to incorporate many new architectures that will eventually underpin things like Self Attention and contextual embeddings

At a high level, the decoder will typically focus on the 3 things below:

  • Feed source and previously generated target words into a network
  • Get vector representation of context (both source and previous target) from the networks decoder
  • From this vector representation, predict a probability distribution for the next token

Decoder Example

At each time step tt' the decoder assigns a probability to each possible token in the vocabulary, and the probability is conditioned upon previous output tokens {y[0],y[1],...,y[t1]}\{y_{[0]}, y_{[1]}, ... , y_{[t' - 1]} \}, the previous hidden state h[t1]h_{[t' - 1]}, and the encoder hidden state context variable cc

h[t]=g(h[t1],y[t1],c)h_{[t']} = g(h_{[t' - 1]}, y_{[t' - 1]}, c)

After getting the hidden state for current time tt', we can use an output layer and softmax operation to compute the predictive distribution. This output layer is typically used to go from the underlying dimension of the decoder NN into the dd vocabulary size we need to predict

P(y_{[t']} | y_{[0]}, y_{[1]}, ... , y_{[t' - 1], c) = f(h_{[t - 1]}, y_{[t' - 1]}, c)

Below python script showcases some of this:

  • Use the last hidden state of the encoder to initialize the hidden state of the decoder
    • Therefore, RNN encoder and decoder have the same number of layers and hidden units
  • Hidden context cc is concatenated with decoder input at all time steps to ensure cc is able to spread it's information around everywhere
    • The thought here is this context variable is so vital, we shouldn't lose it by passing it through hidden states and instead should allow it to be used at each step
  • The output layer is a fully connected layer to predict the probability distribution of the output token over all vocabulary entries
Show Python Script
pytorch
mxnet
jax
tensorflow

class Seq2SeqDecoder(d2l.Decoder):
"""The RNN decoder for sequence to sequence learning."""
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = d2l.GRU(embed_size+num_hiddens, num_hiddens,
num_layers, dropout)
self.dense = nn.LazyLinear(vocab_size)
self.apply(init_seq2seq)

def init_state(self, enc_all_outputs, *args):
return enc_all_outputs

def forward(self, X, state):
# X shape: (batch_size, num_steps)
# embs shape: (num_steps, batch_size, embed_size)
embs = self.embedding(X.t().type(torch.int32))
enc_output, hidden_state = state
# context shape: (batch_size, num_hiddens)
context = enc_output[-1]
# Broadcast context to (num_steps, batch_size, num_hiddens)
context = context.repeat(embs.shape[0], 1, 1)
# Concat at the feature dimension
embs_and_context = torch.cat((embs, context), -1)
outputs, hidden_state = self.rnn(embs_and_context, hidden_state)
outputs = self.dense(outputs).swapaxes(0, 1)
# outputs shape: (batch_size, num_steps, vocab_size)
# hidden_state shape: (num_layers, batch_size, num_hiddens)
return outputs, [enc_output, hidden_state]

Both parts are jointly trained to maximise the conditional log-likelihood, where θ\theta denotes the set of model parameters and (xn,yn)(x_n, y_n) (input, output) pairings from the training set with size NN

maxθ1Nn=1Nlogpθ(ynxn)\text{max}_{\theta} {1 \over N} \sum_{n = 1} ^ N \log{p_{\theta} (y_n | x_n})

This best probability is usually found via the beam search algorithm, whose core idea is that on each step of the decoder you track the kk most probable partial translations (usually called hypotheses)

Training

These models are trained to predict probability distributions of the next token given previous context (source and previous output target tokens). Understanding that they simply predict probability distributions is important! Their evaluation is based on distribution comparisons, and then the actual generation done via search algorithms are completely split out from prediction output

This idea took me a long time to internalize, so I'll bold italicize it a bunch

Formally for training, assume:

  • Training instance with source x=(x[0],x[1],...,xm)x = (x_{[0]}, x_{[1]}, ... , x_{m})
  • Target (y[0],y[1],...,y[n])(y_{[0]}, y_{[1]}, ... , y_{[n]})
  • At timestamp tt the model predicts a probability distribution \p(t)=p(y[0],y[1],...,y[t1],x[0],x[1],...,xm)\p^{(t)} = p(* | y_{[0]}, y_{[1]}, ... , y_{[t - 1]}, x_{[0]}, x_{[1]}, ... , x_{m})
  • The target at thsi step is p=one-hot(yt)p^* = \text{one-hot}(y_t)
    • i.e. we want model to assign probability 1 to the correct token yty_t and zero to the rest
  • The standard loss for probability comparisons is cross entropy

These target instances are from cleaned web scraped content, human reviewed content, etc

Inferece

Inference and generating actual output tokens is an expensive task. The output probability distributions the model outputs aren't the end result - they need to have a word chosen from them!

The formal definition of what needs to be maximized:

y=argmaxyp(yx)=argmaxyt=1np(yty<t,x)y^{`} = {\text{argmax}_ {y}} p(y | x) = {\text{argmax}_ {y}} \prod_{t=1}^{n} p(y_{t} | y_{\lt t}, x)

So how can you actually find argmaxy\text{argmax}_y? Generating one at a time might incur incorrect results, especially if the desired token 2 steps away depends on a current token being predicted that isn't the top probability. For example at time tt, the sentence may be "my name", and t+1t+1 next token may have the highest probability, given the encoded context vectors, of "Luke", whereas we actually need it to be "is Luke" over 2 steps

This can be formulated as hypotheses, where each potential set of future words chosen, all of the combinations, are a set of hypotheses we could output. So what are the main strategies searching over this result space and generating actual tokens to output? It all comes down to time complexities and the efficiency of models

For any time tt' the decoder outputs predictions representing the probability of each token in the vocabulary coming next in the sequence, and it's conditioned on the previous tokens, {y[0],y[1],...,y[t1]},alongwiththecontextvariable\{y_{[0]}, y_{[1]}, ... , y_{[t' - 1]} \}, along with the context variable c.Ifwedenote. If we denote Vastheoutputvocabularylist,wecanalsodefinethemaxnumberoftokensofanoutputsequenceasas the output vocabulary list, we can also define the max number of tokens of an output sequence asT'$

Greedy Search would simply choose the token with the highest conditional probability from VV. However, the greedy search finds the sequence of most likely tokens, whereas the objective of Seq2Seq is actually to find the most likely sequence. The sequence of most likley tokens is typically not equal to the most likely sequence altogether

Exhaustive Search would look to generate all possible combinations of output tokens, and then run probability distributions over all combinations in the resulting set. If there are 100 words in our vocabulary with max sentence size of 20, we'd need to run 10020100^20 simulations, which is well beyond the capabilities (and frankly wasteful)

So greedy search is extremely cheap at O(V×T)O(|V| \times T'), but is sub-optimal, and exhaustive search is extremely compute intensive, and there are many strategies inbetween, like beam search, which looks to bridge the gap inbetween

Greedy Decoding

This is the most straightfoward, cheapest, and more efficient decoding strategy. At each time step you simply pick the token with the highest probability based on encoding state and previous vectors without any searching

It's flawed in that the best token at the current step doesn't always necessarily lead to the best, or correct, output sequence. That's formalized by the max over products not being equal to the product over maxes

argmaxyt=1np(yty<t,x)t=1nargmaxyp(yty<t,x){\text{argmax}_ {y}} \prod_{t=1}^{n} p(y_{t} | y_{\lt t}, x) \neq \prod_{t=1}^{n} {\text{argmax}_ {y}} p(y_{t} | y_{\lt t}, x)

Beam search keeps track of the kk most probably hypotheses

Beam search is a method for decoding a sequence given an auto-regressive function that outputs a probability distribution over the next possible symbols - in this scenario it's specifically focusing on how to choose a word from an output probability distribution in an encoder-decoder seq2seq model architecture

Beam search itself is characterized, mostly, by a hyperparameter beam size k

  • At step 1 we choose kk tokens with the highest predicted possibilities
  • At step 2, based on the initial kk candidates, we choose kk candidates from each of those
  • And so on and so forth, ultimately selecting kk candidate output sequences with the highest predicted possibilities from the O(kV)O(k |V|) possible choices

Below shows beam size of 2 with a total output sequence of length 3. From this the set of output sequences would be A, C, AB, CE, ABD, and CED, and for our predicted output we would choose the highest probability sequence from this list Beam Search

Prediction and Evaluation

To predict the actual output sequence at each step, the predicted token from the previous time step is fed into the decoder as an input in the current step

When a new prediction is ready to be made, there are many ways to actually choose which token from the distribution we should utilize - the most intuitive and common approach would be to just take the word with highest probability. It's a pretty straightforward architecture, seems intuitive, etc, and it's fine but there are also more sophisticated techniques like beam search

The initial token fed into the sentence is a special token like <bos> meaning Beginning Of Sentence, and then when the decoder finally predicts <eos> the sentence generation will stop

Encoder Decoder BOS EOS

After this is done, how can we evaluate if that word is correct? What if we chose America instead of USA as the output token? Is that "wrong"? Most of the time the Bilingual Evaluation Understudy (BLEU), which was proposed for evaluating machine translation results, is the most commonly used dataset for evaluating Seq2Seq models like this. It works by looking over n-grams in the predicted sequence, and BLEU allows evaluation on whether any n-gram appears in the target sequence

If pnp_{n} is the precision of an n-gram, which is defined as the ratio of the number of matched n-grams in the predicted and target sequences to the number of n-grams in the predicted sequence, then we can define metrics as

pn=# of matched n-grams# of n-grams in the candidatep_n = \frac{\text{\# of matched } n\text{-grams}}{\text{\# of } n\text{-grams in the candidate}}

Which gives BLEU as

BLEU=BPexp(n=1Nwnlogpn)\text{BLEU} = BP \cdot \exp\left( \sum_{n=1}^N w_n \log p_n \right)

A better explanation would be, given a target sequence A,B,C,D,E,FA, B, C, D, E, F and a predicted sequence of A,B,B,C,DA, B, B, C, D, we have:

  • p1=45p_1 = 4 \over 5
  • p2=34p_2 = 3 \over 4
  • p3=13p_3 = 1 \over 3
  • p4=0p_4 = 0

If the predicted sequence is the same as the target sequence, the BLEU score is 1, and the length of the n-gram matters because matching a longer n-gram is harder than a shorter one