A Few extra details about Transformer

Fangda Han
4 min readMar 4, 2020

Recently I read a really nice article about Transformer in NLP. I want to write a summary and some details about the post as well as Transformer itself. A lot of pictures are directly from that post, definitely read the original post pleaes!

The big picture of Transformer, the input goes through 6 (hyerparameter) encoders and 6 decoders to generate the output. (ref: http://jalammar.github.io/illustrated-transformer/)

Input

Encoder (Self-Attention + Feed Forward Network)

Big picture of one Encoder

Self-Attention

Weights of Transformer that transform Embedding to Queries, Keys and Values

In reality, we can compute all z_i in parallel.

Multi-Head Self-Attention

Repeat the above self-attention module eight times
Combine eight heads by a weight matrix
The big picture of self-attention module

Positional Encoding (the hard-to-understand part)

First of all, the intuition is that Transformer does not encode position information. So we need a way to tell the model the position of each word in one sentence. This paper adds a positional encoding to the input:

Unfortunately, the excellent post explains this part in a wrong way. I found another post that tries to explain the positional encoding correctly with more details. The idea is each position takes a specific place in the frequency domain.

ref: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

The proposed method has two immediate advantages:

  1. Each position vector is different and follows a specific trend.
  2. All values stay between [-1,1] and the summation within each vector is similar, this a very good property because the contribution of each word will not affect be the positional encoding.

Residual Connection

After every self-attention, the input and the output is added and layer-normalized.

Decoder

The output of the final encoder is transformed into a set of attention vectors K,V.

This is another part that I want to add details, let’s look at the code for PyTorch, that the attention of the decoder is a combination of input and output embeddings, the target words are embedded as well to compute the attention. This is one key difference compared with previous RNN models.

DecoderLayer: http://nlp.seas.harvard.edu/2018/04/03/attention.html

Loss

The output of the decoder is mapped back to the vocabulary space to compute the loss

SimpleLossCompute: http://nlp.seas.harvard.edu/2018/04/03/attention.html

Summary

The encoder, decoder and the loss cover the main thing about Transformer, to summrize the thing in one figure:

--

--