A Few extra details about Transformer

4 min readMar 4, 2020

Recently I read a really nice article about Transformer in NLP. I want to write a summary and some details about the post as well as Transformer itself. A lot of pictures are directly from that post, definitely read the original post pleaes!

Input

Encoder (Self-Attention + Feed Forward Network)

Self-Attention

Weights of Transformer that transform Embedding to Queries, Keys and Values

In reality, we can compute all z_i in parallel.

Multi-Head Self-Attention

Repeat the above self-attention module eight times

The big picture of self-attention module

Positional Encoding (the hard-to-understand part)

First of all, the intuition is that Transformer does not encode position information. So we need a way to tell the model the position of each word in one sentence. This paper adds a positional encoding to the input:

Unfortunately, the excellent post explains this part in a wrong way. I found another post that tries to explain the positional encoding correctly with more details. The idea is each position takes a specific place in the frequency domain.

ref: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

The proposed method has two immediate advantages:

Each position vector is different and follows a specific trend.
All values stay between [-1,1] and the summation within each vector is similar, this a very good property because the contribution of each word will not affect be the positional encoding.

Residual Connection

After every self-attention, the input and the output is added and layer-normalized.

Decoder

The output of the final encoder is transformed into a set of attention vectors K,V.

This is another part that I want to add details, let’s look at the code for PyTorch, that the attention of the decoder is a combination of input and output embeddings, the target words are embedded as well to compute the attention. This is one key difference compared with previous RNN models.