A Few extra details about Transformer

4 min readMar 4, 2020

Recently I read a really nice article about Transformer in NLP. I want to write a summary and some details about the post as well as Transformer itself. A lot of pictures are directly from that post, definitely read the original post pleaes!

Input

Encoder (Self-Attention + Feed Forward Network)

Self-Attention

Weights of Transformer that transform Embedding to Queries, Keys and Values

In reality, we can compute all z_i in parallel.

Multi-Head Self-Attention

Repeat the above self-attention module eight times

The big picture of self-attention module

Positional Encoding (the hard-to-understand part)

First of all, the intuition is that Transformer does not encode position information. So we need a way to tell the model the position of each word in one sentence. This paper adds a positional encoding to the input:

Unfortunately, the excellent post explains this part in a wrong way. I found another post that tries to explain the positional encoding correctly with more details. The idea is each position takes a specific place in the frequency domain.

ref: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

The proposed method has two immediate advantages:

Each position vector is different and follows a specific trend.
All values stay between [-1,1] and the summation within each vector is similar, this a very good property because the contribution of each word will not affect be the positional encoding.

Residual Connection

After every self-attention, the input and the output is added and layer-normalized.

Decoder

The output of the final encoder is transformed into a set of attention vectors K,V.

This is another part that I want to add details, let’s look at the code for PyTorch, that the attention of the decoder is a combination of input and output embeddings, the target words are embedded as well to compute the attention. This is one key difference compared with previous RNN models.

DecoderLayer: http://nlp.seas.harvard.edu/2018/04/03/attention.html

Loss

The output of the decoder is mapped back to the vocabulary space to compute the loss

Summary

The encoder, decoder and the loss cover the main thing about Transformer, to summrize the thing in one figure:

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Fangda Han

22 Followers

32 Following

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Understanding KL Divergence for NLP Fundamentals: A Comprehensive Guide with PyTorch Implementation

Sambit Kumar Barik

Understanding KL Divergence for NLP Fundamentals: A Comprehensive Guide with PyTorch Implementation

Introduction

Sep 15, 2024

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jessica Stillman

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Oct 30, 2024

732

Lists

Predictive Modeling w/ Python

20 stories1857 saves

Practical Guides to Machine Learning

10 stories2225 saves

Natural Language Processing

1977 stories1620 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories563 saves

Data Science All Algorithm Cheatsheet 2025

Artificial Intelligence in Plain English

Ritesh Gupta

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Jan 5

Building LLMs: A Deep Dive into Data, Pretraining, Posttraining, RLHF, Loss and Evaluation

Shweta Pawar

Building LLMs: A Deep Dive into Data, Pretraining, Posttraining, RLHF, Loss and Evaluation

Explore the complete journey of building LLMs — from data collection and pretraining to posttraining, RLHF, and evaluation.

Mar 2

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Level Up Coding

Jacob Bennett

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Jan 7

260

Exploring Mercury, the First Commercial-Scale Diffusion Large Language Model

Jenray

Exploring Mercury, the First Commercial-Scale Diffusion Large Language Model

Mercury, is making waves as the first commercial-scale dLLM, promising to revolutionize text generation with its speed and efficiency.

5d ago

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams