transformer

How Transformers Work: A Detailed Exploration of Transformer Architecture

Source: DataCamp (Jan 9, 2024)

In Brief


What Are Transformers?

Transformers are neural network architectures designed for sequence-to-sequence tasks. They rely almost entirely on attention mechanisms instead of recurrence (as in RNNs).

Transformer as a Black Box

The transformer architecture as a black box Image by the author. Transformer architecture as a black box.


Historical Context


The Shift from RNNs to Transformers

Limitations of RNNs

Transformer Advantages


The Transformer Architecture

High-Level Overview

Transformer encoder-decoder black box Image by the author. Transformer as encoder-decoder black box.

Components:

Encoder-Decoder structure Image by the author. Global structure of Encoder-Decoder.

Stacked Encoder-Decoder layers Image by the author. Encoder-Decoder repeated N times.


The Encoder Workflow

Encoder architecture Image by the author. Encoder structure.

Step 1: Input Embeddings

Input Embedding Encoder workflow: Input Embedding

Step 2: Positional Encoding

Positional Encoding Encoder workflow: Positional Encoding

Step 3: Stack of Encoder Layers

Each layer contains:

Encoder Layers Stack Encoder workflow: Stack of encoder layers

Step 3.1: Multi-Head Self-Attention

Multi-Head Attention Encoder workflow: Multi-Head Attention

Attention Computation

QK MatMul Matrix multiplication of Query and Key

Scaling Scores Scaling attention scores

Softmax Softmax on attention scores

Weighted Values Combining softmax results with Value vectors

Step 3.2: Residual Connections + Normalization

Residual + Norm Residual connections and normalization

Step 3.3: Feed-Forward Neural Network

Feed Forward Network Encoder workflow: Feed-Forward Neural Network

Step 4: Encoder Output


The Decoder Workflow

Decoder architecture Image by the author. Decoder structure.

Step 1: Output Embeddings

Step 2: Positional Encoding

Step 3: Stack of Decoder Layers

Each layer contains:

Step 3.1: Masked Self-Attention

Masked Attention Decoder workflow: Masked self-attention

Step 3.2: Encoder-Decoder (Cross) Attention

Cross Attention Decoder workflow: Encoder-Decoder attention

Step 3.3: Feed-Forward Neural Network


Summary

Transformers replace recurrence with attention, enabling:

This architecture underpins modern models such as GPT, BERT, and other foundation models.



Page Source