Transformer Residual Stream Simulator

A minimal "Attention Is All You Need"-style encoder. Each layer adds two writes — self-attention and a feed-forward network — to the residual stream. Watch a token's hidden state evolve layer by layer, and toggle residual connections or layer norm to see what breaks without them.

Input

Tokens split on whitespace. Click a token to follow its row across layers.

Architecture

Layers (L)

Heads (h)

d_model

d_ff

Seed

Residual connections LayerNorm (post-norm) Causal mask

Tokens–

d_k per head–

Params (approx)–

Stream drift–

Inspect

Layer Head

Residual stream across the network

Rows = tokens, columns = dimensions of d_model. Each heatmap is the stream at one checkpoint: embed → per layer (+ attn → + ffn). Compare columns to see how the stream evolves. Hover a cell for its value.

Attention — selected layer & head

Rows = queries, columns = keys. Darker = higher weight (post-softmax). Each row sums to 1.

Per-layer sublayer contributions

Norm of each sublayer's write to the stream, for the selected token (orange if token is chosen), or averaged across tokens otherwise. Bigger bars = that sublayer moved the stream more at that layer.