Transformer Residual Stream

A minimal "Attention Is All You Need"-style encoder. Each layer adds two writes — self-attention and a feed-forward network — to the residual stream. Watch a token's hidden state evolve layer by layer, and toggle residual connections or layer norm to see what breaks without them.

Input

Tokens split on whitespace. Click a token to follow its row across layers.

Architecture

Tokens
dk per head
Params (approx)
Stream drift

Inspect

Residual stream across the network

Rows = tokens, columns = dimensions of dmodel. Each heatmap is the stream at one checkpoint: embed → per layer (+ attn+ ffn). Compare columns to see how the stream evolves. Hover a cell for its value.

Attention — selected layer & head

Rows = queries, columns = keys. Darker = higher weight (post-softmax). Each row sums to 1.

Per-layer sublayer contributions

Norm of each sublayer's write to the stream, for the selected token (orange if token is chosen), or averaged across tokens otherwise. Bigger bars = that sublayer moved the stream more at that layer.