A minimal "Attention Is All You Need"-style encoder. Each layer adds two writes — self-attention
and a feed-forward network — to the residual stream. Watch a token's hidden state evolve layer by layer,
and toggle residual connections or layer norm to see what breaks without them.
Input
Tokens split on whitespace. Click a token to follow its row across layers.
Architecture
Tokens–
dk per head–
Params (approx)–
Stream drift–
Inspect
Residual stream across the network
Rows = tokens, columns = dimensions of dmodel. Each heatmap is the stream at one checkpoint:
embed → per layer (+ attn → + ffn). Compare columns to see how the stream evolves.
Hover a cell for its value.
Attention — selected layer & head
Rows = queries, columns = keys. Darker = higher weight (post-softmax). Each row sums to 1.
Per-layer sublayer contributions
Norm of each sublayer's write to the stream, for the selected token (orange if token is chosen),
or averaged across tokens otherwise. Bigger bars = that sublayer moved the stream more at that layer.