Transformer Residual Stream

A minimal 'Attention Is All You Need' encoder you can poke at — watch the stream change layer by layer.

The Transformer is a stack of identical blocks. Each block reads from a shared 'residual stream' (one vector per token), writes an update from multi-head self-attention, then writes another update from a feed-forward network. The stream is what carries information between layers.

The strip at the top is the full stream at every checkpoint: the embedding, then after each sublayer. Click any token on the left to follow its row across the whole network, and watch how each sublayer nudges it.

Use the controls to scale the model — more layers, more heads, bigger d_model — or turn off the residual connection or layer norm to see why the original paper needs both.

Open the demo in a new tab ↗