before transformers

Classical approaches to language modelling, such as Markov chain approaches, were not flexible enough to learn nuances the same way neural network can.

Previous neural network architectures had scaling or gradient problems.

Transformers are structured in a manner that allows them to be trained readily and stably. Originally introduced in the Attention Is All You Needpaper. Best-explained by 3b1b.

high-level transformer architecture

At a high level, the GPT architecture has three sections:

  • Text and positional embeddings
  • A transformer decoder stack
  • projection to vocab step

High-level architecture (from GPT in 60 lines of Numpy):

def gpt2(inputs, wte, wpe, blocks, ln_f, n_head): # [n_seq] -> [n_seq, n_vocab]
	# token + positional embeddings
	x = wte[inputs] + wpe[range(len(inputs))] # [n_seq] -> [n_seq, n_embd]
	
	# decoder stack: forward pass through n_layer transformer blocks
	for block in blocks:
		x = transformer_block(x, **block, n_head=n_head) # [n_seq, n_embd] -> [n_seq, n_embd]
 
	# projection to vocab
	x = layer_norm(x, **ln_f) # [n_seq, n_embd] -> [n_seq, n_embd]
	return x @ wte.T # [n_seq, n_embd] -> [n_seq, n_vocab]

inputs

The inputs to the model are tokens. These are created by using a tokeniser to break a string down and map to integers.

Token are converted to vectors via an embedding matrix.

# wte converts token to vector via embedding matrix
# wpe encode positional information into our inputs
x = wte[inputs] + wpe[range(len(inputs))]

decoder architecture

# decoder stack: forward pass through n_layer transformer blocks
for block in blocks:
	x = transformer_block(x, **block, n_head=n_head) # [n_seq, n_embd] -> [n_seq, n_embd]

where

def transformer_block(x, mlp, attn, ln_1, ln_2, n_head): # [n_seq, n_embd] -> [n_seq, n_embd]
	# multi-head causal self attention
	x = x + mha(layer_norm(x, **ln_1), **attn, n_head=n_head) # [n_seq, n_embd] -> [n_seq, n_embd]
 
	# position-wise feed forward network – standard dense network
	x = x + ffn(layer_norm(x, **ln_2), **mlp) # [n_seq, n_embd] -> [n_seq, n_embd]
 
	return x

Layer normalisation ensures that the inputs for each layer are always within a consistent range, which helps speed up and stabilise the training process.

mha are multi-headed self-attention blocks, which facilitate the communication between the inputs. Nowhere else in the network does the model allow inputs to see each other.

outputs

# projection to vocab
# reusing the embedding matrix `wte` for the projection
# output logits rather than softmax-transformed
x = layer_norm(x, **ln_f) # [n_seq, n_embd] -> [n_seq, n_embd]
return x @ wte.T # [n_seq, n_embd] -> [n_seq, n_vocab]

Normally apply softmax transform (convert set of real numbers to probabilities) over the last axis of the input. But softmax is monotonic and logits are more stable so we output them.

See language modelling for how to predict.