before transformers
Classical approaches to language modelling, such as Markov chain approaches, were not flexible enough to learn nuances the same way neural network can.
Previous neural network architectures had scaling or gradient problems.
Transformers are structured in a manner that allows them to be trained readily and stably. Originally introduced in the Attention Is All You Needpaper. Best-explained by 3b1b.
high-level transformer architecture
At a high level, the GPT architecture has three sections:
- Text and positional embeddings
- A transformer decoder stack
- A projection to vocab step
High-level architecture (from GPT in 60 lines of Numpy):
inputs
The inputs to the model are tokens. These are created by using a tokeniser to break a string down and map to integers.
Token are converted to vectors via an embedding matrix.
decoder architecture
where
Layer normalisation ensures that the inputs for each layer are always within a consistent range, which helps speed up and stabilise the training process.
mha
are multi-headed self-attention blocks, which facilitate the communication between the inputs. Nowhere else in the network does the model allow inputs to see each other.
outputs
Normally apply softmax transform (convert set of real numbers to probabilities) over the last axis of the input. But softmax is monotonic and logits are more stable so we output them.
See language modelling for how to predict.