Taken from Ravin Kumar’s GenAI Guidebook.

The task of predicting the next logical word in a sequence is called language modelling.

Formally,

A language model is a probability distribution over sequences of words.

Generating text given a prompt is also referred to as conditional generation. In classical language modelling, we are often only concerned with predicting the next word. For a text to text model, the intuition is, if I have to predict a word, what is the most likely one. For example, when fixing a typo. “name my is…”

This configuration is very unlikely and so should be changed.

Taking the token with the highest probability as our prediction is known as greedy sampling. We can introduce stochasticity to our generations by sampling from the probability distribution instead of being greedy. Increasing temperature makes our model take more risks when selecting from the probability distribution and thus be more creative.

We are often not only trying to predict given only the previous word. This is where more complex, modern neural networks like the transformer architecture come in.