Encode semantics in a meaningful way by representing words in a vector space (see 3b1b). For example,
king - man + woman = queen
In a simplified set up, we can train a neural network so it correctly predicts the next word in the corpus:
Dimensions:
- One-hot current word has shape
(1, N_words)
- Embeddings matrix, , has shape
(N_words, N_dims)
- Latent embeddings have shape
(1, N_dims)
- Second trainable weight matrix, , has length
(N_dims, N_words)
- Predicted next word has shape
(1, N_words)
We use cross-entry loss to compare the predicted word to the true word. We can vectorise this over the entire corpus by having the current word shape as (length_corpus, N_words)
.
word2vec (2013) has an embedding dimension (N_dims
) dimension of around 300 dimensions for word vectors. OpenAI Ada has 1536.
word2vec uses more context than just the next word:
- Continuous bag of words (CBOW) uses the surrounding context words to predict the centre word (maximise the probability of every context word given every corresponding centre word)
- Skip-gram uses the word in the middle to predict surrounding words