Encode semantics in a meaningful way by representing words in a vector space (see 3b1b). For example,

king - man + woman = queen

In a simplified set up, we can train a neural network so it correctly predicts the next word in the corpus:

Dimensions:

  • One-hot current word has shape (1, N_words)
  • Embeddings matrix, , has shape (N_words, N_dims)
  • Latent embeddings have shape (1, N_dims)
  • Second trainable weight matrix, , has length (N_dims, N_words)
  • Predicted next word has shape (1, N_words)

We use cross-entry loss to compare the predicted word to the true word. We can vectorise this over the entire corpus by having the current word shape as (length_corpus, N_words).

word2vec (2013) has an embedding dimension (N_dims) dimension of around 300 dimensions for word vectors. OpenAI Ada has 1536.

word2vec uses more context than just the next word:

  • Continuous bag of words (CBOW) uses the surrounding context words to predict the centre word (maximise the probability of every context word given every corresponding centre word)
  • Skip-gram uses the word in the middle to predict surrounding words