Word Embeddings

Hey students! 🚀 Today we're diving into one of the most fascinating areas of natural language processing: word embeddings. Think of this as teaching computers to understand words the way humans do - by giving them numerical representations that capture meaning and relationships. By the end of this lesson, you'll understand how algorithms like Word2Vec and GloVe transform words into mathematical vectors, how these systems learn language patterns, and why they're so powerful for AI applications. Get ready to discover how machines can learn that "king" - "man" + "woman" = "queen"!

What Are Word Embeddings and Why Do We Need Them?

Imagine trying to explain to an alien what the word "dog" means without using any other words - pretty tough, right? 🛸 This is exactly the challenge computers face when processing human language. Word embeddings solve this problem by converting words into dense numerical vectors (lists of numbers) that capture semantic meaning and relationships.

Traditional approaches used "one-hot encoding" where each word was represented as a vector with mostly zeros and a single 1. For a vocabulary of 50,000 words, each word would be a 50,000-dimensional vector with 49,999 zeros! This approach had major problems: it was incredibly sparse, required massive storage, and worst of all, it couldn't capture any relationships between words. The words "cat" and "dog" would be just as different as "cat" and "airplane."

Word embeddings revolutionized this by creating dense representations - typically 100-300 dimensions - where similar words have similar vectors. Research shows that well-trained embeddings can capture syntactic relationships (like verb tenses), semantic relationships (like synonyms), and even complex analogies. For example, in a good embedding space, the vector for "Paris" minus "France" plus "Italy" will be very close to the vector for "Rome"!

Word2Vec: Learning from Context

Word2Vec, introduced by Google researchers in 2013, was a game-changer in natural language processing. The core insight is brilliantly simple: "You shall know a word by the company it keeps." This means we can learn what words mean by looking at what words appear around them in text.

Word2Vec actually includes two different architectures: Continuous Bag of Words (CBOW) and Skip-gram. CBOW predicts a target word from its surrounding context words - imagine filling in the blank: "The cat sat on the ___" where the model learns that "mat," "chair," or "floor" are likely answers. Skip-gram works in reverse: given a word, it predicts what words are likely to appear nearby.

The training process is fascinating! The algorithm slides a window across millions of sentences, creating training examples. For the sentence "The quick brown fox jumps," with a window size of 2, it creates pairs like (quick, The), (quick, brown), (brown, quick), (brown, fox), and so on. The neural network learns to adjust word vectors so that words appearing in similar contexts end up with similar representations.

What makes Word2Vec particularly clever is its use of negative sampling. Instead of updating weights for all 50,000 words in the vocabulary for each training example (which would be computationally expensive), it only updates the target word and a few randomly selected "negative" words. This makes training much faster while maintaining quality.

GloVe: Combining Global and Local Statistics

While Word2Vec was revolutionary, researchers at Stanford noticed it only used local context information. Enter GloVe (Global Vectors for Word Representation), which combines the best of both worlds: the speed and efficiency of Word2Vec with global statistical information about word co-occurrences.

GloVe starts by building a massive co-occurrence matrix that counts how often every pair of words appears together across the entire corpus. If you have 50,000 unique words, this creates a 50,000 × 50,000 matrix! The key insight is that the ratio of co-occurrence probabilities can encode meaning. For example, "ice" appears with "solid" much more often than "gas" does, while "steam" shows the opposite pattern.

The GloVe algorithm then factorizes this co-occurrence matrix to produce word vectors. The training objective is elegantly designed: it tries to make the dot product of two word vectors equal to the logarithm of their co-occurrence probability. This mathematical formulation ensures that words with similar co-occurrence patterns end up with similar vectors.

Research has shown that GloVe often performs better than Word2Vec on word analogy tasks and tends to work well with smaller datasets. The global statistics help it capture broader patterns that might be missed by Word2Vec's local window approach.

Vector Algebra: The Magic of Mathematical Relationships

Here's where things get really exciting! 🎯 Well-trained word embeddings exhibit remarkable mathematical properties that mirror human understanding of language. The most famous example is the analogy: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$.

This isn't just a cute trick - it reveals that the embedding space has learned abstract concepts like gender and royalty as directions in the vector space. The difference between "king" and "man" captures the concept of royalty, while the difference between "man" and "woman" captures gender. When you add these transformations together, you get vectors close to "queen."

These relationships extend far beyond simple analogies. Embeddings can capture:

Syntactic relationships: walk/walked, go/went (verb tenses)
Semantic relationships: big/large, car/automobile (synonyms)
Geographical relationships: Paris/France, Tokyo/Japan (capitals and countries)
Comparative relationships: good/better/best (degrees of comparison)

The mathematical operations work because similar words cluster together in the high-dimensional space. Words like "dog," "cat," "rabbit" form one cluster, while "car," "truck," "bicycle" form another. The directions between clusters represent semantic relationships.

Training Objectives and Optimization

Both Word2Vec and GloVe use sophisticated optimization techniques to learn these representations. Word2Vec's objective function aims to maximize the probability of predicting context words given a target word (or vice versa for CBOW). Mathematically, for Skip-gram, this means maximizing:

$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)$$

where $T$ is the total number of words, $c$ is the context window size, and $p(w_{t+j} | w_t)$ is the probability of word $w_{t+j}$ given word $w_t$.

GloVe's objective function is different but equally elegant:

$$J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2$$

where $X_{ij}$ is the co-occurrence count, $w_i$ and $\tilde{w}_j$ are word vectors, and $f(X_{ij})$ is a weighting function that prevents very common word pairs from dominating the training.

Both algorithms use stochastic gradient descent to optimize these objectives, processing millions of training examples to gradually adjust the word vectors until they capture meaningful relationships.

Evaluation Methods: Measuring Success

How do we know if our word embeddings are actually good? 🤔 Researchers have developed several evaluation methods:

Intrinsic evaluation tests the embeddings directly through tasks like word similarity and analogy completion. The word similarity task compares embedding similarities with human judgments - if humans rate "car" and "automobile" as very similar, good embeddings should have high cosine similarity between their vectors. The analogy task tests whether embeddings can solve problems like "man is to king as woman is to ___."

Extrinsic evaluation uses embeddings as input to downstream tasks like sentiment analysis, named entity recognition, or machine translation. Better embeddings typically lead to better performance on these real-world applications.

Modern evaluation also considers fairness and bias. Since embeddings learn from human-generated text, they can perpetuate societal biases. Researchers now test for gender, racial, and cultural biases in embedding spaces and develop techniques to mitigate these issues.

Conclusion

Word embeddings represent a fundamental breakthrough in how computers understand language. By converting words into dense numerical vectors, algorithms like Word2Vec and GloVe enable machines to capture semantic relationships, perform mathematical operations on meaning, and serve as the foundation for modern NLP systems. These techniques transformed natural language processing from rule-based systems to learning-based approaches that can generalize across different contexts and languages. Understanding word embeddings gives you insight into how AI systems like chatbots, translation services, and search engines actually work under the hood!

Study Notes

• Word embeddings convert words into dense numerical vectors that capture semantic meaning and relationships

• One-hot encoding problems: sparse representation, no semantic relationships, computationally expensive

• Word2Vec uses two architectures: CBOW (predicts target from context) and Skip-gram (predicts context from target)

• Negative sampling speeds up Word2Vec training by updating only target word and few random negatives

• GloVe combines local context with global co-occurrence statistics across entire corpus

• Vector algebra: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ demonstrates learned semantic relationships

• Training objectives: Word2Vec maximizes context prediction probability; GloVe minimizes co-occurrence reconstruction error

• Evaluation methods: Intrinsic (word similarity, analogies) and extrinsic (downstream task performance)

• Typical dimensions: 100-300 dimensions for most applications

• Applications: Foundation for modern NLP systems including search, translation, and chatbots