Vector Representations

Hey students! 👋 Welcome to one of the most exciting topics in natural language processing - vector representations! In this lesson, we'll explore how computers transform words and text into numbers they can actually understand and work with. You'll discover the journey from simple counting methods to sophisticated AI models that can capture the true meaning of language. By the end of this lesson, you'll understand how these mathematical representations power everything from search engines to chatbots, and why they're absolutely crucial for modern AI systems.

The Foundation: Why Computers Need Numbers

Imagine trying to teach a friend who only speaks math to understand poetry 📚. That's essentially what we face when we want computers to process human language. Computers are brilliant at crunching numbers, but they're completely lost when it comes to words like "love," "pizza," or "awesome." This is where vector representations come to the rescue!

A vector representation is simply a way of converting words, sentences, or entire documents into lists of numbers (called vectors) that preserve their meaning and relationships. Think of it like creating a mathematical fingerprint for each word that captures its essence.

For example, the word "king" might be represented as [0.2, -0.1, 0.8, 0.3, -0.5], while "queen" could be [0.1, -0.2, 0.7, 0.4, -0.4]. Notice how similar they are? That's intentional - words with similar meanings should have similar vector representations.

The magic happens because these vectors allow computers to perform mathematical operations on words. Want to find words similar to "happy"? Just find vectors that are mathematically close to the "happy" vector. It's like giving computers a superpower to understand language! ✨

Count-Based Vectors: The Simple Beginning

Let's start our journey with the simplest approach: count-based vectors. These methods literally count how often words appear in documents or near other words.

Bag of Words (BoW) is like creating a giant vocabulary list and then counting how many times each word appears in a document. Imagine you have two sentences:

"I love pizza"
"Pizza is love"

Your vocabulary might be [I, love, pizza, is], and the vectors would be:

Sentence 1: [1, 1, 1, 0] (one "I", one "love", one "pizza", zero "is")
Sentence 2: [0, 1, 1, 1] (zero "I", one "love", one "pizza", one "is")

Term Frequency-Inverse Document Frequency (TF-IDF) takes this further by being smart about which words are actually important. It gives higher scores to words that appear frequently in a specific document but rarely across all documents. This way, common words like "the" and "and" don't dominate the representation.

The TF-IDF formula is: $$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \log\left(\frac{N}{\text{DF}(t)}\right)$$

Where TF is term frequency, N is the total number of documents, and DF is document frequency.

Count-based methods work surprisingly well for many tasks! Search engines used these techniques for decades. However, they have a major limitation: they treat words as completely independent entities. "King" and "queen" would have completely different vectors even though they're closely related concepts.

The Revolution: Word2Vec and Distributed Representations

In 2013, Google researchers introduced Word2Vec, which completely revolutionized how we think about word representations 🚀. Instead of counting, Word2Vec learns to predict words based on their context.

Word2Vec comes in two flavors:

Skip-gram: Given a word, predict its surrounding words
CBOW (Continuous Bag of Words): Given surrounding words, predict the center word

The breakthrough insight was that "words that appear in similar contexts have similar meanings." If you see "The cat sat on the..." and "The dog sat on the...", the model learns that "cat" and "dog" are similar because they appear in similar contexts.

Word2Vec creates dense vectors (typically 100-300 dimensions) where semantically similar words cluster together in the vector space. The famous example is: $$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$

This mathematical relationship captures the concept that "king is to man as queen is to woman" - mind-blowing, right? 🤯

GloVe (Global Vectors) improved upon Word2Vec by combining the best of count-based methods and prediction-based methods. It uses global word co-occurrence statistics while still learning dense representations. GloVe often performs better on word similarity tasks and has become incredibly popular in research.

These distributed representations typically achieve 300 dimensions compared to the thousands or millions needed for count-based methods, making them much more efficient while capturing richer semantic relationships.

The Modern Era: Contextual Embeddings

Traditional word embeddings like Word2Vec have one major limitation: each word gets exactly one vector representation. But think about the word "bank" - it means something completely different in "river bank" versus "savings bank" 🏦🏞️.

This is where contextual embeddings come in, led by models like BERT (Bidirectional Encoder Representations from Transformers). These models create different vector representations for the same word depending on its context.

BERT revolutionized NLP by:

Looking at words in both directions (left and right context)
Creating dynamic representations that change based on surrounding words
Pre-training on massive amounts of text to understand language patterns

When BERT processes "The bank of the river was muddy," it creates a completely different vector for "bank" than when it processes "I deposited money at the bank." This contextual awareness allows for much more nuanced understanding of language.

Modern contextual models like GPT, BERT, and their variants use transformer architectures with attention mechanisms. These models can have hundreds of millions or even billions of parameters, creating incredibly sophisticated representations that capture subtle linguistic patterns.

The vectors from these models are typically 768 or 1024 dimensions and can represent not just individual words, but entire sentences or paragraphs. This enables applications like question answering, text summarization, and even creative writing assistance.

Properties and Trade-offs: Choosing the Right Approach

Each type of vector representation has its strengths and weaknesses, students. Let me break down the key trade-offs:

Count-based vectors are interpretable and fast to compute. You can easily see why a document got classified a certain way by looking at which words had high counts. They work great for tasks like document classification and information retrieval. However, they struggle with synonyms and related concepts, and they create very high-dimensional, sparse vectors.

Static embeddings like Word2Vec and GloVe capture semantic relationships beautifully and are much more compact. They're perfect for tasks where you need to find similar words or perform analogical reasoning. The downside? They can't handle polysemy (words with multiple meanings) and struggle with out-of-vocabulary words.

Contextual embeddings are the most powerful but also the most computationally expensive. They excel at complex tasks like reading comprehension, sentiment analysis, and language generation. However, they require significant computational resources and are harder to interpret.

For real-world applications, the choice depends on your specific needs:

Building a simple document search system? TF-IDF might be perfect
Creating a recommendation system based on product descriptions? Word2Vec could be ideal
Building a chatbot that needs to understand context? You'll want BERT or similar models

The field continues evolving rapidly, with new architectures and techniques emerging regularly. Recent developments include multilingual embeddings that work across languages and specialized embeddings for specific domains like medicine or law.

Conclusion

Vector representations are the bridge between human language and machine understanding. We've journeyed from simple counting methods like TF-IDF through the revolutionary Word2Vec and GloVe, all the way to sophisticated contextual embeddings like BERT. Each approach offers unique advantages: count-based methods provide interpretability, static embeddings capture semantic relationships efficiently, and contextual embeddings deliver unprecedented understanding of language nuance. As you continue your NLP journey, remember that choosing the right representation depends on balancing accuracy, computational resources, and interpretability for your specific application. These mathematical representations of language continue to evolve, powering the AI systems that are transforming how we interact with technology every day.

Study Notes

• Vector representations convert words/text into numerical vectors that computers can process while preserving semantic meaning

• Bag of Words (BoW) creates vectors by counting word occurrences in documents - simple but treats words as independent

• TF-IDF formula: $\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \log\left(\frac{N}{\text{DF}(t)}\right)$ - weights important words higher

• Word2Vec learns dense vectors by predicting words from context - captures semantic relationships like $\text{king} - \text{man} + \text{woman} \approx \text{queen}$

• Skip-gram predicts surrounding words from center word; CBOW predicts center word from context

• GloVe combines global co-occurrence statistics with prediction methods for improved performance

• Static embeddings give each word one fixed vector regardless of context

• Contextual embeddings (like BERT) create different vectors for same word based on surrounding context

• BERT uses bidirectional context and transformer architecture with attention mechanisms

• Trade-offs: Count-based (interpretable, fast, sparse) vs Static embeddings (semantic relationships, compact) vs Contextual (powerful, expensive, complex)

• Dimensions: Count-based (thousands/millions), Word2Vec/GloVe (100-300), BERT (768-1024)

• Applications: Document search (TF-IDF), similarity tasks (Word2Vec), complex NLP (BERT)