Table of Contents

Embeddings

Embeddings are dense vector representations of objects - typically we use them for Documents, Queries, Users, Context, or Items…but they can really be used to represent anything

History

Categorical Features

Numeric Features

Text Embeddings

Text Embeddings are one of the harder things to figure out, but there are some standards nowadays

Word2Vec

As we said above Word2Vec is essentially a static lookup of words to embeddings, and so Word2Vec is an embedding model

Training and Output

Model Architecture

Continuous Bag of Words Architecture

Continuous Skip Gram Architecture

Evaluation of Model

TODO: Is below correct? Semantic vs Syntactic…Semantic is “underlying meaning of words” and syntactic is “placement of word”?

BERT

BERT architecture, training, and fine tuning is descirbed in another page, but given all of that is read through we discuss below how to get useful embeddings out of BERT!

Since BERT is an Encoder Only Model, it basically takes an input, runs it through multiple Encoders, and would send it through an output layer at the end - this output layer tyipcally isn’t useful by itself for Word Embeddings, so we would need to go back through the hidden state values and aggregate these in some way to produce Word, Sentence, or Document embeddings

BERT Word Embeddings

BERT Embeddings Pseudo Code

Why does this work?

BERT Sentence Embeddings

User Embeddings

TODO: Outside of collab filtering, how do we get user embeddings? TLDR; How do we get meaningful representations of users?

Embeddings vs Autoencoder vs Variational Autoencoder

This question has come up in my own thoughts, and others have asked me - they all get to a relatively similar output of representing things into a compressed numeric format, but they all have different training objectives, use cases, and architectures - Autoencoders were created to reduce dimensionality, and Embeddings were created to represent, possibly dense items, into dense numeric representations

Embeddings

Description

Use Cases

When to Use

Autoencoder

Description

Use Cases

When to Use

Variational Autoencoder (VAE)

Description

Use Cases

When to Use

Comparison and When to Choose

Technique Static Embeddings Dynamic Embeddings Generative Tasks Dimensionality Reduction
Embeddings Yes Yes No No
Autoencoder Yes No No Yes
Variational Autoencoder (VAE) No Yes Yes Yes
Word2Vec Yes No No No

Key Considerations

  1. Static vs. Dynamic Embeddings:
    • Use Autoencoders or Word2Vec for static representations
    • Use BERT or some sort of Transformer model with Attention for dynamic embeddings
  2. Dimensionality Reduction:
    • Use Autoencoders or VAEs when you need to reduce the dimensionality of high-dimensional data
  3. Generative Tasks:
    • Use VAEs when you need to generate new data points or capture variability in the data
  4. Lightweight Models:
    • Use Word2Vec for lightweight, static word embeddings

Vector Similarities + Lookup

Vector similarities are useful for comparing our final embeddings to others in search space

None of the below are actually useful in real life, as computing these for Top K is very inefficient - approximate Top K algorithms like Branch-and-Bound, Locality Sensitive Hashing, and FAISS clustering are used instead

We discuss all of that here

Quantization

Quantization

Feature Multiplexing

TODO:

Cosine

Cosine similarity will ultimately find the angle between 2 vectors $\text{cosine similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$

Dot

The Dot product is similar to the Cosine product, except it doens’t ignore the magnitude

$dot(a, b) = \sum_{i=1}^v a_ib_i = { A   B }cosine(a,b)$

Which basically means we just compare each item over each dimension. If $a, b$ are normalized then Dot is equivalent to Cosine

Euclidean

This is the typical distance in euclidean space

$euclidean(a, b) = [\sum_{i=1}^v (a_i \times b_i)^2]^{1/2}$

Here magnitude matters, and a smaller distance between vector end-points means a smaller overall distance metric

Topological Interpretations

Most of this comes from Yuan Meng Embeddings Post

There we see discussions of how embeddings, topologically, can be considered a injective one-to-one mapping that preserves properties of both metric spaces

We can also see that from a ML lense, embeddings represent dense numeric features in n-dimensional space

The main point of all of this is that Embeddings equate to → topological properties are preserved - that’s what allows the famous King - man + woman = Queen and France is to Paris as Germany is to Berlin

A random list of numbers is a numeric representation, but they are not Embeddings

One-hot encoding, kindof, preserves topolological properties, but all of the vectors end up being orthogonal to each other so we can’t say category1 + category2 = 0.5category3…they’re orthogonal! Typically we need to map these from OHE metric space to a lower dimensional metric space to get those properties out of it

Yuan Meng Example