Table of Contents

BERT

BERT (Bidirectional Encoder Representations from Transformers) goes a step beyond Word2Vec as it is an all around language representation model that can provide contextual word and sentence embeddings for specific supervised tasks

BERT is technically an Encoder Only Model even though it has a decoder stack, the Attention is All You Need Paper references Encoder-Decoder, which BART is, but BERT is Encoder only

Use Case: Original Encoder-Decoder Transformers were great for machine translation, but that isn’t the use case for BERT! Encoder only helps us with transfer learning for a variety of contextual embedding use cases

GPT, BERT, and Others

Therefore, if we look into Attention markdown, BERT would only use the Self Attention encoding over multiple stacked encoders, ultimately resulting in an attended to set of hidden states outputs

BERT doesn’t generate text, but it produces token embeddings that are great for Classification, Sentence Similarity, Sentiment Analysis, and NER / Token Level Tasks

Contextual Word and Sentence Embeddings is a loaded phrase, but it basically means it can help encode any structure of text, for any vocabulary, and it does this through word tokenization and Attention respectively

Transfer Learning is the idea that the semi-supervised training of of a BERT model is just for creating weights and parameters, and that the ideal output of this phase is just the BERT model with said weights and parameters. Once this is done the model itself can have extra layers tacked onto it / updated and be used in a wide range of downstream tasks like sentence classification, word embeddings, report summary, etc…

BERT training has a similar setup to Word2Vec where we use a certain context size to help us model a specific word, but the embeddings can’t necessarily be saved because the output layer (embedding) depdends on the hidden layers…therefore we need to pass a word through with context to get an embedding

Bidirectionality is the big buzz word throughout this paper, and the paper mentioned OpenAI’s GPT models multiple times discussing how they only have unidirectional architecture in the Attention layers which ultimately restricts it’s abilities in some downstream tasks like sentence classification and question answering

BERT itself is…useless? Meaning the model out of the box doesn’t have an exact perfect use case (outside of word / sentence embeddings) and for most successful NLP projects it needs to have a final layer trained (this is false too, a lot of good sentence embedding can be done OOTB)

Most companies don’t actually use BERT out of the box, most companies will fine-tune on top of it and then distill it to lower memory and inference footprint | Use Case | Head on Top of BERT | | ——— | ——— | | Sentiment Analysis | [CLS] Token → Dense → SoftMax | | Named Entity Recognition (NER) | Per Token → Dense → CRF or SoftMax | | Question Answering | Two linear layers → Start / End Token Logits | | Sentence Similarity | Mean / CLS pooling → Dense → Cosine / Classifier | | Retrieval | Dual encoders (query / doc) → VectorDB | | Re-Ranking | Cross-encoder (CLS output) → Score |

Training

Base Models

Base Model Architecture

Masked Language Modeling Architecture

MLM Example

Next Sentence Prediction Architecture

Extending Base Models

Fine Tuning Architecture

Fine Tuning Examples