ML Exam - Transformers and LLMs
ML Associate Exam Prep
Transformers and LLMs
Basic Concepts - Tokens and Embeddings
- Tokens = numerical representations of words or parts of words
- A word can consist of 1+ tokens
- Punctuation signs (. “ ,) are also usually tokens
- Words/tokens can be loosely thought as the same, although strictly speaking they're obviously different
- Embeddings = mathematical representations (vectors) that encode the “meaning” of a token
Evolution of the Transformer Architecture
1. RNNs and LSTMs
Recurrent Neural Networks (RNNs) are AI models designed for sequential data - like text or time series - by using internal memory to process inputs in order.
Long Short-Term Memory (LSTM) networks are a specialized, advanced type of RNN created to solve the "vanishing gradient" problem, allowing them to learn long-term dependencies that standard RNNs forget.
RNNs and LSTMs are obsolete with Transformers for many NLP tasks, though they remain relevant for time-series forecasting.
Useful for modeling sequential stuff like time series or language (sequence of words)
RNNs propagate the “hidden state” i.e. the previous output
2. Encoder-Decoder Architecture (e.g. for Machine Translation)
Encoders and Decoders are RNNs
Last Hidden State = huge vector that contains the meaning of the sentence “X0 X1 X2”
Decoder understands the huge vector and can translate back (outputs “Y0 Y1 Y2 Y3”)
Problem: the one vector tying encoder↔decoder creates an information bottleneck → Information from the start of the sequence may be lost
3. “Attention is all you need”
“Attention is all you need” = Revolutionary NLP paper from 2017
A hidden state for each step (word/token) instead of the whole sentence
Each word has weights for the other words in the sentence → “attention weights”
Represent how important the other words are for this specific word → context
thicker/bigger arrow in diagram = bigger weight
Emerging concept for relationships between words
Deals better with differences in word order in the sentence
Lingering problem: RNNs sequential in nature → can’t parallelize it
Transformer-based LLM Diagram
4. (Modern) Transformer Architecture
Ditches RNNs for feed-forward neural networks (FFNNs)Uses “self-attention” positional encoding → each word has position in the sentence encoded in it
Means we can process words in parallel since we don't lose the info of their position in the sentence!
Each word has attention weights from all other words embedded within it
The words with their self-attention weights are fed into FFNNs
Parallelizable → can train on much more data (whole Wikipedia, whole internet…)
Self-Attention
Self-Attention and Attention-Based Neural NetworksEach encoder/decoder has a list of vector embeddings for each token (representations of the meaning of each token)
Self-attention produces a new vector for each token. This vector is the weighted average of all token embeddings with respect to this particular token.
The “magic” is in computing the attention weights
The final vector captures the “meaning” of the token, with context
Meaning tied to the embeddings of the other tokens, and their context weights
Example: the word “novel” can mean either “book” or “original”, depending on context
Self-attention will capture different words in the sentence to determine what meaning “novel” has in that sentence.
The final vector captures the “meaning” of the token, with context
Meaning tied to the embeddings of the other tokens, and their context weights
Example: the word “novel” can mean either “book” or “original”, depending on context
Self-attention will capture different words in the sentence to determine what meaning “novel” has in that sentence.
Calculating Self-Attention
3 weight matrices learned through back-propagation: Query (Wq), Key (Wk), Value (Wv).
Every token gets a q, k and v vector by multiplying its embedding against these matrices.
Calculate a score for each token.
Scaled dot-product attention → score computed by multiplying (dot product) the query vector of a token with each key vector.
Other similarity functions can be used instead of dot product.
Softmax then applied to normalize scores.
Every token gets a q, k and v vector by multiplying its embedding against these matrices.
Calculate a score for each token.
Scaled dot-product attention → score computed by multiplying (dot product) the query vector of a token with each key vector.
Other similarity functions can be used instead of dot product.
Softmax then applied to normalize scores.
Mask prevents tokens from “peeking” into future tokens
Normally when we read, we also read in sequence: we know what we have read so far in the sentence, but not yet what comes later
GPT uses masked self-attention, but BERT does something else (masked language modeling)!!
Multiply value vectors with corresponding scores, then sum them up → Final z vector for token (final self-attention vector for a specific token)
entire process for each token → gets self-attention embeddings for each token
This can be done in parallel for each token!!
Self-attention embeddings can now feed the FFNN
Multi-Headed Self-Attention
q, k, v vectors reshaped into matrices → each row is a "head”
Heads can be processed in parallel
Chat
But transformers by themselves are NOT good chatbots! Just the underlying building block!
Must be trained further to hold a conversation
Also need extra moderation wrapping (prevents them from e.g. generating offensive content…)
Question answering
Text classification (e.g., sentiment analysis)
Named entity recognition (NER)
Summarization
Not that different from Machine Translation: vector embeddings representing meaning of text can have a shorter or longer representation
Not that different from Machine Translation: vector embeddings representing meaning of text can have a shorter or longer representation
Translation
Code generation
Text generation (e.g. automated customer service)
But deploy tools with care! Chatbots might agree to do stuff that never has a corresponding action. And people hate chatbots not understanding them.
Code generation
Text generation (e.g. automated customer service)
But deploy tools with care! Chatbots might agree to do stuff that never has a corresponding action. And people hate chatbots not understanding them.
GPT Architecture
Generative Pre-Trained Transformer (GPT)Type of Large Language Models (LLMs), i.e. models that have been trained on a huge amount of human language data
OpenAI's GPT-2 is FOSS, although later models are closed
Other LLMs are similar to GPT-2
GPT is decoder-only
In contrast, BERT is encoder-only. Also, T5 is an example of a model that uses both encoders and decoders.
In contrast, BERT is encoder-only. Also, T5 is an example of a model that uses both encoders and decoders.
GPT has stacks of decoder blocks = masked self-attention layer + FFNN
GPT has no concept of input!!
All GPT does is continuously generate the next token in a sequence
All GPT does is continuously generate the next token in a sequence
Using attention to maintain relationships to previous tokens
GPT can be triggered to start (or “prompted”) with a sequence of tokens
It then keeps on generating given the previous tokens
Remember that the sequence can be processed in parallel! No need to feed in one token at a time.
Getting rid of the idea of inputs/outputs allows unsupervised training on unlabeled piles of text
GPT “learns a language” rather than being optimized for some specific task. Learns to interpret and speak it.
GPT “learns a language” rather than being optimized for some specific task. Learns to interpret and speak it.
Hundreds of billions of parameters
Token embedding
Captures token similarities, i.e. semantic relationships between tokens, with vectors in a very high dimensional space.
Positional encoding
Captures token positioning in the input relative to other nearby tokens
Uses an interleaved sinusoidal function (both sine and cosine), allows it to work on any sequence length.
For a given period of e.g. 100 tokens, you can infer where the token is relative to its 100 neighbors.
For a really long sequence of tokens, this would eventually repeat, but by then the token is probably not relevant to the context any more.
LLM Output Processing
Stack of decoders outputs a vector at the end
Output vector contains the “meaning” of what we want to say
Multiply output vector with token embeddings
Result is logits (probabilities) of each token being the correct next token in the sequence
Final output can be randomized from logits (e.g. increase the “temperature”) instead of always picking the highest probability token → increases GPT’s “creativity”
LLMs
LLM Input Processing
Tokenization + token encoding (of prompt sequence) Token embedding
Captures token similarities, i.e. semantic relationships between tokens, with vectors in a very high dimensional space.
Positional encoding
Captures token positioning in the input relative to other nearby tokens
Uses an interleaved sinusoidal function (both sine and cosine), allows it to work on any sequence length.
For a given period of e.g. 100 tokens, you can infer where the token is relative to its 100 neighbors.
For a really long sequence of tokens, this would eventually repeat, but by then the token is probably not relevant to the context any more.
LLM Output Processing
Stack of decoders outputs a vector at the end
Output vector contains the “meaning” of what we want to say
Multiply output vector with token embeddings
Result is logits (probabilities) of each token being the correct next token in the sequence
Final output can be randomized from logits (e.g. increase the “temperature”) instead of always picking the highest probability token → increases GPT’s “creativity”
Top P = Threshold probability for token inclusion (higher = more random)
Top K = Alternate mechanism where K candidates exist for token inclusion (higher = more random)
Temperature = Level of randomness in selecting the next word in the output from those tokens
High temperature → More random/creative
Low temperature → More consistent
Context window = Number of tokens an LLM can process at once
Max tokens = Limit for total number of tokens (on input or output)
Transfer Learning (Fine Tuning) with Transformers
Use a pre-trained model as a base model, then adapt it (fine tune it) with your own data for your purposes Allows starting off with giant models (e.g. those that understand English, Klingon or Python) instead of having to train one from scratch!
Opens up a whole new world of AI applications
Types of fine tuning
Add additional training data through the whole thing
Freeze specific layers, re-train others
E.g. train a new tokenizer to learn a new language → train just the tokenization steps
A popular technique here is LoRA (low-rank adaptation)
Add additional training data through the whole thing
Freeze specific layers, re-train others
E.g. train a new tokenizer to learn a new language → train just the tokenization steps
A popular technique here is LoRA (low-rank adaptation)
Add a layer on top of the pre-trained model
Just a few may be all that’s needed!
Just a few may be all that’s needed!
Can provide examples of prompts and desired completions
e.g. “How’s the weather?” → “What’s it to you, bucko?”
Can adapt it to classification or other tasks
e.g. “Wow, I love this course!” → Positive emotion
Hugging Face
Giant repository of pre-trained models you can use → huggingface.co
Can mess around with models… for free!
Contains also a ton of learning resources
Many available models: GPT-2, LLaMa, Stable Diffusion…
e.g. “How’s the weather?” → “What’s it to you, bucko?”
Can adapt it to classification or other tasks
e.g. “Wow, I love this course!” → Positive emotion
Giant repository of pre-trained models you can use → huggingface.co
Can mess around with models… for free!
Contains also a ton of learning resources
Many available models: GPT-2, LLaMa, Stable Diffusion…
Comments
Post a Comment