ML Exam - Transformers and LLMs

 

ML Associate Exam Prep 

Transformers and LLMs


Basic Concepts - Tokens and Embeddings

  • Tokens = numerical representations of words or parts of words
    • A word can consist of 1+ tokens
    • Punctuation signs (. “ ,) are also usually tokens
    • Words/tokens can be loosely thought as the same, although strictly speaking they're obviously different
  • Embeddings = mathematical representations (vectors) that encode the “meaning” of a token

Evolution of the Transformer Architecture

1. RNNs and LSTMs
   Recurrent Neural Networks (RNNs) are AI models designed for sequential data - like text or time series - by using internal memory to process inputs in order. 
   Long Short-Term Memory (LSTM) networks are a specialized, advanced type of RNN created to solve the "vanishing gradient" problem, allowing them to learn long-term dependencies that standard RNNs forget.
   RNNs and LSTMs are obsolete with Transformers for many NLP tasks, though they remain relevant for time-series forecasting.
   Feedback loop already present here
   Useful for modeling sequential stuff like time series or language (sequence of words)
   RNNs propagate the “hidden state” i.e. the previous output

2. Encoder-Decoder Architecture (e.g. for Machine Translation)
   Encoders and Decoders are RNNs
   Last Hidden State = huge vector that contains the meaning of the sentence “X0 X1 X2”
   Decoder understands the huge vector and can translate back (outputs “Y0 Y1 Y2 Y3”)
   Problem: the one vector tying encoder↔decoder creates an information bottleneck → Information from the start of the sequence may be lost

3. “Attention is all you need”
   “Attention is all you need” = Revolutionary NLP paper from 2017
   A hidden state for each step (word/token) instead of the whole sentence
   Each word has weights for the other words in the sentence → “attention weights”
   Represent how important the other words are for this specific word → context
   thicker/bigger arrow in diagram = bigger weight
   Emerging concept for relationships between words
   Deals better with differences in word order in the sentence
   Lingering problem: RNNs sequential in nature → can’t parallelize it
   Transformer-based LLM Diagram
 
4. (Modern) Transformer Architecture
   Ditches RNNs for feed-forward neural networks (FFNNs)
   Uses “self-attention” positional encoding → each word has position in the sentence encoded in it
   Means we can process words in parallel since we don't lose the info of their position in the sentence!
   Each word has attention weights from all other words embedded within it
   The words with their self-attention weights are fed into FFNNs
   Parallelizable → can train on much more data (whole Wikipedia, whole internet…)

Self-Attention

Self-Attention and Attention-Based Neural Networks
   Each encoder/decoder has a list of vector embeddings for each token (representations of the meaning of each token)
   Self-attention produces a new vector for each token. This vector is the weighted average of all token   embeddings with respect to this particular token.
   The “magic” is in computing the attention weights
   The final vector captures the “meaning” of the token, with context
   Meaning tied to the embeddings of the other tokens, and their context weights
   Example: the word “novel” can mean either “book” or “original”, depending on context
   Self-attention will capture different words in the sentence to determine what meaning “novel” has in that sentence. 

Calculating Self-Attention
   3 weight matrices learned through back-propagation: Query (Wq), Key (Wk), Value (Wv).
   Every token gets a q, k and v vector by multiplying its embedding against these matrices.
   Calculate a score for each token.
   Scaled dot-product attention → score computed by multiplying (dot product) the query vector of a token with each key vector.
   Other similarity functions can be used instead of dot product.
   Softmax then applied to normalize scores.

(Optional) Masked Self-Attention
   Mask prevents tokens from “peeking” into future tokens
   Normally when we read, we also read in sequence: we know what we have read so far in the sentence, but not yet what comes later
   GPT uses masked self-attention, but BERT does something else (masked language modeling)!!
   Multiply value vectors with corresponding scores, then sum them up → Final z vector for token (final self-attention vector for a specific token)
entire process for each token → gets self-attention embeddings for each token
   This can be done in parallel for each token!!
   Self-attention embeddings can now feed the FFNN
   Multi-Headed Self-Attention
   q, k, v vectors reshaped into matrices → each row is a "head”
   Heads can be processed in parallel

Applications of Transformers
   Chat
      But transformers by themselves are NOT good chatbots! Just the underlying building block!
      Must be trained further to hold a conversation
      Also need extra moderation wrapping (prevents them from e.g. generating offensive content…)
   Question answering
   Text classification (e.g., sentiment analysis)
   Named entity recognition (NER)
   Summarization
      Not that different from Machine Translation: vector embeddings representing meaning of text can have a shorter or longer representation
   Translation
   Code generation
   Text generation (e.g. automated customer service)
      But deploy tools with care! Chatbots might agree to do stuff that never has a corresponding action.  And people hate chatbots not understanding them.

GPT Architecture

Generative Pre-Trained Transformer (GPT)
   Type of Large Language Models (LLMs), i.e. models that have been trained on a huge amount of human language data
   OpenAI's GPT-2 is FOSS, although later models are closed
   Other LLMs are similar to GPT-2
  
   GPT is decoder-only
        In contrast, BERT is encoder-only. Also, T5 is an example of a model that uses both encoders and decoders.
       GPT has stacks of decoder blocks = masked self-attention layer + FFNN

    GPT has no concept of input!!
        All GPT does is continuously generate the next token in a sequence
        Using attention to maintain relationships to previous tokens
        GPT can be triggered to start (or “prompted”) with a sequence of tokens
        It then keeps on generating given the previous tokens
        Remember that the sequence can be processed in parallel! No need to feed in one token at a time.
Getting rid of the idea of inputs/outputs allows unsupervised training on unlabeled piles of text
        GPT “learns a language” rather than being optimized for some specific task. Learns to interpret and speak it.

   Hundreds of billions of parameters

LLMs

LLM Input Processing
    Tokenization + token encoding (of prompt sequence)
    Token embedding
    Captures token similarities, i.e. semantic relationships between tokens, with vectors in a very high dimensional space.
    Positional encoding
    Captures token positioning in the input relative to other nearby tokens
    Uses an interleaved sinusoidal function (both sine and cosine), allows it to work on any sequence length.
    For a given period of e.g. 100 tokens, you can infer where the token is relative to its 100 neighbors.
For a really long sequence of tokens, this would eventually repeat, but by then the token is probably not relevant to the context any more.

LLM Output Processing
    Stack of decoders outputs a vector at the end
    Output vector contains the “meaning” of what we want to say
    Multiply output vector with token embeddings
    Result is logits (probabilities) of each token being the correct next token in the sequence
    Final output can be randomized from logits (e.g. increase the “temperature”) instead of always  picking the highest probability token → increases GPT’s “creativity”

LLM Key Terms & LLM Inference Parameters
    Top P = Threshold probability for token inclusion (higher = more random)
    Top K = Alternate mechanism where K candidates exist for token inclusion (higher = more random)
    Temperature = Level of randomness in selecting the next word in the output from those tokens
    High temperature → More random/creative
    Low temperature → More consistent
    Context window = Number of tokens an LLM can process at once
    Max tokens = Limit for total number of tokens (on input or output)

Transfer Learning (Fine Tuning) with Transformers
    Use a pre-trained model as a base model, then adapt it (fine tune it) with your own data for your purposes
    Allows starting off with giant models (e.g. those that understand English, Klingon or Python) instead of having to train one from scratch!
        Opens up a whole new world of AI applications
    
Types of fine tuning
    Add additional training data through the whole thing
    Freeze specific layers, re-train others
    E.g. train a new tokenizer to learn a new language → train just the tokenization steps
        A popular technique here is LoRA (low-rank adaptation)
    Add a layer on top of the pre-trained model
        Just a few may be all that’s needed!
    Can provide examples of prompts and desired completions
    e.g. “How’s the weather?” → “What’s it to you, bucko?”
    Can adapt it to classification or other tasks
    e.g. “Wow, I love this course!” → Positive emotion

Hugging Face
    Giant repository of pre-trained models you can use → huggingface.co
    Can mess around with models… for free!
    Contains also a ton of learning resources
    Many available models: GPT-2, LLaMa, Stable Diffusion…


Comments

Popular posts from this blog

GHL Email Campaigns

Whitelabel Options

Await