ML Exam Prep: 8 - Model Fitting and Tuning

ML Exam Prep

Model Fitting and Tuning


Problems:

1. Overfitting (Model is too complex)

The Problem: Great on training data since memorizes (including its noise and random errors) or too many training epochs, instead of patterns. Terrible on new data. "High Variance" model. If the loss curve is affected (validation loss increases, while training loss decreases), it‘s a strong indicator of overfitting.
Fix - Simplify the Model:
  • 1) Early Stopping: Halts when validation loss begins to rise.
  • 2) Increase Regularization: Penalizes extreme values; L1 (Lasso) zeroes out weak features, L2 (Ridge) shrinks them. If L1 and all features go to 0, then decrease regularization parameter.
  • 3) Increase Dropout: Randomly shuts off neurons during training.
  • 4) Fewer Feature Combinations: Removes complex or noisy inputs.
  • 5) Data Augmentation: Tweaks existing training samples (e.g., rotating images) to create new variety.
  • 6) Use Cross-Validation: By rotating thru training and validation sets. Use stratified K-fold. Also is on Underfitting.
  • 7) Decrease Max Depth/Layers: Simplifies the model. 
2. Underfitting (Model is too simple)
The Problem: Fails to get data patterns, performing poorly on both training and testing/validation sets.
Fix - Fatten the Model:
  • 1) Increase Model Complexity: Add layers/neurons 
  • 2) Decrease Regularization: Reduce dropout or L1/L2 regularization.
  • 3) Increase or Combine Features: Combine or expand to reveal complex relationships. Ex: combining width and length into area.
  • 4) Train for More Epochs: Give the model more training time to find the global minimum.
  • 5) Use Cross-Validation: By rotating thru training and validation sets. Use stratified K-fold. Also is on Overfitting.
  • 6) Add Training Data - add to minority class, image inversions (upside down), image translations (slight shift or angle), or just more data.
  • 7) Better algorithm.
  • 8) Increase the Learning Rate: Fixes models where too slow to learn or gets stuck, by making model converge faster.  Usually not the first choice to fix things.
3. Catastrophic Forgetting (Model forgets old tasks)
The Problem:  In sequential learning, new task training completely overwrites weights learned from previous tasks. Ex: first trained on find cats and then trained find cars, it completely overwrites its weights and entirely forgets how to do find cats.
Fix:
1) Regularization-Based (Elastic Weight Consolidation - EWC): Ids weights critical to old tasks and penalizes changing them. 
2) Rehearsal / Interleaving: Mixes small old task data into new task training set.
3) Architecture-Based (Progressive Neural Networks): Freezes old network columns and adds new neurons for new tasks.

4. Oscillating
    Oscillating pattern of training and validation loss is fluctuate between underfitting and overfitting. So learning rate is too high.

5. Rare is Being Overlooked
    If single instance of rare disease and recall is missing it, you might try adding class weight to the loss function for the target class.

6. Too Many Features or Highly Correlated/Multicollinearity:
a) If data points, then correlation plot with heat maps, univariate selection, feature importance with a tree-based classifier,
b) Pearson Correlation Coefficient: Feature correlation is 0 = no (so independent), 1 (or -1) = strong. Use Naive Bayesian Model if independent, otherwise use full Bayesian network if dependent.
c) Principal Component Analysis (PCA): Drops dimensions, keeps data variance. Sensitive to large swings so scale with Min Max Scaler transformation.
d) Recursive Feature Elimination (RFE): iteratively trains the model, ranks features by importance (e.g., based on coefficients in logistic regression), removes least important features, and repeats the process until target number.
e) Singular Value Decomposition (SVD) = looks at single value at a time to find impact.

7. Computational efficiency:
Prune the data.

8. Prediction Power Score of 1:
"Target leakage" happened. Ex: Cancellation Date column which detecting customer churn.

9. Imbalanced data:
1) Stratified Sampling: Dataset has same % of minority and majority classes as problem dataset. 
2) Add Training Data: add to minority class

10. Vanishing Gradient on model:
Reduce the "Learning Rate".

-----------------------------------------------------------------------------------------------------------

Hyperparameter Tuning


1. Loss Functions
The loss function measures how badly a model's predictions deviate from the actual targets. 
Regression Loss Metrics
  • "L1" Loss (Mean Absolute Error "MAE"): Measures the absolute differences between predictions and actual values. It is highly robust to outliers because it does not square the errors.
  • "L2" Loss (Mean Squared Error "MSE"): Squares the differences between predictions and actual values. It heavily penalizes large errors, forcing the model to avoid extreme mistakes.
  • Huber / Elastic Loss ("delta"): Combines the best parts of "L1" and "L2" loss. It behaves like "L2" for small errors and  "L1" for large errors. Tuning "delta" parameter tells where this transition happens.
Classification Loss Metrics
  • Hinge Loss: Used most notably Support Vector Machines. Maxs margin between classes and is highly robust to outliers in the data.
  • Logistic Loss (Cross-Entropy): Good for: skewed class distributions when paired with threshold adjustments.
  • Class Weights / Focal Loss: Applied to imbalanced datasets (e.g., fraud detection). It assigns a higher penalty to misclassifying the minority class, forcing the model to learn rare events.

2. Regularization
Regularization parameters prevent overfitting by constraining the model's weights to keep them small and simple.
  • "L1" Regularization (Lasso) Penalty: Zeroes out weak features.
  • "L2" Regularization (Ridge) Penalty: Shrinks and smooths out predictions.
  • Dropout Rate: (DL specific) Dictates the % of random neurons to disable during each training step, forcing the network to learn robust, redundant pathways.

3. Learning Parameters
Learning parameters dictate the speed, step size, and math stability of the optimization algorithm (like Gradient Descent) as searches for best model weights.
  • Learning Rate ("alpha"): Controls how much the model updates its internal weights in response to the estimated error at each step.
  • Learning Rate Decay: Gradually decreases the learning rate over time to help the model settle precisely into the lowest error point.
    • Decaying too fast: The learning steps become too small too quickly, and the algorithm never reaches the optimum (stalls out).
    • Decaying too slowly: The steps remain too large, causing the algorithm to bounce around the optimum and never converge.
  • Batch Size: Count of training samples processed before the model updates its internal weights. Small batches add helpful noise; large batches optimize GPU utilization.
  • Momentum: Accelerates the optimization algorithm in the correct direction by averaging past gradients, helping model pass through flat regions or escape local traps.

4. Hyperband 

  Allocates more resources to promising configurations while stopping underperforming configurations early.

Comments

Popular posts from this blog

GHL Email Campaigns

Await

Free AI Tools