ML Exam: 6 - Feature Engineering

ML Exam: 6

Feature Engineering 


Feature Engineering - Basic Concepts

  • Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
    • ART OF ML!!
      • Most critical part in a good ML implementation
      • Talented/expert ML specialists are good at feature engineering
  • Curse of dimensionality
    • More features is not better!
      • Every feature is a new dimension
      • Much of feature engineering is selecting most relevant features → domain knowledge comes into play
    • Unsupervised dimensionality reduction techniques can help (PCA, K-Means)


Feature Engineering - Techniques

Numeric:
Min-Max Scaling: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging 
"Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the 
distorting effects of outliers.
Logarithmic transformation: Squashes skewed data. 
Good for "Size (sq. feet or sq. meters)".
Missing Value Imputation: Replaces missing numeric values with the mean, median, or a custom placeholder to keep row entries viable for training.

Categorical:
One-Hot Encoding: Converts categorical strings into a series of binary (0 or 1) columns, mapping discrete choices for algorithms requiring numerical input. Good for "City (Name)".
Ordinal / Label Encoding: Maps categories to sequential integers when a distinct, meaningful order exists between them.

Advanced:
Feature Splitting (such as Date/Time): Extracts granular components (such as hour, day of the week, or month) from a single timestamp string to expose cyclical patterns. 
Good for "Type and Year of Home".
Text Embedding Extraction: Converts raw text strings into dense vector representations using pre-trained Natural Language Processing (NLP) models.
Image Standardization: Resizes, crops, rotates, or converts image channels to grayscale to prepare uniform visual dimensions for deep learning pipelines.

Math:
Principal Component Analysis (PCA): Reduces dataset dimensionality by projecting high-dimensional data onto a smaller set of uncorrelated components while preserving variance.
Synthetic Minority Over-sampling (SMOTE): Balances skewed datasets by generating synthetic examples of underrepresented target classes to eliminate bias.
Custom Math Formulations: Uses a built-in calculator to combine columns mathematically, generating new custom interaction metrics.


Common problems are below: 

Missing Data

  • Impute missing data = fill missing data with something

Impute: Mean Replacement

  • Replace missing values with mean value of column
    • A column represents a single feature
    • Median value of column can be more useful if outliers distort the mean
      • e.g. outlier billionaires distorting the income data of average citizens
  • Pros
    • Fast & easy
    • Doesn't affect mean or sample size of overall data set
  • Cons: pretty terrible
    • Not very accurate
    • Misses correlations between features (only works on column level)
      • If age & income are correlated, simply imputing the mean will muddy that relation a lot
    • Mean/median can only be calculated on numeric features, not on categorical features
      • Most frequent value in a categorical feature could work though

Dropping Missing Data

  • Reasonable if (all must apply!):
    • Not many rows with missing data
    • Dropping those rows doesn't bias data
    • Need a fast solution
  • Almost anything is better though, rarely “best” approach
    • e.g. impute similar field (impute “review summary” into “full text”)
    • Data is generally valuable, dropping it is generally a bad idea

Impute: using ML

  1. KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data
    • Assumes numerical data
    • Categorical data can be handled with Hamming distance, but usually DL is better for this
  2. Deep Learning (DL) → DL model trained on all complete data, then can input missing data on incomplete data
    • Good: Works very well for categorical data
    • Bad: Complicated compared to other solutions
  3. Regression → Find linear or non-linear relationships between missing feature and other features
    • Multiple Imputation by Chained Equations (MICE) - advanced technique

Impute: Just Get More Data ;)

  • DUH! But often the right solution… try harder to get MOAR data!

Unbalanced Data

  • Unbalanced data has a large discrepancy in the occurrence of positives and negatives
    • Positive = “Hit” = What you're testing/observing happens
      • Nothing to do with “good” or “ethical” outcomes
    • Mainly a problem with neural networks (NNs)
  • Common example: fraud detection
    • Fraud is generally very rare, e.g. 0.1% of the time
      • Only 0.1% of the data is positive (fraud), rest is negative → unbalanced
    • If we don't deal with unbalanced data, a ML model might learn to always ignore fraud… after all, it almost never happens, so the accuracy is high… but the model is then useless!

Possible solutions

  • Oversampling: clone samples from minority class (can be random)
  • Undersampling: remove a certain amount of negative samples
    • Throwing data away usually not the right answer! (unless wanting to avoid scaling issues)
  • Synthetic Minority Over-sampling TEchnique (SMOTE)
    • Uses KNN to augment minorities
      • artificially generate new samples by using mean values from minority neighbors
    • Generally better than plain oversampling
  • Adjusting thresholds (precision/recall trade-off)
    • Predictions in a classification usually return a probability → need to define a threshold by which you flag a datapoint as positive
    • Increasing the threshold can reduce false positives…
      • …but also increase false negatives
      • Precision/Recall Trade-off/balance → use a threshold that makes sense

Outliers

  • Variance ($𝜎^2$) = average of the squared differences from the mean
  • Standard deviation (𝜎) = $\sqrt(𝜎^2)$ = how far from mean are values on average
    • Can be used to identify outliers (those that are >𝜎 from mean)
      • Possible criterion: object is outlier if it's multiple 𝜎 away from mean… what multiple exactly? Have to use common sense!
      • You can talk about how extreme a data point is by talking about “how many sigmas” away from the mean it is.
  • Example: dataset = (1, 4, 5, 4, 8)
    • Mean = (1+4+5+4+8)/5 = 4.4
    • Differences from the mean = (-3.4, -0.4, 0.6, -0.4, 3.6)
    • Squared differences = (11.56, 0.16, 0.36, 0.16, 12.96)
      • Squaring prevents positive and negative variances from cancelling each other
      • Squaring also amplifies weight for outliers
    • Variance ($𝜎^2$) = avg of the squared differences = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
    • Standard deviation (𝜎) = $\sqrt 5.04$ = 2.24
      • 1 and 8 are more than 2.24 away from mean (4.4) → 1 and 8 are outliers
  • Sometimes appropriate to remove outliers from training data…  SOMETIMES!!
    • e.g.1 Collaborative filtering → a single user rates lots of stuff compared to others, this user then has a big effect compared to those who rate just a few
    • e.g.2 Bots/agents in web log data → they don't represent actual human behavior
    • e.g.3 obtain a meaningful graph for the mean income of most of the population… a billionaire can skew this
  • BUT! Be responsible! Understand WHY you eliminate outliers!
    • e.g. If need an accurate measure of the mean income for ALL US citizens, that includes billionaires, even if few and skew the data… Need accuracy!
  • Random Cut Forest algorithm is an example of outlier detection
    • Present in many services (QuickSight, SageMaker…)
    • In Amazon Managed Service for Apache Flink → RANDOM_CUT_FOREST is a SQL function used for anomaly detection on numeric columns in a stream

Data Transformation

Binning

  • Put values into bins/buckets i.e. ranges of values
    • e.g. instead of actual age, use 10-19y.o., 20-29y.o., etc
  • Transforms numeric data to ordinal data
    • Usually not good because you lose data…
    • …but really useful when there's uncertainty in measurements! → can minimize measurement errors
  • Quantile binning: categorizes data by their place in the data distribution
    • Ensures even sizes of bins/buckets

Transforming

  • Applying some function to a feature to suit it for training
    • e.g. apply a log to an exponential trend
  • Can replace original feature… or just create a new feature to use side-by side (careful with curse of dimensionality though)
    • e.g. YouTube recommendations use $x$, $x^2$ and $\sqrt x$


Scaling/Normalization

  • A particular type of Transforming function

  • Transform data to a manageable scale or distribution

    • Remember to scale results back up!
  • Most models require data to be scaled to comparable values

    • Otherwise features with large magnitudes will have more weight than desired!
    • e.g. age↔income… incomes have much higher values than ages
  • Some models (e.g. neural networks) prefer data to be normally distributed around 0


  • Libraries can help (e.g. scikit_learn has preprocessor modules like MinMaxScaler)

Encoding

  • Data transformed into a new representation
    • Some ML models require the new representation
  • One-hot encoding
    • Create “buckets” for every category, each sample has only one bucket with 1, the rest has 0.
    • Very common in DLcategories are represented by individual output “neurons
      • Allows handling categorical data (e.g. city names) in neural networks

Shuffling

  • Randomize order of rows
  • Benefits many algorithms
    • Eliminates residual signals in data that correlate to collection order

Comments

Popular posts from this blog

GHL Email Campaigns

Whitelabel Options

Free AI Tools