ML Exam: 6

Feature Engineering

Feature Engineering - Basic Concepts

Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
- ART OF ML!!
  - Most critical part in a good ML implementation
  - Talented/expert ML specialists are good at feature engineering
Curse of dimensionality
- More features is not better!
  - Every feature is a new dimension
  - Much of feature engineering is selecting most relevant features → domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can help (PCA, K-Means)

Feature Engineering - Techniques

Numeric:
Min-Max Scaling: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging "Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the
distorting effects of outliers.
Logarithmic transformation: Squashes skewed data. Good for "Size (sq. feet or sq. meters)".
Missing Value Imputation: Replaces missing numeric values with the mean, median, or a custom placeholder to keep row entries viable for training.

Categorical:
One-Hot Encoding: Converts categorical strings into a series of binary (0 or 1) columns, mapping discrete choices for algorithms requiring numerical input. Good for "City (Name)".
Ordinal / Label Encoding: Maps categories to sequential integers when a distinct, meaningful order exists between them.

Advanced:
Feature Splitting (such as Date/Time): Extracts granular components (such as hour, day of the week, or month) from a single timestamp string to expose cyclical patterns. Good for "Type and Year of Home".
Text Embedding Extraction: Converts raw text strings into dense vector representations using pre-trained Natural Language Processing (NLP) models.
Image Standardization: Resizes, crops, rotates, or converts image channels to grayscale to prepare uniform visual dimensions for deep learning pipelines.

Math:
Principal Component Analysis (PCA): Reduces dataset dimensionality by projecting high-dimensional data onto a smaller set of uncorrelated components while preserving variance.
Synthetic Minority Over-sampling (SMOTE): Balances skewed datasets by generating synthetic examples of underrepresented target classes to eliminate bias.
Custom Math Formulations: Uses a built-in calculator to combine columns mathematically, generating new custom interaction metrics.

Common problems are below:

Missing Data

Impute missing data = fill missing data with something

Impute: Mean Replacement

Replace missing values with mean value of column
- A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
  - e.g. outlier billionaires distorting the income data of average citizens
Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
  - If age & income are correlated, simply imputing the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
  - Most frequent value in a categorical feature could work though

Dropping Missing Data

Reasonable if (all must apply!):
- Not many rows with missing data
- Dropping those rows doesn't bias data
- Need a fast solution
Almost anything is better though, rarely “best” approach
- e.g. impute similar field (impute “review summary” into “full text”)
- Data is generally valuable, dropping it is generally a bad idea

Impute: using ML

KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data
- Assumes numerical data
- Categorical data can be handled with Hamming distance, but usually DL is better for this
Deep Learning (DL) → DL model trained on all complete data, then can input missing data on incomplete data
- Good: Works very well for categorical data
- Bad: Complicated compared to other solutions
Regression → Find linear or non-linear relationships between missing feature and other features
- Multiple Imputation by Chained Equations (MICE) - advanced technique

Impute: Just Get More Data ;)

DUH! But often the right solution… try harder to get MOAR data!

Unbalanced Data

Unbalanced data has a large discrepancy in the occurrence of positives and negatives
- Positive = “Hit” = What you're testing/observing happens
  - Nothing to do with “good” or “ethical” outcomes
- Mainly a problem with neural networks (NNs)
Common example: fraud detection
- Fraud is generally very rare, e.g. 0.1% of the time
  - Only 0.1% of the data is positive (fraud), rest is negative → unbalanced
- If we don't deal with unbalanced data, a ML model might learn to always ignore fraud… after all, it almost never happens, so the accuracy is high… but the model is then useless!

Possible solutions

Oversampling: clone samples from minority class (can be random)
Undersampling: remove a certain amount of negative samples
- Throwing data away usually not the right answer! (unless wanting to avoid scaling issues)
Synthetic Minority Over-sampling TEchnique (SMOTE)
- Uses KNN to augment minorities
  - artificially generate new samples by using mean values from minority neighbors
- Generally better than plain oversampling
Adjusting thresholds (precision/recall trade-off)
- Predictions in a classification usually return a probability → need to define a threshold by which you flag a datapoint as positive
- Increasing the threshold can reduce false positives…
  - …but also increase false negatives
  - Precision/Recall Trade-off/balance → use a threshold that makes sense

Outliers

Variance ($𝜎^2$) = average of the squared differences from the mean
Standard deviation (𝜎) = $\sqrt(𝜎^2)$ = how far from mean are values on average
- Can be used to identify outliers (those that are >𝜎 from mean)
  - Possible criterion: object is outlier if it's multiple 𝜎 away from mean… what multiple exactly? Have to use common sense!
  - You can talk about how extreme a data point is by talking about “how many sigmas” away from the mean it is.
Example: dataset = (1, 4, 5, 4, 8)
- Mean = (1+4+5+4+8)/5 = 4.4
- Differences from the mean = (-3.4, -0.4, 0.6, -0.4, 3.6)
- Squared differences = (11.56, 0.16, 0.36, 0.16, 12.96)
  - Squaring prevents positive and negative variances from cancelling each other
  - Squaring also amplifies weight for outliers
- Variance ($𝜎^2$) = avg of the squared differences = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
- Standard deviation (𝜎) = $\sqrt 5.04$ = 2.24
  - 1 and 8 are more than 2.24 away from mean (4.4) → 1 and 8 are outliers
Sometimes appropriate to remove outliers from training data… SOMETIMES!!
- e.g.1 Collaborative filtering → a single user rates lots of stuff compared to others, this user then has a big effect compared to those who rate just a few
- e.g.2 Bots/agents in web log data → they don't represent actual human behavior
- e.g.3 obtain a meaningful graph for the mean income of most of the population… a billionaire can skew this
BUT! Be responsible! Understand WHY you eliminate outliers!
- e.g. If need an accurate measure of the mean income for ALL US citizens, that includes billionaires, even if few and skew the data… Need accuracy!
Random Cut Forest algorithm is an example of outlier detection
- Present in many services (QuickSight, SageMaker…)
- In Amazon Managed Service for Apache Flink → RANDOM_CUT_FOREST is a SQL function used for anomaly detection on numeric columns in a stream

Data Transformation

Binning

Put values into bins/buckets i.e. ranges of values
- e.g. instead of actual age, use 10-19y.o., 20-29y.o., etc
Transforms numeric data to ordinal data
- Usually not good because you lose data…
- …but really useful when there's uncertainty in measurements! → can minimize measurement errors
Quantile binning: categorizes data by their place in the data distribution
- Ensures even sizes of bins/buckets

Transforming

Applying some function to a feature to suit it for training
- e.g. apply a log to an exponential trend
Can replace original feature… or just create a new feature to use side-by side (careful with curse of dimensionality though)
- e.g. YouTube recommendations use $x$, $x^2$ and $\sqrt x$

Scaling/Normalization

A particular type of Transforming function
Transform data to a manageable scale or distribution
- Remember to scale results back up!
Most models require data to be scaled to comparable values
- Otherwise features with large magnitudes will have more weight than desired!
- e.g. age↔income… incomes have much higher values than ages
Some models (e.g. neural networks) prefer data to be normally distributed around 0
Libraries can help (e.g. scikit_learn has preprocessor modules like MinMaxScaler)

Encoding

Data transformed into a new representation
- Some ML models require the new representation
One-hot encoding
- Create “buckets” for every category, each sample has only one bucket with 1, the rest has 0.
- Very common in DL → categories are represented by individual output “neurons”
  - Allows handling categorical data (e.g. city names) in neural networks

Shuffling

Randomize order of rows
Benefits many algorithms
- Eliminates residual signals in data that correlate to collection order

Search This Blog

Ones and Zeros

ML Exam: 6 - Feature Engineering

ML Exam: 6

Feature Engineering

Feature Engineering - Basic Concepts

Feature Engineering - Techniques

Common problems are below:

Missing Data

Impute: Mean Replacement

Dropping Missing Data

Impute: using ML

Impute: Just Get More Data ;)

Unbalanced Data

Possible solutions

Outliers

Data Transformation

Binning

Transforming

Scaling/Normalization

Encoding

Shuffling

Comments

Post a Comment

Popular posts from this blog

GHL Email Campaigns

Whitelabel Options

Free AI Tools