ML Exam: 6 - Feature Engineering
ML Exam: 6
Feature Engineering
Feature Engineering - Basic Concepts
- Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model
- ART OF ML!!
- Most critical part in a good ML implementation
- Talented/expert ML specialists are good at feature engineering
- ART OF ML!!
- Curse of dimensionality
- More features is not better!
- Every feature is a new dimension
- Much of feature engineering is selecting most relevant features → domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can help (PCA, K-Means)
- More features is not better!
Feature Engineering - Techniques
Numeric:
Min-Max Scaling: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging "Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the distorting effects of outliers.
Logarithmic transformation: Squashes skewed data. Good for "Size (sq. feet or sq. meters)".Missing Value Imputation: Replaces missing numeric values with the mean, median, or a custom placeholder to keep row entries viable for training.
Categorical:
One-Hot Encoding: Converts categorical strings into a series of binary (0 or 1) columns, mapping discrete choices for algorithms requiring numerical input. Good for "City (Name)".
Ordinal / Label Encoding: Maps categories to sequential integers when a distinct, meaningful order exists between them.
Advanced:
Feature Splitting (such as Date/Time): Extracts granular components (such as hour, day of the week, or month) from a single timestamp string to expose cyclical patterns. Good for "Type and Year of Home".
Text Embedding Extraction: Converts raw text strings into dense vector representations using pre-trained Natural Language Processing (NLP) models.
Image Standardization: Resizes, crops, rotates, or converts image channels to grayscale to prepare uniform visual dimensions for deep learning pipelines.
Math:
Principal Component Analysis (PCA): Reduces dataset dimensionality by projecting high-dimensional data onto a smaller set of uncorrelated components while preserving variance.
Synthetic Minority Over-sampling (SMOTE): Balances skewed datasets by generating synthetic examples of underrepresented target classes to eliminate bias.
Custom Math Formulations: Uses a built-in calculator to combine columns mathematically, generating new custom interaction metrics.
Min-Max Scaling: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging "Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the
Logarithmic transformation: Squashes skewed data. Good for "Size (sq. feet or sq. meters)".
One-Hot Encoding: Converts categorical strings into a series of binary (0 or 1) columns, mapping discrete choices for algorithms requiring numerical input. Good for "City (Name)".
Ordinal / Label Encoding: Maps categories to sequential integers when a distinct, meaningful order exists between them.
Feature Splitting (such as Date/Time): Extracts granular components (such as hour, day of the week, or month) from a single timestamp string to expose cyclical patterns. Good for "Type and Year of Home".
Text Embedding Extraction: Converts raw text strings into dense vector representations using pre-trained Natural Language Processing (NLP) models.
Image Standardization: Resizes, crops, rotates, or converts image channels to grayscale to prepare uniform visual dimensions for deep learning pipelines.
Principal Component Analysis (PCA): Reduces dataset dimensionality by projecting high-dimensional data onto a smaller set of uncorrelated components while preserving variance.
Synthetic Minority Over-sampling (SMOTE): Balances skewed datasets by generating synthetic examples of underrepresented target classes to eliminate bias.
Custom Math Formulations: Uses a built-in calculator to combine columns mathematically, generating new custom interaction metrics.
Common problems are below:
Missing Data
- Impute missing data = fill missing data with something
Impute: Mean Replacement
- Replace missing values with mean value of column
- A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
- e.g. outlier billionaires distorting the income data of average citizens
- Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
- Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
- If age & income are correlated, simply imputing the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
- Most frequent value in a categorical feature could work though
Dropping Missing Data
- Reasonable if (all must apply!):
- Not many rows with missing data
- Dropping those rows doesn't bias data
- Need a fast solution
- Almost anything is better though, rarely “best” approach
- e.g. impute similar field (impute “review summary” into “full text”)
- Data is generally valuable, dropping it is generally a bad idea
Impute: using ML
- KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data
- Assumes numerical data
- Categorical data can be handled with Hamming distance, but usually DL is better for this
- Deep Learning (DL) → DL model trained on all complete data, then can input missing data on incomplete data
- Good: Works very well for categorical data
- Bad: Complicated compared to other solutions
- Regression → Find linear or non-linear relationships between missing feature and other features
- Multiple Imputation by Chained Equations (MICE) - advanced technique
Impute: Just Get More Data ;)
- DUH! But often the right solution… try harder to get MOAR data!
Unbalanced Data
- Unbalanced data has a large discrepancy in the occurrence of positives and negatives
- Positive = “Hit” = What you're testing/observing happens
- Nothing to do with “good” or “ethical” outcomes
- Mainly a problem with neural networks (NNs)
- Positive = “Hit” = What you're testing/observing happens
- Common example: fraud detection
- Fraud is generally very rare, e.g. 0.1% of the time
- Only 0.1% of the data is positive (fraud), rest is negative → unbalanced
- If we don't deal with unbalanced data, a ML model might learn to always ignore fraud… after all, it almost never happens, so the accuracy is high… but the model is then useless!
- Fraud is generally very rare, e.g. 0.1% of the time
Possible solutions
- Oversampling: clone samples from minority class (can be random)
- Undersampling: remove a certain amount of negative samples
- Throwing data away usually not the right answer! (unless wanting to avoid scaling issues)
- Synthetic Minority Over-sampling TEchnique (SMOTE)
- Uses KNN to augment minorities
- artificially generate new samples by using mean values from minority neighbors
- Generally better than plain oversampling
- Uses KNN to augment minorities
- Adjusting thresholds (precision/recall trade-off)
- Predictions in a classification usually return a probability → need to define a threshold by which you flag a datapoint as positive
- Increasing the threshold can reduce false positives…
- …but also increase false negatives
- Precision/Recall Trade-off/balance → use a threshold that makes sense
Outliers
- Variance ($π^2$) = average of the squared differences from the mean
- Standard deviation (π) = $\sqrt(π^2)$ = how far from mean are values on average
- Can be used to identify outliers (those that are >π from mean)
- Possible criterion: object is outlier if it's multiple π away from mean… what multiple exactly? Have to use common sense!
- You can talk about how extreme a data point is by talking about “how many sigmas” away from the mean it is.
- Can be used to identify outliers (those that are >π from mean)
- Example: dataset = (1, 4, 5, 4, 8)
- Mean = (1+4+5+4+8)/5 = 4.4
- Differences from the mean = (-3.4, -0.4, 0.6, -0.4, 3.6)
- Squared differences = (11.56, 0.16, 0.36, 0.16, 12.96)
- Squaring prevents positive and negative variances from cancelling each other
- Squaring also amplifies weight for outliers
- Variance ($π^2$) = avg of the squared differences = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
- Standard deviation (π) = $\sqrt 5.04$ = 2.24
- 1 and 8 are more than 2.24 away from mean (4.4) → 1 and 8 are outliers
- Sometimes appropriate to remove outliers from training data… SOMETIMES!!
- e.g.1 Collaborative filtering → a single user rates lots of stuff compared to others, this user then has a big effect compared to those who rate just a few
- e.g.2 Bots/agents in web log data → they don't represent actual human behavior
- e.g.3 obtain a meaningful graph for the mean income of most of the population… a billionaire can skew this
- BUT! Be responsible! Understand WHY you eliminate outliers!
- e.g. If need an accurate measure of the mean income for ALL US citizens, that includes billionaires, even if few and skew the data… Need accuracy!
- Random Cut Forest algorithm is an example of outlier detection
- Present in many services (QuickSight, SageMaker…)
- In Amazon Managed Service for Apache Flink →
RANDOM_CUT_FORESTis a SQL function used for anomaly detection on numeric columns in a stream
Data Transformation
Binning
- Put values into bins/buckets i.e. ranges of values
- e.g. instead of actual age, use 10-19y.o., 20-29y.o., etc
- Transforms numeric data to ordinal data
- Usually not good because you lose data…
- …but really useful when there's uncertainty in measurements! → can minimize measurement errors
- Quantile binning: categorizes data by their place in the data distribution
- Ensures even sizes of bins/buckets
Transforming
- Applying some function to a feature to suit it for training
- e.g. apply a log to an exponential trend
- Can replace original feature… or just create a new feature to use side-by side (careful with curse of dimensionality though)
Scaling/Normalization
-
A particular type of Transforming function
-
Transform data to a manageable scale or distribution
- Remember to scale results back up!
-
Most models require data to be scaled to comparable values
- Otherwise features with large magnitudes will have more weight than desired!
- e.g. age↔income… incomes have much higher values than ages
-
Some models (e.g. neural networks) prefer data to be normally distributed around 0
-
Libraries can help (e.g. scikit_learn has preprocessor modules like
MinMaxScaler)
Encoding
- Data transformed into a new representation
- Some ML models require the new representation
- One-hot encoding
- Create “buckets” for every category, each sample has only one bucket with 1, the rest has 0.
- Very common in DL → categories are represented by individual output “neurons”
- Allows handling categorical data (e.g. city names) in neural networks
Shuffling
- Randomize order of rows
- Benefits many algorithms
- Eliminates residual signals in data that correlate to collection order
Comments
Post a Comment