ML Exam: 6 - Feature Engineering
ML Exam: 6
Feature Engineering
Feature Engineering - Techniques
Features:Pearson Correlation Coefficient: Feature correlation is 0 = no (so independent), 1 (or -1) = strong. Use Naive Bayesian Model if independent, otherwise use full Bayesian network if dependent.Recursive Feature Elimination (RFE): iteratively trains the model, ranks features by importance (e.g., based on coefficients in logistic regression), removes least important features, and repeats the process until target number.
Outliers:Random Cut Forest = outlier detection.
Images:Image Standardization: Resizes, crops, rotates, or converts images to grayscale so uniform for DL pipelines.
Recommendations:Factorization Machines (FM): For recommendation systems and sparse data prediction (e.g., product recommendations).
Data Cleansing:Missing Value Imputation: Replaces NULL with the mean, median, interpolated, or a custom placeholder.
Outlier Detection: Finding anomalies by standard deviation or Interquartile Range (IQR) formulas.Duplicate Row Removal: Strip out exact or near-identical matching rows.
Time-Series / Datetime:Feature Splitting/Auto Datetime: Extract day, month, year, or specific hour from generic timestamps.Time-Series Resampling: Adjusting the frequency of time-stamped transactional records to establish fixed, regular intervals for chronological analysis.
Data Balancing:Synthetic Minority Over-sampling (SMOTE): Balances skewed datasets by using KNN to generate synthetic examples of underrepresented target classes.
Dimensionality Reduction: Principal Component Analysis (PCA): Condenses data with lots of columns into fewer, uncorrelated features while retaining max info. Stabilizes linear models. Good for lots of highly correlated data.
Numeric:
Min-Max Scaling: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging "Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the distorting effects of outliers.
Log transformation: Squashes skewed data. Good for "Size (sq. feet or sq. meters)".Dynamic Column Re-typing: Explicit manual schema alteration converting general fields into specialized numeric data variants.
Target Leakage / Prediction Power Weighting: Built-in reporting metrics that transform or filter highly correlated numeric anomalies before modeling.Aggregated Features: Group into specific category to calc summations like AvgPurchasePerCustomer or MaxLoginTime.Add Training Set Frequency Column: for regressions.
Categorical:
One-Hot Encoding: "City Name" ("NY", "Paris", "Tokyo") to City_NY, City_Paris, City_Tokyo columns with only single 1 row. Good for multi-choice with each choice as column.Binning: Ages (22, 45, 61) into life stages (Young Adult, Middle Aged, Senior).Ordinal/Label Encoding: ("Low", "Medium", "High") into (0, 1, 2) for ordinal data. City name ("Dallas", "Paris", "London") to single column of City of (0, 1, 2). Binary Encoding: Transform categories into integers and then split into binary to save memory.Tokenization: Splitting continuous text docs into smaller units like individual words or subwords.Target Encoding: Replace each category value with avg value of the target variable to capture deep numeric relationships.Frequency Encoding: Replace each category with its total count or % frequency to show common vs. rare data points.Ordinal Encoding: Explicitly mapping ordered categories to specific numbers based on a strict, manually predefined hierarchical scale.Base-N Encoding: Generalizing binary encoding by using numbers > 2 to perfectly balance the creation of new columns with computer memory usage.Effect Encoding: Using a grid system of -1, 0, and 1 to represent baseline deviations in linear regression models.
Min-Max Scaling: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging "Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the
Log transformation: Squashes skewed data. Good for "Size (sq. feet or sq. meters)".
Advanced:Custom Math Formulas: Uses a built-in calculator to combine columns mathematically, generating new custom interaction metrics.eXtreme Gradient Boosting (XGBoost): creates weak decision trees sequentially with each new tree correcting error in prior. For Classification & Regression problems.Interaction Columns: Multiplying or combining 2 distinct columns together (e.g., Price × Quantity = Total_Spent).Shuffling: Randomize order of rows. Helps because quiets data noise in relation to collection order.Single Shot MultiBox Detector (SSD): real-time object detection algorithm.Text Embedding Extraction: Converts raw text strings into dense vectors using NLP models.
Common problems are below:
Missing Data
- Impute missing data = fill missing data with something
Impute: Mean Replacement
- Replace missing values with mean value of column
- A column represents a single feature
- Median value of column can be more useful if outliers distort the mean
- e.g. outlier billionaires distorting the income data of average citizens
- Pros
- Fast & easy
- Doesn't affect mean or sample size of overall data set
- Cons: pretty terrible
- Not very accurate
- Misses correlations between features (only works on column level)
- If age & income are correlated, simply imputing the mean will muddy that relation a lot
- Mean/median can only be calculated on numeric features, not on categorical features
- Most frequent value in a categorical feature could work though
Dropping Missing Data
- Reasonable if (all must apply!): Not many rows with missing data AND dropping those rows doesn't bias data AND need a fast solution.
- Almost anything is better though, rarely “best” approach
- e.g. impute similar field (impute “review summary” into “full text”)
- Data is generally valuable, dropping it is generally a bad idea
Impute: using ML
- KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data. Assumes numerical data. Categorical data can be handled with Hamming distance, but usually DL is better for this
- Deep Learning (DL) → DL model trained on all complete data, then can input missing data on incomplete data. Good: Works very well for categorical data. Bad: Complicated.
- Regression → Find linear or non-linear relationships between missing feature and other features using Multiple Imputation by Chained Equations (MICE).
Impute: Just Get More Data
Often the right solution… try harder to get more data!
Unbalanced Data
- Unbalanced data has a large discrepancy in the occurrence of positives and negatives
- Positive = “Hit” = What you're testing/observing happens
- Nothing to do with “good” or “ethical” outcomes
- Mainly a problem with neural networks (NNs)
- Positive = “Hit” = What you're testing/observing happens
- Common example: fraud detection
- Fraud is generally very rare, e.g. 0.1% of the time
- Only 0.1% of the data is positive (fraud), rest is negative → unbalanced
- If we don't deal with unbalanced data, a ML model might learn to always ignore fraud… after all, it almost never happens, so the accuracy is high… but the model is then useless!
- Fraud is generally very rare, e.g. 0.1% of the time
Possible solutions
- Oversampling: clone samples from minority class (can be random)
- Undersampling: remove a certain amount of negative samples
- Throwing data away usually not the right answer! (unless wanting to avoid scaling issues)
- Synthetic Minority Over-sampling TEchnique (SMOTE)
- Uses KNN to augment minorities
- artificially generate new samples by using mean values from minority neighbors
- Generally better than plain oversampling
- Uses KNN to augment minorities
- Adjusting thresholds (precision/recall trade-off)
- Predictions in a classification usually return a probability → need to define a threshold by which you flag a datapoint as positive
- Increasing the threshold can reduce false positives…
- …but also increase false negatives
- Precision/Recall Trade-off/balance → use a threshold that makes sense
Outliers
- Variance ($π^2$) = average of the squared differences from the mean
- Standard deviation (π) = $\sqrt(π^2)$ = how far from mean are values on average
- Can be used to identify outliers (those that are >π from mean)
- Possible criterion: object is outlier if it's multiple π away from mean… what multiple exactly? Have to use common sense!
- You can talk about how extreme a data point is by talking about “how many sigmas” away from the mean it is.
- Can be used to identify outliers (those that are >π from mean)
- Example: dataset = (1, 4, 5, 4, 8)
- Mean = (1+4+5+4+8)/5 = 4.4
- Differences from the mean = (-3.4, -0.4, 0.6, -0.4, 3.6)
- Squared differences = (11.56, 0.16, 0.36, 0.16, 12.96)
- Squaring prevents positive and negative variances from cancelling each other
- Squaring also amplifies weight for outliers
- Variance ($π^2$) = avg of the squared differences = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
- Standard deviation (π) = $\sqrt 5.04$ = 2.24
- 1 and 8 are more than 2.24 away from mean (4.4) → 1 and 8 are outliers
- Sometimes appropriate to remove outliers from training data… SOMETIMES!!
- e.g.1 Collaborative filtering → a single user rates lots of stuff compared to others, this user then has a big effect compared to those who rate just a few
- e.g.2 Bots/agents in web log data → they don't represent actual human behavior
- e.g.3 obtain a meaningful graph for the mean income of most of the population… a billionaire can skew this
- BUT! Be responsible! Understand WHY you eliminate outliers!
- e.g. If need an accurate measure of the mean income for ALL US citizens, that includes billionaires, even if few and skew the data… Need accuracy!
- Random Cut Forest algorithm is an example of outlier detection
- Present in many services (QuickSight, SageMaker…)
- In Amazon Managed Service for Apache Flink →
RANDOM_CUT_FORESTis a SQL function used for anomaly detection on numeric columns in a stream
Comments
Post a Comment