ML Exam: 6 - Feature Engineering

ML Exam: 6

Feature Engineering 


Feature Engineering - Techniques

Features:
Pearson Correlation Coefficient: Feature correlation is 0 = no (so independent), 1 (or -1) = strong. Use Naive Bayesian Model if independent, otherwise use full Bayesian network if dependent.
Recursive Feature Elimination (RFE): iteratively trains the model, ranks features by importance (e.g., based  on coefficients in logistic regression), removes least important features, and repeats the process until target number

Outliers:
Random Cut Forest = outlier detection.

Images:
Image Standardization: Resizes, crops, rotates, or converts images to grayscale so uniform for DL pipelines.

Recommendations:
Factorization Machines (FM): For recommendation systems and sparse data prediction (e.g., product recommendations).

Data Cleansing:
Missing Value Imputation: Replaces NULL with the mean, median, interpolated, or a custom placeholder.
Outlier Detection:  Finding anomalies by standard deviation or Interquartile Range (IQR) formulas.
Duplicate Row Removal: Strip out exact or near-identical matching rows.

Time-Series / Datetime:
Feature Splitting/Auto Datetime: Extract day, month, year, or specific hour from generic timestamps.
Time-Series Resampling: Adjusting the frequency of time-stamped transactional records to establish fixed, regular intervals for chronological analysis.

Data Balancing:
Synthetic Minority Over-sampling (SMOTE): Balances skewed datasets by using KNN to generate synthetic examples of underrepresented target classes. 

Dimensionality Reduction
Principal Component Analysis (PCA): Condenses data with lots of columns into fewer, uncorrelated features while retaining max info. Stabilizes linear models. Good for lots of highly correlated data. 

Numeric:
Min-Max Scaling
: Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training.
Standardized Distribution/Standard Scaling (Z-score): Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same math scale to lessen huge numbers. Good for merging 
"Age" (ranging from 0–100) and "Annual Income" (ranging from $10,000–$1,000,000)".
Robust Scaling: Scales features using the median and Interquartile Range (IQR) to minimize the 
distorting effects of outliers.
Log transformation: Squashes skewed data. 
Good for "Size (sq. feet or sq. meters)".
Dynamic Column Re-typing: Explicit manual schema alteration converting general fields into specialized numeric data variants.
Target Leakage / Prediction Power Weighting: Built-in reporting metrics that transform or filter highly correlated numeric anomalies before modeling.
Aggregated Features: Group into specific category to calc summations like AvgPurchasePerCustomer or MaxLoginTime.
Add Training Set Frequency Column: for regressions.

Categorical:
One-Hot Encoding: "City Name"  ("NY", "Paris", "Tokyo") to  City_NY, City_Paris, City_Tokyo columns with only single 1 row. Good for multi-choice with each choice as column.
Binning: Ages (22, 45, 61) into life stages (Young Adult, Middle Aged, Senior).
Ordinal/Label Encoding: ("Low", "Medium", "High") into (0, 1, 2) for ordinal data. City name 
("Dallas", "Paris", "London") to single column of City of (0, 1, 2). 
Binary Encoding: Transform categories into integers and then split into binary to save memory.
Tokenization: Splitting continuous text docs into smaller units like individual words or subwords.
Target Encoding: Replace each category value with avg value of the target variable to capture deep numeric relationships.
Frequency EncodingReplace each category with its total count or % frequency to show common vs. rare data points.
Ordinal Encoding: Explicitly mapping ordered categories to specific numbers based on a strict, manually predefined hierarchical scale.
Base-N Encoding: Generalizing binary encoding by using numbers > 2 to perfectly balance the creation of new columns with computer memory usage.
Effect Encoding: Using a grid system of -1, 0, and 1 to represent baseline deviations in linear regression models.

Advanced:
Custom Math Formulas: Uses a built-in calculator to combine columns mathematically, generating new custom interaction metrics.
eXtreme Gradient Boosting (XGBoost): creates weak decision trees sequentially with each new tree correcting error in prior. For Classification & Regression problems.
Interaction Columns: Multiplying or combining 2 distinct columns together (e.g., Price × Quantity = Total_Spent).
Shuffling: Randomize order of rows. Helps because quiets data noise in relation to collection order.
Single Shot MultiBox Detector (SSD): real-time object detection algorithm.
Text Embedding Extraction: Converts raw text strings into dense vectors using NLP models.


Common problems are below: 

Missing Data

  • Impute missing data = fill missing data with something

Impute: Mean Replacement

  • Replace missing values with mean value of column
    • A column represents a single feature
    • Median value of column can be more useful if outliers distort the mean
      • e.g. outlier billionaires distorting the income data of average citizens
  • Pros
    • Fast & easy
    • Doesn't affect mean or sample size of overall data set
  • Cons: pretty terrible
    • Not very accurate
    • Misses correlations between features (only works on column level)
      • If age & income are correlated, simply imputing the mean will muddy that relation a lot
    • Mean/median can only be calculated on numeric features, not on categorical features
      • Most frequent value in a categorical feature could work though

Dropping Missing Data

  • Reasonable if (all must apply!): Not many rows with missing data AND dropping those rows doesn't bias data AND need a fast solution. 
  • Almost anything is better though, rarely “best” approach
    • e.g. impute similar field (impute “review summary” into “full text”)
    • Data is generally valuable, dropping it is generally a bad idea

Impute: using ML

  1. KNN → Find K “nearest” (most similar) neighbors, i.e. rows, & average values to fill missing data. Assumes numerical data. Categorical data can be handled with Hamming distance, but usually DL is better for this
  2. Deep Learning (DL) → DL model trained on all complete data, then can input missing data on incomplete data. Good: Works very well for categorical data. Bad: Complicated.
  3. Regression → Find linear or non-linear relationships between missing feature and other features using Multiple Imputation by Chained Equations (MICE).

Impute: Just Get More Data

   Often the right solution… try harder to get more data!

Unbalanced Data

  • Unbalanced data has a large discrepancy in the occurrence of positives and negatives
    • Positive = “Hit” = What you're testing/observing happens
      • Nothing to do with “good” or “ethical” outcomes
    • Mainly a problem with neural networks (NNs)
  • Common example: fraud detection
    • Fraud is generally very rare, e.g. 0.1% of the time
      • Only 0.1% of the data is positive (fraud), rest is negative → unbalanced
    • If we don't deal with unbalanced data, a ML model might learn to always ignore fraud… after all, it almost never happens, so the accuracy is high… but the model is then useless!

Possible solutions

  • Oversampling: clone samples from minority class (can be random)
  • Undersampling: remove a certain amount of negative samples
    • Throwing data away usually not the right answer! (unless wanting to avoid scaling issues)
  • Synthetic Minority Over-sampling TEchnique (SMOTE)
    • Uses KNN to augment minorities
      • artificially generate new samples by using mean values from minority neighbors
    • Generally better than plain oversampling
  • Adjusting thresholds (precision/recall trade-off)
    • Predictions in a classification usually return a probability → need to define a threshold by which you flag a datapoint as positive
    • Increasing the threshold can reduce false positives…
      • …but also increase false negatives
      • Precision/Recall Trade-off/balance → use a threshold that makes sense

Outliers

  • Variance ($𝜎^2$) = average of the squared differences from the mean
  • Standard deviation (𝜎) = $\sqrt(𝜎^2)$ = how far from mean are values on average
    • Can be used to identify outliers (those that are >𝜎 from mean)
      • Possible criterion: object is outlier if it's multiple 𝜎 away from mean… what multiple exactly? Have to use common sense!
      • You can talk about how extreme a data point is by talking about “how many sigmas” away from the mean it is.
  • Example: dataset = (1, 4, 5, 4, 8)
    • Mean = (1+4+5+4+8)/5 = 4.4
    • Differences from the mean = (-3.4, -0.4, 0.6, -0.4, 3.6)
    • Squared differences = (11.56, 0.16, 0.36, 0.16, 12.96)
      • Squaring prevents positive and negative variances from cancelling each other
      • Squaring also amplifies weight for outliers
    • Variance ($𝜎^2$) = avg of the squared differences = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
    • Standard deviation (𝜎) = $\sqrt 5.04$ = 2.24
      • 1 and 8 are more than 2.24 away from mean (4.4) → 1 and 8 are outliers
  • Sometimes appropriate to remove outliers from training data…  SOMETIMES!!
    • e.g.1 Collaborative filtering → a single user rates lots of stuff compared to others, this user then has a big effect compared to those who rate just a few
    • e.g.2 Bots/agents in web log data → they don't represent actual human behavior
    • e.g.3 obtain a meaningful graph for the mean income of most of the population… a billionaire can skew this
  • BUT! Be responsible! Understand WHY you eliminate outliers!
    • e.g. If need an accurate measure of the mean income for ALL US citizens, that includes billionaires, even if few and skew the data… Need accuracy!
  • Random Cut Forest algorithm is an example of outlier detection
    • Present in many services (QuickSight, SageMaker…)
    • In Amazon Managed Service for Apache Flink → RANDOM_CUT_FOREST is a SQL function used for anomaly detection on numeric columns in a stream


Comments

Popular posts from this blog

GHL Email Campaigns

Await

Free AI Tools