Posts

ML Exam: 6 - Feature Engineering

Image
ML Exam: 6 Feature Engineering  Feature Engineering - Basic Concepts Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model ART OF ML!! Most critical part in a good ML implementation Talented/expert ML specialists are good at feature engineering Curse of dimensionality More features is not better! Every feature is a new dimension Much of feature engineering is selecting most relevant features → domain knowledge comes into play Unsupervised dimensionality reduction techniques can help (PCA, K-Means) Common problems are below:  Missing Data Impute missing data = fill missing data with something Impute: Mean Replacement Replace missing values with mean value of column A column represents a single feature Median value of column can be more useful if outliers distort the mean e.g. outlier billionaires distorting the income data of average citizens Pros Fast & ea...

ML Exam: 5 - Products

ML Exam: 5 Products Data Analytics Data Analysis & Visualization Quick Sight = I nteractive dashboards and reports over data .       Athena = serverless SQL on S3 for ad-hoc queries and data lake analysis . cost-effective. Does SQL in parallel .      Redshift - Think Oracle was "Big Red" and shifting away from Oracle data warehouses . F ully managed. S tructured or semi-structured data. scalability and pay-as-you-go pricing model. SQL across data warehouses, data lakes, and operational DBs. Can run either with provisioning OR stateless unprovisioned.  Data Pipelines Kinesis Data Streams for real-time data from apps, streams + sensors. Auto provisioning and scaling in on-demand mode. Kinesis is for real time streaming event data and instant analytics/metrics over those streams. Data Firehose for near real-time data. Fully managed service. Auto provisioning and scaling. Gives data to storage and services. Data Processing Glue is serverles...

ML Exam: 4 - Data

Image
ML Exam : 4  Data Types of Structure of Data Properties of Data (4 Vs) 1. Volume   (Size) - GBs? PBs?… 2. Velocity  - High velocity → Real-Time or near-RT processing 3. Variety  - Structured? Mixed? Multiple sources? Multiple formats? 4. Veracity? Data Warehouses, Data Lakes, Data Lakehouses - Data Warehouse (DWH) (e.g. Amazon Redshift) - Centralized repository optimized for analysis  (read-heavy operations) where data from different sources is stored in a structured format - Data Lake (e.g. Amazon S3 can be used as data lake)  - Storage repository that holds vast amounts of raw data  in its native format (predefined structure is not necessary). Structured, semi-structured, & unstructured data - Often, organizations use a combination of both , ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis - Data Lakehouse (e.g. AWS Lake Formation with S3 & Redshift Spectrum) - Hybrid data ...