Posts

ML Exam: 7 - End-to-End Process

Image
ML Exam: 7 - End-to-End Process 1. Define Business Problem and Data Objectives   Pick core metric to optimize (e.g., churn rate, fraud detection). See if requires supervised, unsupervised, or reinforcement learning. Map out data availability, regulatory compliance boundaries, and project success metrics.   2. Data Ingestion and Collection Aggregate raw structured, semi-structured, or unstructured data into cloud storage. Use Amazon S3 as the centralized data lake landing zone. Import streaming data in real time using Amazon Kinesis . Extract relational database data using AWS Glue or AWS DMS .   3. Data Cleansing and Preparation Clean raw datasets by handling missing values, filtering duplicates, and removing outliers. Transform features using Amazon SageMaker Data Wrangler to visually profile data quality. Standardize, normalize, and tokenize data text or resize images for computer vision. Store fully processed, reusable data features in the Amazon SageMaker Feature S...

ML Exam: 6 - Feature Engineering

Image
ML Exam: 6 Feature Engineering  Feature Engineering - Basic Concepts Applying domain knowledge (your knowledge of the data – and the model you’re using) to create better features to train your model ART OF ML!! Most critical part in a good ML implementation Talented/expert ML specialists are good at feature engineering Curse of dimensionality More features is not better! Every feature is a new dimension Much of feature engineering is selecting most relevant features → domain knowledge comes into play Unsupervised dimensionality reduction techniques can help (PCA, K-Means) Feature Engineering - Techniques Numeric: Min-Max Scaling : Rescales features to a fixed range (typically 0 to 1) to prevent features with large magnitudes from dominating model training. Standardized Distribution/Standard Scaling (Z-score) : Centers data around a mean of 0 with a standard deviation of 1 for algorithms assuming normally distributed data. Puts wide number ranges into same mat...

ML Exam: 5 - Products

ML Exam: 5 Products Data Analytics Data Analysis & Visualization Quick Sight = I nteractive dashboards and reports over data .       Athena = serverless SQL on S3 for ad-hoc queries and data lake analysis . cost-effective. Does SQL in parallel .      Redshift - Think Oracle was "Big Red" and shifting away from Oracle data warehouses . F ully managed. S tructured or semi-structured data. scalability and pay-as-you-go pricing model. SQL across data warehouses, data lakes, and operational DBs. Can run either with provisioning OR stateless unprovisioned.  Data Pipelines Kinesis Data Streams for real-time data from apps, streams + sensors. Auto provisioning and scaling in on-demand mode. Kinesis is for real time streaming event data and instant analytics/metrics over those streams. Data Firehose for near real-time data. Fully managed service. Auto provisioning and scaling. Gives data to storage and services. Data Processing Glue is serverles...