ML Exam: 4 - Data
ML Exam : 4
Data
Data File and Object Storage
Data Connectivity
1) JDBC (Java DB Connectivity) = Platform-independent, but only Java2) ODBC (Open Database Connectivity) = Platform-dependent (thx to drivers), but language-independent3) Other: Raw logs, APIs, Streams, etc.
Data Ingestion
1) Glue = Fully managed serverless ETL service. Uses JDBC for DBs. Changes relational data into optimized formats (Parquet/ORC) and stores data in S3 for SageMaker.2) Athena = standard SQL query to S3. External tools (like Tableau, PowerBI, or a local SQL IDE) connect to Athena via ODBC/JDBC drivers.
3) SageMaker Canvas - Visual tool that imports, transforms, and analyzes data using no-code. a) Diverse Data Ingestion: Connects to S3, Athena, Redshift, Lake Formation, Snowflake, and Databricks. Also SageMaker Feature Store. b) Direct Destination Export: Exports to SageMaker Feature Store, S3, or triggers a SageMaker Pipeline. c) Sampling Strategies: Ingests a sample using Top K, Random or Stratified sampling. d) Data Insights Report: Creates a report showing data quality, target leakage, dups, and outliers.
4) SageMaker Feature Store - Central repo to store, share, and manage ML features. a) Online Store: Low-latency, real-time read/write store (backed by DynamoDB). Used for high-speed, real-time model inference. This auto syncs data to the Offline Store in minutes. b) Offline Store: High-throughput, object-storage data lake (backed by S3). Retains historical feature data used for training models. Used for Batch processing. c) Ingestion Sync: Fills Online Store. d) Feature Group Batch: Uses SageMaker Canvas or Spark/Glue jobs to ingest bulk historical features to Online Store. e) Feature Group Streaming: Uses Kinesis or Kafka to feed real-time streaming data to Online Store.
5) Kinesis Data Streams: Low-latency ingestion for custom real-time applications.
6) Kinesis Data Firehose: Zero-code delivery to S3, Redshift, or OpenSearch.
7) Apache Kafka / MSK: Enterprise-grade, open-source streaming migration to AWS.
8) Apache Flink / Kinesis Analytics: Real-time stream processing using SQL or Java.
6) Kinesis Data Firehose: Zero-code delivery to S3, Redshift, or OpenSearch.
7) Apache Kafka / MSK: Enterprise-grade, open-source streaming migration to AWS.
8) Apache Flink / Kinesis Analytics: Real-time stream processing using SQL or Java.
Data Formats
1) CSV (Comma-Separated Values) and TSV (Tab-Separated Values) = Logs. Row-based Storage, Characters, Non-Validated. Small to medium datasets. Main AWS services: lots of AWS services. Good for: import/export of DBs & spreadsheets. Bad: slowest queries and no schema enforcement.
2a) JSON (JavaScript Object Notation) = Logs. Document or Object Based Storage, Characters, Non-Validated. Main AWS services: Lambda and DynamoDB. Good for: App APIs, highly nested semi-structured data. Represents data as hierarchical trees of key-value pairs or arrays. Bad: Expensive to query, single syntax error can break entire file. Use cases: configurations/settings, RESTful APIs.
2b) JSON Lines (JSONL) = Logs. Row Based Storage, Characters, Non-Validated. Each JSON object is separated by a newline character. Good: Easy appending of new records. Only corrupted line is lost. Use cases: logging, real-time streaming apps, and ML training.
3) Apache Avro = Streaming Schema with Data. Row Based Storage, Binary, native block-level compression, Validated. Stores schema in binary header. Main AWS services: Glue Schema Registry. Good for: stream ingestion, heavy writes, and evolution. Bad: Binary so not human readable. Use cases: big data, real-time processing, schema evolution needed, efficient serialization. Used in Apache systems (Kafka, Spark, Flink) and Hadoop.
4a) RecordIO wrapped = Envelope/Storage. Takes individual chunks of binary data (records) and organizes them into a continuous file by prepending an explicit length-prefix. Main AWS services: Sagemaker, Glue Schema Registry and Lambda. 4b) Protobuf (Protocol Buffers) = Ultra Fast for Microservices. Compiled Schema. Binary, Serialization format. Stores schema in external .proto file. Good for: High speed, microservices, and gRPC streaming. Bad: Binary so not human readable. Supports schema evolution.
5) Apache Parquet = Query. Column-based Storage, Binary, native column-level compression, Validated. Main AWS services: Athena, Glue, and Redshift. Good for: OLAP and filtering specific columns for analytics. Bad: slow writes and high CPU to compress. Excellent data compression (so storage space) & encoding algorithms. Improved query performance. Use cases: analyze large datasets, read columns (instead of rows/records), optimize storage & IO operations. Used in Apache systems (Spark, Hive, Impala), Hadoop ecosystem, Redshift.
6) Apache ORC (Optimized Row Columnar) = Query. Column-based Storage, Binary, native column-level compression, Validated. Main AWS services: EMR and Athena. Good for: analytics and max compression. Bad: slow writes and high CPU to compress.
2a) JSON (JavaScript Object Notation) = Logs. Document or Object Based Storage, Characters, Non-Validated. Main AWS services: Lambda and DynamoDB. Good for: App APIs, highly nested semi-structured data. Represents data as hierarchical trees of key-value pairs or arrays. Bad: Expensive to query, single syntax error can break entire file. Use cases: configurations/settings, RESTful APIs.
2b) JSON Lines (JSONL) = Logs. Row Based Storage, Characters, Non-Validated. Each JSON object is separated by a newline character. Good: Easy appending of new records. Only corrupted line is lost. Use cases: logging, real-time streaming apps, and ML training.
3) Apache Avro = Streaming Schema with Data. Row Based Storage, Binary, native block-level compression, Validated. Stores schema in binary header. Main AWS services: Glue Schema Registry. Good for: stream ingestion, heavy writes, and evolution. Bad: Binary so not human readable. Use cases: big data, real-time processing, schema evolution needed, efficient serialization. Used in Apache systems (Kafka, Spark, Flink) and Hadoop.
4a) RecordIO wrapped = Envelope/Storage. Takes individual chunks of binary data (records) and organizes them into a continuous file by prepending an explicit length-prefix. Main AWS services: Sagemaker, Glue Schema Registry and Lambda.
5) Apache Parquet = Query. Column-based Storage, Binary, native column-level compression, Validated. Main AWS services: Athena, Glue, and Redshift. Good for: OLAP and filtering specific columns for analytics. Bad: slow writes and high CPU to compress. Excellent data compression (so storage space) & encoding algorithms. Improved query performance. Use cases: analyze large datasets, read columns (instead of rows/records), optimize storage & IO operations. Used in Apache systems (Spark, Hive, Impala), Hadoop ecosystem, Redshift.
6) Apache ORC (Optimized Row Columnar) = Query. Column-based Storage, Binary, native column-level compression, Validated. Main AWS services: EMR and Athena. Good for: analytics and max compression. Bad: slow writes and high CPU to compress.
Data Extraction
1) S3 (Simple Storage Service) - Main data lake for ML pipelines.
a) S3 Transfer Acceleration: Uses CloudFront’s edge locations to speed uploads/downloads over long distances.
b) S3 Select: Extracts only needed rows/columns using SQL.
c) Multipart Uploads: Breaks large files into parts to upload concurrently.
d) S3 VPC Endpoints: Keeps data in AWS when moving to SageMaker, for security and speed.
2) EBS (Elastic Block Store) - High-performance block storage for SageMaker training and EC2.
a) EBS Provisioned IOPS (io1/io2): Delivers sustained, high-speed input/output performance for I/O-intensive ML training jobs.
b) EBS Optimized Instances: Dedicated throughput that minimizes contention between I/O and other traffic from EC2/SageMaker.
c) RAID 0 Configurations: Stripes multiple EBS volumes together on an EC2 instance to max read/write throughput for ultra-large datasets.
3) EFS (Elastic File System) - Serverless, fully managed network file system shared across multiple training instances.
a) Provisioned Throughput: Guarantees high throughput levels.
b) Max I/O Mode: Scales to higher throughput and operations/sec, optimized for multi-instance distributed training.
4) RDS & Aurora - Relational DB storage for structured ML features and metadata.
a) Read Replicas: Pulls from read-only copies of DB (so not to slow production apps).
b) Data Pipeline / Glue: Parallel, high-throughput extraction of relational tables into S3 format for ML consumption.
5) DynamoDB - NoSQL database for high-throughput, low-latency key-value data storage.
a) DynamoDB Streams: Captures real-time, time-ordered sequences of item-level modifications for continuous streaming into ML models.
b) Parallel Scans: Divides a large dataset scan into multiple segments processed concurrently to speed up data extraction.
c) DynamoDB Export to S3: Exports full table data directly to S3 in JSON or Ion format without consuming your app's read capacity units (RCU).
Data Transformation
1) Glue - A fully managed, serverless event-driven ETL service. a) Glue Data Catalog: Central metadata repo. Glue Crawlers to auto scan S3/DB tables, infer schemas, and populate data catalogs. b) Dynamic Frames: Extension of Apache Spark DataFrames used by Glue to handle messy, semi-structured data (like nested JSON) without requiring a pre-defined schema. c) Job Types: Supports Apache Spark (for massive datasets) and Python Shell (for lightweight scripts). d) Output Optimization: Converts relational data into column storage, compressed formats (Parquet or ORC) to speed up SageMaker training reads.
2) Glue DataBrew - Visual prep tool to clean and normalize data without writing code. a) No-Code Transformations: Has over 250 pre-built transformations (e.g., handling missing values, one-hot encoding, correcting invalid dates). b) Data Lineage: Visualizes the entire pipeline flow of how the data was manipulated from its raw source to the final output destination. c) Recipe-Driven: Saves "recipe" (template) of transformation done via scheduled DataBrew job.
3) Apache Spark on EMR - Distributed processing framework for massive data scaling running big data like Apache Spark, Hadoop, and Hive. a) Petabyte-Scale Ingestion: Best used when datasets are too massive for a single instance or serverless Glue limits, utilizing distributed cluster compute. b) EMR Step Execution: Runs data transformations programmatically by submitting "Steps" (e.g., executing a PySpark script stored in S3) to the cluster. c) Spot Instance Savings: Can run Core or Task nodes on AWS Spot Instances to drastically reduce big data transformation costs.
4) SageMaker Canvas (was Data Wrangler) - Unified visual UI in SageMaker Studio to clean, transform, and evaluate data for ML pipelines. a) ML-Specific Transforms: Includes built-in transformations like balancing class imbalances (SMOTE), string tokenization, and vectorization. b) Data Insights Report: Checks for data quality issues, anomalies, multicollinearity, and target leakage. c) Direct Pipeline Integration: Exports transformation workflows directly to an automated SageMaker Pipeline code artifact or pushes features directly into the SageMaker Feature Store.
5) Orchestration services (EventBridge, Lambda…)Data Labeling & Feature Management1) SageMaker Feature Store: Central repository to name, share, and reuse features.
2) SageMaker Ground Truth: Managed data labeling using human workflows and ML.
3) Mechanical Turk: Crowdsourced marketplace for manual data annotation tasks.
2) SageMaker Ground Truth: Managed data labeling using human workflows and ML.
3) Mechanical Turk: Crowdsourced marketplace for manual data annotation tasks.
------------------------------------------------------------------------------------------------------- BACKGROUND
-------------------------------------------------------------------------------------------------------
BACKGROUND
Types of Structure of Data
Properties of Data (4 Vs)
1. Volume (Size) - GBs? PBs?…
2. Velocity - High velocity → Real-Time or near-RT processing
3. Variety - Structured? Mixed? Multiple sources? Multiple formats?
4. Veracity?
Data Warehouses, Data Lakes, Data Lakehouses
- Data Warehouse (DWH) (e.g. Amazon Redshift) - Centralized repository optimized for analysis (read-heavy operations) where data from different sources is stored in a structured format
- Data Lake (e.g. Amazon S3 can be used as data lake) - Storage repository that holds vast amounts of raw data in its native format (predefined structure is not necessary). Structured, semi-structured, & unstructured data
- Often, organizations use a combination of both, ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis
- Data Lakehouse (e.g. AWS Lake Formation with S3 & Redshift Spectrum) - Hybrid data architecture, tries to provide advantages of both. Performance, reliability & capabilities of DWHs. Flexibility, scale & low-cost storage of data lakes.
Data Mesh
Data Columns - Leakage
Leakage causes a model to look accurate during training and testing, but fails in production. Because the ML will just decide to only look at that column for the answer, since the column is perfectly matches the prediction.2) "Customer Churn": If predicting Customer Churn and have Account Cancellation Date column, this will have "Yes" values. It is a leaked feature, since no cancellation date for active customer you are trying to retain.

Comments
Post a Comment