ML Exam: 4 - Data

ML Exam : 4 

Data


Data File and Object Storage

1) S3 = Object Storage. Use Cases: Centralized data lake and model artifacts. Via: HTTP and HTTPS API. Performance: High throughput and higher latency. Ingestion to Sagemaker: File Mode (downloads all dataset), Pipe Mode (streams data sequentially), or FastFile (exposes S3 objects as a local POSIX file).
2) EFS  Shared POSIX file system; Use Cases: Common notebook storage, shared home dirs, and team training scripts. Via: NFSv4. Performance: Elastic throughput and low latency.
3) FSx for Lustre = Ultra-fast storage; directly links to S3 for training. Use Case: Ultra-fast training on massive data. Via: POSIX. Performance: Sub-millisecond latency, millions of IOPS.
4) FSx for NetApp ONTAP = Enterprise File System. Use Case: Hybrid cloud data pipelines, migrating on-prem NetApp. Via: NFS, SMB, iSCSI. Performance: Sub-millisecond latency and enterprise caching.

Data Connectivity

1) JDBC (Java DB Connectivity) = Platform-independent, but only Java
2) ODBC (Open Database Connectivity) = Platform-dependent (thx to drivers), but language-independent
3) Other: Raw logs, APIs, Streams, etc.

Data Ingestion

1) Glue = Fully managed serverless ETL service. Uses JDBC for DBs.  Changes relational data into optimized formats (Parquet/ORC) and stores data in S3 for SageMaker.
2) Athena = standard SQL query to S3.  External tools (like Tableau, PowerBI, or a local SQL IDE)  connect to Athena via ODBC/JDBC drivers.

3) SageMaker Canvas - Visual tool that imports, transforms, and analyzes data using no-code. 
    a) Diverse Data Ingestion: Connects to S3, Athena, Redshift, Lake Formation, Snowflake, and Databricks. Also SageMaker Feature Store.
    b) Direct Destination Export: Exports to SageMaker Feature Store, S3, or triggers a SageMaker Pipeline.
    c) Sampling Strategies: Ingests a sample using Top K, Random or Stratified sampling.
    d) Data Insights Report: Creates a report showing data quality, target leakage, dups, and outliers. 

4) SageMaker Feature Store - Central repo to store, share, and manage ML features. 
    a) Online Store: Low-latency, real-time read/write store (backed by DynamoDB). Used for high-speed, real-time model inference. This auto syncs data to the Offline Store in minutes.
    b) Offline Store: High-throughput, object-storage data lake (backed by S3). Retains historical feature data used for training models.  Used for Batch processing.
    c) Ingestion Sync: Fills Online Store.
    d) Feature Group Batch: Uses SageMaker Canvas or Spark/Glue jobs to ingest bulk historical features to Online Store.
    e) Feature Group Streaming: Uses Kinesis or Kafka to feed real-time streaming data to Online Store. 

5) Kinesis Data Streams: Low-latency ingestion for custom real-time applications.
6) Kinesis Data Firehose: Zero-code delivery to S3, Redshift, or OpenSearch.
7) Apache Kafka / MSK: Enterprise-grade, open-source streaming migration to AWS.
8) Apache Flink / Kinesis Analytics: Real-time stream processing using SQL or Java.

Data Formats

1) CSV (Comma-Separated Values) and TSV (Tab-Separated Values) = Logs. Row-based Storage, Characters, Non-Validated. Small to medium datasets. Main AWS services: lots of AWS services. Good for: import/export of DBs & spreadsheets. Bad: slowest queries and no schema enforcement.
2a) JSON (JavaScript Object Notation)  Logs. Document or Object Based Storage, Characters, Non-Validated. Main AWS services: Lambda and DynamoDB. Good for: App APIs, highly nested semi-structured data. Represents data as hierarchical trees of key-value pairs or arrays. Bad: Expensive to query, single syntax error can break entire file. Use cases: configurations/settings, RESTful APIs.
2b) JSON Lines (JSONL) = Logs. Row Based Storage, Characters, Non-Validated. Each JSON object is separated by a newline character. Good: Easy appending of new records. Only corrupted line is lost. Use cases: logging, real-time streaming apps, and ML training.
3) Apache Avro Streaming Schema with Data. Row Based Storage, Binary, native block-level compression, Validated. Stores schema in binary header. Main AWS services: Glue Schema Registry. Good for: stream ingestion, heavy writes, and evolution. Bad: Binary so not human readable. Use cases: big data, real-time processing, schema evolution needed, efficient serialization. Used in Apache systems (Kafka, Spark, Flink) and Hadoop.
4a) RecordIO wrapped = Envelope/Storage. Takes individual chunks of binary data (records) and organizes them into a continuous file by prepending an explicit length-prefix. Main AWS services: Sagemaker, Glue Schema Registry and Lambda. 
4b) Protobuf (Protocol Buffers) = Ultra Fast for Microservices. Compiled Schema. Binary, Serialization format. Stores schema in external .proto file. Good for: High speed, microservices, and gRPC streaming. Bad: Binary so not human readable. Supports schema evolution. 
5) Apache Parquet = Query. Column-based Storage, Binary, native column-level compression, Validated. Main AWS services: Athena, Glue, and Redshift. Good for: OLAP and filtering specific columns for analytics. Bad: slow writes and high CPU to compress. Excellent data compression (so storage space) & encoding algorithms. Improved query performance. Use cases: analyze large datasets, read columns (instead of rows/records), optimize storage & IO operations. Used in Apache systems (Spark, Hive, Impala), Hadoop ecosystem, Redshift.
6) Apache ORC (Optimized Row Columnar)  = Query. Column-based Storage, Binary, native column-level compression, Validated. Main AWS services: EMR and Athena. Good for: analytics and max compression. Bad: slow writes and high CPU to compress.

Data Extraction

1) S3 (Simple Storage Service) - Main data lake for ML pipelines.
    a) S3 Transfer Acceleration: Uses CloudFront’s edge locations to speed uploads/downloads over long distances.
    b) S3 Select: Extracts only needed rows/columns using SQL.
   c) Multipart Uploads: Breaks large files into parts to upload concurrently.
   d) S3 VPC Endpoints: Keeps data in AWS when moving to SageMaker, for security and speed.

2) EBS (Elastic Block Store) - High-performance block storage for SageMaker training and EC2.
    a) EBS Provisioned IOPS (io1/io2): Delivers sustained, high-speed input/output performance for I/O-intensive ML training jobs.
    b) EBS Optimized Instances: Dedicated throughput that minimizes contention between I/O and other traffic from EC2/SageMaker.
    c) RAID 0 Configurations: Stripes multiple EBS volumes together on an EC2 instance to max read/write throughput for ultra-large datasets.

3) EFS (Elastic File System) - Serverless, fully managed network file system shared across multiple training instances.
    a) Provisioned Throughput: Guarantees high throughput levels.
    b) Max I/O Mode: Scales to higher throughput and operations/sec, optimized for multi-instance distributed training.

4) RDS & Aurora - Relational DB storage for structured ML features and metadata.
    a) Read Replicas: Pulls from read-only copies of DB (so not to slow production apps).
    b) Data Pipeline / Glue: Parallel, high-throughput extraction of relational tables into S3 format for ML consumption.

5) DynamoDB - NoSQL database for high-throughput, low-latency key-value data storage.
    a) DynamoDB Streams: Captures real-time, time-ordered sequences of item-level modifications for continuous streaming into ML models.
    b) Parallel Scans: Divides a large dataset scan into multiple segments processed concurrently to speed up data extraction.
    c) DynamoDB Export to S3: Exports full table data directly to S3 in JSON or Ion format without consuming your app's read capacity units (RCU).


Data Transformation

1) Glue - A fully managed, serverless event-driven ETL service. 
    a) Glue Data Catalog: Central metadata repo. Glue Crawlers to auto scan S3/DB tables, infer schemas, and populate data catalogs.
    b) Dynamic Frames: Extension of Apache Spark DataFrames used by Glue to handle messy, semi-structured data (like nested JSON) without requiring a pre-defined schema.
    c) Job Types: Supports Apache Spark (for massive datasets) and Python Shell (for lightweight scripts).
    d) Output Optimization: Converts relational data into column storage, compressed formats (Parquet or ORC) to speed up SageMaker training reads. 

2) Glue DataBrew - Visual prep tool to clean and normalize data without writing code. 
    a) No-Code Transformations: Has over 250 pre-built transformations (e.g., handling missing values, one-hot encoding, correcting invalid dates).
    b) Data Lineage: Visualizes the entire pipeline flow of how the data was manipulated from its raw source to the final output destination.
    c) Recipe-Driven: Saves "recipe" (template) of transformation done via scheduled DataBrew job. 

3) Apache Spark on EMR - Distributed processing framework for massive data scaling running big data like Apache Spark, Hadoop, and Hive. 
    a) Petabyte-Scale Ingestion: Best used when datasets are too massive for a single instance or serverless Glue limits, utilizing distributed cluster compute.
    b) EMR Step Execution: Runs data transformations programmatically by submitting "Steps" (e.g., executing a PySpark script stored in S3) to the cluster.
    c) Spot Instance Savings: Can run Core or Task nodes on AWS Spot Instances to drastically reduce big data transformation costs. 

4) SageMaker Canvas (was Data Wrangler) - Unified visual UI in SageMaker Studio to clean, transform, and evaluate data for ML pipelines.
    a) ML-Specific Transforms: Includes built-in transformations like balancing class imbalances (SMOTE), string tokenization, and vectorization.
    b) Data Insights Report: Checks for data quality issues, anomalies, multicollinearity, and target leakage.
    c) Direct Pipeline Integration: Exports transformation workflows directly to an automated SageMaker Pipeline code artifact or pushes features directly into the SageMaker Feature Store.

5) Orchestration services (EventBridge, Lambda…)
Data Labeling & Feature Management
1) SageMaker Feature Store: Central repository to name, share, and reuse features.
2) SageMaker Ground Truth: Managed data labeling using human workflows and ML.
3) Mechanical Turk: Crowdsourced marketplace for manual data annotation tasks.



-------------------------------------------------------------------------------------------------------

  BACKGROUND

-------------------------------------------------------------------------------------------------------

Types of Structure of Data


Properties of Data (4 Vs)

1. Volume (Size) - GBs? PBs?…

2. Velocity - High velocity → Real-Time or near-RT processing

3. Variety - Structured? Mixed? Multiple sources? Multiple formats?

4. Veracity?


Data Warehouses, Data Lakes, Data Lakehouses

- Data Warehouse (DWH) (e.g. Amazon Redshift) - Centralized repository optimized for analysis (read-heavy operations) where data from different sources is stored in a structured format

- Data Lake (e.g. Amazon S3 can be used as data lake)  - Storage repository that holds vast amounts of raw data in its native format (predefined structure is not necessary). Structured, semi-structured, & unstructured data

- Often, organizations use a combination of both, ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis

- Data Lakehouse (e.g. AWS Lake Formation with S3 & Redshift Spectrum) - Hybrid data architecture, tries to provide advantages of both. Performance, reliability & capabilities of DWHs. Flexibility, scale & low-cost storage of data lakes.

Data Mesh

Domain-based data management paradigm. Decentralized architectural framework that shifts data ownership from a central team to domain-specific teams (e.g., marketing, sales, shipping).


Data Columns - Leakage

   Leakage causes a model to look accurate during training and testing, but fails in production. Because the ML will just decide to only look at that column for the answer, since the column is perfectly matches the prediction. 

Examples:
  1) "Airbag": If predicting safe drivers, and training data has Airbag Deployed column, Canvas/Data Wrangler will flag Airbag Deployed as target leakage. In production, you need to predict the crash before it happens, at which point the airbag has not yet deployed.
  2) "Customer Churn": If predicting Customer Churn and have Account Cancellation Date  column, this will have "Yes" values. It is a leaked feature, since no cancellation date for active customer you are trying to retain.
  3) "Transaction Fraud": If predicting Fraud and include Support Ticket ID Filed for Fraud Recovery as a feature, the data leaks the final outcome.

Result: You should drop the column in the Training data.
 

ETL Pipelines

Processing steps

1. Extract = Retrieve raw data from sources (DBs, flat files…). Ensure data integrity. Real-Time or batches.
2. Transform = Convert raw data into suitable format. Data cleansing, enrichment, computations, encoding/decoding, format changes. 
3. Load = Store transformed data into target (DWH, repo…). Ensure data maintains integrity. Batches or streaming.


Comments

Popular posts from this blog

GHL Email Campaigns

Await

Free AI Tools