ML Exam Prep - Data Ingestion and Storage

May 11, 2026

ML Exam Prep

Data Ingestion and Storage

Types of Structure of Data

Properties of Data (4 Vs)

1. Volume (Size) - GBs? PBs?…

2. Velocity - High velocity → Real-Time or near-RT processing

3. Variety - Structured? Mixed? Multiple sources? Multiple formats?

4. Veracity?

Data Warehouses, Data Lakes, Data Lakehouses

- Data Warehouse (DWH) (e.g. Amazon Redshift) - Centralized repository optimized for analysis (read-heavy operations) where data from different sources is stored in a structured format

- Data Lake (e.g. Amazon S3 can be used as data lake) - Storage repository that holds vast amounts of raw data in its native format (predefined structure is not necessary). Structured, semi-structured, & unstructured data

- Often, organizations use a combination of both, ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis

- Data Lakehouse (e.g. AWS Lake Formation with S3 & Redshift Spectrum) - Hybrid data architecture, tries to provide advantages of both. Performance, reliability & capabilities of DWHs. Flexibility, scale & low-cost storage of data lakes.

Data Mesh

Domain-based data management paradigm. Decentralized architectural framework that shifts data ownership from a central team to domain-specific teams (e.g., marketing, sales, shipping).

ETL Pipelines

Processing steps

1. Extract = Retrieve raw data from sources (DBs, flat files…)

- Ensure data integrity

- Real-Time or batches

2. Transform = Convert raw data into suitable format

- Data cleansing, enrichment, computations, encoding/decoding, format changes…

3. Load = Store transformed data into target (DWH, repo…)

- Ensure data maintains integrity

- Batches or streaming

Managing ETL pipelines

- Automation

- AWS Glue

- Orchestration services (EventBridge, Lambda…)

Data Sources

- JDBC (Java DB Connectivity) = Platform-independent, Language-dependent (only Java)

- ODBC (Open Database Connectivity) = Platform-dependent (thx to drivers), Language-independent

- Other: Raw logs, APIs, Streams…

Data Formats

CSV (Comma-Separated Values)
- Human-readable & editable data storage
- Small-medium datasets → import/export for DBs & spreadsheets
- While CSV normally refers to commas (,), the underlying data format is the same no matter the separating character → character can also be a tab (TSV), blankspace...
JSON (JavaScript Object Notation)
- Human-readable, Flexible schema with nested structures
- Use cases: data exchange between web servers-clients or programming languages, configurations/settings for SW apps, RESTful APIs…
- JSON Lines (JSONL) = newline-delimited JSON → subformat of JSON
  - Each JSON object is separated by a newline character
  - Advantage: individual lines/objects can be read/processed independently (without requiring entire dataset to be loaded into memory)
    - Allows efficient streaming for real-time applications
    - Ideal for large-scale data processing
    - Easy appending of new records
  - JSONL screenshot
Apache Avro

Row-based binary format, stores both data & its schema
Use cases: big data, real-time processing systems, schema evolution needed, efficient serialization for data transportation…
Used in Apache systems (Kafka, Spark, Flink), Hadoop ecosystem.

Protobuf (Protocol Buffers)

Originally designed by Google
Similar to Avro: row-based, binary, ideal for serialization, supports schema evolution…
Many SageMaker built-in algorithms expect this data format for their training data
- Usually specified as “RecordIO-Protobuf”

Apache Parquet

Columnar storage format, optimized for analytics
Excellent data compression & encoding algorithms

reduced storage space
improved query performance
Use cases: analyze large datasets, read columns instead of rows/records, optimize storage & IO operations
Used in Apache systems (Spark, Hive, Impala), Hadoop ecosystem, Amazon Redshift Spectrum

Apache ORC (Optimized Row Columnar)

Columnar storage format (like Parquet)

Search This Blog

Ones and Zeros

ML Exam Prep - Data Ingestion and Storage

ML Exam Prep

Data Ingestion and Storage

Types of Structure of Data

Properties of Data (4 Vs)

Data Warehouses, Data Lakes, Data Lakehouses

Data Mesh

ETL Pipelines

Processing steps

Managing ETL pipelines

Data Sources

Data Formats

Comments

Post a Comment

Popular posts from this blog

GHL Email Campaigns

Whitelabel Options

Free AI Tools