ML Exam Prep - Data Ingestion and Storage

ML Exam Prep 

Data Ingestion and Storage


Types of Structure of Data


Properties of Data (4 Vs)

1. Volume (Size) - GBs? PBs?…

2. Velocity - High velocity → Real-Time or near-RT processing

3. Variety - Structured? Mixed? Multiple sources? Multiple formats?

4. Veracity?


Data Warehouses, Data Lakes, Data Lakehouses

- Data Warehouse (DWH) (e.g. Amazon Redshift) - Centralized repository optimized for analysis (read-heavy operations) where data from different sources is stored in a structured format

- Data Lake (e.g. Amazon S3 can be used as data lake)  - Storage repository that holds vast amounts of raw data in its native format (predefined structure is not necessary). Structured, semi-structured, & unstructured data

- Often, organizations use a combination of both, ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis

- Data Lakehouse (e.g. AWS Lake Formation with S3 & Redshift Spectrum) - Hybrid data architecture, tries to provide advantages of both. Performance, reliability & capabilities of DWHs. Flexibility, scale & low-cost storage of data lakes.

Data Mesh

Domain-based data management paradigm. Decentralized architectural framework that shifts data ownership from a central team to domain-specific teams (e.g., marketing, sales, shipping).

ETL Pipelines

Processing steps

1. Extract = Retrieve raw data from sources (DBs, flat files…)
    - Ensure data integrity
    - Real-Time or batches
2. Transform = Convert raw data into suitable format
    - Data cleansing, enrichment, computations, encoding/decoding, format changes…
3. Load = Store transformed data into target (DWH, repo…)
    - Ensure data maintains integrity
    - Batches or streaming

Managing ETL pipelines

- Automation
    - AWS Glue
    - Orchestration services (EventBridge, Lambda…)

Data Sources

- JDBC (Java DB Connectivity) = Platform-independent, Language-dependent (only Java)
- ODBC (Open Database Connectivity) = Platform-dependent (thx to drivers), Language-independent
- Other: Raw logs, APIs, Streams…


Data Formats

  • CSV (Comma-Separated Values)

    • Human-readable & editable data storage
    • Small-medium datasets → import/export for DBs & spreadsheets
    • While CSV normally refers to commas (,), the underlying data format is the same no matter the separating character → character can also be a tab (TSV), blankspace...
  • JSON (JavaScript Object Notation)

    • Human-readable, Flexible schema with nested structures
    • Use cases: data exchange between web servers-clients or programming languages, configurations/settings for SW apps, RESTful APIs…
    • JSON Lines (JSONL) = newline-delimited JSON → subformat of JSON
      • Each JSON object is separated by a newline character

      • Advantage: individual lines/objects can be read/processed independently (without requiring entire dataset to be loaded into memory)

        • Allows efficient streaming for real-time applications
        • Ideal for large-scale data processing
        • Easy appending of new records
      • JSONL screenshot

  • Apache Avro

    • Row-based binary format, stores both data & its schema
    • Use cases: big data, real-time processing systems, schema evolution needed, efficient serialization for data transportation…
    • Used in Apache systems (Kafka, Spark, Flink), Hadoop ecosystem.
  • Protobuf (Protocol Buffers)
      • Originally designed by Google
      • Similar to Avro: row-based, binary, ideal for serialization, supports schema evolution…
      • Many SageMaker built-in algorithms expect this data format for their training data
        • Usually specified as “RecordIO-Protobuf”
  • Apache Parquet
    • Columnar storage format, optimized for analytics
    • Excellent data compression & encoding algorithms
      • reduced storage space
      • improved query performance
      • Use cases: analyze large datasets, read columns instead of rows/records, optimize storage & IO operations
      • Used in Apache systems (Spark, Hive, Impala), Hadoop ecosystem, Amazon Redshift Spectrum
  • Apache ORC (Optimized Row Columnar)
    • Columnar storage format (like Parquet)

Comments

Popular posts from this blog

GHL Email Campaigns

Whitelabel Options

Free AI Tools