ML Exam Prep - Data Ingestion and Storage
ML Exam Prep
Data Ingestion and Storage
Types of Structure of Data
Properties of Data (4 Vs)
1. Volume (Size) - GBs? PBs?…
2. Velocity - High velocity → Real-Time or near-RT processing
3. Variety - Structured? Mixed? Multiple sources? Multiple formats?
4. Veracity?
Data Warehouses, Data Lakes, Data Lakehouses
- Data Warehouse (DWH) (e.g. Amazon Redshift) - Centralized repository optimized for analysis (read-heavy operations) where data from different sources is stored in a structured format
- Data Lake (e.g. Amazon S3 can be used as data lake) - Storage repository that holds vast amounts of raw data in its native format (predefined structure is not necessary). Structured, semi-structured, & unstructured data
- Often, organizations use a combination of both, ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis
- Data Lakehouse (e.g. AWS Lake Formation with S3 & Redshift Spectrum) - Hybrid data architecture, tries to provide advantages of both. Performance, reliability & capabilities of DWHs. Flexibility, scale & low-cost storage of data lakes.
Data Mesh
ETL Pipelines
Processing steps
Managing ETL pipelines
Data Sources
Data Formats
-
CSV (Comma-Separated Values)
- Human-readable & editable data storage
- Small-medium datasets → import/export for DBs & spreadsheets
- While CSV normally refers to commas (
,), the underlying data format is the same no matter the separating character → character can also be a tab (TSV), blankspace...
-
JSON (JavaScript Object Notation)
- Human-readable, Flexible schema with nested structures
- Use cases: data exchange between web servers-clients or programming languages, configurations/settings for SW apps, RESTful APIs…
- JSON Lines (JSONL) = newline-delimited JSON → subformat of JSON
-
Each JSON object is separated by a newline character
-
Advantage: individual lines/objects can be read/processed independently (without requiring entire dataset to be loaded into memory)
- Allows efficient streaming for real-time applications
- Ideal for large-scale data processing
- Easy appending of new records
-
JSONL screenshot
-
-
Apache Avro
- Row-based binary format, stores both data & its schema
- Use cases: big data, real-time processing systems, schema evolution needed, efficient serialization for data transportation…
- Used in Apache systems (Kafka, Spark, Flink), Hadoop ecosystem.
- Protobuf (Protocol Buffers)
- Originally designed by Google
- Similar to Avro: row-based, binary, ideal for serialization, supports schema evolution…
- Many SageMaker built-in algorithms expect this data format for their training data
- Usually specified as “RecordIO-Protobuf”
- Apache Parquet
- Columnar storage format, optimized for analytics
- Excellent data compression & encoding algorithms
- reduced storage space
- improved query performance
- Use cases: analyze large datasets, read columns instead of rows/records, optimize storage & IO operations
- Used in Apache systems (Spark, Hive, Impala), Hadoop ecosystem, Amazon Redshift Spectrum
- Apache ORC (Optimized Row Columnar)
- Columnar storage format (like Parquet)

Comments
Post a Comment