ML Exam: 3 - Sagemaker AI
ML Exam: 3
Sagemaker AI
SageMaker JumpStart vs. Bedrock
1) Bedrock: Best for serverless, API-driven access to FMs (Anthropic, Cohere, Meta, Amazon). No infrastructure to manage.
2) SageMaker JumpStart: Best for complete control. Hub to deploy, fine-tune, and host open-source models (like Llama or Mistral) on dedicated SageMaker instances.
Sagemaker Concepts:
Sagemaker Developer Products:
SageMaker AI = IDE plugin. fully managed service that automates ML lifecycle (from data prep to production) with "no-code" environment and handles infrastructure to streamline building, tuning, and deploying models. Capabilities: Predictive analytics, computer vision, NLP, and fraud detection. Steps to Start: 1) pick labeled S3 dataset in CSV, Parquet or other, 2) set algorithm 3) set hyperparameters. 4) Pick compute resources (e.g., instance type), 5) run training job by picking pre-built container.
tar.gz file. Upload file to S3. 2) Build Container: Create Dockerfile containing Python, scikit-learn, and an HTTP server wrapper. 3) Push to ECR: Authenticate your local Docker client to AWS, tag the image, and push it to ECR. 4) Deploy Endpoint: Create a Model pointing to the S3 bucket path and the ECR image URI, then deploy to a live endpoint.SageMaker Input Modes: (for training data from S3)
- "File Mode" - Default. Copies entire S3 data to onto the training instance's local EBS volume in Docker container. Then starts training. Terrible when training dataset is huge!
- "Pipe Mode" - Streams real-time data from S3 straight to training. Data not stored on local storage of training instance. Obsolete, use Fast File.
- "Fast File Mode" - Real-time. Training begins before all data loaded in parallel so decreases startup time. Combines benefits of both File and Pipe modes. Can access entire files (like File mode) and streams data to algorithm (like Pipe mode). Can do random access (but best with sequential access).
- S3 Express One Zone = Fast storage class in one AZ. Combines with an S3 mode (File, Pipe or Fast File)
- FSx for Lustre = Scales to high performance (100s GB of throughput and millions of IOPS) with low latency, Single AZ, Requires VPC
- EFS = Requires VPC.
Main SageMaker Algorithms
Regression & Classification (Tabular Data):
Linear Learner = for classification and regression. optimizes variants in parallel. Parameters 1) Increasing "Target Precision" parameter minimizes false positives. XGBoost = SL. Decision tree. Parameters: 1) max_depth that controls tree complexity. 2) Increasing reg_lambda counteracts overfitting.
Factorization Machines = Good for click-through-rate (CTR) and high volume. Models interactions between features and is effective for sparse datasets, such as recommendation systems. K-Nearest Neighbors (KNN) = SL. Classification (common) or rarely in Progression. classifies data point on how its features are similar to others (neighbors). Classification answer is 0 to 1. Object2Vec = Converts pairs of words, customer IDs, and tokens into 1 or 0 for similar words.
Regression & Classification (Tabular Data):
Linear Learner = for classification and regression. optimizes variants in parallel. Parameters 1) Increasing "Target Precision" parameter minimizes false positives.
Factorization Machines = Good for click-through-rate (CTR) and high volume. Models interactions between features and is effective for sparse datasets, such as recommendation systems.
UL and Clustering: KMeans = UL. No event planning by 1) K = number of K leaders for people to cluster around, 2) each data point finds closest leader, and 3) the leader moves to the Means = math mean (center) of their group. Good for: finds hidden or unlabeled patterns, customer segmentation, risk grouping, and pattern discovery. Principal Component Analysis (PCA) = UL. think: looks for Patterns, Compressing it (reducing the dimensions), on the Anonymous data (so UL). PC1 = trend of points, PC2 = perpendicular and sub-trend. Ex: Does not care about labels (of "height" and "weight"), but rather creates single dimension of size (so seeing the trend) which is PC1. Then tracks data that is not explainable by size (say "body shape") that is PC2. Only cares about where the data is most spread out (variance). Principal Components = new, independent axes (directions) that rank the data's most important trends (patterns) from highest to lowest spread (variance)." Random Cut Forest = anomaly detection. Good for id outliers or unusual behavior.
NLP & Topic: BlazingText: An ultra-fast word embedding and text classification engine optimized for GPUs. It can scale across multi-node clusters to generate Word2Vec vectors or categorize text items (e.g., web queries, sentiment tags) at scale. Neural Topic Model (NTM): organizes large text doc collections into distinct thematic topic categories. It maps hidden word associations without requiring pre-existing manual index labels. Latent Dirichlet Allocation (LDA): UL. NLP. Dirichlet is a lazy (so UL) bible reader that looks through text (so NLP), finding different topics, and finds the theme by associations between topics. Sequence-to-Sequence (Seq2Seq): supervised neural framework mapping an input sequence of tokens directly to an output sequence. Good for translations, summarization models, and speech-to-text workflows.
Vision: Image Classification: Assigns one or more categorical labels to a whole image using deep CNN (ResNet). Supports transfer learning from pre-trained nodes or full custom initializations. Object Detection: Ids, bounds, and classifies multiple distinct elements inside a single frame. It produces standard pixel-coordinate bounding boxes tagged with categorical confidence markers. Semantic Segmentation: Pixel-level structural tracking, tagging every individual pixel in an image with a class category. Good for autonomous driving maps or medical scan line tracking.
Time Series Forecasting: DeepAR: Needs historical data. Optimized for predicting future values.
Other:
IP Insights: UL, IPv4 addresses and associates them with entities like user IDs.
SageMaker AI is the “heart” of the MLA-C01 certification
- The majority of exam questions will be with SageMaker.
- It is important to understand and discern between SageMaker Processing, SageMaker Training, and SageMaker Hosting, which all cover different aspects of the end-to-end ML process.
- These notes first cover generic ML knowledge and concepts, and then their implementation in AWS (usually involving SageMaker and other AWS services).
- Some open-source Apache services like Hadoop or Spark are also covered, since they are also popular in ML environments and are well supported in AWS
- It is a good idea to review the high-level overview of SageMaker that was done in the foundational AIF-C01 certification. MLA-C01 builds on top of that knowledge.
- AWS service that can handle the whole End-to-End process in ML
- Data processing, model training, model deployment, and model hosting
- Tons of features and sub-products (will go into depth in these notes)
- SageMaker Training and Deployment Architecture
- Input/output data usually in S3, but could be in other data stores
- Training and inference code must be inside container images registered in ECR
- Not all ML models will be deployed to endpoints.
- Data Preparation (data prep)
- Data usually comes from S3
- Data can also come from Athena, EMR, Redshift, Amazon Keyspaces DB…
- Integration with Apache Spark
- Data usually comes from S3
- Data Processing
- Processing job: copy raw data from S3 → Spin up processing container → Output processed data to S3
- Container can be SageMaker built-in or user provided (code)
- Training
- Training job requires
- URL of S3 bucket with training data
- ML compute resources
- URL of S3 bucket for output → Model outputted to S3
- Container (ECR) path to training code
- Many training options available
- Built-in algorithms, Spark MLLib, Tensorflow, PyTorch, Scikit-learn, XGBoost, Hugging Face, your own Docker image, AWS marketplace-purchased algorithms…
- Training job requires
- Deployment
- 2 ways:
- Persistent endpoint for individual predictions/inference on demand
- SageMaker Batch Transform for predictions of an entire dataset
- Many cool options: inference pipelines, SageMaker Neo (edge devices), Elastic Inference, automatic scaling, shadow testing…
- Organizational unit within SageMaker → organize users, apps, and resources
- A domain must be configured before you can do anything in SageMaker!
- Think of it as an isolation of an ML project
- A domain must be configured before you can do anything in SageMaker!
- Each domain has one EFS volume
- Each user has their private EFS directory within that volume
- There's a shared EFS directory available to all users
- User profile: represents an individual user/person in a domain
- Can create own personal apps
- Can spin private SageMaker Studio instances
- Has access to a private EFS directory to store personal files
- Shared resources across all users:
- Shared spaces
- Shared EFS directory
- Communal IDE app (SageMaker Studio public to all users)
Network Configuration in SageMaker Domain
- By default, a domain has two VPCs
- One with public internet access → can expose public endpoints for your domain
- Managed by SageMaker
- Optional → can select “VPC Only” when creating the domain, which means this managed VPC is NOT created
- One for private traffic
- Encrypted traffic to domain's EFS volume
- YOU manage it: must specify the VPC, its subnets, and security groups (SGs)
Interfaces for Using SageMaker
SageMaker Notebooks
- Old/classic method for ML in SageMaker → ML code
- Spin up EC2 instances to host ML Notebooks, which direct ML E2E process:
- S3 data access
- ML code in Jupyter Notebook
- Libraries like Scikit_learn, numpy, pandas, Apache Spark, Tensorflow, etc at your disposal
- Wide variety of built-in models
- Can spin up training instances
- Can deploy trained models for making predictions (inferring) at scale
SageMaker SDKs
- Training and deployment of ML models via Python scripts
- Python API libraries → import inside your code
- Boto3 (low-level API)
- SageMaker Python SDK (high-level API)
- Can automate ML workflows, manage training jobs, deployments, and pipelines
SageMaker Studio
Web-based IDE for E2E ML development
-
Features: Team collaboration, Tune and debug ML models, Deploy ML models, Automated workflows.
-
Screenshot:
SageMaker Console UI
-
AWS Management Console interface for SageMaker
- GUI for managing SageMaker resources
-
Mostly for administrative tasks
-
Can access all other interfaces from the console UI
-
Screenshot:
SageMaker Jumpstart
-
ML Hub with many pre-trained ML models and pre-built ML solutions. Offers one-click deployment of models for inference. End-to-end solutions for common business problems.
-
Computer Vision (CV) models, Natural Language Processing (NLP) models, GenAI Foundation Models (FMs)…
- Amazon-owned models or 3rd-party provider models
- Provider examples: HuggingFace, Databricks, Meta…
-
Screenshot:
SageMaker Canvas (was Data Wrangler)
- Canvas became the unified no-code workspace for both data prep and model building. Data Wrangler was integrated directly into Canvas. No-code ML for business analysts.
-
Features
- Build custom ML model (leverages AutoML powered by SageMaker Autopilot)
- e.g. Upload CSV data (CSV-only for now), select column to predict & build model
- Automatic data cleaning (leverages Data Wrangler)
- Access ready-to-use models from AWS AI services (Rekognition, Comprehend…)
- GenAI support via Bedrock or JumpStart FMs
- Import, preview, visualize, transform data… in a visual UI
- Even “Quick Model”
- Can also export data flow
- Many feature engineering capabilities (transform images, balance data, impute missing data, handle outliers, PCA…)
- Troubleshooting:
- SageMaker Studio should have correct IAM roles/permissions
- Data sources should allow access (e.g.
AmazonSageMakerFullAccesspolicy) - EC2 instance limit
“The following instance type is not available…”error → actually is usually a service quota problem → Ask for a bigger EC2 instance/quota increase
- Build custom ML model (leverages AutoML powered by SageMaker Autopilot)
-
Screenshot:
Mechanical Turk workers, your employees, or third-party vendors. Ground Truth creates own model as humans label data → RLHF. Only images the model isn't sure about are sent to human labelers (reduces manual work by 70%).
Ground Truth Plus: Turnkey solution.
AWS experts manage the whole workflow. Fill out a form. Experts contact you, discuss pricing, manage labelers. Do NOT confuse with Amazon Augmented AI (A2I)!
Labels
GroundTruth is for human labeling, while A2I is for human oversight of trained model predictions. However, SageMaker Ground Truth and A2I can use the same human workforce! Benefits: consistency, efficiency, flexibility. Other ways: Rekognition, Comprehend, etc. Some pre-trained models or unsupervised techniques can be helpful.
SageMaker Feature Store
- Centralized portal for features. Offers fast, secure access to feature data for ML models.
- Data ingestion via streaming or batch
- Feature Store inputs data from streams with
PutRecordAPI. - Online store (Model gets features with
GetRecordAPI) - Offline store (Feature Store inputs data into S3, AWS Glue creates Data Catalog, models can then access features via BATCH access)
- Feature Store inputs data from streams with
- Security
- encryption at rest (KMS…) and in transit
- IAM, PrivateLink…
SageMaker Model Monitor
Detects data drift, model drift, and bias drift in deployed ML models. Continuously compares incoming data against a baseline dataset captured during training.When detects drift beyond thresholds, it emits Amazon CloudWatch events. These events can trigger an AWS Lambda function, which is a common way to automate workflows such as model retraining. Then the lambda function starts a SageMaker Pipeline, firing a retraining job with updated data.





Comments
Post a Comment