ML Exam: 3 - Sagemaker AI

 

ML Exam: 3 

Sagemaker AI


 SageMaker JumpStart vs. Bedrock

Choose between Jumpstart and Bedrock for GenAI and FMs: 
  1) Bedrock: Best for serverless, API-driven access to FMs (Anthropic, Cohere, Meta, Amazon). No infrastructure to manage.
  2) SageMaker JumpStart: Best for complete control. Hub to deploy, fine-tune, and host open-source models (like Llama or Mistral) on dedicated SageMaker instances.

Sagemaker Concepts:

SHAP baseline - To understand how features changes over time, SageMaker needs a point of comparison which is the SHAP baseline. SHAP is (SHapley Additive exPlanations). SHAP is cooperative game theory math. Breaks down a model's prediction and assigns "importance score" (SHAP value) to each input feature. Used to quantify the contribution of each feature in a prediction.
Shadow Testing = evaluates a new model against a production model with minimal operational overhead. Allows traffic to be routed to multiple models without managing more endpoints.  With a shadow variant, the new model receives a copy of live traffic but does not affect production responses. Latency, accuracy, and error rate metrics is compared directly against the current model using CloudWatch metrics. Natively supported.
    Serverless endpoints scale independently in a fully serverless manner. Additionally, the memory requirements fit within the 6 GB memory and 200 maximum concurrency limits of serverless endpoints.

Sagemaker Developer Products: 

(in ML frameworks layer)

SageMaker AI = IDE plugin. fully managed service that automates ML lifecycle (from data prep to production) with "no-code" environment and handles infrastructure to streamline building, tuning, and deploying models. Capabilities: Predictive analytics, computer vision, NLP, and fraud detection. Steps to Start: 1) pick labeled S3 dataset in CSV, Parquet or other, 2) set algorithm 3) set hyperparameters. 4) Pick compute resources (e.g., instance type), 5) run training job by picking pre-built container.   

Features: auto-training and integrated logging. Secured via IAM; Data: integrates with S3 (storage), including Apache Parquet, Lambda (triggers), RecordIO-protobuf, CloudWatch (monitoring) and API Gateway (endpoints).  "Managed Warm Pools" pre-warm training instances, so ready train immediately.
    Has Scaling Policies: 1) Target Tracking Scaling (benchmarks like CPU %), 2) Step Scaling (thresholds at different tiers), 3) Scheduled Scaling (at different times) with cool downs of scale out and scale in to increase capacity.
   Has Input Modes discussed below: 1) File Mode (default), 2) Pipe Mode (obsolete), 3) Fast File Mode, 4) S3 Express One Zone, 5) FSx for Lustre 6) EFS.   
   To use on spot instances, requires a) algorithm, b) hyperparameter range, and c) performance metric
   Has "Network Isolation" mode for preventing data exfiltration.   
   "Bring Your Own Container" approach - However, requires Docker image creation, container configuration, and deployment. Use SDK to call the TensorFlow or other model in the custom container.
   If super large, then might just mount the volume that holds S3 data.

    Auto Pilot = fully managed ML service. auto process of building, training, and tuning ML models. Good for quickly exploring different models and algorithms to find the best one for your task. uses Clarify to show how ML models could make predictions. uses SHAP values. Auto finds the best hyperparameters.
    Auto Model Tuning/Hyperparameter Tuning = UI location: Training tab - Hyperparameter tuning section - add tuning job. Steps: 1) Job (IAM role, VPC settings, and min/max metric), 2) algorithm or custom script, 3) tuning limits and search strategies with option to Autotune. Only works with ML and not data prep or data analysis. Adjusts learning rate and batch size.
    Canvas/Data Wrangler = No-code ML tool to create predictions. Has AutoML piece that is focused on models. Changes data using feature engineering, such as: 1) rebalances data for undersampling, 2) oversampling, 3) systemic minority oversampling (SMOTE), 4) transform categorical data into numeric data, 5) can replace NULL or missing values with mean, median, or interpolated values. Canvas user must have permissions to access S3 bucket of model. Requires model registered in Model Registry
    Features: 1) has "Quick Model" to quickly get Feature Importance Plot by building and evaluating ML model. 2) Auto trains a model on prepared dataset and creates a "Feature Importance Plot" that shows each feature's contribution to prediction. 3) has "one-hot encoding" on categories. if see feature dominates due to large numbers, then feature normalize with Min-Max scaling. 4) has "Similarity Encode" algorithm to handle misspellings. 5) has "Corrupt Image" transform (impulse noise option), 6) "Featurize date/time" transform extracts day, month, weekday, etc. from a timestamp.
    UI: 1) Line plots show how numbers change over time or across ordered categories. 2) Histograms show numerical data distribution.
    forecasting algorithms:
       parameters: ARIMA, DeepAR+ (time-series), ETS, and LSTNet. 
       non-parameters: CNN-QR (predicts quadriles), NPTS, and Prophet (time-series).
    Might have to use Pipelines instead if super large. 
    Clarifyevaluates FMs for accuracy, bias, robustness, and toxicity and creates reports with results. Helps with transparency and explainability. monitors for bias drift. Can check for class imbalance. Partial dependence plots (PDPs): Show diff in predicted outcome as feature changes. Diff in proportions of labels (DPL): Measures imbalance of + outcomes between different facet values.
    Code Editor = connects to VS Code
    Console = main screen. Under Network Options or Algorithm Options section when creating a Training, Processing, or Tuning Job, check "Enable Network Isolation" to block internet access and external network access.    
    Debugger = debugs the code. Built-in rules for real-time monitoring of Vanishing Gradients (and auto reduce the Learning Rate), underutilized GPUs, and Overfitting (watches the loss curve) triggering alerts or actions when thresholds exceeded. create a hook in the training script to capture and log tensors, to get insights into model‘s performance. If debugger shows low GPU use, adjust batch size and data loading pipeline to ensure GPU is used efficiently. 
    DeepAR model = time-series forecasting model
    Deploy locally trained models = Steps: 1) Serialize Model: Save scikit-learn model and compress to .tar.gz file. Upload file to S3. 2) Build Container: Create  Dockerfile containing Python, scikit-learn, and an HTTP server wrapper. 3) Push to ECR: Authenticate your local Docker client to AWS, tag the image, and push it to ECR. 4) Deploy Endpoint: Create a Model pointing to the S3 bucket path and the ECR image URI, then deploy to a live endpoint.
    Endpoints = fully managed service for ML. via HTTPS URLs. 
call endpoints to deploy your model then later allow your apps to send data to ML model and receive a prediction. can have auto scaling policies based on CloudWatch metrics to adjust instances dynamically. Features: 1) call  "Update Endpoint" to auto blue/green deploy. 2) can set min and max capacity via auto scaling. 3) can set "Max Concurrency" to 1 to be singleton. 4) to do multiple production variants, create a endpoint config, set Production Variants, add multiple target models to it, optional set of Infrastructure => Traffic Split for load balancing, then select new endpoint config in endpoint. 5) use "Elastic Inference accelerators" to attach a small GPU acceleration to CPU-based instances. 6) "Invocations Per Instance" metric to add more instances in peak times. 7) increase "Desired Weight" from 0 to allow more to an endpoint. 
       Types: 1) Real Time Inference endpoint (Max payload is 6 MB. Max response timeout 60 sec. Supports Multi-Model and Multi-Container endpoints. for low-latency responses)2) Asynchronous endpoint (large or long-running, Max payload is 1 GB. Max response timeout 15 mins, Supports Scale-to-Zero. Ideal for large-scale, offline processing tasks (like video processing). Auto scales resources based on the workload. Feature Inference Recommends alt. models, not model quality), 3) Serverless (intermittent or long inactive, Max payload is 30 MB. Max response timeout 60 sec. Only pay for compute used in ms. Features: "Provisioned concurrency" sets min available instances and avoids cold start overhead.) , 4) Batch (massive or long offline, batch, deploys model asynch for inference, only pay for compute used). 5) VPC endpoint. Security: When VPC created, the auto created ENI is governed by a IAM group. You can add a VPC endpoint policy to allow access to IAM users. 
    ExperimentsTrack different configurations, hyperparameters, and algorithms used in each experiment. You can see the visual results in Sagemarker AI Studio"Trial" is a single training run with specific configuration. "Trial components" are individual steps or artifacts in a trial (e.g., metrics, outputs) for different workflow stages such as data processing or model training. "Trackers" functionality auto log hyperparameters, datasets, metrics, and code changes for each trial. 
    Feature Store = stores and shares the features/variables of a model to team. steps: 1) create feature group, 2) ingest, 3) access store to build datasets for training. create offline store for batch and/or online store for real-time inference.  ′GetRecord′ API always returns latest feature version.
    Ground Truth = data labeling service creates high-quality training datasets by sending the most difficult ones (hardest 30%) to crowd-sourced humans by outsourcing data labeling tasks. Has special computer vision labeling section.
     Horovod distributed framework = Popular OSS distributed DL framework. Efficient scaling of TensorFlow training across multiple GPUs. User-friendly.
     HyperPod = managed infrastructure service for accelerated distributed training and fine-tuning of FMs. 
     Inference or Deployment options = deploys trained models as hosted services with 4 options: 1) Real-Time: Good for low-latency, but has low payload. 2) Serverless: Good for intermittent traffic with idle periods. 3) Asynchronous: Good for large payloads or long-running. 4) Batch Transform: Good for offline massive datasets. Works for bulk data, scheduled. Not for real-time.
    Inference Recommender = recommends inference types.
    JumpStart = Gen AI. ML hub with 100s of FMs and pre-built MLs (vision, NLP, and tabular data) deployable with a few clicks. Not a low code option. Created code should feature threat detection and data protection. Some of evaluation types such as automatic model evaluation. Can restrict FMs. Comes with pre-built pipelines for whole ML process.
   Lineage Tracking = auto records data lineage, model parameters, and artifacts across the workflow. Good for audit trails, governance, and compliance verifications, tracking the exact lifecycle of a model from raw data to deployment. Better than Experiments if already in Sagemaker since gives better workflow management via DAG.
   Managed Warm Pools = keeps provisioned ML compute instances active and "warm" for a Keep-Alive period after a model training job completes.
    MLOps  = DevOps for ML
    Model Dashboard = sharing team info on production model behavior in one place.
      Model Cards = your model property info for documentation purposes
      Model Registry = store, manage, tracks your model versions and through deploy ML lifecycle. Hierarchy: 1) collection and 2) model groups to catalog the models. You must give unique tags for each model version. Really is a cataloging step, not deployment step. Feature: "Model Approval".  
    Model Invocation Logging = you can turn this on.
    Model Monitor = monitors production models for data drift (missing values and outliers in new data, see bias drift by data pattern changes, "concept drift" in customer data, and deviations from the baseline training data), model drift, model quality (performance) loss, and feature change drift (called model explainability). If feature drift, then alert the team and retrain the model with updated data to ensure feature importance remains balanced. If new model, then must do another baseline. For deployed models and not for training/evaluation metrics. To measure model quality (accuracy, precision, recall, F1 score), you must merge predictions with ground truth labels. Steps for other AWS accounts and models: 1) create "model group". 2) create new AWS account with this as central catalog, 3) attach cross-account resource policy to each model group in other AWS accounts,   
    Neo - for edge devices. compiles the model into a platform-specific format, reducing its size and for faster performance.
    Neural Topic Model (NTM): to automate blog post tagging.
    Notebook = WARNING - different from Studio Notebook!!!  Is older version, not for collaboration. Accessed by short-lived presigned URL via CLI or SDK. Instances: 1) General Purpose, 2) Memory Optimized, 3) Compute Optimized, and 4) Accelerated Computing. Rights for individual: 1) Create single IAM role with permissions or policy to access the notebook and read to S3 data. 2) Attach to each notebook instance. 3) If S3 data encrypted with KMS, then grant rights to KMS key policy to IAM role. Rights for shared notebook: 1) IAM groups do permission mgmt. for multiple users. Rights for no internet: Put on private subnet in a VPC with S3 and/or SageMaker VPC endpoints inside it. Rights On Creation: When create notebook, its EC2 instance and its attached EBS volume is in AWS-owned service account, not directly in your AWS customer account.
      Lifecycle Config = Create config with on-create or on-start scripts for a notebook.
    Pipelinesserverless, purpose-built MLOps and LLMOps workflow orchestration service.  DAG done in JSON or Python. once passes Model Registry, then process waits until "Approved" status. 
     Steps for other pipelines: 1) pipeline definition such as name, data pre-processing, model training, and model registration, 2) steps for each action, 3) parameters for step configuration. Steps to Code Pipeline: 1) define steps in this, 2) add explicit step to this in Code Pipeline, 3) trigger via Code Commit, 4) monitor and validate. CI/CD steps: 1) S3 event triggers pipeline, 2) CodeBuild processes/trains model using data in S3, 3) Model registered in SageMaker Model Registry.  
    Supports batch transforms to run inference of entire datasets in the most cheap manner for only periodic access. Can automate all ML workflow. Can trigger retraining pipelines whenever data drift is detected or when new data becomes available. Can auto register a new model in Model Registry and then trigger deployment to real-time endpoint based on predefined steps. It looks at Model Lineage for formal details.
       Pipeline Parameters = create variables that can be overridden when running pipeline (without modifying the pipeline).
       Pipeline Steps = define actions in the pipeline, but do not offer parameter flexibility. 
    Processing = fully managed feature designed to run data engineering, feature engineering, data validation, and model evaluation workloads at scale. Steps: 0) Triggers prep script, 1) spins up instance, 2) reads from S3, Athena, or Redshift. 3) spins up Docker or custom container. 4) Runs script, 5) writes output back to S3, 6) auto teardown of temp EC2 instances.
    Profiler = focus is on profiling and visualization, not changing hyperparameters. Import necessary modules and add start and stop profiling commands in PyTorch training script to track performance and resource utilization.  Monitor: 1) GPU utilization and adjust the data pipeline if GPU underutilization is detected, 2) disk IO to prevent bottlenecks in data loading. By adding start and stop commands to the training script, you can collect detailed performance metrics, including CPU and GPU use, memory usage, and I/O operations.
    Purchase Provisioned Throughput = dedicated capacity bought. required for using custom models.
    Python SDK Estimator = Define the profiler config, specifying profiling duration in seconds for the CPU and GPU. can have endpoints.
    Role Manager = define min. permissions
    Script Modegives pre-configured Python and PyTorch containers to hold your existing custom training and inference scripts to help reuse. can train custom ML models. 
    SDK
       Estimator = high-level Python interface
    Serverless Inference = built for infrequent, intermittent, or predictable workloads (such as a single nightly run). Auto handles runs and scales down to absolute zero when not in use. Setting the 
Max Concurrency parameter to 1 limits the endpoint to a single concurrent container invocation.
    Studio = web-based IDE for ML with tools for data prep, model building, training, and deployment. Its "Feature Importance" charts and "Summary" plots use SHAP values. On "Autopilot" tab then "Training Mode section" then "Hyperparameter Tuning" section, you can set "Early Stopping" to Enabled or Auto to stop early.
       Studio notebooksWARNING - different from Notebook!!! Are collaborative with Jupyter notebooks with ML libraries, persistent storage, and integrated tools. Use Studio notebooks to write, run, and share code for data exploration, model training, and deployment. Studio notebooks are integrated with SageMaker Studio, allowing persistent storage and multiple notebooks per project, while standalone notebook instances operate independently.
       Studio JupyterLab environment = No need for manual infrastructure setup. Has Matplotlib and Seaborn libraries for data visualizations. Has pre‑installed libraries such as TensorFlow and PyTorch. Simplifies infrastructure setup, auto save work to S3, and allow users to resume their work later.
    Tensor Board = visual tool. track, debug, and optimize DL models over "epochs" (training loops).
    Training = trains ML models on various compute instances, including GPU accelerated instances using distributed service.
    Users can have resource tagging tracking for threshold detection.

SageMaker Input Modes: (for training data from S3)

  1. "File Mode" - Default. Copies entire S3 data to onto the training instance's local EBS volume in Docker container. Then starts training. Terrible when training dataset is huge!
  2. "Pipe Mode" - Streams real-time data from S3 straight to training. Data not stored on local storage of training instance. Obsolete, use Fast File.
  3. "Fast File Mode" - Real-time. Training begins before all data loaded in parallel so decreases startup time.  Combines benefits of both File and Pipe modes. Can access entire files (like File mode) and streams data to algorithm (like Pipe mode). Can do random access (but best with sequential access).
  4. S3 Express One Zone = Fast storage class in one AZ. Combines with an S3 mode (File, Pipe or Fast File)
  5. FSx for Lustre = Scales to high performance (100s GB of throughput and millions of IOPS) with low latency, Single AZ, Requires VPC
  6. EFS = Requires VPC.

Main SageMaker Algorithms

Regression & Classification (Tabular Data):
  Linear Learner =
for classification and regression. optimizes variants in parallel. Parameters 1) Increasing "Target Precision" parameter minimizes false positives. 
  XGBoost = SL. Decision tree. Parameters: 1) max_depth that controls tree complexity. 2) 
Increasing reg_lambda counteracts overfitting.
  Factorization Machines = Good for click-throu
gh-rate (CTR) and high volume. Models interactions between features and is effective for sparse datasets, such as recommendation systems.
  K-Nearest Neighbors (KNN) = SL. Classification (common) or rarely in Progression. classifies data point on how its features are similar to others (neighbors). Classification answer is 0 to 1.
  Object2Vec = Converts pairs of words, customer IDs, and tokens into 1 or 0 for similar words.

UL and Clustering:
  KMeans = UL. No event planning by 1) K = number of K leaders for people to cluster around, 2) each data point finds closest leader, and 3) the leader moves to the Means = math mean (center) of their group. Good for: finds hidden or unlabeled patterns, customer segmentation, risk grouping, and pattern discovery.
  Principal Component Analysis (PCA) = UL. think: looks for Patterns, Compressing it (reducing the dimensions), on the Anonymous data (so UL).  PC1 = trend of points, PC2 = perpendicular and sub-trend. Ex: Does not care about labels (of "height" and "weight"), but rather creates single dimension of size (so seeing the trend) which is PC1. Then tracks data that is not explainable by size (say "body shape") that is PC2.  Only cares about where the data is most spread out (variance).
  Principal Components = new, independent axes (directions) that rank the data's most important trends (patterns) from highest to lowest spread (variance)."
  Random Cut Forest = anomaly detection. Good for id outliers or unusual behavior.

NLP & Topic: 
   BlazingText: An ultra-fast word embedding and text classification engine optimized for GPUs. It can scale across multi-node clusters to generate Word2Vec vectors or categorize text items (e.g., web queries, sentiment tags) at scale.
   Neural Topic Model (NTM): organizes large text doc collections into distinct thematic topic categories. It maps hidden word associations without requiring pre-existing manual index labels.
   Latent Dirichlet Allocation (LDA): UL. NLP. Dirichlet is a lazy (so UL) bible reader that looks through text (so NLP), finding different topics, and finds the theme by associations between topics. 
   Sequence-to-Sequence (Seq2Seq): supervised neural framework mapping an input sequence of tokens directly to an output sequence. Good for translations, summarization models, and speech-to-text workflows.

Vision:
  Image Classification: Assigns one or more categorical labels to a whole image using deep CNN (ResNet). Supports transfer learning from pre-trained nodes or full custom initializations.
  Object Detection: Ids, bounds, and classifies multiple distinct elements inside a single frame. It produces standard pixel-coordinate bounding boxes tagged with categorical confidence markers.
  Semantic Segmentation: Pixel-level structural tracking, tagging every individual pixel in an image with a class category. Good for autonomous driving maps or medical scan line tracking. 

Time Series Forecasting: 
  DeepAR: Needs historical data. Optimized for predicting future values.

Other:

   LightGBM: tree-based algorithm. Can be configured with oversampling the minority class or adjusting class weights on imbalanced classes.  Captures complex relationships and interactions between features.  Fast and efficient. Quickly train models.
   IP Insights: UL, IPv4 addresses and associates them with entities like user IDs.
   Bias: Jensen-Shannon divergence, Kullback-Leibler divergence, and total variation distance for loan bias.

----------------------------------------------------------------------------------------
                          Optional Sagemaker Understanding Stuff
----------------------------------------------------------------------------------------

SageMaker AI is the “heart” of the MLA-C01 certification

  • The majority of exam questions will be with SageMaker.
  • It is important to understand and discern between SageMaker Processing, SageMaker Training, and SageMaker Hosting, which all cover different aspects of the end-to-end ML process.
  • These notes first cover generic ML knowledge and concepts, and then their implementation in AWS (usually involving SageMaker and other AWS services).
  • Some open-source Apache services like Hadoop or Spark are also covered, since they are also popular in ML environments and are well supported in AWS
  • It is a good idea to review the high-level overview of SageMaker that was done in the foundational AIF-C01 certification. MLA-C01 builds on top of that knowledge.
Intro to SageMaker AI
  • AWS service that can handle the whole End-to-End process in ML
    • Data processing, model training, model deployment, and model hosting
    • Tons of features and sub-products (will go into depth in these notes)
  • SageMaker Training and Deployment Architecture
    • Input/output data usually in S3, but could be in other data stores
    • Training and inference code must be inside container images registered in ECR
    • Not all ML models will be deployed to endpoints.
End-to-End Process:
  1. Data Preparation (data prep)
    • Data usually comes from S3
      • Data can also come from Athena, EMR, Redshift, Amazon Keyspaces DB…
    • Integration with Apache Spark
  2. Data Processing
    • Processing job: copy raw data from S3 → Spin up processing container → Output processed data to S3
    • Container can be SageMaker built-in or user provided (code)
  3. Training
    • Training job requires
      • URL of S3 bucket with training data
      • ML compute resources
      • URL of S3 bucket for output → Model outputted to S3
      • Container (ECR) path to training code
    • Many training options available
      • Built-in algorithms, Spark MLLib, Tensorflow, PyTorch, Scikit-learn, XGBoost, Hugging Face, your own Docker image, AWS marketplace-purchased algorithms…
  4. Deployment
    • 2 ways:
      • Persistent endpoint for individual predictions/inference on demand
      • SageMaker Batch Transform for predictions of an entire dataset
    • Many cool options: inference pipelines, SageMaker Neo (edge devices), Elastic Inference, automatic scaling, shadow testing…
SageMaker Domain
  • Organizational unit within SageMaker → organize users, apps, and resources
    • A domain must be configured before you can do anything in SageMaker!
      • Think of it as an isolation of an ML project
  • Each domain has one EFS volume
    • Each user has their private EFS directory within that volume
    • There's a shared EFS directory available to all users
  • User profile: represents an individual user/person in a domain
    • Can create own personal apps
    • Can spin private SageMaker Studio instances
    • Has access to a private EFS directory to store personal files
  • Shared resources across all users:
    • Shared spaces
    • Shared EFS directory
    • Communal IDE app (SageMaker Studio public to all users)

Network Configuration in SageMaker Domain

  • By default, a domain has two VPCs
    1. One with public internet access → can expose public endpoints for your domain
      • Managed by SageMaker
      • Optional → can select “VPC Only” when creating the domain, which means this managed VPC is NOT created
    2. One for private traffic
      • Encrypted traffic to domain's EFS volume
      • YOU manage it: must specify the VPC, its subnets, and security groups (SGs) 


Interfaces for Using SageMaker

SageMaker Notebooks

  • Old/classic method for ML in SageMaker → ML code
  • Spin up EC2 instances to host ML Notebooks, which direct ML E2E process:
    • S3 data access
    • ML code in Jupyter Notebook
      • Libraries like Scikit_learn, numpy, pandas, Apache Spark, Tensorflow, etc at your disposal
      • Wide variety of built-in models
    • Can spin up training instances
    • Can deploy trained models for making predictions (inferring) at scale

SageMaker SDKs

  • Training and deployment of ML models via Python scripts
  • Python API libraries → import inside your code
    1. Boto3 (low-level API)
    2. SageMaker Python SDK (high-level API)
  • Can automate ML workflows, manage training jobs, deployments, and pipelines

SageMaker Studio

  • Web-based IDE for E2E ML development

  • Features: Team collaboration, Tune and debug ML models, Deploy ML models, Automated workflows.

  • Screenshot:



SageMaker Console UI

  • AWS Management Console interface for SageMaker

    • GUI for managing SageMaker resources
  • Mostly for administrative tasks

  • Can access all other interfaces from the console UI

  • Screenshot:



SageMaker Jumpstart 

  • ML Hub with many pre-trained ML models and pre-built ML solutions. Offers one-click deployment of models for inference. End-to-end solutions for common business problems.

  • Computer Vision (CV) models, Natural Language Processing (NLP) models, GenAI Foundation Models (FMs)…

    • Amazon-owned models or 3rd-party provider models
    • Provider examples: HuggingFace, Databricks, Meta…
  • Screenshot:



SageMaker Canvas (was Data Wrangler)

  • Canvas became the unified no-code workspace for both data prep and model building.  Data Wrangler was integrated directly into CanvasNo-code ML for business analysts.
  • Features
    • Build custom ML model (leverages AutoML powered by SageMaker Autopilot)
      • e.g. Upload CSV data (CSV-only for now), select column to predict & build model
    • Automatic data cleaning (leverages Data Wrangler)
    • Access ready-to-use models from AWS AI services (Rekognition, Comprehend…)
    • GenAI support via Bedrock or JumpStart FMs
    • Import, preview, visualize, transform data… in a visual UI
      • Even “Quick Model
      • Can also export data flow
    • Many feature engineering capabilities (transform images, balance data, impute missing data, handle outliers, PCA…)
    • Troubleshooting:
      • SageMaker Studio should have correct IAM roles/permissions
      • Data sources should allow access (e.g. AmazonSageMakerFullAccess policy)
      • EC2 instance limit
        • “The following instance type is not available…” error → actually is usually a service quota problem → Ask for a bigger EC2 instance/quota increase
  • Screenshot:

Summary Table of Interfaces for SageMaker



SageMaker Ground Truth

Humans label data → Prepare a training dataset with humans


   Human reviewers: 

      Mechanical Turk workers, your employees, or third-party vendors. Ground Truth creates own model as humans label data → RLHF.  Only images the model isn't sure about are sent to human labelers (reduces manual work by 70%).  

      Ground Truth Plus: Turnkey solution.
  AWS experts manage the whole workflow. Fill out a form. Experts contact you, discuss pricing, manage labelers. Do NOT confuse with Amazon Augmented AI (A2I)!


   Labels
  GroundTruth is for human labeling, while A2I is for human oversight of trained model predictions. However, SageMaker Ground Truth and A2I can use the same human workforce! Benefits: consistency, efficiency, flexibility. Other ways: Rekognition, Comprehend, etc. Some pre-trained models or unsupervised techniques can be helpful.

SageMaker Feature Store

  • Centralized portal for features. Offers fast, secure access to feature data for ML models.
  • Data ingestion via streaming or batch
    • Feature Store inputs data from streams with PutRecord API. 
    • Online store (Model gets features with GetRecord API)
    • Offline store (Feature Store inputs data into S3, AWS Glue creates Data Catalog, models can then access features via BATCH access)
  • Security
    • encryption at rest (KMS…) and in transit
    • IAM, PrivateLink…

SageMaker Model Monitor

  Detects data drift, model drift, and bias drift in deployed ML models. Continuously compares incoming data against a baseline dataset captured during training.
  When detects drift beyond thresholds, it emits Amazon CloudWatch events. These events can trigger an AWS Lambda function, which is a common way to automate workflows such as model retraining. Then the lambda function starts a SageMaker Pipeline, firing a retraining job with updated data.

Comments

Popular posts from this blog

GHL Email Campaigns

Await

Free AI Tools