ML Exam: 5 - ML Products

ML Exam: 5

ML Products


Data

Data Analysis & Visualization
ForecastForecasts data. Features: 1) "PerformAutoML" selects best ML algorithm (including DeepAR+ and CNN-QR (very complex time series), Prophet (time-series plus holiday and seasons), ARIMA (time-series only), ETS (trends and seasons), and others) and based on chosen objective (e.g., min avg loss across quantiles). 2) Hyperparameter Optimization (HPO) selects best hyperparameter combo for chosen algorithm. 3) "FeaturizationMethodName" handles imputation. 4) "Holidays" feature.
Quick Sight = Interactive dashboards and reports over data. Limit of 1 TB per dataset. Has anomaly detection, forecasting, and auto-narratives.
     Athena = serverless SQL on S3 for ad-hoc queries and data lake analysis. cost-effective. Does SQL in parallel.
     Redshift - Think Oracle was "Big Red" and shifting away from Oracle data warehouses. Faster querying.  Fully managed. Structured or semi-structured data. scalability and pay-as-you-go pricing model. SQL across data warehouses, data lakes, and operational DBs. Can run either with provisioning OR stateless unprovisioned.  Feature: "dynamic data masking" for controlling access to sensitive data.
         Redshift Spectrum = query external data lakes
         Redshift ML = run ML models on your data.

Data Pipelines
Kinesis Data Analysis = for real-time streaming event data from apps, streams + sensors.  Instant analytics, metrics, and data insights. "Random Cut Forest" for outliers. Allows for pre-processor Lambda functions to handle custom transforms. Allows real-time transformations using SQL.
Kinesis Data Streams = for real-time streaming event data from apps, streams + sensors. Requires app to process stream data and write to S3. Auto provisioning and scaling in on-demand mode with auto shard splitting to handle volume. Instant analytics, metrics, and data insights. Feature: 1) "Enhanced Fanout" - helps "consumer scaling" by reducing latency and ensuring dedicated throughput per consumer.  2) "Random Cut Forest" for outliers. Limits per shard: 1. Write Throughput: 1 MB/sec. of data. 2. Record Transaction: 1,000 records/sec. .
Kinesis Data Firehose = for data delivery (including near real-time, but usually for batch) to S3 or Redshift over public endpoint for ingestion (good for sites with no direct connection or VPN). Fully managed service. Auto provisioning and scaling. Gives data to storage and services.  Cannot be used for sub-second latency apps or ETL. Features: 1) Can send failed records to separate S3 bucket. 2) Compress using Gzip compression. 3) Built in transformation to Parquet (good for Athena Sql queries). 4) Can set to zero buffering for real-time rather than default 60 second. 5) Can call custom Lambda function for ETL if set as invocation target. 6) “Source record backup“ feature stores raw. 
Kinesis Video Streams = Proxy server + RTSP to KVS is the usual approach to ingest live video streams from existing IP cameras (which use RTSP) into AWS. 

Data Processing
Glue is serverless data integration or ETL for text-based semi-structured data (like JSON, CSV, XML, and Apache Parquet) or unstructured text logs. Not for images or deploying models for inference. Can read from Redshift. data dialog.  Requires writing, testing, and managing custom Python/Scala Spark code or maintaining complex visual workflow. No ready-to-use Docker images for popular ML frameworks such as TensorFlow and PyTorch.  Enable "Job Bookmarking" to track data processed between job runs and reduce data redundancy by ensuring only new or changed data is run in later ETL runs. Features: 1) auto transform into Apache Parquet format. 2) "Use minimal number of Data Processing Units" since charges based on number of DPUs used and job duration. 3) Partitions data (for performance and reducing query costs) to process these partitions independently. Many partition by incremental amounts or time frames (such as months and/or years). 4) "Find Matches" option detects duplicates for merging of 2 tables. 5) built-in transforms Filter, Map, and RenameField.
     crawlers = infers columns in U. can infer data schema, even if it changes over time. stores schema into Glue Data Catalogoptimizes costs by avoiding unnecessary ETL runs.
     Glue Data Brewvisual tool. cleans, normalizes, and transforms data with no code. 250 transformations including way to mask sensitive data or PII. supports workflow creation and scheduling of periodic runs. Nothing for image resizing.
     Glue Data Catalog = good for metadata extraction. Auto rule suggestions. Has anomaly detection via ML. Auto recommends and runs data quality rules directly on Data Catalog tables.
     Glue Data Quality = Serverless that auto measures, monitors, and manages data quality in Glue. not for monitoring model bias. 

Elastic MapReduce (EMR) = Auto provisioning, cluster mgmt, and scaling. Managed. Apache Spark, Apache Hadoop, and Apache Hive. Reads mass data and maps to key value pairs and reduces dups. Apache Spark in EMR is for ETL tasks at scale connecting natively to relational databases (via JDBC) and pulls data from S3. Aggregates, cleans, and pre-processes massive datasets before model training. Can 1) run the primary node and core nodes (crucial for the stability and performance) on "On-Demand Instances", 2) run the task nodes on "Spot Instances".

Search
OpenSearch Service = Search via keyword and NL matching.

Data Storage
Lake Formation = Simplifies data lakes on S3. Centralizes governance, enforce row/column-level control on data, and auto data ingestion and cataloging, replacing complex IAM policies. Good for data aggregation from different data sources. 
Can configure tags to map users to their campaigns. 


MISC
Managed Service for Apache Flink = managed, serverless.
Apache's Flink = open-source, distributed engine for stateful processing over unbounded (streams) and bounded (batches) data sets. Stream processing apps are designed to run continuously, with minimal downtime, and process data as it is ingested. For low latency processing, performing calcs in-memory, for high availability, removing single point of failures, and to scale horizontally. Exactly-once consistency guarantees.


Application Integration

EC2 ROUTING

Event Bridge = serverless. event routing + store events. Could use to decouple apps. Can fire off a Sagemaker Pipeline if has a valid role OR has S3 Upload event pattern that matches.

    E. B. Scheduler = can schedule things to happen say weekly or monthly via a rule, but this is usually done after the interaction elements of the task are created. 

Simple Queue Service (SQS) = message queuing. Could use to decouple apps.

Simple Notification Service (SNS) = pub/sub service that stores until 2nd service is up. Real time. Multi-target simultaneous. Email and texting (via SMS).

Lambda = serverless compute that responds to SQS or SNS events. We manage rights via IAM roles (by role, we grant it the S3 and DynamoDB resources access without exposing sensitive credentials),  our code, triggering event, and run times. AWS is responsible for capacity and OS mgmt.

MISC APP INTEGRATION

Managed Workflows for Apache Airflow (MWAA)  fully managed service that allows you to use the open-source Apache Airflow platform to orchestrate data pipelines.

Step Functions  conductor. visual no-code workflow builder. serverless orchestration service coordinates multiple services (such as Pipelines and Glue jobs) into workflow. prevents Lambda timeouts by allowing the API call to process in the background. 


Cloud Financial Management

Billing and Cost Mgmt. dashboard = Inside it is:
     Bills = shows invoices and payments 
     Budgets = set budgets and alerts when costs, usage, or Savings Plans and RIs exceed limits. 
     Cost Explorer dashboard of costs and usage with interactive graphs, reports, and forecasts. Shows spending patterns, trends, and RI recommendations.


Compute

Batch = batch workloads not for deploying models for inference. auto schedule, manages + scales. Parallel work. Compute.

Elastic Compute Cloud (EC2) 
Think HAS/HAS:  
      Hosting (Multi-tenancy (VMs isolated but share host resources)); Auto Scaling;  Setup ( 1) AMI for the OS/Software, 2) Instance Type: Pick your "T-shirt size" (General, Memory, etc.), 3) Storage  (Instance Store for temp data, EBS for DBs)
      Hosts (Dedicated give you the whole physical server); Availability (Capacity Reservations guarantee you have space in a specific AZ when needed); Security ( 1) of Security Groups act as firewalls for instances and 2) IAM Roles secure EC2 from API and let EC2 talk to S3 securely); 

 EC2 Types:
       General Purpose = flexible + cost effective 
       Memory Optimized = good for real time, large data, or data analytics
       Storage Optimized = has high-disk throughput & low latency. Good for data analysis.
       Dedicated Host = full machine + physical server. Supports BYOL.
       Spot Instance = Stop(able - think rearrange letters) batch operations. Unused EC2 w/ 90% savings. 
       On Demand = w/o commitment for unpredictable and mission critical or for "short" (6- mths)
       “Reserved Instance” (RI) = for predictable work. 1 or 3 years commitment for a discounted rate on compute usage (like EC2 or RDS) in specific AZ + 70% cost savings when you agree to use a specific instance config. Good for 90% to 100% utilized. If on-demand matches RI then auto billing discount.

Can install Cloud Watch Agent on EC2 instance to monitor CPU and other custom metrics. 

Containers

EC2 CONTAINERS
1) EC2 Self Managed = full infrastructure control (provisioning, OS updates, patches)
2) Elastic Container Service (ECS) = partially managed. we kick off patching. we config scaling. 
         Fargate ECS = think "far-away servers".  Serverless. does not support GPU instances.
3) Elastic Kubernetes Service (EKS) = partially managed. we kick off patching. we config scaling.
         Fargate EKS = think "far-away servers".  Serverless. does not support GPU instances.
Elastic Container Registry (ECR) = managed. Hands Docker and OCI (Open container initiative) containers to ECS and EKS. Uses IAM security.
Deep Learning Containers (DLCs): Pre-built Docker images optimized for DL frameworks like TensorFlow, reducing the effort of setting up the environment. 

Database

DB 
Relational DB Service (RDS) = Managed Service. Connects to different Relational DBs, including Aurora, MySQL, PostgreSQL, MS SQL Server, MariaDB, and Oracle. Auto backups and read replicas. Multi-AZ. Cost effective. VPC isolation as well as encryption at rest and in transit. Auto patches. We build schema and do database settings. Start with Mgmt. Console or Cloud Formation.
    RDS Read Replicas - Of RDS instance for redundancy and scalability for cross region or cross AZ, or same AZ.
    Aurora = Mnemonic: "Aurora Makes Perfect Database Relationships" (MySQL, PostgreSQL, DB, Relational) (so DB replacement for MySQL or PostgreSQL). Auto grows storage. Auto detect hardware failures and redirects traffic. Auto backups. Multi-AZ. Low cost.  
        Aurora serverless = Serverless DB version.
Dynamo DB = Think Keys and KMaps (Key/Value (so No-SQL), Managed, Availability of Data and Auto-Scaling, Provisioned, Dynamic Schema). Serverless. Encrypted prior to storage. For unpredictable traffic
Elasti Cache is fully managed in-memory caching. auto detect and failover on nodes. Good for Redis, Valkey, or Memcached tools, for 2-tier web apps, and for read heavy apps.
Document DB = For JSON data or MongoDB. Good for semi-structured data like product catalogs.

DB SERVICES
Neptune.  Think Neptune, Roman god of sea. Sea of interconnected currents and fish so highly connected graph DB service.
   Neptune serverless : Serverless.


Developer Tools

Chat Bot = allows DevOps and developers to monitor and manage resources directly in Slack, Microsoft Teams, and Amazon Chime.

Code Artifact = secure, managed artifact repo for storing and sharing software packages. Ex: NuGet.

Code Build = fully managed CI.
Cloud Development Kit (CDK) = helps developers define cloud infrastructure using programming languages
Code Deploy auto deploys software for various compute services
Code Pipeline = fully managed CI/CD service to build, test, and deploy.
X-Ray = visual dashboard of tracing, debugging, and performance analysis tool. Xray is eXamine Requests, Analyze path, Yield trace.
Code Guru = recommends code quality fixes and id an app’s most expensive lines of code. Has Reviewer and Profiler.
DevOps Guru = analyzes operational data and metrics and events to id behaviors that deviate from normal patterns. Users are notified when detects an operational issue or risk.
Serverless App Repo = library of pre-built serverless patterns.


Machine Learning

AI/ML Products (in services layer)
Augmented AI (A2I) =  builds workflows for human review allowing adjust confidence levels
Comprehend = NLP (text). calcs unlabeled pre-trained toxicity detection, redaction, sentiment, entities, key phrases, keywords, and topics NLP models. Can detect and redact PII and offensive language from user interactions. Feature: Can "Import Model API" to copy model into an account. Modes: 1) Built-in models, 2) Entity recognition, 3) Classify multi-label ("Sample yogurt" to “yogurt,“ “snack,“ and “diry product“ classify), 4) Classify multi-class ("Sample yogurt" to single classify).
    MedicalHIPAA. NLP. Uses ML to auto extract structured medical info such as meds, diagnoses, and test results from unstructured text like doctor's notes or clinical reports.
Kendra = NLP (text). enterprise search service to find context or semantic searches across S3, Sharepoint, etc.  Searches relational databases via JDBC connectors. 
Lex = Think Lexicon.  Conversational brain for speech and text to apps. Interaction focused. Such as chatbot, voice controlled menu, powers Alexia, etc. Outputs intent.
OpenSearch Service = Store embeddings in a vector database via the scalable index mgmt and nearest neighbor search, then later search via keyword and NL matching. Good for semantic searches and similarity-based recommendations since embeddings. 
Personalize recommends. Ex: movie recommendations. Works with batch training. Personalize doesn‘t understand users from their static profiles, but rather from their current behavior. Feature: 1) Has "Event Tracker" for real-time user interactions.  2) "PerformAutoML" selects best ML algorithm and based on chosen objective (e.g., min avg loss across quantiles). 3) "Hyperparameter Optimization" (HPO) selects best hyperparameter combo for chosen algorithm. 4) Use "Event Tracker" to include real-time user interactions. CANNOT read from Kinesis and Dynamo DB. 5) "Personalized Ranking" ranks the chosen item list for recommendations. 6) "User Personalization" ranks based on historical data.   
Polly converts text into speech/audio for conversational apps. Think Polly Anna Parrot. WaveNet is the GenAI part of it.  Polly itself is not used to build the dialogue flow or understand user input.
Rekognition = auto image and video analysis for your apps without ML experience. Think eyes. Object and scene detection and has OCR for text. Outputs labels, text, and data. has built-in eyes gaze detection. Can do content moderation such as detecting violent content. Does NOT show live video feeds. 
   Rekognition Custom Labels is for custom object detection or image classification with no-code.
   Rekognition Medical extracts medical info (such as medication, condition, test results) from unstructured text. 
Textract extracts text from docs and hand written from cursive or scanned typed document images. No search. Can recognize entities like names, dates, and addresses.  
Transcribe converts speech into text
Translate is a text translation to different language.

Gen AI Products (in services layer)

Q Business - BI. Answers questions using your company data. Supports 3rd party app plugins, including ServiceNow and Zendesk. Features: 1) "Blocked Phrase".
Q Developer - Helps developers with test case creation, documentation creation, code recommendations, opens source license tracking, reference tracking, and snippets. Works with Glue.

Fraud Detector = Fully managed. Steps: 1) ingest historical data such as past account signups and user details to train the model. 2) real-time evaluation using fraud score (using ML) to id payment fraud or fake account creation. Do not try to force all the model features.
Health LakeHIPAA. Fully managed. Standardizes data. Central repo.
Lookout for EquipmentObsolete. Monitors real-time sensor data (e.g., pressure, temperature, RPMs) to trigger predictive maintenance.
Lookout for Metrics = Obsolete. Fully managed. Monitors real-time anomalies and ids root causes in business metrics.
Lookout for Vision  = Obsolete. Fully managed. Monitors inspection cameras for QC. 
Mechanical Turkoutsourced crowdsourcing marketplace


Management and Governance

Cloud Formation = Create JSON/YAML templates which are I as code so can start anew quickly in disaster recovery situation or setup your preferred AWS environment.

USER AND SECURITY MONITORING

Cloud Watch visualizes real time and system wide the resource use, app speed, and operational health. Alerts. Create custom dashboards. CW collects metrics (such as CPU, memory, and invoke counts) on EC2 and Sagemaker, sets thresholds alarms (including budget alarms). Like neighborhood watch. Views logs from Cloud Trail, VPC Flow Logs, and Guard Duty. Can create custom metric such as model overfitting and alert for newly created metric. Install CloudWatch Agent to push custom metrics for active users. Alarms can call Lambda functions. Add "Invocations metric" to listen to Sagemaker endpoints that auto pushes Invocations. 

Cloud Trail logs tracking user activity, model deployment frequency, and API usage even on stopped EC2 instances. Logs to S3 bucket. No UI. Helps with governance, compliance, and operational and risk auditing. 

GOVERNANCE
Organizations manages entire org (including solo members and org units). Links accounts. Consolidates billing (cheaper from volume discounts) and creates hierarchy. Making orgs needs root.
Service Catalog creates, shares, and organizes a curated catalog of resources.
Trusted Advisor monitors real-time cost, performance, resilience, security, and service quotas.
 
AUDIT
Config audits resource configurations, marks them compliant/non-compliant, and can emit events to CloudWatch. Managed. Monitors config changes for compliance, security, and change mgmt. Works in the control tower's landing zone. Use config snapshots and store them in S3. Set compliance rules and evaluate resources against them. Can schedule periodic valuations. Custom rules created with Lambda.

SECURITY CONFIG
Systems Manager Operational insights by centralized view of ID, OS details, auto registry edits, user mgmt, and patching. 

Resource Access Manager (RAM) = shares resources across accounts or in org.

Compute Optimizer = analyzes historical use metrics (CPU, memory, storage) uses ML to recommend optimal resource configs. Reduces costs and improves app performance by rightsizing over-provisioned or under-provisioned resources like EC2, EBS, Lambda, and ECS on Fargate.

MISC

Connect = AI powered call center. Connect connects customers to your contact center.


Migration and Transfer

DATA MIGRATION

Data Sync = auto fast internet data transfer to S3, EFS, and FsX for Windows File Server. Progress  checking and task reporting

TRANSFER

IoT Greengrass = Deploys ML models to edge devices. Core can catch MQTT traffic that you redirect to Kinesis Data Firehose.


Networking and Content Delivery

CONNECTIONS
Virtual Private Cloud (VPC) run public or private resources in your virtual network. Most control over infrastructure.
   Components of VPC:
       Virtual private gateway lets protected internet traffic to enter the VPC. For hybrid. However, has narrow bandwidth. Can isolate parts of the VPC in a given account.
      Subnet = sub of VPC, group resources based on security or operational needs. Subnets can be public or private. A private subnet has no direct route to the internet.
      EC2 instances = VPC hosts any of the EC2 instances you want.

Direct Connect = Dedicated fiber optic cables connect you and AWS that is not over the internet. Large bandwidth and good security. 
Cloud Front = cached CDN on edge with fast loading times, cost savings, and reliability. Helps with lower latency. Create CloudFront distribution centers in multiple regions. Good for videos or uploads.
API Gateway fully managed service. acts as the "front door" for apps.


Security, Identity, and Compliance

SECURITY CONFIG
IAM Identity Center = federated identity mgmt. Single sign on. Managed by Amazon.
IAM trust policypolicy that specifies which identities (users, services, or accounts) are allowed to assume a role, commonly used for cross-account access.
IAM role = id to gain temp access to permissions for a single session.
IAM functionality composed of roles and users and is set using CLI or APIs.
IAM identity-based policies = apply to IAM users, groups, or roles
IAM resource-based policies = apply to AWS resources like S3 buckets.

Secrets Manager = Manage passwords and API keys or tokens. SM is a butler that fetches the secret/token for app. Has lifecycle of credentials including auto rotation. Lambda function can trigger rotation of API keys if API key has external third-party text embeddings.
Key Mgmt Service (KMS) = Create and manage crypto keys. AWS (same account) and Customer managed keys (for cross-account). Think physical key.  Keys: AWS Managed Key and AWS Owned Key: manages the key lifecycle. Symmetric Key: is single key for encryption and decryption. Customer Managed Key: can rotate keys, set usage policies, and audit key usage.


USER AND SECURITY MONITORING
Macie = monitors secure data at rest. Uses ML. Monitors content (data in S3). Detects anomalous access patterns to find unauthorized access to sensitive data. Auto classifies sensitive data such as PII, financial data, and health records stored in S3. Does no encryption. Can pair with Lambda for auto removal of sensitive info.


Storage

EC2 INSTANCES CONTENT
Data:
    1) Instance Store = Ephemeral block-level and memory-based data with no snapshots. Cost-effective, super high performance. For buffers, caches, and scratch data.
    2) Elastic Block Store (EBS) = Features are PASSE: Persists, AZ, Same as E2, Snapshots, Elastic.
Overall low-latency, low cost. Manual set of volume size. High availability by auto replicating in same AZ. We do data encryption at rest and snapshots. Best for storing temp intermediate databases in ML process.
          EBS Snapshots - Incremental point-in-time backups. Good for data protection, cross-region data migration, disaster recovery, volume resizing or cloning, sharing data across accounts and low cost.
          EBS Types: SSD (general purpose such as gp2, gp3) or (input output such io1, io2 for mission critical, lowest latency, heavy read-write) for speed good for self-hosted DBs and boot volumes OR HDD (std1, sc1) for large, sequential streaming data (such as server logs, sensor data, stock prices, etc.).
          EBS How Encrypt steps: Only manual by 1) create snapshot, 2) copy snapshot turning on encryption, 3) create volume from new encrypted, 4) attach to EC2.
          EBS How Backup steps: 1) Pick volume, 2) Create snapshot, 3) Store in S3.
Data Lifecycle Manager = create, delete, retain EBS snapshots.

DATA FILES
Simple Storage Service (S3) = Think BOUTS, BOUTS.  Bucket storage & Block public access, Objects (structured) & Objects (unstructured), Unlimited (nearly) storage & URLs (pre-signed), Tiered (like Glacier) & Transitional (lifecycle rules to auto-move objects for cost savings), Secure (via IAM policies (IAM policies cannot be attached directly to specific S3 buckets.) and APIs) & Serverless (highly available). Used for CDN, hosting static websites, media files (even for CloudFront), app data storage, archiving, data lakes, and compliance-driven data retention. 11 9's of data durability.
       S3 Transfer Acceleration - Fast file transfers to S3 buckets using distributed edges by sending traffic over AWS rather not the internet.
       Lifecycle in S3 - Create a lifecycle rule to predictably transition to S3 Standard-Infrequent Access (S3 Standard-IA) after 30 days, transition to S3 Glacier after 90 days. Used to define rules to auto move between different storage classes, or delete based on age or usage. Transition fee
       S3 Intelligent Tiering - Auto tiering. Anything less than 128k is never auto tiered. Charges per object monitoring fee. Good for unpredictable.
       S3 Storage classes offer different performance, availability, and cost.
       S3 bucket policies decide who can use. Rules are written in JSON.  1) Version. 2) Effect, 3) Principal, 4) Action, 5) Resource, 6) Condition.
       S3 Event Notification on arrival of new data.
       S3 Cross-Region Replication (CRR) auto and asynch replicates (and their metadata and versions) from S3 bucket to S3 bucket in different Region. 

Elastic File System (EFS) = Linux, Elastic size, POSIX permissions, Multi-connections. Shared network file system for many EC2 instances simultaneously on many servers in even different AZs, fully managed, auto scalable file storage, scales as number of files changes. For containers. Great for images and legacy apps.
        Storage classes in EFS: Auto moves. Standard, Infrequent (30 days later), 1 Zone, 1 Zone Infrequent, and Archive (90 days later).         

File System X (FSx) = Windows and Lustre (which is file system on Linux), Static size. A fully managed service that provides cost-effective, scalable file storage built on widely used file systems. Supports multiple file system protocols (such as Windows File Server, Lustre, OpenZFS, and NetAPP ONTAP).  Windows version: SMB support, Active Directory integration, and Windows features like data deduplication.

Storage Gateway = Extends AWS storage to your on-premise location.
     Benefits: 1) Integration, 2) Better mgmt. 3) Local caching, 4) Low Cost.
     Types: 1) S3 file gateway low-latency local via local caching and fits existing file-based workflows, 2) cached volume gateway - cache local, store backup on cloud, 3) shared volume gateway, 4) tape gateway. Good for hybrid.

MISC
Cloud Search = old search app.
Open Search = new search app.



Comments

Popular posts from this blog

GHL Email Campaigns

Await

Free AI Tools