AWS Certified Machine Learning

Kamil
16 min readJan 14, 2021

Exam passed on March 6th 2021!

I have written this post while preparing for AWS Certified Machine Learning — Specialty exam. This is a sum up of what I know about that technology. You may find some of the notes very trivial — my goal was to make sure that I don’t make any mistake and remember every fact.

Courses I’ve taken

  1. Udemy AWS Certified Machine Learning Specialty 2021 — Hands On! course. It’s better to take Udemy course first. Whizlabs goes into more details. You may even find it hard, whereas Stephane and Frank make it much easier to understand the concepts behind the terminology.
  2. Whizlabs AWS Certified Machine Learning Specialty course
  3. Machine Learning by Andrew Ng
  4. My bachelor thesis was about OCR using CNN and also compared the results with Tesseract
  5. Deep Learning Specialization (also by Andrew Ng)
  6. AWS youtube videos & some of the reInvent 2021 conference
  7. AWS Exam readiness course

Quizes & Question dumps & Misc

  1. Best question dump I’ve found (on exam topics, in case the URL dies)
  2. Whizlabs practice exams
  3. Testprep exam questions
  4. AWS Certified Machine Learning Specialty — Sample Questions
  5. Andrew Ng and his programming assignments in his courses
  6. Reddit topics (check how others are preparing for it)

Data related

Data scaling & normalization methods

  • Mean/variance standardization
  • MinMax scaling
  • Maxabs scaling
  • Robust scaling
  • Normalizer (scales row-wise)
  • Standard scaler (performs scaling and shifting/centering column-wise)
  • One-hot-encoding (do that for categorical data)

EC2 (Elastic Cloud Compute)

It’s the environment in which SageMaker Jupyter notebooks are being run.

S3 (Single Storage Service)

For SageMaker, if distribution type parameter:

  • ShardedByS3Key (replicates a subset of dataset)
  • FullReplication (replicates the entire dataset)

Lifecycle configuration

  • Transition actions — Define when objects transition to another storage class. For example, you might choose to transition objects to the S3 Standard-IA storage class 30 days after you created them, or archive objects to the S3 Glacier storage class one year after creating them.
  • Expiration actions — Define when objects expire. Amazon S3 deletes expired objects on your behalf.

Data storage options

  • S3 Standard
    general-purpose storage of frequently accessed data
  • S3 Intelligent-Tiering
    data with unknown or changing access patterns
  • S3 Standard-Infrequent Access (S3 Standard-IA) and S3 One Zone-Infrequent Access (S3 One Zone-IA)
    long-lived, but less frequently accessed data
  • S3 Glacier and S3 Glacier Deep Archive
    long-term archive and digital preservation.
  • S3 Outposts
    If you have data residency requirements that can’t be met by an existing AWS Region, you can use the

Kinesis Data Stream

PutRecord API

Input

single shard can ingest < 1MB

Output

Extras

  • real time
  • PutRecord (puts a single record into Kinesis Data Stream)
  • has to use Kinesis Consumer Library to receive the data and then write data to S3

Kinesis Video Streams

  • real time video processing

Kinesis Data Firehose

Serverless solution

PutRecord API

Input

json

record has to be < 1000 kB

input buffer can be 1–128 MB

buffer interval 60–900 seconds

lambda default timeout is 3 seconds

Output

outputs Parquet or ORC on the fly

Extras

  • non real time or close to real time
  • can output parquet file format
  • can write directly to S3

status of transformed data:

  • OK
  • Dropped (intentionally rejected by transformation)
  • ProcessingFailed (could not transform the data)

Kinesis Producer Library

Provides built-in performance benefits and is very easy to use.

e.g. ingesting clickstream data should be veeeery easy

Kinesis Data Analytics

Transform and analyze streaming data in real time using Apache Flink (data processing for streams). Uses SQL queries.

Can detect dense regions in data using Hotspots.

Can detect anomalies using Random Cut Forest.

AWS Data Pipeline

Managed ETL (Extract, Transform, Load) service

AWS DataSync

ingest data from NFS drive

AWS DMS (Database Migration Service)

  • for batch processing
  • reads from relational and non relational databases

AWS Glue

uses crawlers to do ETL jobs:

  • structured data
  • unstructured data
  • has the function FindMatches Transform (labeling file should be encoded in UTF-8 with Byte Order Mark)
  • has a module called Built-In Transforms
  • has Spark ML jobs (jobs operating on parquet data)
  • don’t use if ETL is not mentioned
  • batch processing

What tools can be used on AWS Glue when using Spark

  • parquet data
  • Spark MLeap containers
  • Spark MLib for building ML components for data transformation (tokenizing, encoding, normalizing etc.)
  • SparkML Serving Container allows to deploy Apache Spark ML pipeline in SageMaker

Amazon Athena

  • Serverless ETL
  • built on Presto
  • runs standard SQL

Amazon Aurora

MySQL and PostgreSQL-compatible relational database built for the cloud. Performance and availability of commercial-grade databases at 1/10th the cost.

Requires provisioning!

Amazon Fraud Detector

  • ONLINE_FRAUD_INSIGHTS
  • ingests only CSV

Fails if:

  • rows_count > 10k
  • fraud_rows_count < 400

Amazon MSK (Managed Streaming for Apache Kafka)

Kafka is a publish/subscribe messaging system

Amazon Redshift

  • data warehouse
  • if the company has the Redshift data warehouse and wants to move part of its data to S3, it can use Redshift Spectrum to query that data using Redshift

Amazon DynamoDB

NoSQL Key-Value database

AWS Lake Formation

  • data lake
  • is built on top of AWS Glue (e.g. has crawlers)
  • uses S3 as data storage

Amazon Step Functions

  • can do a lot of ETL on batch data

The diagram below can also be described with ASL (Amazon State Language):

Amazon FSx for Lustre

a file system service — speeds up training jobs (accelerates data flow between S3 and Sagemaker. Prevents downloading the same set of data three times (example below)

Amazon EFS (Elastic File System)

faster training times thanks to directly extracting data from S3 to training jobs. No need to pull data from S3 for a training job and Notebook separately.

Amazon EBS (Elastic Block Store) volumes

Easy to use, high performance block storage at any scale. Designed to use with EC2

Amazon Quicksight

Only for structured data on S3 or database

Used to visualize data. Some of the visualizations:

  • hotspots
  • Net Promoter Score
  • KPI (Key Performance Indicator)
  • Customer Profitability Score
  • Bar charts
  • Pie charts

Amazon Sagemaker

SageMaker is a framework from Amazon which makes it easier to implement and deploy AI algorithms.

  • Cannot read from Elasticache, has to be S3.
  • Offline testing model -> alpha endpoints
  • Its notebooks are run on EC2 instances
  • Use SageMaker Management Console to specify the metrics you want to track
  • You can also use sagemaker.analytics module and use TraningJobAnalytics
  • validation:cross_entropy
  • You can change the inference pipeline when it is deploy by using UpdateEndpoint API. Although, you will lose AutoScaling
  • no limit on input size data

SageMaker Linear Learner

  • classification
  • regression (regression assumes normal distribution of the data)

Training

  • recordIO-protobuf float32

Testing

Hyperparameters

Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.

In order to interact with SageMaker hyperparameter tuning jobs call HyperparameterTuner() API call

REGRESSION
predictor_type=’regressor’
mean square error, cross entropy loss, absolute error.

CLASSIFICATION
predictor_type=’binary_classifier’
predictor_type=’multiclass_classifier’
F1 measure, precision, recall, or accuracy.

SageMaker kMeans

  • unsupervised
  • classification

Training

Testing

testing metrics:

  • test:msd (mean squared distances)
  • test:ssd (sum of the squared distances)

Hyperparameters

Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.

k-Nearest Neighbors

  • classification

XGBoost

  • classification
  • regression

objective set to multi:softprob

memory bound, so benefits more from M instances, rather than C ones

Inference endpoints can use only (no application/sth):

  • text/csv
  • text/libsvm

XGBoost parameters

SEAGuL acronym ;)

  • subsample [default=1]
    Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
    range: (0,1]
  • eta [default=0.3, alias: learning_rate]
    Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
    range: [0,1]
  • alpha [default=0, alias: reg_alpha]
    L1 regularization term on weights. Increasing this value will make model more conservative.
  • gamma [default=0, alias: min_split_loss]
    Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.
    range: [0,∞]
  • lambda [default=1, alias: reg_lambda]
    L2 regularization term on weights. Increasing this value will make model more conservative.

Tuning hyperparametrs of XGBoost

  • Random search
    will work but it may run very long
  • Bayesian optimization
    also optimizes but runs shorter

What about grid search?

Grid search is similar to random search in that it chooses hyperparameter configurations blindly. But it’s usually less effective because it leads to almost duplicate training jobs if some of the hyperparameters don’t influence the results much.

Metrics:

  • MSE (Mean Squared Error)-> good for measuring regression problems but handles outliers poorly
  • MAE (Mean Absolute Error) -> good regression metric and can be significantly influenced by outliers

SageMaker Production Variants

Something like shadow testing in Tesla cars (you have two concurrent Autopilots running). Weights decide which algorithm is more important. If you want to slowly introduce a new model:

  1. Create an endpoint configuration with production variants for the two models with a weight ratio 0:1
  2. Update the weights periodically

SageMaker Estimators

High-level interface for sagemaker training

SageMaker Processing

Makes it easier to manage infrastructure on SageMaker. If you need fast ML solution, better use SageMaker Processing compared to manually writing code using SageMaker Studio.

Amazon Neptune ML

New stuff, 2021. Works on GNNs (Graph neural networks). Optimized machine learning for graphs. E.g. xgboost has to operate on tabular data. Uses deep graph library.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

a technique allowing to reduce dimensionality

PCA (Principal Component Analysis)

dimensionality reduction

operates in 2 modes:

  • regular -> sparse datasets, moderate number of observations & features
  • randomized -> large number of observations & features

train Input:

  • application/recordIO-wrapped-protobuf
  • text/csv

test Input:

  • text/csv
  • application/json
  • application/x-recordio-protobuf

return format:

  • application/json
  • application/x-recordio-protobuf (vector of projections)

Modes:

  • File mode
  • Pipe mode

SageMaker Studio — IDE

allows to synchronize work between data scientists

SageMaker Random Cut Forest

Does not support GPU!

Training input:

  • text/csv
  • application/x-recordio-protobuf

SageMaker Batch Transform

Can do pre- and post- processing (like removing id feature and then joining the id to the data). Used to handle very large datasets. It’s only a SageMaker feature

  • not for real time applications
  • for a lot of data

SageMaker Experiments

Compares ML models.

SageMaker Debugger

Tool to make it easier for a data scientist to track it’s model performance and possible problems. Saves the model parameters during training so it is possible to visualize that.

SageMaker Ground Truth

Annotation consolidation will send an image to couple of workers, so that whenever one of them mislabels the image, the others might not and thanks to that we could be sure that the data will be properly labeled.

SageMaker Autopilot

  • AutoML
  • accepts CSV only.

SageMaker Neo + AWS IoT Greengrass

Neo is used to compile model for an IoT device, GreenGrass gathers the data from those devices and IoT device can use AWS IoT Core to draw inferences from the models (SageMaker is deployed on the Edge).

Algorithms divided by the type of acceleration available

Machine type instances

TODO:

EC2 P3 & P3DN

EC2 G4 & EC2 CS

FPGAS

AMIs

Elastic inference

Inferentia

High performance deep learning powered chip

AWS snowball

  • local storage and large scale-data transfer
  • large v40 CPUs

AWS snowmobile

Container size for physically transferring the data using semi truck.

Text processing algorithms

  • TF-IDF (Text Frequency Inverse Document Frequency) -> determines how important a word is in a document by giving weights

Please do not sit here
Please do not smoke here

tf-idf matrix (unigrams and bigrams) size = (2, 6 + 6) = (2,12)

More examples

  • Sequence-to-sequence -> machine translation, text summarization (needs tokenization and input data in RecordIO-protobuf with integer tokens)
  • bag of words -> creates tokens out of the words on the input
  • OSB (Orthogonal Space Bigram) -> creates group of words
  • n-gram -> used to find multi word phrases in text
  • LDA (Latent Dirichlet Allocation) -> topic modeling, unsupervised
  • Neural Topic model -> topic modeling, unsupervised

DeepAR Forecasting

Forecasting scalar (1D) time series using RNN.

Factorization Machines

unsupervised, works well on sparse data, recommendation system

Input data type

  • recordIO-protobuf float32

Inference data type

  • application/json
  • application/x-recordio-protobuf

IP Insights

unsupervised, uses neural network underneath, detects strange network traffic anomalies, “random fishy things”

Guard Duty

Detects anomalies on website (users’ behavior anomalies)

Reinforcement learning

You can run it on multiple cores/multiple machines.

Automatic Model Tuning

  • learns as it goes

Elastic MapReduce (EMR)

  • tool for big data processing and analysis
  • connected with Spark (spark can output a parquet file)
  • Real-time streaming
  • HPC (High Performance Computing)
  • requires management (provisioning)

Amazon Comprehend

  • NLP
  • text analytics
  • Amazon Comprehend Medical
  • sentiments
  • document classification
  • can understand many languages
  • Personally Identifiable Information (PII) -> be careful with that (Amazon Comprehend can detect that)

Amazon Comprehend Medical

Amazon Translate

  • json which translates (even when source language is set to auto) to a wanted language

Amazon Transcribe

  • speech-to-text
  • channel identification
  • custom vocabularies
  • streaming client is an HTTP/2 streaming client

Amazon Polly

  • text-to-speech
  • polly is a stereotypical name for a parrot
  • when you happen to have acronyms in text (W3C -> World Wide Web Consortium), you can create SSML for that (e.g. <sub alias=”World Wide Web Consortium”>W3C</sub), but it is DOCUMENT specific only. That’s why there is a better option - create a custom lexicon

Amazon Forecast

  • Amazon Forecast Prophet
    good for time series with strong seasonal effects
  • Amazon Forecast DeepAR+
    large datasets
    can work with related time series (many time series datasets that are correlated)
  • Amazon Forecast ARIMA
    simple datasets (less than 100 time series)
  • Amazon Forecast CNN-QR
    1D time series, Seq2Seq model
  • Amazon Forecast ETS (Exponential Smoothing)
    good for seasonality and other prior assumptions about the data
  • Amazon Forecast NTPS (Non-Parameteric Time Series)
    works good for sparse time series

Amazon Kendra

  • text extracted from an individual document cannot exceed 5 MB
  • supports HTML, PowerPoint, Word, PDF, plain text

Amazon Lex

  • chat bot engine
  • utterance -> intent -> lambda -> slot (extra information)

Amazon Rekognition

  • computer vision
  • face detection
  • can be paired with Augmented AI (Rekognition predictions will be reviewed by humans)
  • Rekognition Image
  • Reokognition Video

Amazon Cognito

  • for authorization and user authentication

Amazon Connect

Easy to use omnichannel cloud contact center

Amazon Personalize

(recommender system) -> PaaS type

real time personalize system

Amazon Textract (OCR)

send image/pdf to amazon and receive text with confidence score

Amazon Sumerian

used with augmented reality

IoT Core

used to gather data from devices to sagemaker and between the devices itself (intercommunication)

IoT Greengrass

moves AWS to the Edge for IoT devices, allowing them to connect to inference endpoints

IoT Analytics

used to gather the data from IoT devices and is able to enrich that data with external one

NLP

Methods in NLP (in order of introducing models)

Blazing Text

Highly-optimized Word2Vec

  • sentiment analysis
  • entity recognition

Word2Vec (a specific case for Object2Vec)

  • word vectors = word embeddings
  • similar meaning = similar vectors (when using Word2Vec)
  • Object2Vect is capable of creating embeddings for arbitrary object, such as Tweets

GloVe

Global Vectors for Word Representation

Transformer

ELMo

Embedding from Language Models uses LSTM

PyTorch BERT

BiDirectional Encoder Representations from Transformers

Used word masking (just like dropout feature in NLP) during pre training. It does so on 15% of the data.

Enables transfer learning. First learn from Wikipedia or books-corpus and then train for domain specific problems.

Amazons NLP tips

  • in NLP spelling has a relatively lower bearing on the importance of the word
  • in NLP remove stop words (e.g. not, neither, nor)
  • tokenization of words for NLP

Extra info about AWS Services

  • AWS DeepRacer (reinforcement learning powered 1/18-scale race car)
  • DeepLens (deep learning-enabled video camera)
  • CloudTrail is for auditioning (e.g. how often a model is deployed)
  • CloudWatch is monitoring and issues alarms (e.g. monitor CPU/GPU)
  • when a model fails you can also call DescribeJob API to check the FailureReason option
  • AWS KMS (Key Management Service)
    SSE — Server Side Encryption
    CSE — Client Side Encryption
  • model training happens inside VPC
  • SimpleImputer default strategy = mean
  • lambda function max deployment package size is 50MB
  • lambda max memory setting = 3 MB
  • lambda blueprint can be taken from AWS Serverless Application Repository or AWS Lambda Repository
  • lambda transformed record must contain recordId, result and data
  • Semantic Segmentation is used for computer vision, not NLP
  • online learning -> learning on the go
  • incremental learning -> if you have a model trained for the specific job and you will train it again using new data
  • transfer learning -> use the pretrained model (like ResNet, YoloV3) and retrain for your specific data
  • out-of-core learning -> used to train huge datasets that cannot be loaded into a single server, it trains using subsets of data
  • ReLU -> Rectified Linear Unit
  • Collaborative Filtering -> Amazon used that to create a recommendation system “Users who bought this also bought this”
  • RMSE is a good evaluation metric for regression when solving for a continuous problem
  • ROC is a good evaluation metric for regression problems when solving for a binary variable
  • NAT gateways are instantiated in public subnets. Whenether you hear about encrypting and hiding SageMaker servicing, always keep in mind word “VPC interface endpoint”
  • SQS (Simple Queue Service) message queue service. You need boto for that
  • SNS (Simple Notification Service) notification system — mail, SMS or push notification
  • automatic load balancing also costs money on AWS
  • SGD (Stochastic Gradient Descent) fails -> RMSProp, Adam or Adagrad, Adadelta, NAG or Momentum
  • Gradient Descent converges faster after normalization

Correlation coefficients

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. No correlation means negative correlation

Covariance correlation coefficient

Covariance is used when you have a Gaussian relationship between your variables.

Pearson’s correlation coefficient

Also used when you have a Gaussian relationship.
neg. correlation < -0.5 < indeterminate correlation < 0.5 < pos. correlation

Rank correlations

Spearman’s correlation coefficient
Also used when you have a non-Gaussian relationship.

Polychloric correlation coefficient (or tetrachloric)

This coefficient is used to understand the relationship of variables gathered via surveys such as personality tests and surveys that use rating scales.

Other ML techniques and notes

Naive Bayes

Multinomial Naive Bayes for document word search will count frequency of a given word/observation.
Bernoulli Naive Bayes against document classification tasks is where you wish to know whether the word appears or not.
Gaussian Naive Bayes works continuous values in your observations, not discrete values.

Techniques in ML

Ridge regression
Will reduce the coefficients in the model but not all the way to 0.
Lasso regression
Can reduce some of the coefficients to 0.

Imputation techniques:

  • deep learning
  • yeo-johnson transformation (used to give a more gaussian distribution for your data)
  • mean imputation (predicts measing values, but is rather a naive approach)
  • multivariate imputation (used for predicting missing values in the data, better than mean imputation)

Techniques for using multiple GPUs
-
Horovod (simple, use only for training, remove when deploying an inference model)
- PySpark (more work than Horovod)
- using DeepAr (more work than Horovod)

Visualizing data

S3 -> Lake Formation -> QuickSight
Elasticsearch -> Kibana

!!!!!!! Plot types !!!!!!!!!

  • scatter plot
  • bubble chart (can be used to compare 3 features)
  • pairs plot
  • swarm plot
  • cat plot
  • covariance matrix
  • correlation matrix
  • confusion matrix

often used to describe classification performance

  • entropy matrix
  • histogram (1D)
  • line chart (for trends, time series data)
  • radar chart (good for drawing multiple variables simply)
  • bar chart
  • heat map

go there for more
seaborn examples

Transport layer network protocols

  • HTTP
  • HTTP/2 (e.g. the one used by Amazon Transcribe) used for streaming data
  • HTTP/3 (is currently being developed)

ARIMA -> Autoregressive integrated moving average

Oversampling (how to handle imbalanced datasets)

Creating missing data (e.g. fraud detections, anomalies) where there are very little positive data, but we need to detect them. Undersampling would be a technique were we remove the unique data (e.g. kNN undersampling)

SMOTE oversampling
Synthetic Minority Oversampling Technique, kinda good but not that awesome

Random oversampling
Naive way to achieve that

GANs oversampling (Generative Adversarial Networks)
Creates new data which in a very good way. Thanks to that there are more unique observations.

While SMOTE approaches are based on local information, GAN methods learn from the overall class distribution.

Measuring the goodness of AI

ROC (Receiver Operating Characteristic)

PR curve

Precision

Recall

Databases

Relational

  • SQL
  • MySQL

Non-relational

  • Hadoop
  • Spark
  • Mongo-db
  • NoSQL

Q&A

Example exam from AWS notes:

  1. A
  2. C
    should be A
  3. B
  4. D
    should be B
  5. D
  6. D
    should be B
  7. C
  8. B (maybe D, but why to do that for 5% when there are multiple columns to be imputed)
    but it is D, lol it was really worth it. Maybe 5% is really a lot.
  9. A
    should be B
  10. D

Test score = 60%

If you happen to have different local minima (training function fluctuating around different values during different batch runs), then it would be best to:

  • decrease batch size (will not hit local minima)
  • decrease learning rate (will not overshoot global minima)

small mini-batch -> prevent stopping at local minima

large mini-batch -> good for computationally expensive

ensemble of models -> a combination of ML models working to get one inference (e.g. XGBoost for unstructured data and CNN for images)

Neural Networks are being widely used in ML thanks to:

  1. A lot of data generated through social media, captcha etc.
  2. efficient algorithms arose (softmax etc.)
  3. cheaper GPUs

--

--

Kamil

A tech guy with artistic soul. I will post everything I find worth sharing.