AWS Certified Machine Learning

16 min readJan 14, 2021

Exam passed on March 6th 2021!

I have written this post while preparing for AWS Certified Machine Learning — Specialty exam. This is a sum up of what I know about that technology. You may find some of the notes very trivial — my goal was to make sure that I don’t make any mistake and remember every fact.

Courses I’ve taken

Udemy AWS Certified Machine Learning Specialty 2021 — Hands On! course. It’s better to take Udemy course first. Whizlabs goes into more details. You may even find it hard, whereas Stephane and Frank make it much easier to understand the concepts behind the terminology.
Whizlabs AWS Certified Machine Learning Specialty course
Machine Learning by Andrew Ng
My bachelor thesis was about OCR using CNN and also compared the results with Tesseract
Deep Learning Specialization (also by Andrew Ng)
AWS youtube videos & some of the reInvent 2021 conference
AWS Exam readiness course

Quizes & Question dumps & Misc

Best question dump I’ve found (on exam topics, in case the URL dies)
Whizlabs practice exams
Testprep exam questions
AWS Certified Machine Learning Specialty — Sample Questions
Andrew Ng and his programming assignments in his courses
Reddit topics (check how others are preparing for it)

Data related

Data scaling & normalization methods

Mean/variance standardization
MinMax scaling
Maxabs scaling
Robust scaling
Normalizer (scales row-wise)
Standard scaler (performs scaling and shifting/centering column-wise)
One-hot-encoding (do that for categorical data)

EC2 (Elastic Cloud Compute)

It’s the environment in which SageMaker Jupyter notebooks are being run.

S3 (Single Storage Service)

For SageMaker, if distribution type parameter:

ShardedByS3Key (replicates a subset of dataset)
FullReplication (replicates the entire dataset)

Lifecycle configuration

Transition actions — Define when objects transition to another storage class. For example, you might choose to transition objects to the S3 Standard-IA storage class 30 days after you created them, or archive objects to the S3 Glacier storage class one year after creating them.
Expiration actions — Define when objects expire. Amazon S3 deletes expired objects on your behalf.

Data storage options

S3 Standard
general-purpose storage of frequently accessed data
S3 Intelligent-Tiering
data with unknown or changing access patterns
S3 Standard-Infrequent Access (S3 Standard-IA) and S3 One Zone-Infrequent Access (S3 One Zone-IA)
long-lived, but less frequently accessed data
S3 Glacier and S3 Glacier Deep Archive
long-term archive and digital preservation.
S3 Outposts
If you have data residency requirements that can’t be met by an existing AWS Region, you can use the

Kinesis Data Stream

PutRecord API

Input

single shard can ingest < 1MB

Output

Extras

real time
PutRecord (puts a single record into Kinesis Data Stream)
has to use Kinesis Consumer Library to receive the data and then write data to S3

Kinesis Video Streams

real time video processing

Kinesis Data Firehose

Serverless solution

PutRecord API

Input

json

record has to be < 1000 kB

input buffer can be 1–128 MB

buffer interval 60–900 seconds

lambda default timeout is 3 seconds

Output

outputs Parquet or ORC on the fly

Extras

non real time or close to real time
can output parquet file format
can write directly to S3

status of transformed data:

OK
Dropped (intentionally rejected by transformation)
ProcessingFailed (could not transform the data)

Kinesis Producer Library

Provides built-in performance benefits and is very easy to use.

e.g. ingesting clickstream data should be veeeery easy

Kinesis Data Analytics

Transform and analyze streaming data in real time using Apache Flink (data processing for streams). Uses SQL queries.

Can detect dense regions in data using Hotspots.

Can detect anomalies using Random Cut Forest.

AWS Data Pipeline

Managed ETL (Extract, Transform, Load) service

AWS DataSync

ingest data from NFS drive

AWS DMS (Database Migration Service)

for batch processing
reads from relational and non relational databases

AWS Glue

uses crawlers to do ETL jobs:

structured data
unstructured data
has the function FindMatches Transform (labeling file should be encoded in UTF-8 with Byte Order Mark)
has a module called Built-In Transforms
has Spark ML jobs (jobs operating on parquet data)
don’t use if ETL is not mentioned
batch processing

What tools can be used on AWS Glue when using Spark

parquet data
Spark MLeap containers
Spark MLib for building ML components for data transformation (tokenizing, encoding, normalizing etc.)
SparkML Serving Container allows to deploy Apache Spark ML pipeline in SageMaker

Amazon Athena

Serverless ETL
built on Presto
runs standard SQL

Amazon Aurora

MySQL and PostgreSQL-compatible relational database built for the cloud. Performance and availability of commercial-grade databases at 1/10th the cost.

Requires provisioning!

Amazon Fraud Detector

ONLINE_FRAUD_INSIGHTS
ingests only CSV

Fails if:

rows_count > 10k
fraud_rows_count < 400

Amazon MSK (Managed Streaming for Apache Kafka)

Kafka is a publish/subscribe messaging system

Amazon Redshift

data warehouse
if the company has the Redshift data warehouse and wants to move part of its data to S3, it can use Redshift Spectrum to query that data using Redshift

Amazon DynamoDB

NoSQL Key-Value database

AWS Lake Formation

data lake
is built on top of AWS Glue (e.g. has crawlers)
uses S3 as data storage

Amazon Step Functions

can do a lot of ETL on batch data

The diagram below can also be described with ASL (Amazon State Language):

Amazon FSx for Lustre

a file system service — speeds up training jobs (accelerates data flow between S3 and Sagemaker. Prevents downloading the same set of data three times (example below)

Amazon EFS (Elastic File System)

faster training times thanks to directly extracting data from S3 to training jobs. No need to pull data from S3 for a training job and Notebook separately.

Amazon EBS (Elastic Block Store) volumes

Easy to use, high performance block storage at any scale. Designed to use with EC2

Amazon Quicksight

Only for structured data on S3 or database

Used to visualize data. Some of the visualizations:

hotspots
Net Promoter Score

KPI (Key Performance Indicator)
Customer Profitability Score
Bar charts
Pie charts

Amazon Sagemaker

SageMaker is a framework from Amazon which makes it easier to implement and deploy AI algorithms.

Cannot read from Elasticache, has to be S3.
Offline testing model -> alpha endpoints
Its notebooks are run on EC2 instances
Use SageMaker Management Console to specify the metrics you want to track
You can also use sagemaker.analytics module and use TraningJobAnalytics
validation:cross_entropy
You can change the inference pipeline when it is deploy by using UpdateEndpoint API. Although, you will lose AutoScaling
no limit on input size data

SageMaker Linear Learner

classification
regression (regression assumes normal distribution of the data)

Training

recordIO-protobuf float32

Testing

Hyperparameters

Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.

In order to interact with SageMaker hyperparameter tuning jobs call HyperparameterTuner() API call

REGRESSION
predictor_type=’regressor’
mean square error, cross entropy loss, absolute error.

CLASSIFICATION
predictor_type=’binary_classifier’
predictor_type=’multiclass_classifier’
F1 measure, precision, recall, or accuracy.

SageMaker kMeans

unsupervised
classification

Training

Testing

testing metrics:

test:msd (mean squared distances)
test:ssd (sum of the squared distances)

Hyperparameters

Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.

k-Nearest Neighbors

classification

XGBoost

classification
regression

objective set to multi:softprob

memory bound, so benefits more from M instances, rather than C ones

Inference endpoints can use only (no application/sth):

text/csv
text/libsvm

XGBoost parameters

SEAGuL acronym ;)

subsample [default=1]
Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
range: (0,1]
eta [default=0.3, alias: learning_rate]
Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
range: [0,1]
alpha [default=0, alias: reg_alpha]
L1 regularization term on weights. Increasing this value will make model more conservative.
gamma [default=0, alias: min_split_loss]
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.
range: [0,∞]
lambda [default=1, alias: reg_lambda]
L2 regularization term on weights. Increasing this value will make model more conservative.

Tuning hyperparametrs of XGBoost

Random search
will work but it may run very long
Bayesian optimization
also optimizes but runs shorter

What about grid search?

Grid search is similar to random search in that it chooses hyperparameter configurations blindly. But it’s usually less effective because it leads to almost duplicate training jobs if some of the hyperparameters don’t influence the results much.

Metrics:

MSE (Mean Squared Error)-> good for measuring regression problems but handles outliers poorly
MAE (Mean Absolute Error) -> good regression metric and can be significantly influenced by outliers

SageMaker Production Variants

Something like shadow testing in Tesla cars (you have two concurrent Autopilots running). Weights decide which algorithm is more important. If you want to slowly introduce a new model:

Create an endpoint configuration with production variants for the two models with a weight ratio 0:1
Update the weights periodically

SageMaker Estimators

High-level interface for sagemaker training

SageMaker Processing

Makes it easier to manage infrastructure on SageMaker. If you need fast ML solution, better use SageMaker Processing compared to manually writing code using SageMaker Studio.

Amazon Neptune ML

New stuff, 2021. Works on GNNs (Graph neural networks). Optimized machine learning for graphs. E.g. xgboost has to operate on tabular data. Uses deep graph library.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

a technique allowing to reduce dimensionality

PCA (Principal Component Analysis)

dimensionality reduction

operates in 2 modes:

regular -> sparse datasets, moderate number of observations & features
randomized -> large number of observations & features

train Input:

application/recordIO-wrapped-protobuf
text/csv

test Input:

text/csv
application/json
application/x-recordio-protobuf

return format:

application/json
application/x-recordio-protobuf (vector of projections)

Modes:

File mode
Pipe mode

SageMaker Studio — IDE

allows to synchronize work between data scientists

SageMaker Random Cut Forest

Does not support GPU!

Training input:

text/csv
application/x-recordio-protobuf

SageMaker Batch Transform

Can do pre- and post- processing (like removing id feature and then joining the id to the data). Used to handle very large datasets. It’s only a SageMaker feature

not for real time applications
for a lot of data

SageMaker Experiments

Compares ML models.

SageMaker Debugger

Tool to make it easier for a data scientist to track it’s model performance and possible problems. Saves the model parameters during training so it is possible to visualize that.

SageMaker Ground Truth

Annotation consolidation will send an image to couple of workers, so that whenever one of them mislabels the image, the others might not and thanks to that we could be sure that the data will be properly labeled.

SageMaker Autopilot

AutoML
accepts CSV only.

SageMaker Neo + AWS IoT Greengrass

Neo is used to compile model for an IoT device, GreenGrass gathers the data from those devices and IoT device can use AWS IoT Core to draw inferences from the models (SageMaker is deployed on the Edge).

Algorithms divided by the type of acceleration available

Machine type instances

TODO:

EC2 P3 & P3DN

EC2 G4 & EC2 CS

FPGAS

AMIs

Elastic inference

Inferentia

High performance deep learning powered chip

AWS snowball

local storage and large scale-data transfer
large v40 CPUs

AWS snowmobile

Container size for physically transferring the data using semi truck.

Text processing algorithms

TF-IDF (Text Frequency Inverse Document Frequency) -> determines how important a word is in a document by giving weights

Please do not sit here
Please do not smoke here

tf-idf matrix (unigrams and bigrams) size = (2, 6 + 6) = (2,12)

More examples

Sequence-to-sequence -> machine translation, text summarization (needs tokenization and input data in RecordIO-protobuf with integer tokens)
bag of words -> creates tokens out of the words on the input
OSB (Orthogonal Space Bigram) -> creates group of words
n-gram -> used to find multi word phrases in text
LDA (Latent Dirichlet Allocation) -> topic modeling, unsupervised
Neural Topic model -> topic modeling, unsupervised

DeepAR Forecasting

Forecasting scalar (1D) time series using RNN.

Factorization Machines

unsupervised, works well on sparse data, recommendation system

Input data type

recordIO-protobuf float32

Inference data type

application/json
application/x-recordio-protobuf

IP Insights

unsupervised, uses neural network underneath, detects strange network traffic anomalies, “random fishy things”

Guard Duty

Detects anomalies on website (users’ behavior anomalies)

Reinforcement learning

You can run it on multiple cores/multiple machines.

Automatic Model Tuning

learns as it goes

Elastic MapReduce (EMR)

tool for big data processing and analysis
connected with Spark (spark can output a parquet file)
Real-time streaming
HPC (High Performance Computing)
requires management (provisioning)

Amazon Comprehend

NLP
text analytics
Amazon Comprehend Medical
sentiments
document classification
can understand many languages
Personally Identifiable Information (PII) -> be careful with that (Amazon Comprehend can detect that)

Amazon Comprehend Medical

Amazon Translate

json which translates (even when source language is set to auto) to a wanted language

Amazon Transcribe

speech-to-text
channel identification
custom vocabularies
streaming client is an HTTP/2 streaming client

Amazon Polly

text-to-speech
polly is a stereotypical name for a parrot
when you happen to have acronyms in text (W3C -> World Wide Web Consortium), you can create SSML for that (e.g. <sub alias=”World Wide Web Consortium”>W3C</sub), but it is DOCUMENT specific only. That’s why there is a better option - create a custom lexicon

Amazon Forecast

Amazon Forecast Prophet
good for time series with strong seasonal effects
Amazon Forecast DeepAR+
large datasets
can work with related time series (many time series datasets that are correlated)
Amazon Forecast ARIMA
simple datasets (less than 100 time series)
Amazon Forecast CNN-QR
1D time series, Seq2Seq model
Amazon Forecast ETS (Exponential Smoothing)
good for seasonality and other prior assumptions about the data
Amazon Forecast NTPS (Non-Parameteric Time Series)
works good for sparse time series

Amazon Kendra

text extracted from an individual document cannot exceed 5 MB
supports HTML, PowerPoint, Word, PDF, plain text

Amazon Lex

chat bot engine
utterance -> intent -> lambda -> slot (extra information)

Amazon Rekognition

computer vision
face detection
can be paired with Augmented AI (Rekognition predictions will be reviewed by humans)
Rekognition Image
Reokognition Video

Amazon Cognito

for authorization and user authentication

Amazon Connect

Easy to use omnichannel cloud contact center

Amazon Personalize

(recommender system) -> PaaS type

real time personalize system

Amazon Textract (OCR)

send image/pdf to amazon and receive text with confidence score

Amazon Sumerian

used with augmented reality

IoT Core

used to gather data from devices to sagemaker and between the devices itself (intercommunication)

IoT Greengrass

moves AWS to the Edge for IoT devices, allowing them to connect to inference endpoints

IoT Analytics

used to gather the data from IoT devices and is able to enrich that data with external one

NLP

Methods in NLP (in order of introducing models)

Blazing Text

Highly-optimized Word2Vec

sentiment analysis
entity recognition

Word2Vec (a specific case for Object2Vec)

word vectors = word embeddings
similar meaning = similar vectors (when using Word2Vec)
Object2Vect is capable of creating embeddings for arbitrary object, such as Tweets

GloVe

Global Vectors for Word Representation

Transformer

ELMo

Embedding from Language Models uses LSTM

PyTorch BERT

BiDirectional Encoder Representations from Transformers

Used word masking (just like dropout feature in NLP) during pre training. It does so on 15% of the data.

Enables transfer learning. First learn from Wikipedia or books-corpus and then train for domain specific problems.

Amazons NLP tips

in NLP spelling has a relatively lower bearing on the importance of the word
in NLP remove stop words (e.g. not, neither, nor)
tokenization of words for NLP

Extra info about AWS Services

AWS DeepRacer (reinforcement learning powered 1/18-scale race car)
DeepLens (deep learning-enabled video camera)
CloudTrail is for auditioning (e.g. how often a model is deployed)
CloudWatch is monitoring and issues alarms (e.g. monitor CPU/GPU)
when a model fails you can also call DescribeJob API to check the FailureReason option

AWS KMS (Key Management Service)
SSE — Server Side Encryption
CSE — Client Side Encryption
model training happens inside VPC
SimpleImputer default strategy = mean
lambda function max deployment package size is 50MB
lambda max memory setting = 3 MB
lambda blueprint can be taken from AWS Serverless Application Repository or AWS Lambda Repository
lambda transformed record must contain recordId, result and data
Semantic Segmentation is used for computer vision, not NLP
online learning -> learning on the go
incremental learning -> if you have a model trained for the specific job and you will train it again using new data
transfer learning -> use the pretrained model (like ResNet, YoloV3) and retrain for your specific data
out-of-core learning -> used to train huge datasets that cannot be loaded into a single server, it trains using subsets of data
ReLU -> Rectified Linear Unit
Collaborative Filtering -> Amazon used that to create a recommendation system “Users who bought this also bought this”
RMSE is a good evaluation metric for regression when solving for a continuous problem
ROC is a good evaluation metric for regression problems when solving for a binary variable
NAT gateways are instantiated in public subnets. Whenether you hear about encrypting and hiding SageMaker servicing, always keep in mind word “VPC interface endpoint”
SQS (Simple Queue Service) message queue service. You need boto for that
SNS (Simple Notification Service) notification system — mail, SMS or push notification
automatic load balancing also costs money on AWS
SGD (Stochastic Gradient Descent) fails -> RMSProp, Adam or Adagrad, Adadelta, NAG or Momentum
Gradient Descent converges faster after normalization

Correlation coefficients

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. No correlation means negative correlation

Covariance correlation coefficient

Covariance is used when you have a Gaussian relationship between your variables.

Pearson’s correlation coefficient

Also used when you have a Gaussian relationship.
neg. correlation < -0.5 < indeterminate correlation < 0.5 < pos. correlation

Rank correlations

Spearman’s correlation coefficient
Also used when you have a non-Gaussian relationship.

Polychloric correlation coefficient (or tetrachloric)

This coefficient is used to understand the relationship of variables gathered via surveys such as personality tests and surveys that use rating scales.

Other ML techniques and notes

Naive Bayes

Multinomial Naive Bayes for document word search will count frequency of a given word/observation.
Bernoulli Naive Bayes against document classification tasks is where you wish to know whether the word appears or not.
Gaussian Naive Bayes works continuous values in your observations, not discrete values.

Techniques in ML

Ridge regression
Will reduce the coefficients in the model but not all the way to 0.
Lasso regression
Can reduce some of the coefficients to 0.

Imputation techniques:

deep learning
yeo-johnson transformation (used to give a more gaussian distribution for your data)
mean imputation (predicts measing values, but is rather a naive approach)
multivariate imputation (used for predicting missing values in the data, better than mean imputation)

Techniques for using multiple GPUs
- Horovod (simple, use only for training, remove when deploying an inference model)
- PySpark (more work than Horovod)
- using DeepAr (more work than Horovod)

Visualizing data

S3 -> Lake Formation -> QuickSight
Elasticsearch -> Kibana

!!!!!!! Plot types !!!!!!!!!

scatter plot

bubble chart (can be used to compare 3 features)

pairs plot

swarm plot

cat plot

covariance matrix

correlation matrix

confusion matrix

often used to describe classification performance

entropy matrix

histogram (1D)

line chart (for trends, time series data)

residual plot (good for deciding whether regression fits well to data)

radar chart (good for drawing multiple variables simply)

bar chart

heat map

go there for more
seaborn examples

Transport layer network protocols

HTTP
HTTP/2 (e.g. the one used by Amazon Transcribe) used for streaming data
HTTP/3 (is currently being developed)

ARIMA -> Autoregressive integrated moving average

Oversampling (how to handle imbalanced datasets)

Creating missing data (e.g. fraud detections, anomalies) where there are very little positive data, but we need to detect them. Undersampling would be a technique were we remove the unique data (e.g. kNN undersampling)

SMOTE oversampling
Synthetic Minority Oversampling Technique, kinda good but not that awesome

Random oversampling
Naive way to achieve that

GANs oversampling (Generative Adversarial Networks)
Creates new data which in a very good way. Thanks to that there are more unique observations.

While SMOTE approaches are based on local information, GAN methods learn from the overall class distribution.

Measuring the goodness of AI

ROC (Receiver Operating Characteristic)

PR curve

Precision

Recall

Databases

Relational

SQL
MySQL

Non-relational

Hadoop
Spark
Mongo-db
NoSQL

Q&A

Example exam from AWS notes:

A
C
should be A
B
D
should be B
D
D
should be B
C
B (maybe D, but why to do that for 5% when there are multiple columns to be imputed)
but it is D, lol it was really worth it. Maybe 5% is really a lot.
A
should be B
D

Test score = 60%

If you happen to have different local minima (training function fluctuating around different values during different batch runs), then it would be best to:

decrease batch size (will not hit local minima)
decrease learning rate (will not overshoot global minima)

small mini-batch -> prevent stopping at local minima

large mini-batch -> good for computationally expensive

ensemble of models -> a combination of ML models working to get one inference (e.g. XGBoost for unstructured data and CNN for images)

Neural Networks are being widely used in ML thanks to:

A lot of data generated through social media, captcha etc.
efficient algorithms arose (softmax etc.)
cheaper GPUs