AWS Certified Machine Learning

Exam passed on March 6th 2021!

I have written this post while preparing for AWS Certified Machine Learning — Specialty exam. This is a sum up of what I know about that technology. You may find some of the notes very trivial — my goal was to make sure that I don’t make any mistake and remember every fact.

Courses I’ve taken

  1. Udemy AWS Certified Machine Learning Specialty 2021 — Hands On! course. It’s better to take Udemy course first. Whizlabs goes into more details. You may even find it hard, whereas Stephane and Frank make it much easier to understand the concepts behind the terminology.

Quizes & Question dumps & Misc

  1. Best question dump I’ve found (on exam topics, in case the URL dies)

Data related

Data scaling & normalization methods

  • Mean/variance standardization

EC2 (Elastic Cloud Compute)

It’s the environment in which SageMaker Jupyter notebooks are being run.

S3 (Single Storage Service)

For SageMaker, if distribution type parameter:

  • ShardedByS3Key (replicates a subset of dataset)

Lifecycle configuration

  • Transition actions — Define when objects transition to another storage class. For example, you might choose to transition objects to the S3 Standard-IA storage class 30 days after you created them, or archive objects to the S3 Glacier storage class one year after creating them.

Data storage options

  • S3 Standard
    general-purpose storage of frequently accessed data

Kinesis Data Stream

PutRecord API

Input

single shard can ingest < 1MB

Output

Extras

  • real time

Kinesis Video Streams

  • real time video processing

Kinesis Data Firehose

Serverless solution

PutRecord API

Input

json

record has to be < 1000 kB

input buffer can be 1–128 MB

buffer interval 60–900 seconds

lambda default timeout is 3 seconds

Output

outputs Parquet or ORC on the fly

Extras

  • non real time or close to real time

status of transformed data:

  • OK

Kinesis Producer Library

Provides built-in performance benefits and is very easy to use.

e.g. ingesting clickstream data should be veeeery easy

Kinesis Data Analytics

Transform and analyze streaming data in real time using Apache Flink (data processing for streams). Uses SQL queries.

Can detect dense regions in data using Hotspots.

Can detect anomalies using Random Cut Forest.

AWS Data Pipeline

Managed ETL (Extract, Transform, Load) service

AWS DataSync

ingest data from NFS drive

AWS DMS (Database Migration Service)

  • for batch processing

AWS Glue

uses crawlers to do ETL jobs:

  • structured data

What tools can be used on AWS Glue when using Spark

  • parquet data

Amazon Athena

  • Serverless ETL

Amazon Aurora

MySQL and PostgreSQL-compatible relational database built for the cloud. Performance and availability of commercial-grade databases at 1/10th the cost.

Requires provisioning!

Amazon Fraud Detector

  • ONLINE_FRAUD_INSIGHTS

Fails if:

  • rows_count > 10k

Amazon MSK (Managed Streaming for Apache Kafka)

Kafka is a publish/subscribe messaging system

Amazon Redshift

  • data warehouse

Amazon DynamoDB

NoSQL Key-Value database

AWS Lake Formation

  • data lake

Amazon Step Functions

  • can do a lot of ETL on batch data

The diagram below can also be described with ASL (Amazon State Language):

Amazon FSx for Lustre

a file system service — speeds up training jobs (accelerates data flow between S3 and Sagemaker. Prevents downloading the same set of data three times (example below)

Amazon EFS (Elastic File System)

faster training times thanks to directly extracting data from S3 to training jobs. No need to pull data from S3 for a training job and Notebook separately.

Amazon EBS (Elastic Block Store) volumes

Easy to use, high performance block storage at any scale. Designed to use with EC2

Amazon Quicksight

Only for structured data on S3 or database

Used to visualize data. Some of the visualizations:

  • hotspots
  • KPI (Key Performance Indicator)

Amazon Sagemaker

SageMaker is a framework from Amazon which makes it easier to implement and deploy AI algorithms.

  • Cannot read from Elasticache, has to be S3.

SageMaker Linear Learner

  • classification

Training

  • recordIO-protobuf float32

Testing

Hyperparameters

Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.

In order to interact with SageMaker hyperparameter tuning jobs call HyperparameterTuner() API call

REGRESSION
predictor_type=’regressor’
mean square error, cross entropy loss, absolute error.

CLASSIFICATION
predictor_type=’binary_classifier’
predictor_type=’multiclass_classifier’
F1 measure, precision, recall, or accuracy.

SageMaker kMeans

  • unsupervised

Training

Testing

testing metrics:

  • test:msd (mean squared distances)

Hyperparameters

Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.

k-Nearest Neighbors

  • classification

XGBoost

  • classification

objective set to multi:softprob

memory bound, so benefits more from M instances, rather than C ones

Inference endpoints can use only (no application/sth):

  • text/csv

XGBoost parameters

SEAGuL acronym ;)

  • subsample [default=1]
    Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
    range: (0,1]

Tuning hyperparametrs of XGBoost

  • Random search
    will work but it may run very long

What about grid search?

Grid search is similar to random search in that it chooses hyperparameter configurations blindly. But it’s usually less effective because it leads to almost duplicate training jobs if some of the hyperparameters don’t influence the results much.

Metrics:

  • MSE (Mean Squared Error)-> good for measuring regression problems but handles outliers poorly

SageMaker Production Variants

Something like shadow testing in Tesla cars (you have two concurrent Autopilots running). Weights decide which algorithm is more important. If you want to slowly introduce a new model:

  1. Create an endpoint configuration with production variants for the two models with a weight ratio 0:1

SageMaker Estimators

High-level interface for sagemaker training

SageMaker Processing

Makes it easier to manage infrastructure on SageMaker. If you need fast ML solution, better use SageMaker Processing compared to manually writing code using SageMaker Studio.

Amazon Neptune ML

New stuff, 2021. Works on GNNs (Graph neural networks). Optimized machine learning for graphs. E.g. xgboost has to operate on tabular data. Uses deep graph library.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

a technique allowing to reduce dimensionality

PCA (Principal Component Analysis)

dimensionality reduction

operates in 2 modes:

  • regular -> sparse datasets, moderate number of observations & features

train Input:

  • application/recordIO-wrapped-protobuf

test Input:

  • text/csv

return format:

  • application/json

Modes:

  • File mode

SageMaker Studio — IDE

allows to synchronize work between data scientists

SageMaker Random Cut Forest

Does not support GPU!

Training input:

  • text/csv

SageMaker Batch Transform

Can do pre- and post- processing (like removing id feature and then joining the id to the data). Used to handle very large datasets. It’s only a SageMaker feature

  • not for real time applications

SageMaker Experiments

Compares ML models.

SageMaker Debugger

Tool to make it easier for a data scientist to track it’s model performance and possible problems. Saves the model parameters during training so it is possible to visualize that.

SageMaker Ground Truth

Annotation consolidation will send an image to couple of workers, so that whenever one of them mislabels the image, the others might not and thanks to that we could be sure that the data will be properly labeled.

SageMaker Autopilot

  • AutoML

SageMaker Neo + AWS IoT Greengrass

Neo is used to compile model for an IoT device, GreenGrass gathers the data from those devices and IoT device can use AWS IoT Core to draw inferences from the models (SageMaker is deployed on the Edge).

Algorithms divided by the type of acceleration available

Machine type instances

TODO:

EC2 P3 & P3DN

EC2 G4 & EC2 CS

FPGAS

AMIs

Elastic inference

Inferentia

High performance deep learning powered chip

AWS snowball

  • local storage and large scale-data transfer

AWS snowmobile

Container size for physically transferring the data using semi truck.

Text processing algorithms

  • TF-IDF (Text Frequency Inverse Document Frequency) -> determines how important a word is in a document by giving weights

Please do not sit here
Please do not smoke here

tf-idf matrix (unigrams and bigrams) size = (2, 6 + 6) = (2,12)

More examples

  • Sequence-to-sequence -> machine translation, text summarization (needs tokenization and input data in RecordIO-protobuf with integer tokens)

DeepAR Forecasting

Forecasting scalar (1D) time series using RNN.

Factorization Machines

unsupervised, works well on sparse data, recommendation system

Input data type

  • recordIO-protobuf float32

Inference data type

  • application/json

IP Insights

unsupervised, uses neural network underneath, detects strange network traffic anomalies, “random fishy things”

Guard Duty

Detects anomalies on website (users’ behavior anomalies)

Reinforcement learning

You can run it on multiple cores/multiple machines.

Automatic Model Tuning

  • learns as it goes

Elastic MapReduce (EMR)

  • tool for big data processing and analysis

Amazon Comprehend

  • NLP

Amazon Comprehend Medical

Amazon Translate

  • json which translates (even when source language is set to auto) to a wanted language

Amazon Transcribe

  • speech-to-text

Amazon Polly

  • text-to-speech

Amazon Forecast

  • Amazon Forecast Prophet
    good for time series with strong seasonal effects

Amazon Kendra

  • text extracted from an individual document cannot exceed 5 MB

Amazon Lex

  • chat bot engine

Amazon Rekognition

  • computer vision

Amazon Cognito

  • for authorization and user authentication

Amazon Connect

Easy to use omnichannel cloud contact center

Amazon Personalize

(recommender system) -> PaaS type

real time personalize system

Amazon Textract (OCR)

send image/pdf to amazon and receive text with confidence score

Amazon Sumerian

used with augmented reality

IoT Core

used to gather data from devices to sagemaker and between the devices itself (intercommunication)

IoT Greengrass

moves AWS to the Edge for IoT devices, allowing them to connect to inference endpoints

IoT Analytics

used to gather the data from IoT devices and is able to enrich that data with external one

NLP

Methods in NLP (in order of introducing models)

Blazing Text

Highly-optimized Word2Vec

  • sentiment analysis

Word2Vec (a specific case for Object2Vec)

  • word vectors = word embeddings

GloVe

Global Vectors for Word Representation

Transformer

ELMo

Embedding from Language Models uses LSTM

PyTorch BERT

BiDirectional Encoder Representations from Transformers

Used word masking (just like dropout feature in NLP) during pre training. It does so on 15% of the data.

Enables transfer learning. First learn from Wikipedia or books-corpus and then train for domain specific problems.

Amazons NLP tips

  • in NLP spelling has a relatively lower bearing on the importance of the word

Extra info about AWS Services

  • AWS DeepRacer (reinforcement learning powered 1/18-scale race car)
  • AWS KMS (Key Management Service)
    SSE — Server Side Encryption
    CSE — Client Side Encryption

Correlation coefficients

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. No correlation means negative correlation

Covariance correlation coefficient

Covariance is used when you have a Gaussian relationship between your variables.

Pearson’s correlation coefficient

Also used when you have a Gaussian relationship.
neg. correlation < -0.5 < indeterminate correlation < 0.5 < pos. correlation

Rank correlations

Spearman’s correlation coefficient
Also used when you have a non-Gaussian relationship.

Polychloric correlation coefficient (or tetrachloric)

This coefficient is used to understand the relationship of variables gathered via surveys such as personality tests and surveys that use rating scales.

Other ML techniques and notes

Naive Bayes

Multinomial Naive Bayes for document word search will count frequency of a given word/observation.
Bernoulli Naive Bayes against document classification tasks is where you wish to know whether the word appears or not.
Gaussian Naive Bayes works continuous values in your observations, not discrete values.

Techniques in ML

Ridge regression
Will reduce the coefficients in the model but not all the way to 0.
Lasso regression
Can reduce some of the coefficients to 0.

Imputation techniques:

  • deep learning

Techniques for using multiple GPUs
-
Horovod (simple, use only for training, remove when deploying an inference model)
- PySpark (more work than Horovod)
- using DeepAr (more work than Horovod)

Visualizing data

S3 -> Lake Formation -> QuickSight
Elasticsearch -> Kibana

!!!!!!! Plot types !!!!!!!!!

  • scatter plot
  • bubble chart (can be used to compare 3 features)
  • pairs plot
  • swarm plot
  • cat plot
  • covariance matrix
  • correlation matrix
  • confusion matrix

often used to describe classification performance

  • entropy matrix
  • histogram (1D)
  • line chart (for trends, time series data)
  • radar chart (good for drawing multiple variables simply)
  • bar chart
  • heat map

go there for more
seaborn examples

Transport layer network protocols

  • HTTP

ARIMA -> Autoregressive integrated moving average

Oversampling (how to handle imbalanced datasets)

Creating missing data (e.g. fraud detections, anomalies) where there are very little positive data, but we need to detect them. Undersampling would be a technique were we remove the unique data (e.g. kNN undersampling)

SMOTE oversampling
Synthetic Minority Oversampling Technique, kinda good but not that awesome

Random oversampling
Naive way to achieve that

GANs oversampling (Generative Adversarial Networks)
Creates new data which in a very good way. Thanks to that there are more unique observations.

While SMOTE approaches are based on local information, GAN methods learn from the overall class distribution.

Measuring the goodness of AI

ROC (Receiver Operating Characteristic)

PR curve

Precision

Recall

Databases

Relational

  • SQL

Non-relational

  • Hadoop

Q&A

Example exam from AWS notes:

  1. A

Test score = 60%

If you happen to have different local minima (training function fluctuating around different values during different batch runs), then it would be best to:

  • decrease batch size (will not hit local minima)

small mini-batch -> prevent stopping at local minima

large mini-batch -> good for computationally expensive

ensemble of models -> a combination of ML models working to get one inference (e.g. XGBoost for unstructured data and CNN for images)

Neural Networks are being widely used in ML thanks to:

  1. A lot of data generated through social media, captcha etc.

A tech guy with artistic soul. I will post everything I find worth sharing.