Exam passed on March 6th 2021!
I have written this post while preparing for AWS Certified Machine Learning — Specialty exam. This is a sum up of what I know about that technology. You may find some of the notes very trivial — my goal was to make sure that I don’t make any mistake and remember every fact.
Courses I’ve taken
- Udemy AWS Certified Machine Learning Specialty 2021 — Hands On! course. It’s better to take Udemy course first. Whizlabs goes into more details. You may even find it hard, whereas Stephane and Frank make it much easier to understand the concepts behind the terminology.
- Whizlabs AWS Certified Machine Learning Specialty course
- Machine Learning by Andrew Ng
- My bachelor thesis was about OCR using CNN and also compared the results with Tesseract
- Deep Learning Specialization (also by Andrew Ng)
- AWS youtube videos & some of the reInvent 2021 conference
- AWS Exam readiness course
Quizes & Question dumps & Misc
- Best question dump I’ve found (on exam topics, in case the URL dies)
- Whizlabs practice exams
- Testprep exam questions
- AWS Certified Machine Learning Specialty — Sample Questions
- Andrew Ng and his programming assignments in his courses
- Reddit topics (check how others are preparing for it)
Data scaling & normalization methods
- Mean/variance standardization
- MinMax scaling
- Maxabs scaling
- Robust scaling
- Normalizer (scales row-wise)
- Standard scaler (performs scaling and shifting/centering column-wise)
- One-hot-encoding (do that for categorical data)
EC2 (Elastic Cloud Compute)
It’s the environment in which SageMaker Jupyter notebooks are being run.
S3 (Single Storage Service)
For SageMaker, if distribution type parameter:
- ShardedByS3Key (replicates a subset of dataset)
- FullReplication (replicates the entire dataset)
- Transition actions — Define when objects transition to another storage class. For example, you might choose to transition objects to the S3 Standard-IA storage class 30 days after you created them, or archive objects to the S3 Glacier storage class one year after creating them.
- Expiration actions — Define when objects expire. Amazon S3 deletes expired objects on your behalf.
Data storage options
- S3 Standard
general-purpose storage of frequently accessed data
- S3 Intelligent-Tiering
data with unknown or changing access patterns
- S3 Standard-Infrequent Access (S3 Standard-IA) and S3 One Zone-Infrequent Access (S3 One Zone-IA)
long-lived, but less frequently accessed data
- S3 Glacier and S3 Glacier Deep Archive
long-term archive and digital preservation.
- S3 Outposts
If you have data residency requirements that can’t be met by an existing AWS Region, you can use the
Kinesis Data Stream
single shard can ingest < 1MB
- real time
- PutRecord (puts a single record into Kinesis Data Stream)
- has to use Kinesis Consumer Library to receive the data and then write data to S3
Kinesis Video Streams
- real time video processing
Kinesis Data Firehose
record has to be < 1000 kB
input buffer can be 1–128 MB
buffer interval 60–900 seconds
lambda default timeout is 3 seconds
outputs Parquet or ORC on the fly
- non real time or close to real time
- can output parquet file format
- can write directly to S3
status of transformed data:
- Dropped (intentionally rejected by transformation)
- ProcessingFailed (could not transform the data)
Kinesis Producer Library
Provides built-in performance benefits and is very easy to use.
e.g. ingesting clickstream data should be veeeery easy
Kinesis Data Analytics
Transform and analyze streaming data in real time using Apache Flink (data processing for streams). Uses SQL queries.
Can detect dense regions in data using Hotspots.
Can detect anomalies using Random Cut Forest.
AWS Data Pipeline
Managed ETL (Extract, Transform, Load) service
ingest data from NFS drive
AWS DMS (Database Migration Service)
- for batch processing
- reads from relational and non relational databases
uses crawlers to do ETL jobs:
- structured data
- unstructured data
- has the function FindMatches Transform (labeling file should be encoded in UTF-8 with Byte Order Mark)
- has a module called Built-In Transforms
- has Spark ML jobs (jobs operating on parquet data)
- don’t use if ETL is not mentioned
- batch processing
What tools can be used on AWS Glue when using Spark
- parquet data
- Spark MLeap containers
- Spark MLib for building ML components for data transformation (tokenizing, encoding, normalizing etc.)
- SparkML Serving Container allows to deploy Apache Spark ML pipeline in SageMaker
- Serverless ETL
- built on Presto
- runs standard SQL
MySQL and PostgreSQL-compatible relational database built for the cloud. Performance and availability of commercial-grade databases at 1/10th the cost.
Amazon Fraud Detector
- ingests only CSV
- rows_count > 10k
- fraud_rows_count < 400
Amazon MSK (Managed Streaming for Apache Kafka)
Kafka is a publish/subscribe messaging system
- data warehouse
- if the company has the Redshift data warehouse and wants to move part of its data to S3, it can use Redshift Spectrum to query that data using Redshift
NoSQL Key-Value database
AWS Lake Formation
- data lake
- is built on top of AWS Glue (e.g. has crawlers)
- uses S3 as data storage
Amazon Step Functions
- can do a lot of ETL on batch data
The diagram below can also be described with ASL (Amazon State Language):
Amazon FSx for Lustre
a file system service — speeds up training jobs (accelerates data flow between S3 and Sagemaker. Prevents downloading the same set of data three times (example below)
Amazon EFS (Elastic File System)
faster training times thanks to directly extracting data from S3 to training jobs. No need to pull data from S3 for a training job and Notebook separately.
Amazon EBS (Elastic Block Store) volumes
Easy to use, high performance block storage at any scale. Designed to use with EC2
Only for structured data on S3 or database
Used to visualize data. Some of the visualizations:
- Net Promoter Score
- KPI (Key Performance Indicator)
- Customer Profitability Score
- Bar charts
- Pie charts
SageMaker is a framework from Amazon which makes it easier to implement and deploy AI algorithms.
- Cannot read from Elasticache, has to be S3.
- Offline testing model -> alpha endpoints
- Its notebooks are run on EC2 instances
- Use SageMaker Management Console to specify the metrics you want to track
- You can also use sagemaker.analytics module and use TraningJobAnalytics
- You can change the inference pipeline when it is deploy by using UpdateEndpoint API. Although, you will lose AutoScaling
- no limit on input size data
SageMaker Linear Learner
- regression (regression assumes normal distribution of the data)
- recordIO-protobuf float32
Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.
In order to interact with SageMaker hyperparameter tuning jobs call HyperparameterTuner() API call
mean square error, cross entropy loss, absolute error.
F1 measure, precision, recall, or accuracy.
- test:msd (mean squared distances)
- test:ssd (sum of the squared distances)
Name of the hyperparameter tuning job is CreateHyperparameterTuningJob.
objective set to multi:softprob
memory bound, so benefits more from M instances, rather than C ones
Inference endpoints can use only (no application/sth):
SEAGuL acronym ;)
Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and
etashrinks the feature weights to make the boosting process more conservative.
L1 regularization term on weights. Increasing this value will make model more conservative.
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger
gammais, the more conservative the algorithm will be.
L2 regularization term on weights. Increasing this value will make model more conservative.
Tuning hyperparametrs of XGBoost
- Random search
will work but it may run very long
- Bayesian optimization
also optimizes but runs shorter
Grid search is similar to random search in that it chooses hyperparameter configurations blindly. But it’s usually less effective because it leads to almost duplicate training jobs if some of the hyperparameters don’t influence the results much.
- MSE (Mean Squared Error)-> good for measuring regression problems but handles outliers poorly
- MAE (Mean Absolute Error) -> good regression metric and can be significantly influenced by outliers
SageMaker Production Variants
Something like shadow testing in Tesla cars (you have two concurrent Autopilots running). Weights decide which algorithm is more important. If you want to slowly introduce a new model:
- Create an endpoint configuration with production variants for the two models with a weight ratio 0:1
- Update the weights periodically
High-level interface for sagemaker training
Makes it easier to manage infrastructure on SageMaker. If you need fast ML solution, better use SageMaker Processing compared to manually writing code using SageMaker Studio.
Amazon Neptune ML
New stuff, 2021. Works on GNNs (Graph neural networks). Optimized machine learning for graphs. E.g. xgboost has to operate on tabular data. Uses deep graph library.
t-SNE (t-Distributed Stochastic Neighbor Embedding)
a technique allowing to reduce dimensionality
PCA (Principal Component Analysis)
operates in 2 modes:
- regular -> sparse datasets, moderate number of observations & features
- randomized -> large number of observations & features
- application/x-recordio-protobuf (vector of projections)
- File mode
- Pipe mode
SageMaker Studio — IDE
allows to synchronize work between data scientists
SageMaker Random Cut Forest
Does not support GPU!
SageMaker Batch Transform
Can do pre- and post- processing (like removing id feature and then joining the id to the data). Used to handle very large datasets. It’s only a SageMaker feature
- not for real time applications
- for a lot of data
Compares ML models.
Tool to make it easier for a data scientist to track it’s model performance and possible problems. Saves the model parameters during training so it is possible to visualize that.
SageMaker Ground Truth
Annotation consolidation will send an image to couple of workers, so that whenever one of them mislabels the image, the others might not and thanks to that we could be sure that the data will be properly labeled.
- accepts CSV only.
SageMaker Neo + AWS IoT Greengrass
Neo is used to compile model for an IoT device, GreenGrass gathers the data from those devices and IoT device can use AWS IoT Core to draw inferences from the models (SageMaker is deployed on the Edge).
Algorithms divided by the type of acceleration available
Machine type instances
EC2 P3 & P3DN
EC2 G4 & EC2 CS
High performance deep learning powered chip
- local storage and large scale-data transfer
- large v40 CPUs
Container size for physically transferring the data using semi truck.
Text processing algorithms
- TF-IDF (Text Frequency Inverse Document Frequency) -> determines how important a word is in a document by giving weights
Please do not sit here
Please do not smoke here
tf-idf matrix (unigrams and bigrams) size = (2, 6 + 6) = (2,12)
- Sequence-to-sequence -> machine translation, text summarization (needs tokenization and input data in RecordIO-protobuf with integer tokens)
- bag of words -> creates tokens out of the words on the input
- OSB (Orthogonal Space Bigram) -> creates group of words
- n-gram -> used to find multi word phrases in text
- LDA (Latent Dirichlet Allocation) -> topic modeling, unsupervised
- Neural Topic model -> topic modeling, unsupervised
Forecasting scalar (1D) time series using RNN.
unsupervised, works well on sparse data, recommendation system
Input data type
- recordIO-protobuf float32
Inference data type
unsupervised, uses neural network underneath, detects strange network traffic anomalies, “random fishy things”
Detects anomalies on website (users’ behavior anomalies)
You can run it on multiple cores/multiple machines.
Automatic Model Tuning
- learns as it goes
Elastic MapReduce (EMR)
- tool for big data processing and analysis
- connected with Spark (spark can output a parquet file)
- Real-time streaming
- HPC (High Performance Computing)
- requires management (provisioning)
- text analytics
- Amazon Comprehend Medical
- document classification
- can understand many languages
- Personally Identifiable Information (PII) -> be careful with that (Amazon Comprehend can detect that)
Amazon Comprehend Medical
- json which translates (even when source language is set to auto) to a wanted language
- channel identification
- custom vocabularies
- streaming client is an HTTP/2 streaming client
- polly is a stereotypical name for a parrot
- when you happen to have acronyms in text (W3C -> World Wide Web Consortium), you can create SSML for that (e.g. <sub alias=”World Wide Web Consortium”>W3C</sub), but it is DOCUMENT specific only. That’s why there is a better option - create a custom lexicon
- Amazon Forecast Prophet
good for time series with strong seasonal effects
- Amazon Forecast DeepAR+
can work with related time series (many time series datasets that are correlated)
- Amazon Forecast ARIMA
simple datasets (less than 100 time series)
- Amazon Forecast CNN-QR
1D time series, Seq2Seq model
- Amazon Forecast ETS (Exponential Smoothing)
good for seasonality and other prior assumptions about the data
- Amazon Forecast NTPS (Non-Parameteric Time Series)
works good for sparse time series
- text extracted from an individual document cannot exceed 5 MB
- supports HTML, PowerPoint, Word, PDF, plain text
- chat bot engine
- utterance -> intent -> lambda -> slot (extra information)
- computer vision
- face detection
- can be paired with Augmented AI (Rekognition predictions will be reviewed by humans)
- Rekognition Image
- Reokognition Video
- for authorization and user authentication
Easy to use omnichannel cloud contact center
(recommender system) -> PaaS type
real time personalize system
Amazon Textract (OCR)
send image/pdf to amazon and receive text with confidence score
used with augmented reality
used to gather data from devices to sagemaker and between the devices itself (intercommunication)
moves AWS to the Edge for IoT devices, allowing them to connect to inference endpoints
used to gather the data from IoT devices and is able to enrich that data with external one
Methods in NLP (in order of introducing models)
- sentiment analysis
- entity recognition
Word2Vec (a specific case for Object2Vec)
- word vectors = word embeddings
- similar meaning = similar vectors (when using Word2Vec)
- Object2Vect is capable of creating embeddings for arbitrary object, such as Tweets
Global Vectors for Word Representation
Embedding from Language Models uses LSTM
BiDirectional Encoder Representations from Transformers
Used word masking (just like dropout feature in NLP) during pre training. It does so on 15% of the data.
Enables transfer learning. First learn from Wikipedia or books-corpus and then train for domain specific problems.
Amazons NLP tips
- in NLP spelling has a relatively lower bearing on the importance of the word
- in NLP remove stop words (e.g. not, neither, nor)
- tokenization of words for NLP
Extra info about AWS Services
- AWS DeepRacer (reinforcement learning powered 1/18-scale race car)
- DeepLens (deep learning-enabled video camera)
- CloudTrail is for auditioning (e.g. how often a model is deployed)
- CloudWatch is monitoring and issues alarms (e.g. monitor CPU/GPU)
- when a model fails you can also call DescribeJob API to check the FailureReason option
- AWS KMS (Key Management Service)
SSE — Server Side Encryption
CSE — Client Side Encryption
- model training happens inside VPC
- SimpleImputer default strategy = mean
- lambda function max deployment package size is 50MB
- lambda max memory setting = 3 MB
- lambda blueprint can be taken from AWS Serverless Application Repository or AWS Lambda Repository
- lambda transformed record must contain recordId, result and data
- Semantic Segmentation is used for computer vision, not NLP
- online learning -> learning on the go
- incremental learning -> if you have a model trained for the specific job and you will train it again using new data
- transfer learning -> use the pretrained model (like ResNet, YoloV3) and retrain for your specific data
- out-of-core learning -> used to train huge datasets that cannot be loaded into a single server, it trains using subsets of data
- ReLU -> Rectified Linear Unit
- Collaborative Filtering -> Amazon used that to create a recommendation system “Users who bought this also bought this”
- RMSE is a good evaluation metric for regression when solving for a continuous problem
- ROC is a good evaluation metric for regression problems when solving for a binary variable
- NAT gateways are instantiated in public subnets. Whenether you hear about encrypting and hiding SageMaker servicing, always keep in mind word “VPC interface endpoint”
- SQS (Simple Queue Service) message queue service. You need boto for that
- SNS (Simple Notification Service) notification system — mail, SMS or push notification
- automatic load balancing also costs money on AWS
- SGD (Stochastic Gradient Descent) fails -> RMSProp, Adam or Adagrad, Adadelta, NAG or Momentum
- Gradient Descent converges faster after normalization
The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. No correlation means negative correlation
Covariance correlation coefficient
Covariance is used when you have a Gaussian relationship between your variables.
Pearson’s correlation coefficient
Also used when you have a Gaussian relationship.
neg. correlation < -0.5 < indeterminate correlation < 0.5 < pos. correlation
Spearman’s correlation coefficient
Also used when you have a non-Gaussian relationship.
Polychloric correlation coefficient (or tetrachloric)
This coefficient is used to understand the relationship of variables gathered via surveys such as personality tests and surveys that use rating scales.
Other ML techniques and notes
Multinomial Naive Bayes for document word search will count frequency of a given word/observation.
Bernoulli Naive Bayes against document classification tasks is where you wish to know whether the word appears or not.
Gaussian Naive Bayes works continuous values in your observations, not discrete values.
Techniques in ML
Will reduce the coefficients in the model but not all the way to 0.
Can reduce some of the coefficients to 0.
- deep learning
- yeo-johnson transformation (used to give a more gaussian distribution for your data)
- mean imputation (predicts measing values, but is rather a naive approach)
- multivariate imputation (used for predicting missing values in the data, better than mean imputation)
Techniques for using multiple GPUs
- Horovod (simple, use only for training, remove when deploying an inference model)
- PySpark (more work than Horovod)
- using DeepAr (more work than Horovod)
S3 -> Lake Formation -> QuickSight
Elasticsearch -> Kibana
!!!!!!! Plot types !!!!!!!!!
- scatter plot
- bubble chart (can be used to compare 3 features)
- pairs plot
- swarm plot
- cat plot
- covariance matrix
- correlation matrix
- confusion matrix
often used to describe classification performance
- entropy matrix
- histogram (1D)
- line chart (for trends, time series data)
- residual plot (good for deciding whether regression fits well to data)
- radar chart (good for drawing multiple variables simply)
- bar chart
- heat map
Transport layer network protocols
- HTTP/2 (e.g. the one used by Amazon Transcribe) used for streaming data
- HTTP/3 (is currently being developed)
ARIMA -> Autoregressive integrated moving average
Oversampling (how to handle imbalanced datasets)
Creating missing data (e.g. fraud detections, anomalies) where there are very little positive data, but we need to detect them. Undersampling would be a technique were we remove the unique data (e.g. kNN undersampling)
Synthetic Minority Oversampling Technique, kinda good but not that awesome
Naive way to achieve that
GANs oversampling (Generative Adversarial Networks)
Creates new data which in a very good way. Thanks to that there are more unique observations.
While SMOTE approaches are based on local information, GAN methods learn from the overall class distribution.
Measuring the goodness of AI
ROC (Receiver Operating Characteristic)
Example exam from AWS notes:
should be A
should be B
should be B
- B (maybe D, but why to do that for 5% when there are multiple columns to be imputed)
but it is D, lol it was really worth it. Maybe 5% is really a lot.
should be B
Test score = 60%
If you happen to have different local minima (training function fluctuating around different values during different batch runs), then it would be best to:
- decrease batch size (will not hit local minima)
- decrease learning rate (will not overshoot global minima)
small mini-batch -> prevent stopping at local minima
large mini-batch -> good for computationally expensive
ensemble of models -> a combination of ML models working to get one inference (e.g. XGBoost for unstructured data and CNN for images)
Neural Networks are being widely used in ML thanks to:
- A lot of data generated through social media, captcha etc.
- efficient algorithms arose (softmax etc.)
- cheaper GPUs