数据科学AWS实践1-AutoML｜Analyze Datasets and Train ML models using AutoML

Alex Tech Bolg

已于 2022-08-23 07:26:50 修改

阅读量880

点赞数 1

分类专栏：我的转码路 Python机器学习基础教程文章标签： aws 云计算大数据

于 2022-08-19 01:24:58 首次发布

本文链接：https://blog.csdn.net/qq_41103204/article/details/126382537

版权

Python机器学习基础教程同时被 2 个专栏收录

29 篇文章

订阅专栏

我的转码路

7 篇文章

订阅专栏

Week 1

1 Specialization overview

请添加图片描述

2 Working with Data

2.1 Data ingestion and exploration

AWS S3

请添加图片描述

Data lakes are often built on top of object storage, such as AWS S3.

请添加图片描述

With object storage, data is stored and managed as objects, which consists of the data itself, any relevant metadata, such as when the object was last modified, and a unique identifier.
Object storage is particularly helpful for storing and retrieving growing amounts of data of any type, hence it’s the perfect foundation for data lakes.

AWS Data Wrangler

请添加图片描述

AWS Glue

请添加图片描述

AWS Athena

请添加图片描述

Athena is serverless

请添加图片描述

Athena is based on Presto, an open source distributed SQL engine, developed for this exact use case, running interactive queries against data sources of all sizes.

2.2 Data visualization

请添加图片描述

Week 2

1 Statistical bias

1.1 Statistical bias

请添加图片描述

1.2 Statistical bias causes

请添加图片描述

Societal: This is societal bias. These biases could be introduced because of preconceived notions that exist in society. Data generated by humans can be biased because all of us have unconscious bias.
Data drift (data shift): Data drift happens, especially when the data distribution significantly varies from the distribution of the training data that was used to initially train the model.
- Covariant drift: The distribution of the independent variables or features that make up your dataset can change.
- Prior probability drift: The distribution of your labels or the targeted variables might change.
- Concept drift: The relationship between the features and the labels can change.

1.3 Measuring statistical bias

请添加图片描述

1.4 Detecting statistical bias

请添加图片描述

1.5 Detect statistical bias with Amazon SageMaker Clarify

请添加图片描述

1.6 Approaches to statistical bias detection

请添加图片描述

Sagemaker Data Wrangle:
- Connect to multiple data scources abd explore data in more visual format.
- Only use a subset of your data to detect bias
Sagemaker Clarify:
- Large volumes of data

1.7 Feature importance: SHAP

请添加图片描述

Week 3

1 Automated Machine Learning

1.1 Automated Machine Learning (AutoML)

请添加图片描述

1.2 AutoML Workflow

Ingest & Analyze

请添加图片描述

1.3 Amazon SageMaker Autopilot

请添加图片描述

1.4 Running experiments with Amazon SageMaker Autopilot

请添加图片描述

1.5 Amazon SageMaker Autopilot: evaluating output

请添加图片描述

1.6 Model Hosting

请添加图片描述

Week 4

1 Build in algorithm

1.1 Build in algorithm

请添加图片描述

1.2 Use cases and algorithms

请添加图片描述

1.3 Text analysis

请添加图片描述

One challenge with Word2Vec: Out of vocabulary issues
- Its vocabulary only contains three million words. The vocabulary is a set if words that the model learned in the training phase. Out of vocabulary words are words that were not present in the text data set the model was initially trained on. If the word is not found in its vocabulary, the model architecture assigns a zero to that words which is basically discarding the word.

请添加图片描述