AWS Sagemaker Studio自动驾驶仪准备好迎接黄金时间了吗

最新推荐文章于 2024-07-18 18:03:37 发布

weixin_26704853

最新推荐文章于 2024-07-18 18:03:37 发布

阅读量195

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/is-aws-sagemaker-studio-autopilot-ready-for-prime-time-dcbca718bae7

版权

Over the past couple of years, I have been keeping tabs on the latest offerings in the enterprise autoML space. I’ve seen live and remote demos of a dozen applications during this time. Developments in this space make keeping up to date a challenge as competitors add features and polish up their interfaces frequently. Recently AWS SageMaker Studio Autopilot became available, so I have a set of business cases I want to run through it. For this article, I am running tabular Kaggle datasets through the Autopilot feature and sharing the user experience.

在过去的几年中，我一直在关注企业autoML空间中的最新产品。在这段时间里，我看过许多应用程序的实时和远程演示。随着竞争对手增加功能并频繁完善其界面，这一领域的发展使跟上最新挑战成为现实。最近，AWS SageMaker Studio Autopilot可用，因此我要处理一组业务案例。对于本文，我将通过自动驾驶功能运行表格Kaggle数据集，并分享用户体验。

首先，为什么要使用autoML？ (First, why use autoML?)

There are a variety of reasons to use autoML, including making machine learning accessible to a broader audience, convenience, ease of use, and productivity. The focus of my investigation is on the easy of use for a data scientist to set up a repeatable business process that a data analyst, data engineer, or machine learning engineer could take over. Ease of use and informative user interface features are a must.

使用autoML的原因多种多样，包括使机器学习对更广泛的受众开放，方便，易用以及提高生产率。我的调查重点是让数据科学家易于使用，以建立数据分析师，数据工程师或机器学习工程师可以接管的可重复业务流程。易于使用和信息丰富的用户界面功能是必须的。

为什么选择AWS Autopilot？ (Why AWS Autopilot?)

AWS is a beast. They release products and updates at an amazing pace. If you are already in the AWS ecosystem, it makes sense to try them out.

AWS是野兽。他们以惊人的速度发布产品和更新。如果您已经在AWS生态系统中，则可以尝试一下。

What is also attractive about AWS is that you can use SageMaker and Autopilot on a usage basis. No need for costly licenses. It would be best if you kept an eye on your billing, though. Like I mentioned in my Quantum Computing article, this setup allows individuals to access and use these advanced tools. There are some other products out there with different cost structures that make this impossible.

AWS的另一个吸引人之处在于您可以按使用情况使用SageMaker和Autopilot。无需昂贵的许可证。不过，最好还是留意帐单。就像我在《量子计算》文章中提到的那样，此设置允许个人访问和使用这些高级工具。还有其他一些产品，它们具有不同的成本结构，因此不可能做到这一点。

设置 (The setup)

All of the information below assumes you have an AWS account set up, including billing. If you are setting up your account for the first time, there are some free tier options, though that only applies to Sagemaker in the first two months.

以下所有信息均假设您已设置AWS账户(包括账单)。如果您是第一次设置帐户，则可以使用一些免费套餐选项，尽管这些选项仅适用于前两个月的Sagemaker 。

Once you have an account, you’ll need to name an S3 bucket and folders if you’d like for your experiments. You will load up your data files to that S3 location.

拥有帐户后，如果您想进行实验，则需要命名S3存储桶和文件夹。您将数据文件加载到该S3位置。

数据 (The Data)

Just to use a public dataset that is available to everyone, I grabbed a couple of datasets from Kaggle competitions. There is also some idea of the range of possible metric values with which you can compare your results. ***Note, read the details of the Kaggle competition rules. Some allow the use of autoML, and some do not! You can use the dataset but not necessarily submit the results as an entry.

为了使用每个人都可以使用的公共数据集，我从Kaggle比赛中获取了一些数据集。对于可能的度量标准值的范围也有一些想法，您可以与之比较结果。 ***注意，请阅读Kaggle竞赛规则的详细信息。有些允许使用autoML，有些则不允许！您可以使用数据集，但不必将结果作为条目提交。

I pulled the training datasets from these two Kaggle competitions and loaded them to separate folders in my S3 bucket.

我从这两次Kaggle比赛中提取了训练数据集，并将它们加载到S3存储桶中的单独文件夹中。

Contradictory, My Dear Watson. Detecting contradiction and entailment in multilingual text using TPUs. In this Getting Started Competition, we’re classifying pairs of sentences (consisting of a premise and a hypothesis) into three categories — entailment, contradiction, or neutral.

矛盾的，亲爱的沃森。使用TPU检测多语言文本中的矛盾与牵连。在本入门竞赛中，我们将成对的句子(由前提和假设组成)分为三类- entailment ， contradiction或neutral 。

6 Columns x 13k+ rows — Stanford NLP documentation

6列x 13k +行— Stanford NLP 文档

id
id
premise
premise
hypothesis
hypothesis
lang_abv
lang_abv
language
language
label
label

Real or Not? NLP with Disaster Tweets. Predict which Tweets are about real disasters and which ones are not.

真实与否？ NLP与灾难鸣叫。预测哪些推文是关于真实灾难的，哪些不是。

5 Columns x 7503 unique tweets

5栏x 7503独特的推文

id - a unique identifier for each tweet
id每个推文的唯一标识符
text - the text of the tweet
text -鸣叫文字
location - the location the tweet was sent from (may be blank)
location推文发送的位置(可以为空白)
keyword - a particular keyword from the tweet (may be blank)
keyword的特定关键字(可以为空白)
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)
target -在train.csv只，这是指在鸣叫是否是关于一个真正的灾难( 1 )否( 0 )

启动AutoPilot培训和模型选择 (Kicking off the AutoPilot training and model selection)

Autopilot is on the AWS SageMaker Service. They have recently launched an interface they call Studio. It is here that you can launch an AutoPilot experiment.

自动驾驶仪位于AWS SageMaker Service上。他们最近启动了一个称为Studio的界面。您可以在这里启动AutoPilot实验。

From the AWS console, navigate to the SageMaker service. There you will find the link for Studio Console

在AWS控制台中，导航到SageMaker服务。您会在此处找到Studio Console的链接

Image for post — screenshot by the author

There you can add a user (yourself) and the Open Studio when it is ready.

准备好后，您可以在其中添加用户(您自己)和Open Studio。

This link will take you to the SageMaker Studio within JupyterLab. You will see an option to build models automatically. That’s where we want to be.

该链接将带您到JupyterLab中的SageMaker Studio。您将看到一个自动构建模型的选项。那就是我们想要的地方。

You enter the information regarding your input file and initial settings. Then you Create an Experiment. It was quite simple to get to this point in the process. The interface was intuitive.

您输入有关您的输入文件和初始设置的信息。然后，您创建一个实验。在过程中达到这一点非常简单。界面直观。

用户体验下降 (The User Experience drops off a cliff)

Ok — that wasn’t bad. We are on our way…..

好的-不错。我们在路上…..

….and, that’s when the UX team came to the end of their sprint before the release. There is no status bar to let you know how far into the processing you are. There is no indication of what is really going on behind the scenes. You sit and wait, hoping something is happening.

…。那时，UX团队在发行之前结束了他们的冲刺。没有状态栏可让您知道处理的距离。没有迹象表明幕后真正发生了什么。您坐下等待，希望发生什么事情。

After the Analyze Data step, you have access to a couple of notebooks.

在“分析数据”步骤之后，您可以访问几个笔记本。

I opened the data exploration notebook, ready for some decent information and visualizations. There wasn’t anything in that notebook that provided any real insights that a simple data profile wouldn’t tell me. You can review yourself below.

我打开了数据浏览笔记本，准备好一些体面的信息和可视化效果。该笔记本中没有任何东西可以提供任何简单数据配置文件无法告诉我的真实见解。您可以在下面查看自己。

Eventually, the experiment progresses, and you start to see Trials on the list. I noted the metrics improve as the Trials complete and new ones run.

最终，实验进行了，您开始在列表上看到“试验”。我注意到随着试验的完成和新标准的运行，这些指标有所提高。

While the Trials are running, you can peruse the “Amazon SageMaker Autopilot Candidate Definition Notebook.” This notebook contains the details of the model tuning process. You will need to consult this notebook to make any sense of the Trials or the output.

试用版运行时，您可以细读“ Amazon SageMaker Autopilot候选定义笔记本”。本笔记本包含模型调整过程的详细信息。您将需要咨询此笔记本以了解试用或输出。

By reviewing this notebook, I can see that the models seem to be XGBoost and Linear Learners. It assigns the accuracy/success metric that you will use. “This notebook will build a BinaryClassification model that maximizes the “F1” quality metric of the trained models. The “F1” metric applies for binary classification with a positive and negative class. It mixes between precision and recall, and is recommended in cases where there are more negative examples compared to positive examples.”

通过查看此笔记本，我可以看到这些模型似乎是XGBoost和Linear Learners。它分配您将使用的准确性/成功度量。 “该笔记本将建立一个BinaryClassification模型，该模型将使训练后的模型的“ F1 ”质量指标最大化 。 “ F1 ”度量标准适用于具有正负类的二进制分类。它介于精确度和召回率之间，建议在负面示例比正面示例多的情况下使用。”

结果 (The results)

The output files can be found in your assigned S3 bucket. But all of the files are broken down by the Trial id. You have to know what you are looking for.

可以在分配的S3存储桶中找到输出文件。但是所有文件都按Trial ID细分。您必须知道您要寻找的东西。

When the Trails finish (or you hit the 250 job limit if you set that), one of the models has a small star indicator that it was the most accurate of the Trials.

当越野赛结束时(或者您将其设置为250的工作极限)，其中一个模型会带有一个小星号指示，表明它是最精确的试验。

Honestly, at this point, there wasn’t anything intuitive about the interface at all. I did see a Deploy Model button — assumingly to deploy an API via SageMaker.

老实说，在这一点上，关于界面根本没有任何直观之处。我确实看到了“部署模型”按钮-假设是通过SageMaker部署API。

结论 (Conclusion)

The lack of an intuitive UI that allows easy access to the model selection and accuracy metrics puts this tool outside the scope of my business use case. There is too much control taken by AutoPilot. You can fine-tune on your own, but the battle to find what you want doesn’t seem worth it.

由于缺少直观的UI，因此无法轻松访问模型选择和准确性指标，因此该工具超出了我的业务用例范围。 AutoPilot控制太多。您可以自己进行微调，但是寻找所需内容的战斗似乎并不值得。

I will most likely wait about a year and monitor the features that AWS adds to AutoPilot. Other tools such as DataRobot and H2O.ai DriverlessAI are far ahead in the areas of usability and visualizations.

我很可能会等待大约一年，并监视AWS添加到AutoPilot的功能。其他工具，例如DataRobot和H2O.ai DriverlessAI，在可用性和可视化领域均遥遥领先。

偷窥下一步 (Sneak Peek at Next Steps)

I’ve evaluated H2O.ai Driverless AI in the past and will do a new test to check out their latest features. I want to check out GoogleML as well. I have substantial experience with DataRobot, so I don’t need a separate evaluation.

我过去曾评估过H2O.ai无人驾驶AI，并将进行新的测试以检查其最新功能。我也想签出GoogleML。我在DataRobot方面具有丰富的经验，因此不需要单独评估。

Here are some screenshots of those other UIs:

这是其他UI的一些屏幕截图：

翻译自: https://towardsdatascience.com/is-aws-sagemaker-studio-autopilot-ready-for-prime-time-dcbca718bae7

weixin_26704853

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
AWS Sagemaker Studio自动驾驶仪准备好迎接黄金时间了吗

Over the past couple of years, I have been keeping tabs on the latest offerings in the enterprise autoML space. I’ve seen live and remote demos of a dozen applications during this time. Developments i...
复制链接

扫一扫