机器码是会变得嘛_数据机器人使生活变得轻松

最新推荐文章于 2024-09-10 12:36:53 发布

weixin_26756255

最新推荐文章于 2024-09-10 12:36:53 发布

阅读量149

点赞数

文章标签：人工智能 python java 大数据机器学习

原文链接：https://towardsdatascience.com/datarobot-makes-life-easy-8505637241e5

版权

机器码是会变得嘛

I’m diverging from the previous articles in the series. I’m going to review two tools that are heads and shoulders above the others. The design and beautiful visualizations do not come cheap. That doesn’t mean we can’t admire them and use them as a bar to which we strive. I will start with DataRobot. It’s an enterprise tool that you may find yourself having access to through work or school.

我与本系列的前几篇文章有所不同。我将回顾两个首屈一指的工具。设计和精美的可视化并不便宜。这并不意味着我们不能佩服它们并将它们用作我们努力的标准。我将从DataRobot开始。这是一种企业工具，您可能会发现自己可以通过工作或上学访问。

为什么选择DataRobot？ (Why DataRobot?)

I have experience using this tool and love it for the business cases for which I use it. My business case is to have a straightforward interface for a non-data scientist to run and deploy models in an automated way. DataRobot adds new features on a regular cadence, each built nicely within the existing user experience. I could go on about the benefits, but I will control my inner fan-girl.

我有使用此工具的经验，并且喜欢使用它的业务案例。我的业务案例是为非数据科学家提供一个直接的界面，以自动化方式运行和部署模型。 DataRobot定期添加新功能，并且在现有用户体验中很好地构建了每个功能。我可以继续讲讲好处，但我会控制我内在的迷迷女孩。

To keep things even with the other tools, I will focus on the most basic tasks to run a simple .csv file with autoML without any manual interventions or hyper-parameter tuning.

为了使其他工具保持工作状态，我将专注于最基本的任务，以通过autoML运行简单的.csv文件，而无需任何人工干预或超参数调整。

设置和费用 (The setup and cost)

Straight up, DataRobot is outside of the budget range of the individual data scientist. The implementation and cost are definitely in the realm of businesses. AWS Marketplace offers a one-year subscription for $98,000. Pocket change, I’m sure. But if you use AWS govCloud, it is $9.33/hr (it varies). Interesting.

直截了当，DataRobot超出了单个数据科学家的预算范围。实施和成本绝对在企业领域。 AWS Marketplace提供98,000美元的一年期订购。我敢肯定，零钱。但是，如果您使用AWS govCloud，则每小时$ 9.33 (不同)。有趣。

数据 (The Data)

To keep parity across the tools in this series, I will stick to the Kaggle training file. Contradictory, My Dear Watson. Detecting contradiction and entailment in the multilingual text using TPUs. In this Getting Started Competition, we’re classifying pairs of sentences (consisting of a premise and a hypothesis) into three categories — entailment, contradiction, or neutral.

为了使本系列中的工具保持一致，我将坚持使用Kaggle培训文件。矛盾的，亲爱的沃森。使用TPU检测多语言文本中的矛盾和牵连。在本入门竞赛中，我们将成对的句子(由前提和假设组成)分为三类-蕴涵，矛盾或中立。

6 Columns x 13k+ rows — Stanford NLP documentation

6列x 13k +行— Stanford NLP 文档

id
ID
premise
前提
hypothesis
假设
lang_abv
lang_abv
language
语言
label
标签

加载数据 (Loading the data)

You create a project by uploading a dataset. This interface is where you begin.

您可以通过上传数据集来创建项目。该界面是您开始的地方。

Image for post — screenshot by the author

After the data is loaded, there are opportunities to change datatypes or remove features. There are some data distribution data. A bonus is that there are warnings if there might be data leakage. If data leakage is detected, DataRobot removes that feature from the final training dataset.

加载数据后，就有机会更改数据类型或删除功能。有一些数据分发数据。一个额外的好处是，如果有数据泄漏，则会发出警告。如果检测到数据泄漏，DataRobot将从最终训练数据集中删除该功能。

训练模型 (Training your model)

Once you choose your target, you hit the big Start button with Modeling Mode set to AutoPilot. When you do that, you will see progress on the right side. As models are trained, they become available on the leaderboard as they complete.

选择目标后，您将“建模模式”设置为“自动驾驶”时点击了“开始”按钮。完成此操作后，您将在右侧看到进度。训练模型后，完成后即可在排行榜上使用它们。

One good thing about having access to the early model results is that you can review for significant issues. Many times some data issues become glaringly apparent with the Insights, and I could halt the process and try again. This quick and easy review helps with rapid iteration.

获得早期模型结果的一件好事是，您可以查看重大问题。很多情况下，一些数据问题在“见解”中变得非常明显，我可以暂停该过程，然后重试。快速简便的审查有助于快速迭代。

评估培训结果 (Evaluate Training Results)

The leaderboard begins to fill with the completed models. You can choose several valid metrics in the dropdown. There are also some helpful tags to let you know WHY the leaders are up at the top.

排行榜开始填充完成的模型。您可以在下拉菜单中选择几个有效指标。还有一些有用的标签，可让您知道领导者为何居于首位。

You can compare the models against each other.

您可以相互比较模型。

One tab I use often is speed versus accuracy. There are times when you are scoring millions of records when speed trumps accuracy if the accuracy drop is minor.

我经常使用的一个选项卡是速度与准确性。有时，如果精度下降幅度较小，那么速度会比精度高得多，因此您需要为数百万条记录评分。

The Insights tab is handy. You can quickly see if one of your features is popping. It’s up to your business expertise to know if that’s appropriate or not. This tab is where I find data issues early in the autoML model training. If I see something that doesn’t seem correct, I can iterate faster than waiting for the entire process to finish.

数据分析标签非常方便。您可以快速查看您的功能之一是否正在弹出。由您的业务专家决定是否合适。在自动ML模型培训的早期，我可以在此选项卡中找到数据问题。如果我发现似乎不正确的内容，则可以比等待整个过程更快地进行迭代。

DataRobot model explainability is the best of the tools I have reviewed so far. Each prediction is assigned which features influenced the final score, indicating not only strength but also direction.

到目前为止，DataRobot模型的可解释性是我评测过的最好的工具。每个预测都分配了影响最终得分的特征，这些预测不仅指示强度，还指示方向。

Not to be underestimated, documentation can be a real drain on your time. For this simple dataset, DataRobot generates a 7000+ word document with all of the charts, model parameters, and challenger model details. This documentation is a unique feature that I haven’t found in any other tools, though I have asked for it when asked. All done with a single click.

别小看，文档可能会真正浪费您的时间。对于这个简单的数据集，DataRobot会生成一个7000多个word文档，其中包含所有图表，模型参数和挑战者模型详细信息。该文档是我在其他任何工具中都找不到的独特功能，尽管我在询问时会要求它。一键完成所有操作。

结论 (Conclusions)

To loosely compare results between tools, I reran the dataset in classification mode. The metrics are just slightly higher than Azure. For the most part, the model results are similar.

为了比较工具之间的结果，我以分类模式重新运行了数据集。指标仅略高于Azure。在大多数情况下，模型结果相似。

For my business case, this is the top of the pile so far. Head-to-head in image processing or time-series may provide different results. That would be a challenge for another series.

对于我的业务案例，这是到目前为止的头等大事。图像处理或时间序列中的对立可能会提供不同的结果。这将是另一个系列的挑战。

The ease of use, visualizations, access to challenger model details, model explainability, and the automated documentation stand out from the others. Of course, you are paying dearly for this.

易用性，可视化，访问挑战者模型的详细信息，模型的可解释性以及自动化的文档与众不同。当然，您为此付出了高昂的代价。

Next, I will show you H2O.ai Driverless AI. In my opinion, they are the closest comparison to DataRobot at this time. They have gone to great lengths to get top data visualization designers on the project so I’m expecting great things.

接下来，我将向您展示H2O.ai无人驾驶AI。我认为，它们是目前与DataRobot的最接近的比较。他们竭尽全力以吸引该项目的顶级数据可视化设计师，所以我期望一切顺利。

If you missed one of the articles in the series, I have posted them below.

如果您错过了该系列的文章之一，我将它们张贴在下面。

翻译自: https://towardsdatascience.com/datarobot-makes-life-easy-8505637241e5

机器码是会变得嘛

weixin_26756255

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器码是会变得嘛_数据机器人使生活变得轻松

机器码是会变得嘛I’m diverging from the previous articles in the series. I’m going to review two tools that are heads and shoulders above the others. The design and beautiful visualizations do not come cheap. ...
复制链接

扫一扫