GitHub是您需要的最好的AutoML

最新推荐文章于 2022-09-21 09:22:57 发布

weixin_26752765

最新推荐文章于 2022-09-21 09:22:57 发布

阅读量384

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/github-is-the-best-automl-you-will-ever-need-5331f671f105

版权

You may be wondering since when did GitHub get into the business of Automated Machine Learning. Well, it didn’t but you can use it for testing your personalized AutoML software. In this tutorial, we will show you how to build and containerize your own Automated Machine Learning software and test it on GitHub using Docker container.

您可能想知道GitHub从何时开始涉足自动机器学习业务。好的，不是，但是您可以使用它来测试您的个性化AutoML软件。在本教程中，我们将向您展示如何构建和容器化自己的自动机器学习软件，以及如何使用Docker容器在GitHub上对其进行测试。

We will use PyCaret 2.0, an open source, low-code machine learning library in Python to develop a simple AutoML solution and deploy it as a Docker container using GitHub actions. If you haven’t heard about PyCaret before, you can read official announcement for PyCaret 2.0 here or check the detailed release notes here.

我们将使用PyCaret 2.0(一种Python中的开源，低代码机器学习库)来开发简单的AutoML解决方案，并使用GitHub动作将其部署为Docker容器。如果你还没有听说过PyCaret之前，你可以阅读官方公布的PyCaret 2.0 在这里或检查的详细版本说明这里。

Tu本教程的学习目标 (👉 Learning Goals of this Tutorial)

Understanding what Automated Machine Learning is and how to build a simple AutoML software using PyCaret 2.0.
了解什么是自动机器学习以及如何使用PyCaret 2.0构建简单的AutoML软件。
Understand what is a container and how to deploy your AutoML solution as a Docker container.
了解什么是容器以及如何将AutoML解决方案部署为Docker容器。
What are GitHub actions and how can you use them to test your AutoML.
什么是GitHub动作，以及如何使用它们来测试AutoML。

什么是自动化机器学习？ (What is Automated Machine Learning?)

Automated machine learning (AutoML) is a process of automating the time consuming, iterative tasks of machine learning. It allows data scientists and analysts to build machine learning models with efficiency while sustaining the model quality. The final goal of any AutoML software is to finalize the best model based on some performance criteria.

自动化机器学习(AutoML)是使机器学习的耗时迭代任务自动化的过程。它使数据科学家和分析人员可以在保持模型质量的同时高效地构建机器学习模型。任何AutoML软件的最终目标都是根据某些性能标准最终确定最佳模型。

Traditional machine learning model development process is resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models. With automated machine learning, you’ll accelerate the time it takes to develop production-ready ML models with great ease and efficiency.

传统的机器学习模型开发过程需要大量资源，需要大量的领域知识和时间来生成和比较数十种模型。借助自动化的机器学习，您可以轻松，高效地加快开发可投入生产的ML模型所需的时间。

There are many AutoML software out there, paid as well as open source. Almost all of them use the same collection of transformations and base algorithms. Hence the quality and performances of the models trained under such software are approximately the same.

有许多AutoML软件，既有付费软件也有开源软件。几乎所有人都使用相同的转换和基本算法集合。因此，在这种软件下训练的模型的质量和性能大致相同。

Paid AutoML software as a service are very expensive and financially infeasible if you does not have dozens of use-cases in your back pocket. Managed machine learning as a service platforms are relatively less expensive, but they are often hard to use and require knowledge of the specific platform.

如果您的后兜里没有几十个用例，则付费的AutoML软件即服务非常昂贵且在财务上不可行。托管的机器学习即服务平台相对便宜一些，但是它们通常难以使用并且需要特定平台的知识。

Among many other open source AutoML libraries, PyCaret is relatively a new library and has a unique low-code approach to machine learning. The design and functionality of PyCaret is simple, human friendly, and intuitive. In short amount of time, PyCaret was adopted by over 100,000 data scientists globally and we are a growing community of developers.

在许多其他开源AutoML库中，PyCaret相对来说是一个新库，并且具有独特的低代码机器学习方法。 PyCaret的设计和功能简单，友好且直观。在短时间内，PyCaret被全球超过100,000名数据科学家采用，我们是一个不断发展的开发人员社区。

PyCaret如何运作？ (How Does PyCaret works?)

PyCaret is a workflow automation tool for supervised and unsupervised machine learning. It is organized into six modules and each module has a set of functions available to perform some specific action. Each function takes an input and returns an output, which in most cases is a trained machine learning model. Modules available as of the second release are:

PyCaret是用于有监督和无监督机器学习的工作流自动化工具。它分为六个模块，每个模块都有一组可用于执行某些特定操作的功能。每个函数接受一个输入并返回一个输出，在大多数情况下，这是一个训练有素的机器学习模型。从第二版开始可用的模块是：

All modules in PyCaret supports data preparation (over 25+ essential preprocessing techniques, comes with a huge collection of untrained models & support for custom models, automatic hyperparameter tuning, model analysis and interpretability, automatic model selection, experiment logging and easy cloud deployment options.

PyCaret中的所有模块均支持数据准备(超过25种以上的基本预处理技术，大量未经训练的模型以及对自定义模型的支持，自动超参数调整，模型分析和可解释性，自动模型选择，实验记录以及易于进行的云部署选项。

Image for post — https://www.pycaret.org/guide https://www.pycaret.org/guide

To learn more about PyCaret, click here to read our official release announcement.

要了解有关PyCaret的更多信息，请单击此处阅读我们的正式发布公告。

If you want to get started in Python, click here to see a gallery of example notebooks to get started.

如果要使用Python入门，请单击此处查看示例笔记本的画廊。

start开始之前 (👉 Before we start)

Let’s understand the following terminologies before starting to build an AutoML software. At this point all you need is some basic theoretical knowledge of these tools / terms that we are using in this tutorial. If you would like to go in more details, there are links at the end of this tutorial for you to explore later.

在开始构建AutoML软件之前，让我们了解以下术语。至此，您所需要的只是我们在本教程中使用的这些工具/术语的一些基本理论知识。如果您想了解更多详细信息，请在本教程末尾找到一些链接，以供日后浏览。

容器 (Container)

Containers provide a portable and consistent environment that can be deployed rapidly in different environments to maximize the accuracy, performance, and efficiency of machine learning applications. Environment contains run-time language (for e.g. Python), all the libraries, and the dependencies of your application.

容器提供了一个可移植且一致的环境，可以在不同的环境中快速部署该环境，以最大程度地提高机器学习应用程序的准确性，性能和效率。环境包含运行时语言(例如Python)，所有库以及应用程序的依赖项。

码头工人 (Docker)

Docker is a company that provides software (also called Docker) that allows users to build, run, and manage containers. While Docker’s container are the most common, there are other less famous alternatives such as LXD and LXC that also provide container solution.

Docker是一家提供软件(也称为Docker)的公司，该软件允许用户构建，运行和管理容器。尽管Docker的容器是最常见的，但还有其他不太知名的替代方案，例如LXD和LXC也提供了容器解决方案。

的GitHub (GitHub)

GitHub is a cloud-based service that is used to host, manage and control code. Imagine you are working in a large team where multiple people (sometime in hundreds) are making changes on the same code base. PyCaret is itself an example of an open-source project where hundreds of community developers are continuously contributing to source code. If you haven’t used GitHub before, you can sign up for a free account.

GitHub是一项基于云的服务，用于托管，管理和控制代码。假设您在一个大型团队中工作，其中多个人(有时成百上千)正在同一代码库上进行更改。 PyCaret本身就是一个开源项目的示例，数百个社区开发人员正在不断为源代码做出贡献。如果您以前从未使用过GitHub，则可以注册一个免费帐户。

GitHub动作 (GitHub Actions)

GitHub Actions help you automate your software development workflows in the same place you store code and collaborate on pull requests and issues. You can write individual tasks, called actions, and combine them to create a custom workflow. Workflows are custom automated processes that you can set up in your repository to build, test, package, release, or deploy any code project on GitHub.

GitHub Actions可帮助您在存储代码的同一位置自动执行软件开发工作流程，并就拉取请求和问题进行协作。您可以编写单独的任务(称为操作)，并将其组合以创建自定义工作流程。工作流是自定义的自动化流程，您可以在存储库中对其进行设置，以在GitHub上构建，测试，打包，发布或部署任何代码项目。

👉开始吧 (👉 Let’s get started)

目的 (Objective)

To train and select the best performing regression model that predicts patient charges based on the other variables in the dataset i.e. age, sex, bmi, children, smoker, and region.

训练并选择性能最佳的回归模型，该模型基于数据集中的其他变量(例如年龄，性别，bmi，儿童，吸烟者和区域)来预测患者费用。

👉 步骤1 —开发app.py (👉 Step 1 — Develop app.py)

This is the main file for AutoML, which is also an entry point for Dockerfile (see below in step 2). If you have used PyCaret before then this code must be self-explanatory to you.

这是AutoML的主文件，也是Dockerfile的入口点(请参见下面的第2步)。如果您以前使用过PyCaret，则此代码对您必须是不言自明的。

https://github.com/pycaret/pycaret-git-actions/blob/master/app.py https://github.com/pycaret/pycaret-git-actions/blob/master/app.py

First five lines are about importing libraries and variables from the environment. Next three lines are for reading data as pandas dataframe. Line 12 to Line 15 is to import the relevant module based on environment variable and Line 17 onwards is about PyCaret’s functions to initialize the environment, compare base models and to save the best performing model on your device. The last line downloads the experiment logs as a csv file.

前五行是关于从环境中导入库和变量的。接下来的三行用于读取作为pandas数据框的数据。第12行到第15行是根据环境变量导入相关模块，而第17行起是关于PyCaret的功能，用于初始化环境，比较基本模型并将最佳性能模型保存在设备上。最后一行将实验日志下载为csv文件。

👉步骤2 —创建Dockerfile (👉 Step 2— Create Dockerfile)

Dockerfile is just a file with a few lines of instructions that are saved in your project folder with name “Dockerfile” (case-sensitive and no extension).

Dockerfile只是带有几行指令的文件，该文件以“ Dockerfile”的名称(区分大小写，没有扩展名)保存在您的项目文件夹中。

Another way to think about a Docker file is that it is like a recipe you have invented in your own kitchen. When you share such recipe with others and if they follow the exact same instructions in the recipe, they will able to reproduce the same dish with same quality. Similarly, you can share your docker file with others, who can then create images and run containers based on that docker file.

考虑Docker文件的另一种方法是，它就像您在自己的厨房中发明的食谱一样。当您与他人共享这种食谱时，如果他们遵循食谱中的完全相同的说明，则他们将能够以相同的质量复制相同的菜肴。同样，您可以与其他人共享docker文件，然后其他人可以创建映像并基于该docker文件运行容器。

This Docker file for this project is simple and consist of 6 lines only. See below:

此项目的Docker文件很简单，仅包含6行。见下文：

https://github.com/pycaret/pycaret-git-actions/blob/master/Dockerfile https://github.com/pycaret/pycaret-git-actions/blob/master/Dockerfile

The first line in the Dockerfile imports the python:3.7-slim image. Next four lines create an app folder, update libgomp1 library, and install all the requirements from the requirements.txt file which in this case only requires pycaret. Finally, the last two lines define the entry point of your application; this means that when the container starts, it will execute the app.py file that we earlier saw above in step 1.

Dockerfile中的第一行将导入python：3.7-slim映像。接下来的四行创建一个应用程序文件夹，更新libgomp1库，并从requirements.txt文件安装所有需求，在这种情况下，该文件仅需要pycaret。最后，最后两行定义了应用程序的入口点；这意味着当容器启动时，它将执行我们前面在步骤1中看到的app.py文件。

👉步骤3 —创建action.yml (👉 Step 3 — Create action.yml)

Docker actions require a metadata file. The metadata filename must be either action.yml or action.yaml. The data in the metadata file defines the inputs, outputs and main entrypoint for your action. Action file uses YAML syntax.

Docker操作需要元数据文件。元数据文件名必须是action.yml或action.yaml 。元数据文件中的数据定义了操作的输入，输出和主要入口点。操作文件使用YAML语法。

https://github.com/pycaret/pycaret-git-actions/blob/master/action.yml https://github.com/pycaret/pycaret-git-actions/blob/master/action.yml

Environment variable dataset, target, and usecase are all declared in line 6, 9, and 14 respectively. See line 4–6 of app.py to understand how we have used these environment variables in our app.py file.

环境变量数据集，目标和用例分别在第6、9和14行中声明。请参阅app.py的第4-6行以了解我们如何在app.py文件中使用这些环境变量。

👉步骤4 —在GitHub上发布操作 (👉 Step 4 — Publish action on GitHub)

At this point your project folder should look like this:

此时，您的项目文件夹应如下所示：

Click on ‘Releases’:

点击“发布” ：

Draft a new release:

起草新版本：

Fill in the details (tag, release title and description) and click on ‘Publish release’:

填写详细信息(标签，发行标题和描述)，然后单击“发布发行” ：

Once published click on release and then click on ‘Marketplace’:

发布后，单击发布，然后单击“市场” ：

Click on ‘Use latest version’:

点击“使用最新版本” ：

Save this information, this is the installation details of your software. This is what you will need to install this software on any public GitHub repository:

保存此信息，这是软件的安装详细信息。这是您需要在任何公共GitHub存储库上安装此软件的内容：

👉步骤5 —在GitHub存储库上安装软件 (👉 Step 5— Install software on GitHub repository)

To install and test the software we just created, we have created a new repository pycaret-automl-test and uploaded a few sample datasets for classification and regression.

为了安装和测试我们刚刚创建的软件，我们创建了一个新的存储库pycaret-automl-test 并上传了一些样本数据集以进行分类和回归。

To install the software that we published in the previous step, click on ‘Actions’:

要安装上一步中发布的软件，请单击“ 操作 ”：

Click on ‘set up a workflow yourself’ and copy this script into the editor and click on ‘Start commit’.

单击“ 自行设置工作流程 ”，然后将此脚本复制到编辑器中，然后单击“开始提交” 。

This is an instruction file for GitHub to execute. First action starts from line 9. Line 9 to 15 is an action to install and execute the software we previously developed. Line 11 is where we have referenced the name of the software (see the last part of step 4 above). Line 13 to 15 is action to define environment variables such as the name of the dataset (csv file must be uploaded on the repository), name of the target variable, and usecase type. Line 16 onwards is another action from this repository to upload three files model.pkl, experiment logs as csv file, and system logs as a .log file.

这是GitHub执行的指令文件。第一个动作从第9行开始。第9到15行是安装和执行我们先前开发的软件的动作。第11行是我们引用软件名称的位置(请参见上面第4步的最后一部分)。第13至15行是定义环境变量的操作，例如数据集的名称(必须将csv文件上传到存储库中)，目标变量的名称以及用例类型。从第16行开始，这是该存储库中的另一项操作，用于上传三个文件model.pkl，实验日志作为csv文件以及系统日志作为.log文件。

Once you start commit, click on ‘actions’:

开始提交后，点击“操作” ：

This is where you can monitor the logs of your build as its building and once the workflow is completed, you can collect your files from this location as well.

在这里您可以监视构建日志以及构建日志，一旦工作流完成，您也可以从该位置收集文件。

You can download the files and unzip it on your device.

您可以下载文件并将其解压缩到设备上。

文件：型号 (File: model)

This is a .pkl file for the final model along with the entire transformation pipeline. You can use this file to generate predictions on new dataset using predict_model function. To learn more about it, click here.

这是最终模型以及整个转换管道的.pkl文件。您可以使用此文件通过predict_model函数对新数据集生成预测。要了解更多信息，请单击此处。

文件：实验日志 (File: experiment-logs)

This is a .csv file that has all the details you will ever need for your model. It contains all the models that were trained in app.py script, their performance metrics, hyperparameters and other important meta data.

这是一个.csv文件，其中包含模型所需的所有详细信息。它包含在app.py脚本中经过训练的所有模型，它们的性能指标，超参数和其他重要的元数据。

文件：系统日志 (File: system-logs)

This is a system logs file that PyCaret has generated. This can be used for auditing the process. It contains important meta deta information and is very useful for troubleshooting errors in your software.

这是PyCaret生成的系统日志文件。这可用于审核过程。它包含重要的元数据信息，对于解决软件中的错误非常有用。

揭露 (Disclosure)

GitHub Actions enables you to create custom software development lifecycle workflows directly in your GitHub repository. Each Account comes with included compute and storage quantities for use with Actions, depending on your Account plan, which can be found in the Actions documentation.

GitHub Actions使您可以直接在GitHub存储库中创建自定义软件开发生命周期工作流。每个帐户都随附有用于Actions的计算和存储数量，具体取决于您的帐户计划，可在Actions文档中找到。

Actions and any elements of the Action service may not be used in violation of the Agreement, the Acceptable Use Policy, or the GitHub Actions service limitations. Additionally, Actions should not be used for:

不得违反协议，可接受使用政策或GitHub Actions 服务限制使用 Actions和Action服务的任何元素。此外，操作不应用于：

cryptomining;
加密矿;
serverless computing;
无服务器计算；
using our servers to disrupt, or to gain or to attempt to gain unauthorized access to, any service, device, data, account or network (other than those authorized by the GitHub Bug Bounty program)
使用我们的服务器来破坏，获取或试图获得对任何服务，设备，数据，帐户或网络的未经授权的访问(而不是由GitHub Bug Bounty程序授权的服务)
the provision of a stand-alone or integrated application or service offering Actions or any elements of Actions for commercial purposes; or,
提供用于商业目的的动作或动作的任何元素的独立或集成的应用程序或服务；要么，
any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.
与使用GitHub操作的存储库相关联的软件项目的生产，测试，部署或发布无关的任何其他活动。

In order to prevent violations of these limitations and abuse of GitHub Actions, GitHub may monitor your use of GitHub Actions. Misuse of GitHub Actions may result in termination of jobs, or restrictions in your ability to use GitHub Actions.

为了防止违反这些限制和滥用GitHub Action，GitHub可能会监视您对GitHub Actions的使用。滥用GitHub动作可能会导致工作终止或使用GitHub动作的能力受到限制。

本教程中使用的存储库： (Repositories used in this tutorial:)

There is no limit to what you can achieve using this light-weight workflow automation library in Python. If you find this useful, please do not forget to give us ⭐️ on our github repo.

使用Python中的这种轻量级工作流自动化库可以实现的功能没有限制。如果您觉得这有用，请不要忘记在我们的github存储库中给我们⭐️。

To hear more about PyCaret follow us on LinkedIn and Youtube.

要了解有关PyCaret的更多信息，请在LinkedIn和Youtube上关注我们。

If you would like to learn more about PyCaret 2.0, read this announcement. If you have used PyCaret before, you might be interested in release notes for current release.

如果您想了解有关PyCaret 2.0的更多信息，请阅读此声明。如果您以前使用过PyCaret，则可能对当前发行版的发行说明感兴趣。