使用Elyra和Kubeflow管道创建笔记本管道

As a data scientist you are likely using Jupyter notebooks extensively to perform machine learning workflow tasks, such as data exploration, data processing, and model training, evaluation and tuning. Many of those tasks are performed continuously, requiring you to run the notebooks again and again.

作为数据科学家,您可能会广泛使用Jupyter笔记本来执行机器学习工作流程任务,例如数据探索,数据处理以及模型训练,评估和调整。 这些任务中有许多是连续执行的,需要您一次又一次地运行笔记本。

The Elyra (https://github.com/elyra-ai/elyra) JupyterLab extension recently introduced the notebook pipelines visual editor, which you can use to create and run pipelines without any coding.

Elyra( https://github.com/elyra-ai/elyra )JupyterLab扩展最近引入了笔记本管道可视化编辑器,您可以使用它来创建和运行管道而无需任何编码。

In version 1.0 Elyra utilizes Kubeflow Pipelines, a popular platform for building and deploying machine learning workflows on Kubernetes, to run the pipelines.

1.0版中,Elyra利用Kubeflow Pipelines (一种流行的平台,用于在Kubernetes上构建和部署机器学习工作流程)来运行管道。

Elyra’s notebook pipelines feature is extensible and support for other workflow engines, such as Apache Airflow, might be added in the future given enough demand from the community.

Elyra的笔记本管道功能是可扩展的,并且由于社区的足够需求,将来可能会增加对其他工作流引擎(例如Apache Airflow) 的支持

将笔记本作为管道运行 (Running notebooks as a pipeline)

Using Elyra’s visual editor you assemble a pipeline from multiple notebooks and run the pipeline on Kubeflow Pipelines in any Kubernetes environment.

使用Elyra的可视化编辑器,您可以从多个笔记本组装管道,并在任何Kubernetes环境中的Kubeflow管道上运行管道。

To assemble a pipeline, you drag notebooks from the JupyterLab file browser onto the pipeline editor canvas and connect them as desired. Comment nodes can be added to provide lightweight documention.

要组装管道,请将笔记本从JupyterLab文件浏览器拖到管道编辑器画布上,然后根据需要连接它们。 可以添加注释节点以提供轻量级文档。

Image for post
Creating a notebook pipeline using Elyra’s pipeline editor, which downloads a data set, cleanses the data, analyzes the data, and runs predictions.
使用Elyra的管道编辑器创建一个笔记本管道,该编辑器可下载数据集,清理数据,分析数据并运行预测。

Notebooks can be arranged to execute sequentially or in parallel. The pipeline depicted above from https://github.com/elyra-ai/examples comprises of four notebook nodes. The “load_data” node is executed first and has three downstream nodes. The “Part 1- Data Cleaning” node is executed after processing of its upstream node “load_data” has successfully completed, and the “Part 2- …” and “Part 3 -…” nodes are executed in parallel after processing of their upstream node “Part 1 -…” has finished.

笔记本可以安排为顺序执行或并行执行。 上面https://github.com/elyra-ai/examples中描述的管道包含四个笔记本节点。 首先执行“ load_data”节点,并具有三个下游节点。 在成功完成其上游节点“ load_data”的处理之后,将执行“第1部分-数据清理”节点,并且在对其上游节点进行处理之后,将并行执行“第2-…”部分和“第3 -...”个节点“第1部分-…”已完成。

The following video by Romeo Kienzler illustrates how to analyze COVID-19 time-series data using a notebook pipeline.

Romeo Kienzler的以下视频说明了如何使用笔记本管道分析COVID-19时间序列数据

Analyzing COVID-19 time-series data using notebook pipelines
使用笔记本管道分析COVID-19时间序列数据

配置笔记本节点 (Configuring notebook nodes)

A notebook node is implemented in Elyra as a Kubeflow Pipelines component (source repository) that uses papermill to run the notebook in a Docker container. The Docker containers do not share resources aside from an S3-compatible cloud storage bucket, which is used to store input and output artifacts.

笔记本节点在Elyra中实现为Kubeflow Pipelines组件( 源存储库 ),该组件使用纸厂在Docker容器中运行笔记本。 除了用于存储输入和输出工件的S3兼容云存储桶之外,Docker容器不共享资源。

Image for post

You configure a notebook nodes in the Elyra pipeline editor by providing the following information:

您可以通过提供以下信息在Elyra管道编辑器中配置笔记本节点:

  • The name and location of the Jupyter notebook.

    Jupyter笔记本的名称和位置。

  • The name of the Docker image that will be used to run the notebook. You can bring your own image or choose from predefined public images, such as a Pandas image, a TensorFlow image (with or without GPU support), or a Pytorch image with the CUDA libraries pre-installed. Any Docker image can be used to run a notebook, as long as curl and Python 3 are preinstalled to allow for automatic scaffolding when the notebook component executes.

    用于运行笔记本的Docker映像的名称。 您可以携带自己的图像,也可以从预定义的公共图像中进行选择,例如Pandas图像,TensorFlow图像(具有或不具有GPU支持)或预先安装了CUDA库的Pytorch图像。 只要预先安装了curl和Python 3以允许在执行笔记本组件时自动进行搭建,任何Docker镜像都可以用于运行笔记本。

  • If your notebook requires access to files that are co-located with the notebook when you assemble a pipeline, such as a set of custom Python scripts, you can declare file dependencies that will be packaged together with the notebook in a compressed archive that is uploaded to the pre-configured cloud storage bucket.

    如果您的笔记本需要在组装管道时访问与笔记本位于同一位置的文件(例如一组自定义Python脚本),则可以声明文件依赖性 ,这些文件依赖性将与笔记本一起打包在上载的压缩归档文件中到预先配置的云存储桶。

  • You can define environment variables to parametrize the notebook runtime environment.

    您可以定义环境变量以参数化笔记本运行时环境。

  • If your notebook produces output files that downstream notebooks consume or that you want to access after the notebook was processed, specify their names. The output files are uploaded to cloud storage after notebook processing has completed.

    如果您的笔记本产生了下游笔记本消耗的输出文件 ,或者您希望在处理笔记本之后访问它们,请指定它们的名称。 笔记本处理完成后,输出文件将上传到云存储。

Image for post
Sample configuration for the “load_data” notebook. The notebook runs in a Pandas Docker image, requires the DATASET_URL environment variable and produces a CSV output file in the specified location.
“ load_data”笔记本的示例配置。 笔记本在Pandas Docker映像中运行,需要DATASET_URL环境变量,并在指定位置生成CSV输出文件。

Before you can run a Notebook pipeline you have to configure a Kubeflow Pipelines runtime.

在运行Notebook管道之前,您必须配置Kubeflow Pipelines运行时。

配置Kubeflow管道运行时 (Configuring a Kubeflow Pipelines runtime)

A runtime configuration defines connectivity information for the Kubeflow Pipelines service instance and S3-compatible cloud storage that Elyra uses to process notebook pipelines:

运行时配置为Elyra用于处理笔记本管道的Kubeflow Pipelines服务实例和与S3兼容的云存储定义了连接信息:

  • The Kubeflows Pipelines API endpoint, e.g. http://kubernetes-service.ibm.com/pipeline. Note that at the time of writing Elyra (version 1.0.0) does not yet support dex authentication.

    Kubeflows管道API端点,例如http://kubernetes-service.ibm.com/pipeline 。 请注意,在撰写本文时,Elyra(版本1.0.0) 尚不支持dex身份验证

  • The cloud storage endpoint, e.g. http://minio-service.kubeflow:9000 .

    云存储端点,例如http://minio-service.kubeflow:9000

  • The cloud storage user id, password, and bucket name.

    云存储用户标识,密码和存储桶名称。

You can manage runtime configurations using the the JupyterLab GUI and the elyra-metadata CLI.

您可以使用JupyterLab GUI和elyra-metadata CLI管理运行时配置。

Managing runtime configurations using the GUI

使用GUI管理运行时配置

Elyra adds the Runtimes panel to the sidebar in JupyterLab, which you use to add, edit, and remove a configuration.

Elyra将“ 运行时”面板添加到JupyterLab的侧栏中,您可以使用它来添加,编辑和删除配置。

Image for post
Access pipeline runtime configurations from the pipeline editor or the side bar
从管道编辑器或侧栏访问管道运行时配置

Managing runtime configurations using the CLI

使用CLI管理运行时配置

The elyra-metadata CLI can also be used to manage runtime configurations, for example to automate administrative tasks. The examples below illustrate how to add a runtime configuration named kfp_dev_instance, how to list runtime configurations, and how to delete them.

elyra-metadata CLI也可用于管理运行时配置,例如自动执行管理任务。 以下示例说明了如何添加名为kfp_dev_instance的运行时配置,如何列出运行时配置以及如何删除它们。

$ elyra-metadata install runtimes --schema_name=kfp  \
--name=kfp_dev_instance \
--display_name="KFP dev instance" \
--api_endpoint=http://.../pipeline \
--cos_endpoint=http://... \
--cos_username=... \
--cos_password=... \
--cos_bucket=...$ elyra-metadata list runtimes
Available metadata instances for runtimes (includes invalid):
Schema Instance Resource
------ -------- --------
kfp kfp_dev_instance /Users/.../kfp_dev_instance.json$ elyra-metadata list runtimes --json
[
{
"name": "kfp_dev_instance",
"display_name": "KFP dev instance",
"metadata": {
"api_endpoint": "http://.../pipeline",
"cos_endpoint": "http://...",
"description": "...",
"cos_username": "...",
"cos_password": "...",
"cos_bucket": "..."
},
"schema_name": "kfp",
"resource": "/Users/.../kfp_dev_instance.json"
}
]$ elyra-metadata remove runtimes --name=kfp_dev_instance
Metadata instance '...' removed from namespace 'runtimes'.

The list commands produces optionally raw JSON output, enabling you to process the results using a command line JSON processor like jq .

list的命令产生任选原始JSON输出,使你能够处理使用命令行处理器JSON像结果j q

运行笔记本管道 (Running a notebook pipeline)

To run a pipeline from the Pipeline editor click the ▷ (run) button and select the desired runtime configuration.

要从管道编辑器运行管道,请单击▷(运行)按钮,然后选择所需的运行时配置。

Image for post

Elyra generates, gathers, and packages the required artifacts, uploads them to cloud storage and triggers pipeline execution in the selected Kubeflow Pipelines environment.

Elyra生成,收集和打包所需的工件,将其上传到云存储,并在选定的Kubeflow Pipelines环境中触发管道执行。

监控笔记本管道的运行 (Monitoring a notebook pipeline run)

Elyra currently does not provide pipeline run monitoring capabilities, but you can use the Kubeflow Pipelines UI to check the run status and view the log files. You can access the UI and pipeline output artifacts Runtimes panel by expanding the appropriate runtime configuration entry.

Elyra当前不提供管道运行监视功能,但是您可以使用Kubeflow Pipelines UI来检查运行状态并查看日志文件。 您可以通过展开适当的运行时配置条目来访问UI和管道输出工件“ 运行时”面板。

Image for post
The Runtimes panel in Elyra provides access to the pipeline run history and the cloud storage instance where the run results are stored.
Elyra中的“运行时”面板提供对管道运行历史记录和存储运行结果的云存储实例的访问。

Pipeline runs are listed in the Kubeflow Pipelines UI in the Experiments panel.

管道运行在“ 实验”面板的Kubeflow管道UI中列出。

Image for post
Notebook pipeline runs can be accessed in the Experiments panel of the Kubeflows Pipelines UI
笔记本管道运行可以在Kubeflows管道用户界面的“实验”面板中访问

The Graph panel in the experiment details page displays the execution status of each (notebook) node.

实验详细信息页面中的“ 图形”面板显示每个(笔记本)节点的执行状态。

The graph only displays nodes that have already been processed or are currently running.

该图仅显示已被处理或当前正在运行的节点。

Image for post
Accessing the log file for the load-data node in the Kubeflow Pipelines UI
在Kubeflow Pipelines UI中访问负载数据节点的日志文件

You can access a node’s execution log by selecting the node and opening the Logs panel.

您可以通过选择节点并打开“ 日志”面板来访问节点的执行日志。

Generally speaking, each node in your notebook pipeline is processed by the Notebook component in Kubeflow Pipelines as follows:

一般来说,笔记本管道中的每个节点都由Kubeflow管道中的笔记本组件按以下方式处理:

  • Install the prerequisite Python packages.

    安装必备的Python软件包。
  • Download the input artifacts — the notebook and its dependencies — from cloud storage. (The input artifacts are stored in a compressed archive.)

    从云存储下载输入工件(笔记本及其依赖项)。 (输入工件存储在压缩档案中。)
Image for post
  • Run the notebook using papermill and store the completed notebook as Jupyter notebook file and HTML file

    使用运行笔记本造纸厂和存储完成笔记本作为Jupyter笔记本文件和HTML文件

  • Upload completed notebook files to the configured cloud storage bucket

    将完成的笔记本文件上传到配置的云存储桶
  • Upload configured output files to the configured cloud storage bucket

    将配置的输出文件上传到配置的云存储桶
Image for post

Note that if processing of a node fails its downstream nodes are not executed.

请注意,如果节点的处理失败,则不会执行其下游节点。

访问笔记本管道输出工件 (Accessing the notebook pipeline output artifacts)

You can access the pipeline’s output artifacts using any supported S3 client. As shown earlier, the runtime configuration includes a link that you can use to browse the associated cloud storage.

您可以使用任何受支持的S3客户端访问管道的输出工件。 如前所述,运行时配置包含一个链接,您可以使用该链接浏览关联的云存储。

Image for post
The configured object storage bucket contains the input artifacts and output artifacts; Output artifacts are highlighted in this screen capture in green.
已配置的对象存储桶包含输入工件和输出工件。 该屏幕截图中以绿色突出显示了输出工件。

导出笔记本管道 (Exporting a notebook pipeline)

You can export a notebook pipeline as Kubeflow Pipelines SDK domain specific language (DSL) Python code or a YAML-formatted Kubeflow Pipelines configuration file. During export the input artifacts for each notebook are uploaded to S3-compatible cloud storage.

您可以将笔记本管道作为Kubeflow Pipelines SDK域特定语言(DSL)Python代码或YAML格式的Kubeflow Pipelines配置文件导出 。 在导出期间,每个笔记本的输入工件都被上载到S3兼容的云存储中。

Image for post
Export generate all artifacts to run the pipeline in Kubeflow Pipelines
导出生成所有工件以在Kubeflow Pipelines中运行管道

Caution: The exported Python code or configuration file contains connectivity information (including credentials) for the cloud storage location where input artifacts are stored.

警告:导出的Python代码或配置文件包含用于存储输入工件的云存储位置的连接信息(包括凭据)。

To run an exported pipeline from the Kubeflow Pipelines GUI, upload it and create a new run.

要从Kubeflow Pipelines GUI运行导出的管道,请上载它并创建一个新的运行。

Image for post
Upload and run an exported notebook pipeline in Kubeflows Pipelines
在Kubeflows管道中上载并运行导出的笔记本管道

运行自己的管道 (Running your own pipelines)

If you have access to a Kubeflows Pipeline service running locally or in the cloud you can start to assemble notebook pipelines in minutes. To get you started we’ve already published a few pipelines. The weather data pipeline we’ve referenced in this blog post is part of the official samples repository, which you can find at https://github.com/elyra-ai/examples. If you are interested in learning how to process COVID-19 time-series data (for the USA and the European Union), take a look at https://github.com/CODAIT/covid-notebooks.

如果您有权访问在本地或云中运行的Kubeflows管道服务,则可以在几分钟内开始组装笔记本管道。 为了让您入门,我们已经发布了一些管道。 我们在本博文中引用的天气数据管道是官方样本存储库的一部分,您可以在https://github.com/elyra-ai/examples上找到它。 如果您有兴趣学习如何处理COVID-19时间序列数据(适用于美国和欧盟),请访问https://github.com/CODAIT/covid-notebooks

使用预构建的Docker映像尝试JupyterLab和Elyra (Try JupyterLab and Elyra using the pre-built Docker image)

The Elyra community publishes ready-to-use Docker images on Docker Hub, which have JupyterLab v2 and the Elyra extension pre-installed. To use such an image run

Elyra社区在Docker Hub上发布了现成的Docker映像 ,这些映像已预先安装了JupyterLab v2和Elyra扩展。 要使用这样的图像运行

$ 

and open the displayed URL (e.g. http://127.0.0.1:8888/?token=...) in your web browser.

并在网络浏览器中打开显示的URL(例如http://127.0.0.1:8888/?token=... )。

If you already have notebooks stored on your local file system you should mount the desired directory (e.g. /local/dir/) to make them available.

如果您已经在本地文件系统上存储了笔记本,则应该安装所需的目录(例如/local/dir/ )以使它们可用。

$ docker run -it -p 8888:8888 -v /local/dir/:/home/jovyan/work -w /home/jovyan/work elyra/elyra:1.0.0 jupyter lab --debug

在Binder上尝试JupyterLab和Elyra (Try JupyterLab and Elyra on Binder)

If you don’t have Docker installed or don’t want to download the image (e.g. because of bandwidth constraints) you can try JupyterLab and Elyra in your web browser without having to install anything, thanks to mybinder.org. Open https://mybinder.org/v2/gh/elyra-ai/elyra/v1.0.0?urlpath=lab/tree/binder-demo and you are good to go.

如果您没有安装Docker或不想下载映像(例如,由于带宽限制),则可以通过mybinder.org在Web浏览器中尝试JupyterLab和Elyra,而无需安装任何内容 。 打开https://mybinder.org/v2/gh/elyra-ai/elyra/v1.0.0?urlpath=lab/tree/binder-demo ,您一切顺利。

安装JupyterLab和Elyra (Install JupyterLab and Elyra)

If your local environment meets the prerequisites, you can run JupyterLab and Elyra natively on your own machine by following the installation instructions.

如果您的本地环境满足先决条件,则可以按照安装说明在本地计算机上本地运行JupyterLab和Elyra。

如何参与 (How to get involved)

If you find the visual notebook editor useful, would like to open an enhancement request or report an issue, head on over to https://github.com/elyra-ai/elyra and get the conversation started.

如果您发现可视笔记本编辑器有用,想打开一个增强请求或报告问题,请转到https://github.com/elyra-ai/elyra并开始对话。

To learn more about how you can contribute to Elyra, take a look at https://github.com/elyra-ai/elyra#contributing-to-elyra.

要了解有关如何为Elyra做出贡献的更多信息,请访问https://github.com/elyra-ai/elyra#contributing-to-elyra

Thanks for reading!

谢谢阅读!

翻译自: https://medium.com/codait/creating-notebook-pipelines-using-elyra-and-kubeflow-pipelines-f9606449cc53

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值