使用Jupyter进行机器学习使用开放平台解决工作流管理问题

最新推荐文章于 2025-02-20 09:12:49 发布

weixin_26704853

最新推荐文章于 2025-02-20 09:12:49 发布

阅读量519

点赞数 1

文章标签： python 机器学习 java 人工智能算法

原文链接：https://towardsdatascience.com/machine-learning-with-jupyter-solving-the-workflow-management-problem-using-open-platforms-e1cb70ba85ef

版权

本文探讨了数据科学和机器学习工作中Jupyter的使用，以及如何通过Allegro Trains解决复杂项目中的管理问题。Jupyter适合初步探索，但大型项目需要更好的版本控制、资源管理和协作，这时Allegro Trains提供了解决方案，包括MLOps、AI基础设施、自动日志记录和资源跟踪等功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

The infamous data science workflow with interconnected circles of data acquisition, wrangling, analysis, and reporting understates the multi-connectivity and non-linearity of these components. The same is true for machine learning and deep learning workflows. I understand the need for oversimplification is expedient in presentations and executive summaries. However, it may paint unrealistic pictures, hide the intricacies of ML development and conceal the realities of the mess. This brings me to the tools of the trade or more commonly referred as the infrastructure of artificial intelligence which is the vehicle under which all libraries, experimentations, designs and creative minds meet. These humble infrastructures tend to be overlooked and underappreciated but their glaring importance can’t be overstated. I will explain a couple of these very tools that could be used in tandem to improve your workflow, accountability of data exploration, and lower the time and resources to go from proof-of-concept (POC) to deployment.

臭名昭著的数据科学工作流具有相互关联的数据采集，处理，分析和报告功能，这低估了这些组件的多重连接性和非线性。机器学习和深度学习工作流程也是如此。我了解在演示文稿和执行摘要中过分简化的必要性是很方便的。但是，它可能会绘制出不现实的图片，隐藏ML开发的复杂性，并掩盖混乱的现实。这将我带入了交易工具，或更普遍地称为人工智能的基础设施，这是所有图书馆，实验，设计和创新思想相遇的工具。这些不起眼的基础设施往往被忽视和低估，但其重要性不容小can。我将解释其中的一些工具，这些工具可以串联使用，以改善您的工作流程，数据探索的责任心，并减少从概念验证(POC)到部署的时间和资源。

开源软件 (Open-Source Software)

Before you are on the stage with a turtleneck shirt and casual blue jeans to reveal your data-intensive breakthrough product, the first step is to choose software to tackle the complex terrains of machine learning. Lucky for you, there are excellent open-software platforms such as Apache Spark, Jupyter, TensorFlow, PostgreSQL, and many more to get you started. The recent addition to my list is Allegro Trains which I am a huge proponent of. As a working data scientist, I have five criteria before choosing to use open-source software. These are:

在身穿高领衬衫和休闲蓝色牛仔裤的舞台上展示您的数据密集型突破性产品之前，第一步是选择软件来应对机器学习的复杂领域。幸运的是，这里有出色的开放软件平台，例如Apache Spark，Jupyter，TensorFlow，PostgreSQL等，可以帮助您入门。我名单上最近增加的是Allegro火车，我对此非常支持。作为一名工作数据科学家，在选择使用开源软件之前，我有五个条件。这些是：

Availability and ease of use for POC
POC的可用性和易用性
Reproducibility and collaboration
重现性与协作
Cross-platform compatibility
跨平台兼容性
Resource optimization
资源优化
Productivity boost relative to the learning rate.
生产率相对于学习率提高。

In the spirits of these criteria, let’s explore Jupyter and Trains.

本着这些准则的精神，让我们探索Jupyter和Trains。

Jupyter和Allegro火车 (Jupyter and Allegro Trains)

Jupyter notebook is the de facto gateway to interactive computing for aspiring data scientists and engineers. It offers quick feedback in the form of errors or visual displays and allows users to test out their rudimentary ideas. It’s undoubtedly the perfect tool for POC in addition to being language agnostic and offering text through Markdown cells. Jupyter Lab adds modularity to the notebooks so you are not opening multiple tabs with each notebook. There is also JupyterHub which allows multiple users to share notebooks. These are all great platforms to get some hands-on experience.

对于有抱负的数据科学家和工程师来说，Jupyter笔记本电脑实际上是交互式计算的门户。它以错误或视觉显示的形式提供快速反馈，并允许用户测试其基本想法。毫无疑问，它是POC的理想工具，它不仅可以与语言无关，而且可以通过Markdown单元提供文本。 Jupyter Lab为笔记本电脑增加了模块化，因此您不必为每个笔记本电脑打开多个标签。还有JupyterHub ，它允许多个用户共享笔记本。这些都是获得实践经验的绝佳平台。

The problem comes when you are tackling complex projects where a large number of iterations and experimentations, crucial code and resource optimizations, versioning and collaboration, and data integration and deployment are needed. These are the bottlenecks of bringing data-driven products to the market. As a result, AI architectures that incorporate these features in addition to the five criteria above become a necessity to accelerate R&D to market stages with faster ROI.

当您要处理需要大量迭代和实验，关键代码和资源优化，版本控制和协作以及数据集成和部署的复杂项目时，就会出现问题。这些是将数据驱动产品推向市场的瓶颈。因此，除了以上五个标准外，还应将这些功能与AI架构结合起来，以更快的ROI加速研发进入市场阶段的必要性。

Figure 1: Machine learning project management by Trains

图1：Trains的机器学习项目管理

为什么选择快板火车？ (Why Allegro Trains?)

Take a look at the dummy four projects in figure 1 and imagine if you were running these projects on Jupyter, the level of difficulty to keep the purpose of each notebook quickly becomes a task of its own. You could solve this issue by setting up a git repository to handle your versioning needs and manage your progress. You could also install a memory profiler to manage your resource allocation to each task. What about automating your ML models? You could explore the growing list of AutoML. If you are starting to realize that your ML project complexity is rising with each step to completion, then you are not the only one. This is a common problem and the solution should be addressed with a singular project hub. This is precisely why Allegro Trains was developed.

看一下图1中的虚拟四个项目，并想象一下，如果您在Jupyter上运行这些项目，那么快速保持每个笔记本的用途的难度就成了它自己的任务。您可以通过设置git存储库来解决此问题，以处理您的版本控制需求并管理进度。您还可以安装内存分析器以管理对每个任务的资源分配。自动化ML模型呢？您可以探索不断增长的AutoML列表。如果您开始意识到ML项目的复杂性随着完成的每一步都在增加，那么您并不是唯一的一个。这是一个常见的问题，解决方案应通过单个项目中心解决。这就是为什么开发Allegro火车的原因。

Allegro Trains puts a stop to these all-too-common infrastructure problems. Instead of patching your problems with countless modules, it’s time to bring maturity to ML with an encompassing solution. Trains provides a simple MLOps and AI infrastructure solution to speed up machine/deep learning projects from R&D to deployment. It allows data professionals to take charge of implementing creative insights without the need to worry about messy notebooks, interruption to solve versioning issues, model comparisons, automatic logging, performance tracking, and even CPU/GPU/IO resource allocations. In short, it meets my five-criterion for using open-source software. It also has AutoML to expedite experimentation and the capability to terminate resource-intensive or underperforming tasks from the Web UI.

Allegro Trains阻止了这些非常常见的基础设施问题。现在不是用无数模块来解决问题的方法了，是时候通过全面的解决方案使ML成熟。 Trains提供了一个简单的MLOps和AI基础架构解决方案，以加快从研发到部署的机器/深度学习项目。它允许数据专业人员负责实现创意见解，而无需担心笔记本凌乱，为解决版本问题而中断，模型比较，自动日志记录，性能跟踪，甚至CPU / GPU / IO资源分配。简而言之，它符合我使用开源软件的五个标准。它还具有AutoML来加快实验速度，并具有从Web UI终止资源密集型或性能不佳的任务的功能。

I would like to leave you with this last remark. The ability to use open-platforms in tandem to bring your exciting ideas and R&D to market has become a matter of simple installations. Jupyter is there for your needs of exploration and Allegro Trains tracks your every progress from POC to optimization and collaboration across teams to deployment with Trains Server.

最后一句话，我想离开你。串联使用开放平台将令人兴奋的想法和R＆D推向市场的能力已成为简单安装的问题。 Jupyter可以满足您的探索需求，而Allegro Trains会跟踪您从POC到优化以及整个团队之间的协作以及通过Trains Server进行部署的所有进度。