机器学习概述_在机器学习项目中构建工具，概述

最新推荐文章于 2024-01-17 17:42:10 发布

cullen2012

最新推荐文章于 2024-01-17 17:42:10 发布

阅读量260

点赞数

文章标签： python linux java 编程语言人工智能

原文链接：https://habr.com/en/post/451962/

版权

机器学习概述

make for keeping workflow reproducible. Although make来保持工作流程可重复的帖子。尽管 make is very stable and widely-used I personally like cross-platform solutions. It is 2019 after all, not 1977. One can argue that make itself is cross-platform, but in reality you will have troubles and will spend time on fixing your tool rather than on doing the actual work. So I decided to have a look around and to check out what other tools are available. Yes, I decided to spend some time on tools. make非常稳定且使用广泛，但我个人还是喜欢跨平台解决方案。毕竟是2019年，而不是1977年。有人可以说使自己是跨平台的，但实际上，您会遇到麻烦，并且会花时间修复工具而不是进行实际工作。因此，我决定环顾四周，并查看可用的其他工具。是的，我决定花一些时间在工具上。

This post is more an invitation for a dialogue rather than a tutorial. Perhaps your solution is perfect. If it is then it will be interesting to hear about it.

这篇文章更多是对话邀请而不是教程。也许您的解决方案是完美的。如果是这样，那么听到它将会很有趣。

In this post I will use a small Python project and will do the same automation tasks with different systems:

在本文中，我将使用一个小的Python项目，并在不同的系统上执行相同的自动化任务：

There will be a comparison table in the end of the post.

文章末尾会有一个对照表。

Most of the tools I will look at are known as build automation software or build systems. There are myriads of them in all different flavours, sizes and complexities. The idea is the same: developer defines rules for producing some results in an automated and consistent way. For example, a result might be an image with a graph. In order to make this image one would need to download the data, clean the data and do some data manipulations (classical example, really). You may start with a couple of shell scripts that will do the job. Once you return to the project a year later, it will be difficult to remember all the steps and their order you need to take to make that image. The obvious solution is to document all the steps. Good news! Build systems let you document the steps in a form of computer program. Some build systems are like your shell scripts, but with additional bells and whistles.

我将介绍的大多数工具都称为构建自动化软件或构建系统 。它们的口味，大小和复杂性各不相同。想法是一样的：开发人员定义规则以自动且一致的方式产生一些结果。例如，结果可能是带有图形的图像。为了制作此图像，需要下载数据，清理数据并进行一些数据操作(确实是经典示例)。您可以从几个可以完成此工作的shell脚本开始。一年后返回项目后，将很难记住制作该图像所需的所有步骤及其顺序。显而易见的解决方案是记录所有步骤。好消息！构建系统使您可以以计算机程序的形式记录步骤。一些构建系统就像您的shell脚本一样，但是带有额外的花哨功能。

The foundation of this post is a series of posts by Mateusz Bednarski on automated workflow for a machine learning project. Mateusz explains his views and provides recipes for using make. I encourage you to go and check his posts first. I will mostly use his code, but with different build systems.

这篇文章的基础是Mateusz Bednarski关于机器学习项目的自动化工作流的一系列文章。 Mateusz解释了他的观点，并提供了使用make 。我鼓励您先检查一下他的帖子。我将主要使用他的代码，但使用不同的构建系统。

If you would like to know more about make, following is a references for a couple of posts. Brooke Kennedy gives a high-level overview in 5 Easy Steps to Make Your Data Science Project Reproducible. Zachary Jones gives more details about the syntax and capabilities along with the links to other posts. David Stevens writes a very hype post on why you absolutely have to start using make right away. He provides nice examples comparing the old way and the new way. Samuel Lampa, on the other hand, writes about why using make is a bad idea.

如果您想进一步了解make ，请参考以下几篇文章。布鲁克·肯尼迪(Brooke Kennedy)在使您的数据科学项目可重现的5个简单步骤中进行了概述。 Zachary Jones提供了有关语法和功能的更多详细信息，以及指向其他文章的链接。大卫·史蒂文斯 ( David Stevens)写了一篇非常炒作的文章，介绍了为什么绝对必须立即开始使用make 。他提供了比较旧方法和新方法的很好的例子。另一方面，塞缪尔·兰帕 ( Samuel Lampa)写道，为什么使用make是个坏主意。

My selection of build systems is not comprehensive nor unbiased. If you want to make your list, Wikipedia might be a good starting point. As stated above above, I will cover CMake, PyBuilder, pynt, Paver, doit and Luigi. Most of the tools in this list are python-based and it makes sense since the project is in Python. This post will not cover how to install the tools. I assume that you are fairly proficient in Python.

我对构建系统的选择既不全面也不偏颇。如果要列出您的清单，那么Wikipedia可能是一个很好的起点。如上文所述，我将介绍CMake的， PyBuilder ， pynt ，摊铺机， DOIT和路易吉。此列表中的大多数工具都是基于python的，因为该项目使用的是Python，所以这很有意义。这篇文章不会介绍如何安装工具。我假设您相当精通Python。

I am mostly interested in testing this functionality:

我对测试此功能最感兴趣：

Specifying couple of targets with dependencies. I want to see how to do it and how easy it is.
指定具有依赖关系的两个目标。我想看看它是如何做到的以及它是多么容易。
Checking out if incremental builds are possible. This means that build system won’t rebuild what have not been changed since the last run, i.e. you do not need to redownload your raw data. Another thing that I will look for is incremental builds when dependency changes. Imagine we have a graph of dependencies A -> B -> C. Will target C be rebuilt if B changes? If A?
检查是否可以进行增量构建。这意味着构建系统将不会重建自上次运行以来未更改的内容，即，您无需重新下载原始数据。我还将寻找的另一件事是依赖关系更改时的增量构建。想象一下，我们有一个依赖图A -> B -> C 如果B发生变化，是否会重建目标C ？如果A ？
Checking if rebuild will be triggered if source code is changed, i.e. we change the parameter of generated graph, next time we build the image must be rebuilt.
如果更改了源代码，即更改生成的图形的参数，则检查是否将触发重新构建，下次构建图像时必须重新构建。
Checking out the ways to clean build artifacts, i.e. remove files that have been created during build and roll back to the clean source code.
检验清理构建构件的方法，即删除在构建过程中创建的文件，并回滚到干净的源代码。

I will not use all build targets from Mateusz's post, just three of them to illustrate the principles.

我不会使用Mateusz帖子中的所有构建目标，仅使用其中三个来说明原理。

All the code is available on GitHub.

所有代码都可以在GitHub上找到。

CMake的 (CMake)

CMake is a build script generator, which generates input files for various build systems. And it’s name stands for cross-platform make. CMake is a software engineering tool. It’s primary concern is about building executables and libraries. So CMake knows how to build targets from source code in supported languages. CMake is executed in two steps: configuration and generation. During configuration it is possible to configure the future build according to one needs. For example, user-provided variables are given during this step. Generation is normally straightforward and produces file(s) that build systems can work with. With CMake, you can still use make, but instead of writing makefile directly you write a CMake file, which will generate the makefile for you.

CMake是一个构建脚本生成器，它为各种构建系统生成输入文件。它的名字代表跨平台的制作。 CMake是一种软件工程工具。主要关注的是构建可执行文件和库。因此，CMake知道如何根据支持的语言从源代码构建目标。 CMake分两个步骤执行：配置和生成。在配置期间，可以根据一种需求配置将来的版本。例如，在此步骤中给出了用户提供的变量。生成通常很简单，并且会生成构建系统可以使用的文件。使用CMake，您仍然可以使用make ，但是您可以直接写一个CMake文件，而不是直接编写makefile，它会为您生成makefile。

Another important concept is that CMake encourages out-of-source builds. Out-of-source builds keep source code away from any artifacts it produces. This makes a lot of sense for executables where single source codebase may be compiled under different CPU architectures and operating systems. This approach, however, may contradict the way a lot of data scientists work. It seems to me that data science community tends to have high coupling of data, code and results.

另一个重要的概念是CMake鼓励进行源外构建 。源外构建使源代码远离其产生的任何工件。这对于可执行文件可能具有很大意义，在可执行文件中，单个源代码库可以在不同的CPU体系结构和操作系统下进行编译。但是，这种方法可能与许多数据科学家的工作方式相矛盾。在我看来，数据科学界倾向于将数据，代码和结果高度结合在一起。

Let’s see what we need to achieve our goals with CMake. There are two possibilities to define custom things in CMake: custom targets and custom commands. Unfortunately we will need to use both, which results in more typing compared to vanila makefile. A custom target is considered to be always out of date, i.e. if there is a target for downloading raw data CMake will always redownload it. A combination of custom command with custom target allows to keep targets up to date.

让我们看看用CMake实现目标所需要的。在CMake中定义自定义内容有两种可能性：自定义目标和自定义命令。不幸的是，与vanila makefile相比，我们将需要同时使用两者，这将导致更多的键入。自定义目标被认为总是过时的，即，如果有用于下载原始数据的目标，CMake将始终重新下载它。自定义命令与自定义目标的结合可以使目标保持最新状态。

For our project we will create a file named CMakeLists.txt and put it in the project’s root. Let’s check out the content:

对于我们的项目，我们将创建一个名为CMakeLists.txt的文件，并将其放在项目的根目录中。让我们检查一下内容：

cmake_minimum_required(VERSION 3.14.0 FATAL_ERROR)
project(Cmake_in_ml VERSION 0.1.0 LANGUAGES NONE)

This part is basic. The second line defines the name of your project, version, and specifies that we won’t use any build-in language support (sine we will call Python scripts).

这部分是基本的。第二行定义项目的名称，版本，并指定我们将不使用任何内置语言支持(正弦，我们将称为Python脚本)。

Our first target will download the IRIS dataset:

我们的第一个目标将下载IRIS数据集：

SET(IRIS_URL "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" CACHE STRING "URL to the IRIS data")
set(IRIS_DIR ${CMAKE_CURRENT_SOURCE_DIR}/data/raw)
set(IRIS_FILE ${IRIS_DIR}/iris.csv)
ADD_CUSTOM_COMMAND(OUTPUT ${IRIS_FILE}
    COMMAND ${CMAKE_COMMAND} -E echo "Downloading IRIS."
    COMMAND python src/data/download.py ${IRIS_URL} ${IRIS_FILE}
    COMMAND ${CMAKE_COMMAND} -E echo "Done. Checkout ${IRIS_FILE}."
    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
    )
ADD_CUSTOM_TARGET(rawdata ALL DEPENDS ${IRIS_FILE})

First line defines parameter IRIS_URL, which is exposed to user during configuration step. If you use CMake GUI you can set this variable through the GUI:

第一行定义参数IRIS_URL ，该参数在配置步骤中向用户公开。如果使用CMake GUI，则可以通过GUI设置此变量：

Next, we define variables with downloaded location of IRIS dataset. Then we add a custom command, which will produce IRIS_FILE as it’s output. In the end, we define a custom target rawdata that depends on IRIS_FILE meaning that in order to build rawdata IRIS_FILE must be built. Option ALL of custom target says that rawdata will be one of the default targets to build. Note that I use CMAKE_CURRENT_SOURCE_DIR in order to keep the downloaded data in the source folder and not in the build folder. This is just to make it the same as Mateusz.

接下来，我们使用IRIS数据集的下载位置定义变量。然后，我们添加一个自定义命令，该命令将在输出时生成IRIS_FILE 。最后，我们定义了一个依赖于IRIS_FILE的自定义目标rawdata ，这意味着要构建rawdata IRIS_FILE必须构建rawdata 。自定义目标的选项ALL表示rawdata将是要构建的默认目标之一。请注意，我使用CMAKE_CURRENT_SOURCE_DIR来将下载的数据保留在源文件夹中，而不是在构建文件夹中。这仅仅是为了使其与Mateusz相同。

Alright, let’s see how we can use it. I am currently running it on WIndows with installed MinGW compiler. You may need to adjust the generator setting for your needs (run cmake --help to see the list of available generators). Fire up the terminal and go to the parent folder of the source code, then:

好吧，让我们看看如何使用它。我目前正在装有已安装的MinGW编译器的Windows上运行它。您可能需要根据需要调整发电机设置(运行cmake --help以查看可用发电机的列表)。启动终端并转到源代码的父文件夹，然后：

mkdir overcome-the-chaos-build
cd overcome-the-chaos-build
cmake -G "MinGW Makefiles" ../overcome-the-chaos

结果 (outcome)

— Generating done

—完成生成

— Build files have been written to: C:/home/workspace/overcome-the-chaos-build

—构建文件已写入：C：/ home / workspace / overcome-the-chaos-build

With modern CMake we can build the project directly from CMake. This command will invoke build all command:

使用现代的CMake，我们可以直接从CMake构建项目。该命令将调用build all命令：

cmake --build .

结果 (outcome)

[100%] Built target rawdata

[100％]建立目标原始数据

We can also view the list of available targets:

我们还可以查看可用目标的列表：

cmake --build . --target help

And we can remove downloaded file by:

我们可以通过以下方式删除下载的文件：

cmake --build . --target clean

See that we didn’t need to create the clean target manually.

看到我们不需要手动创建清理目标。

Now let’s move to the next target — preprocessed IRIS data. Mateusz creates two files from a single function: processed.pickle and processed.xlsx. You can see how he goes away with cleaning this Excel file by using rm with wildcard. I think this is not a very good approach. In CMake, we have two options of how to deal with it. First option is to use ADDITIONAL_MAKE_CLEAN_FILES directory property. The code will be:

现在，我们移至下一个目标-预处理的IRIS数据。 Mateusz通过一个函数创建两个文件： processed.pickle和processed.xlsx 。您可以看到他如何通过将rm与通配符一起使用来清理此Excel文件。我认为这不是一个很好的方法。在CMake中，我们有两种方法来处理它。第一种选择是使用ADDITIONAL_MAKE_CLEAN_FILES目录属性。该代码将是：

SET(PROCESSED_FILE ${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.pickle)
ADD_CUSTOM_COMMAND(OUTPUT ${PROCESSED_FILE}
    COMMAND python src/data/preprocess.py ${IRIS_FILE} ${PROCESSED_FILE} --excel data/processed/processed.xlsx
    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
    DEPENDS rawdata ${IRIS_FILE}
    )
ADD_CUSTOM_TARGET(preprocess DEPENDS ${PROCESSED_FILE})
# Additional files to clean
set_property(DIRECTORY PROPERTY ADDITIONAL_MAKE_CLEAN_FILES
    ${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.xlsx
    )

The second option is to specify a list of files as a custom command output:

第二个选项是指定文件列表作为自定义命令输出：

LIST(APPEND PROCESSED_FILE "${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.pickle"
    "${CMAKE_CURRENT_SOURCE_DIR}/data/processed/processed.xlsx"
    )
ADD_CUSTOM_COMMAND(OUTPUT ${PROCESSED_FILE}
    COMMAND python src/data/preprocess.py ${IRIS_FILE} data/processed/processed.pickle --excel data/processed/processed.xlsx
    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
    DEPENDS rawdata ${IRIS_FILE} src/data/preprocess.py
    )
ADD_CUSTOM_TARGET(preprocess DEPENDS ${PROCESSED_FILE})

See that in this case I created the list, but didn’t use it inside custom command. I do not know of a way to reference output arguments of custom command inside it.

看到在这种情况下，我创建了列表，但是没有在自定义命令中使用它。我不知道在其中引用自定义命令输出参数的方法。

Another interesting thing to note is the usage of depends in this custom command. We set dependency not only from a custom target, but it’s output as well and the python script. If we do not add dependency to IRIS_FILE, then modifying iris.csv manually will not result in rebuilding of preprocess target. Well, you should not modify files in your build directory manually in the first place. Just letting you know. More details in Sam Thursfield's post. The dependency to python script is needed to rebuild the target if python script changes.

要注意的另一件有趣的事情是此自定义命令中depends的用法。我们不仅从自定义目标设置了依赖项，还设置了它的输出以及python脚本。如果我们不向IRIS_FILE添加依赖IRIS_FILE ，那么手动修改iris.csv将不会导致重建preprocess目标。好吧，您不应该首先手动修改构建目录中的文件。只是让你知道。有关更多详细信息，请参见Sam Thursfield的文章。如果python脚本发生更改，则需要依赖python脚本来重建目标。

And finally the third target:

最后是第三个目标：

SET(EXPLORATORY_IMG ${CMAKE_CURRENT_SOURCE_DIR}/reports/figures/exploratory.png)
ADD_CUSTOM_COMMAND(OUTPUT ${EXPLORATORY_IMG}
    COMMAND python src/visualization/exploratory.py ${PROCESSED_FILE} ${EXPLORATORY_IMG}
    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
    DEPENDS ${PROCESSED_FILE} src/visualization/exploratory.py
    )
ADD_CUSTOM_TARGET(exploratory DEPENDS ${EXPLORATORY_IMG})

This target is basically the same as the second one.

该目标与第二个目标基本相同。

To wrap up. CMake looks messy and more difficult than Make. Indeed, a lot of people criticize CMake for it’s syntax. In my experience, the understanding will come and it is absolutely possible to make sense of even very complicated CMake files.

总结一下。 CMake看起来比较混乱，比Make更难。确实，很多人批评CMake的语法。以我的经验，理解将会来临，并且完全有可能理解甚至非常复杂的CMake文件。

You will still do a lot of gluing yourself as you will need to pass correct variables around. I do not see an easy way of referencing output of one custom command in another one. It seems like it is possible to do it via custom targets.

您仍然会做很多事情，因为您需要传递正确的变量。我没有看到一种简单的方法来引用另一个自定义命令的输出。似乎可以通过自定义目标来实现。

PyBuilder的 (PyBuilder)

PyBuilder part is very short. I used Python 3.7 in my project and PyBuilder current version 0.11.17 does not support it. The proposed solution is to use development version. However that version is bounded to pip v9. Pip is v19.3 as of time of writing. Bummer. After fiddling around with it a bit, it didn't work for me at all. PyBuilder evaluation was a short-lived one.

PyBuilder部分很短。我在项目中使用了Python 3.7，而PyBuilder当前版本0.11.17不支持它。建议的解决方案是使用开发版本。但是，该版本仅限于pip v9。撰写本文时，点数为v19.3。笨蛋稍微摆弄一下之后，它根本对我不起作用。 PyBuilder评估是短暂的。

nt (pynt)

Pynt is python-based, which means we can use python functions directly. It is not necessary to wrap them with click and to provide command line interface. However, pynt is also capable of executing shell commands. I will use python functions.

Pynt是基于python的，这意味着我们可以直接使用python函数。不必通过单击将它们包装起来并提供命令行界面。但是，pynt也能够执行shell命令。我将使用python函数。

Build commands are given in a file build.py. Targets/tasks are created with function decorators. Task dependencies are provided through the same decorator.

生成命令在文件build.py中给出。使用功能装饰器创建目标/任务。任务依赖项通过同一装饰器提供。

Since I would like to use python functions I need to import them in the build script. Pynt does not include the current directory as python script, so writing smth like this:

由于我想使用python函数，因此需要在构建脚本中导入它们。 Pynt不将当前目录包括为python脚本，因此编写如下代码：

from src.data.download import pydownload_file

will not work. We have to do:

不管用。我们必须做：

import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), '.'))

from src.data.download import pydownload_file

My initial build.py file was like this:

我最初的build.py文件是这样的：

#!/usr/bin/python

import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), '.'))
from pynt import task

from path import Path
import glob

from src.data.download import pydownload_file
from src.data.preprocess import pypreprocess

iris_file = 'data/raw/iris.csv'
processed_file = 'data/processed/processed.pickle'

@task()
def rawdata():
  '''Download IRIS dataset'''
  pydownload_file('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', iris_file)

@task()
def clean():
    '''Clean all build artifacts'''
    patterns = ['data/raw/*.csv', 'data/processed/*.pickle',
            'data/processed/*.xlsx', 'reports/figures/*.png']
    for pat in patterns:
        for fl in glob.glob(pat):
            Path(fl).remove()

@task(rawdata)
def preprocess():
    '''Preprocess IRIS dataset'''
    pypreprocess(iris_file, processed_file, 'data/processed/processed.xlsx')

And the preprocess target didn't work. It was constantly complaining about input arguments of pypreprocess function. It seems like Pynt does not handle optional function arguments very well. I had to remove the argument for making the excel file. Keep this in mind if your project has functions with optional arguments.

而且preprocess目标无效。一直在抱怨pypreprocess函数的输入参数。似乎Pynt不能很好地处理可选函数参数。我必须删除用于制作excel文件的参数。如果您的项目具有带有可选参数的函数，请记住这一点。

We can run pynt from the project's folder and list all the available targets:

我们可以从项目文件夹运行pynt并列出所有可用的目标：

pynt -l

结果 (outcome)

Tasks in build file build.py:
  clean                      Clean all build artifacts
  exploratory                Make an image with pairwise distribution
  preprocess                 Preprocess IRIS dataset
  rawdata                    Download IRIS dataset

Powered by pynt 0.8.2 - A Lightweight Python Build Tool.

Let's make the pairwise distribution:

让我们进行成对分布：

pynt exploratory

结果 (outcome)

[ build.py - Starting task "rawdata" ]
Downloading from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data to data/raw/iris.csv
[ build.py - Completed task "rawdata" ]
[ build.py - Starting task "preprocess" ]
Preprocessing data
[ build.py - Completed task "preprocess" ]
[ build.py - Starting task "exploratory" ]
Plotting pairwise distribution...
[ build.py - Completed task "exploratory" ]

If we now run the same command again (i.e. pynt exploratory) there will be a full rebuild. Pynt didn't track that nothing has changed.

如果我们现在再次运行同一命令(即pynt exploratory )，将进行完全重建。 Pynt没有追踪到没有任何变化。

摊铺机 (Paver)

Paver looks almost exactly as Pynt. It slightly different in a way one defines dependencies between targets (another decorator @needs). Paver makes a full rebuild each time and doesn't play nicely with functions that have optional arguments. Build instructions are found in pavement.py file.

摊铺机看上去几乎与Pynt一样。它在定义目标之间的依赖关系(另一种装饰器@needs )的方式上略有不同。 Paver每次都会进行完全重建，因此不能与带有可选参数的函数配合使用。构建说明位于pavement.py文件中。

做 (doit)

Doit seems like an attempt to create a truly build automation tool in python. It can execute python code and shell commands. It looks quite promising. What it seems to miss (in the context of our specific goals) is the ability to handle dependencies between targets. Let's say we want to make a small pipeline where the output of target A is used as input of target B. And let's say we are using files as outputs, so target A create a file named outA.

Doit似乎是一种尝试在python中创建真正的构建自动化工具的尝试。它可以执行python代码和shell命令。看起来很有希望。在我们特定目标的背景下，似乎缺少的是处理目标之间的依赖关系的能力。假设我们要建立一个小的管道，将目标A的输出用作目标B的输入。假设我们使用文件作为输出，因此目标A创建了一个名为outA的文件。

In order to make such pipeline we will need to specify file outA twice in target A (as a result of a target, but also return it's name as part of target execution). Then we will need to specify it as input to target B. So there are 3 places in total where we need to provide information about file outA. And even after we do so, modification of file outA won't lead to automatic rebuild of target B. This means that if we ask doit to build target B, doit will only check if target B is up-to-date without checking any of the dependencies. To overcome this, we will need to specify outA 4 times — also as file dependency of target B. I see this as a drawback. Both Make and CMake are able to handle such situations correctly.

为了建立这样的管道，我们将需要在目标A中两次指定文件outA (作为目标的结果，但还要在执行目标时返回其名称)。然后，我们需要将其指定为目标B的输入。因此，总共需要提供3个地方来提供有关文件outA信息。而且即使这样做，修改文件outA也不会导致目标B的自动重建。这意味着，如果我们要求doit构建目标B，则doit将仅检查目标B是否为最新，而不检查任何目标B。的依赖关系。为了克服这个问题，我们将需要指定4次outA ，也要指定目标B的文件依赖性。我认为这是一个缺点。 Make和CMake都能够正确处理这种情况。

Dependencies in doit are file-based and expressed as strings. This means that dependencies ./myfile.txt and myfile.txt are viewed as being different. As I wrote above, I find the way of passing information from target to target (when using python targets) a bit strange. Target has a list of artifacts it is going to produce, but another target can't use it. Instead the python function, which constitutes the target, must return a dictionary, which can be accessed in another target. Let's see it on an example:

doit中的依赖项是基于文件的，并表示为字符串。这意味着依赖项./myfile.txt和myfile.txt被视为不同。正如我在上面写的，我发现将信息从目标传递到目标的方式(使用python目标时)有些奇怪。目标有一个将要产生的工件列表，但是另一个目标不能使用它。相反，构成目标的python函数必须返回一个字典，该字典可以在另一个目标中访问。让我们看一个例子：

def task_preprocess():
    """Preprocess IRIS dataset"""
    pickle_file = 'data/processed/processed.pickle'
    excel_file = 'data/processed/processed.xlsx'
    return {
        'file_dep': ['src/data/preprocess.py'],
        'targets': [pickle_file, excel_file],
        'actions': [doit_pypreprocess],
        'getargs': {'input_file': ('rawdata', 'filename')},
        'clean': True,
    }

Here the target preprocess depends on rawdata. The dependency is provided via getargs property. It says that the argument input_file of function doit_pypreprocess is the output filename of the target rawdata. Have a look at the complete example in file dodo.py.

在此，目标preprocess取决于rawdata 。依赖关系通过getargs属性提供。它说函数doit_pypreprocess的参数input_file是目标rawdata的输出filename 。看一下dodo.py文件中的完整示例。

It may be worth reading the success stories of using doit. It definitely has nice features like the ability to provide a custom up-to-date target check.

可能值得阅读使用doit 的成功案例。它绝对具有不错的功能，例如提供自定义最新目标检查的功能。

路易吉 (Luigi)

Luigi stays apart from other tools as it is a system to build complex pipelines. It appeared on my radar after a colleague told me that he tried Make, was never able to use it across Windows/Linux and moved away to Luigi.

Luigi与其他工具分开，因为它是一个用于构建复杂管道的系统。一位同事告诉我他尝试了Make，但从未在Windows / Linux上使用过它，而是搬到了Luigi之后，它出现在我的雷达上。

Luigi aims at production-ready systems. It comes with a server, which can be used to visualize your tasks or to get a history of task executions. The server is called a central schedler. A local scheduler is available for debugging purposes.

Luigi致力于生产就绪系统。它带有一个服务器，该服务器可用于可视化您的任务或获取任务执行的历史记录。该服务器称为中央调度程序 。本地调度程序可用于调试目的。

Luigi is also different from other systems in a way how tasks are created. Lugi doesn't act on some predefined file (like dodo.py, pavement.py or makefile). Rather, one has to pass a python module name. So, if we try to use it in the similar way to other tools (place a file with tasks in project's root), it won't work. We have to either install our project or modify environmental variable PYTHONPATH by adding the path to the project.

Luigi在创建任务的方式上也与其他系统不同。 Lugi不会对某些预定义文件(例如dodo.py ， pavement.py或makefile) dodo.py 。相反，必须传递一个python模块名称。因此，如果我们尝试以与其他工具类似的方式使用它(将带有任务的文件放在项目的根目录中)，它将无法正常工作。我们必须安装项目或通过将路径添加到项目来修改环境变量PYTHONPATH 。

What is great about luigi is the way of specifying dependencies between tasks. Each task is a class. Method output tells Luigi where the results of the task will end up. Results can be a single element or a list. Method requires specifies task dependencies (other tasks; although it is possible to make a dependency from itself). And that's it. Whatever is specified as output in task A will be passed as an input to task B if task B relies on task A.

luigi的优点是指定任务之间的依赖关系的方法。每个任务都是一个类。方法output告诉Luigi任务的结果将在哪里结束。结果可以是单个元素或列表。方法requires指定任务依赖性(其他任务；尽管可以从自身进行依赖性)。就是这样。如果任务B依赖任务A，则在任务A中指定为output的任何内容都将作为输入传递给任务B。

Luigi doesn't care about file modifications. It cares about file existance. So it is not possible to trigger rebuilds when the source code changes. Luigi doesn't have a built-in clean functionality.

Luigi不在乎文件修改。它关心文件的存在。因此，当源代码更改时，不可能触发重建。路易吉(Luigi)没有内置的清理功能。

Luigi tasks for this project are available in file luigitasks.py. I run them from the terminal:

该项目的Luigi任务可在luigitasks.py文件中找到。我从终端运行它们：

luigi --local-scheduler --module luigitasks Exploratory

比较方式 (Comparison)

The table below summarizes how different systems work in respect to our specific goals.

下表总结了不同系统在实现我们特定目标方面的工作方式。

	Define target with dependency	Incremental builds	Incremental builds if source code is changed	Ability to figure out which artifacts to remove during `clean` command
CMake	yes	yes	yes	yes
Pynt	yes	no	no	no
Paver	yes	no	no	no
doit	Somewhat yes	yes	yes	yes
Luigi	yes	no	no	no

	用依赖定义目标	增量构建	如果源代码更改，则增量构建	能够确定在`clean`命令期间要删除哪些工件
CMake的	是	是	是	是
平特	是	没有	没有	没有
摊铺机	是	没有	没有	没有
做	是的	是	是	是
路易吉	是	没有	没有	没有

翻译自: https://habr.com/en/post/451962/

机器学习概述

cullen2012

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫