docker入门数据_面向数据科学家的Docker入门

最新推荐文章于 2023-12-07 10:15:53 发布

weixin_26704853

最新推荐文章于 2023-12-07 10:15:53 发布

阅读量243

点赞数

文章标签： docker java python 人工智能大数据

原文链接：https://towardsdatascience.com/getting-started-with-docker-for-data-scientists-a2ed505e2a09

版权

docker入门数据

We have all been there,

我们都去过那儿，

“It worked on my machine!!”.

“ 它在我的机器上工作了！！ ”。

Who wasn’t on either end of that statement?

谁不在那个陈述的两端？

As developers, data scientists, software engineers, we work on complex code bases that depend on many items in the background. When we want to share our code with colleagues or put up on Github as an opensource project, we need to ensure that the code will work on all different environments.

作为开发人员，数据科学家，软件工程师，我们致力于依赖于后台许多项目的复杂代码库。当我们想与同事共享我们的代码或将其作为开源项目放在Github上时，我们需要确保代码可以在所有不同的环境下工作。

Sometimes — more often than we would like to admit — we try to run a friend’s code or a code that we got from the internet when the computer yells at us “Import Error.” That error means that the code needs more information that it can’t find on your computer.

有时-比我们想承认的要多-当计算机对我们大喊“ 导入错误”时，我们尝试运行朋友的代码或从互联网获取的代码。该错误意味着代码需要更多在计算机上找不到的信息。

The solution for this is using Docker. Docker is a container management system that aims to facilitate sharing projects and to run them across different environments. Basically, Docker makes it easy to write and run codes smoothly on other machines with different operating systems by encapsulating the code and all its dependencies in a container.

解决方案是使用Docker 。 Docker是一个容器管理系统，旨在促进共享项目并在不同环境中运行它们。基本上，Docker通过将代码及其所有依赖项封装在一个容器中，可以轻松地在具有不同操作系统的其他计算机上平稳地编写和运行代码。

This container makes the code self-contained and independent from the operating system.

该容器使代码自成体系，并且独立于操作系统。

为什么我们使用码头工人？ (Why we use dockers?)

When we write code for data science or machine learning applications, we often have many concerns that make using a Docker the best option for our applications. These concerns are:

当我们为数据科学或机器学习应用程序编写代码时，我们经常会遇到很多问题，使使用Docker成为应用程序的最佳选择。这些问题是：

Ensure that the application will work on all environments in the same manner.
确保该应用程序将以相同的方式在所有环境中运行。
Save those who will use/ run your application the trouble of handling dependencies and installation problems.
避免那些将使用/运行您的应用程序的人处理依赖关系和安装问题的麻烦。
Avoid working with virtual machines.
避免使用虚拟机。
Focus on building the application instead of worrying about managing dependencies.
专注于构建应用程序，而不用担心管理依赖项。

Docker基础 (Docker Basics)

So, how does Docker work?

那么，Docker如何工作？

To understand that, we first need to cover some Docker terminology, let’s get into it.

要了解这一点，我们首先需要介绍一些Docker术语，让我们开始了解它。

图片 (Images)

Images are archive with all data needed to run an app. If you’re familiar with programming languages, you can think of an image the same way you do a class. Classes are blueprints; they contain necessary data to generate intense, while images are blueprints with needed data to create containers.

图像已存档，其中包含运行应用程序所需的所有数据。如果您熟悉编程语言，则可以像制作 类一样思考图像。类是蓝图；它们包含生成强烈数据所需的数据，而图像是具有创建容器所需数据的蓝图。

Images don’t change, which means whatever changes you perform on a particular image will not be saved unless you save a copy of the image.

图像不会更改，这意味着除非保存图像的副本，否则不会保存对特定图像执行的任何更改。

货柜 (Containers)

A container is an enclosed environment where your app runs. Containers only have access to the resources it is allowed to (storage, CPU, memory), and does not know anything else about the machine it is running on. A container only has access to a Linux distribution with the information needed to run the application.

容器是运行应用程序的封闭环境。容器只能访问允许使用的资源(存储，CPU，内存)，而对运行的计算机一无所知。容器只能访问Linux发行版，其中包含运行该应用程序所需的信息。

Containers leave no data behind by default. Any changes made to a container, as long as you don’t save it as a new image, are lost as soon as it is removed.

默认情况下，容器不保留任何数据。只要不将其保存为新映像，对容器所做的任何更改都会在删除后立即丢失。

Dockerfiles (Dockerfiles)

Dockerfiles are files that have the needed information to run an application. Every image must contain at least — and preferably — on Dockerfile. Dockerfiles can be divided into three main sections:

Dockerfile是具有运行应用程序所需信息的文件。每个映像都必须至少(最好是)在Dockerfile上包含。 Dockerfile可以分为三个主要部分：

The base image: The core of the application. For example, if the application needs Python 3 to run correctly, then Python3 will be the base image, and additional libraries will be included in the instruction set.
基本映像：应用程序的核心。例如，如果应用程序需要Python 3才能正确运行，则Python3将成为基本映像，并且其他库将包含在指令集中。
Instruction set: The instruction set includes RUN commands; each one represents an additional library or binary the needs to be installed or run.
指令集：指令集包括RUN命令；每个代表一个额外的库或二进制文件，它们需要安装或运行。
Entry command: These are the commands that run once all needed libraries are installed. For example, an entry command can be Open Jupyter Notebook, or Run the Commandline, etc.
输入命令：这些命令一旦安装了所有需要的库就运行。例如，输入命令可以是“ 打开Jupyter Notebook ”或“运行命令行 ”等。

You can think of an image like an onion, the base image is the heart of the onion, and each instruction is a new layer in the image. That’s why you need to pay attention to how you’re layering your instruction.

您可以将图像想象为洋葱，基本图像是洋葱的心脏，每条指令都是图像中的新层。这就是为什么您需要注意如何分层您的指令。

卷数 (Volumes)

Since images are fixed, and containers have a short memory — similar to RAMs — what happens if we have data that is needed to run the application?

由于映像是固定的，并且容器的内存类似于RAM，因此如果我们拥有运行应用程序所需的数据，会发生什么情况？

Here’s where volumes solve the problem. When we have data needed for the application, we can go one of two ways: either access the data locally or from a volume. Accessing the data locally — having local mount points — requires you to select a specific directory on the local machine where the data is stored.

这是解决问题的地方。当我们拥有应用程序所需的数据时，我们可以采用以下两种方式之一：在本地或从卷访问数据。在本地访问具有本地挂载点的数据，需要您在本地计算机上选择一个用于存储数据的特定目录。

Volumes are used for shared data when you don’t know anything about the host machine — where the application will run.

如果您对主机一无所知，则将卷用于共享数据-应用程序将在哪里运行。

登记处 (Registries)

Registries are the repository-equivalent for Docker images. It allows you to pull and push containers images. You can distribute your images directly from your Docker host, or use a cloud agent like Kubernetes, Docker Swarm or DockerHub. Using such services allows you to gain useful features, such as automated deployment and scaling.

注册表是与Docker映像等效的存储库。它允许您拉和推容器图像。您可以直接从Docker主机分发映像，也可以使用Kubernetes ， Docker Swarm或DockerHub之类的云代理。使用此类服务使您可以获得有用的功能，例如自动部署和扩展。

如何开始使用Docker？ (How to start using Docker?)

步骤№1：安装Docker (Step №1: Installing Docker)

To install Docker n your device, head to the official Docker website and install the correct version for your machine. To make use that your installation went correctly, try running the following command:

要在您的设备上安装Docker，请访问Docker官方网站并为您的机器安装正确的版本。要使用安装正确的方法，请尝试运行以下命令：

docker run

If you get something like the following, then everything is up and running, and you are ready to get to work!

如果您收到类似以下的内容，则说明一切正常，并且可以开始工作了！

步骤№2：了解基本命令 (Step №2: Get to know the basic commands)

Dockers are quite a broad concept. However, you can get very far by knowing the basic 6 commands, run, ps, rename, stop, start, and logs.

Docker是一个广泛的概念。但是，您可以通过了解6个基本命令( run ， ps ， 重命名 ， stop ， start和logs)来了解更多内容 。

步骤№3：准备需求文件 (Step №3: Prepare the requirements file)

You can either add the binaries and needed libraries to the Dockerfile directly; it’s better to have them in an independent file. This file is often called requirments.txt. Here’s an example of a requirements.txt file.

您可以将二进制文件和所需的库直接添加到Dockerfile中；最好将它们放在一个独立的文件中。该文件通常称为requirments.txt。这是requirements.txt文件的示例。

步骤№4：准备Dockerfile (Step №4: Prepare the Dockerfile)

Write a simple, efficient Dockerfile. It’s time to assemble the onion from the inside out. We need to set a base image and the entry commands.

编写一个简单，高效的Dockerfile。现在是从内到外组装洋葱的时候了。我们需要设置基本映像和输入命令。

步骤№5：创建一个Docker映像 (Step №5: Create a Docker image)

Often our code is hosted on GitHub, which makes it very easy to create an image of the code. If you have GitHub locally, you can use the command line and run the repo2docker command to create an image from your repo.

通常，我们的代码托管在GitHub上，这使得创建代码映像非常容易。如果您在本地拥有GitHub，则可以使用命令行并运行repo2docker命令从您的存储库创建映像。

However, if the code you’re trying to create an image for is on GitHub, then you can use myBinder too generate and host your repo’s image. You will be able to access the image using a link provided by myBinder. If you are using myBinder, the requirement.txt will be called envrionment.yml, and it will contain the same information as the requirements.

但是，如果您要为其创建映像的代码在GitHub上，则可以使用myBinder生成并托管您的仓库的映像。您将能够使用myBinder提供的链接访问图像。如果您使用的是myBinder，则require.txt将被称为envrionment.yml，并将包含与需求相同的信息。

To use myBinder, you need to have the link of your repo’s master branch. For this article, I am using a repo I created for an event.

要使用myBinder，您需要具有仓库的master分支的链接。在这篇文章中，我使用的是回购的事件创建我。

If you’re using Anaconda navigator, you can use the Conda command line to generate the envrionment.yml file of a specific environment using this command:

如果您使用的是Anaconda导航器，则可以使用以下命令使用Conda命令行生成特定环境的envrionment.yml文件：

conda env export --name ENVNAME > envname.yml

步骤№6：运行一个容器 (Step №6: Run a container)

To run a container, you can use the run command we mentioned previously if the container is already downloaded on your machine. However, if you use myBinder and the image is hosted on the cloud, you can access it by using the link generated with the image. The link is added to your readme.md file and cause a badge that starts a container.

要运行容器，可以使用前面提到的run命令，前提是该容器已下载到您的计算机上。但是，如果使用myBinder并且图像托管在云上，则可以使用图像生成的链接来访问它。该链接已添加到您的readme.md文件，并导致启动容器的标志。

最佳做法的提示和技巧 (Tips and tricks for best practice)

Always make sure you’re using the most efficient base image. For example, in the case of Python3, choose slim-buster or stretch-buster. They have the full support and work well with most DS and ML libraries.
始终确保您使用的是最有效的基本图像。例如，对于Python3，请选择slim-buster或Stretch-buster 。它们具有全面的支持，并且可以与大多数DS和ML库一起很好地工作。
Use labels to provide important information like usage tips and extra information about the application and the needed libraries and how they are used.
使用标签可以提供重要信息，例如用法提示，以及有关应用程序和所需库以及如何使用它们的其他信息。
Split the run commands to make them more readable. Put all the needed libraries in a requirments.txt file to keep things organized.
拆分运行命令以使其更具可读性。 将所有需要的库放在requirments.txt文件中，以使事情井井有条。
Only install the necessary packages. It makes building and running the images more efficient.
仅安装必要的软件包。它使构建和运行图像更加有效。
Ignore files explicitly to avoid security risks (add them to the .ignore file).
明确忽略文件以避免安全风险(将它们添加到.ignore文件中)。
Avoid adding data, either pull data from a database or the cloud (use bind mounts) but don’t hard code them in the image.
避免添加数据，无论是从数据库还是从云中提取数据(使用绑定安装)，但不要在映像中对其进行硬编码。
If you are starting with Docker and want a standard project template, use the CookieCutter data science or CookieCutter docker science project templates.
如果您从Docker开始并需要标准项目模板 ，请使用CookieCutter数据科学或CookieCutter docker science项目模板。

Docker can get quite complicated and challenging to get a handle of, but the best thing to do is to keep practicing and try to make use of the powerful features Docker provides.

Docker可能会变得非常复杂且具有挑战性，但是最好的办法是继续练习并尝试利用Docker提供的强大功能。

Even if you are not entirely familiar with Docker, the primary use of it offers extreme control and power over your applications. Using only the basic commands we covered in this article, you can harness Docker’s power and use it to share, deploy, and develop your applications.

即使您不完全熟悉Docker，它的主要用法还是可以为您的应用程序提供极大的控制权和强大功能。仅使用我们在本文中介绍的基本命令，您就可以利用Docker的功能并将其用于共享，部署和开发应用程序。