使用Docker构建Python数据科学容器

最新推荐文章于 2024-08-09 18:10:54 发布

cumi7754

最新推荐文章于 2024-08-09 18:10:54 发布

阅读量397

点赞数

文章标签： python 机器学习人工智能 java 大数据

原文链接：https://www.freecodecamp.org/news/building-python-data-science-container-using-docker/

版权

TL; DR (TL;DR)

Artificial Intelligence(AI) and Machine Learning(ML) are literally on fire these days. Powering a wide spectrum of use-cases ranging from self-driving cars to drug discovery and to God knows what. AI and ML have a bright and thriving future ahead of them.

这些天来，人工智能(AI)和机器学习(ML)确实火上浇油。从无人驾驶汽车到毒品发现，再到上帝知道为各种各样的用例提供动力。 AI和ML拥有光明和繁荣的未来。

On the other hand, Docker revolutionized the computing world through the introduction of ephemeral lightweight containers. Containers basically package all the software required to run inside an image(a bunch of read-only layers) with a COW(Copy On Write) layer to persist the data.

另一方面，Docker通过引入临时的轻量级容器，彻底改变了计算世界。容器基本上将在图像(一堆只读层)中运行所需的所有软件与一个COW(写时复制)层打包在一起以保存数据。

Enough talk let’s get started with building a Python data science container.

足够多的讨论让我们开始构建Python数据科学容器。

Python数据科学软件包 (Python Data Science Packages)

Our Python data science container makes use of the following super cool python packages:

我们的Python数据科学容器使用以下超酷的python软件包：

NumPy: NumPy or Numeric Python supports large, multi-dimensional arrays and matrices. It provides fast precompiled functions for mathematical and numerical routines. In addition, NumPy optimizes Python programming with powerful data structures for efficient computation of multi-dimensional arrays and matrices.
NumPy ：NumPy或Numeric Python支持大型多维数组和矩阵。它为数学和数字例程提供快速的预编译功能。另外，NumPy使用强大的数据结构优化Python编程，以有效地计算多维数组和矩阵。
SciPy: SciPy provides useful functions for regression, minimization, Fourier-transformation, and many more. Based on NumPy, SciPy extends its capabilities. SciPy’s main data structure is again a multidimensional array, implemented by Numpy. The package contains tools that help with solving linear algebra, probability theory, integral calculus, and many more tasks.
SciPy ：SciPy提供有用的函数进行回归，最小化，傅立叶变换等。基于NumPy，SciPy扩展了其功能。 SciPy的主要数据结构还是由Numpy实现的多维数组。该软件包包含的工具可帮助解决线性代数，概率论，积分演算以及更多任务。
Pandas: Pandas offer versatile and powerful tools for manipulating data structures and performing extensive data analysis. It works well with incomplete, unstructured, and unordered real-world data — and comes with tools for shaping, aggregating, analyzing, and visualizing datasets.
熊猫：熊猫提供了多种功能强大的工具来处理数据结构和执行广泛的数据分析。它适用于不完整，无结构和无序的真实数据，并带有用于整形，汇总，分析和可视化数据集的工具。
SciKit-Learn: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. It is one of the best-known machine-learning libraries for python. The Scikit-learn package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. The primary emphasis is upon ease of use, performance, documentation, and API consistency. With minimal dependencies and easy distribution under the simplified BSD license, SciKit-Learn is widely used in academic and commercial settings. Scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems.
SciKit-Learn ：Scikit-learn是一个Python模块，它集成了各种针对中型监督和非监督问题的最新机器学习算法。它是最著名的python机器学习库之一。 Scikit-learn软件包专注于使用通用高级语言将机器学习带给非专业人员。主要重点是易用性，性能，文档和API一致性。 SciKit-Learn具有最小的依赖性，并且在简化的BSD许可下易于分发，因此广泛用于学术和商业环境。 Scikit-learn为通用的机器学习算法提供了一个简洁而一致的界面，使将ML引入生产系统变得很容易。
Matplotlib: Matplotlib is a Python 2D plotting library, capable of producing publication quality figures in a wide variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Matplotlib ：Matplotlib是Python 2D绘图库，能够以多种硬拷贝格式和跨平台的交互式环境生成出版物质量的图形。 Matplotlib可用于Python脚本，Python和IPython Shell，Jupyter笔记本，Web应用程序服务器以及四个图形用户界面工具包。
NLTK: NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
NLTK ：NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面，并提供了一套用于分类，标记化，词干，标记，解析和语义推理的文本处理库。

构建数据科学容器 (Building the Data Science Container)

Python is fast becoming the go-to language for data scientists and for this reason we are going to use Python as the language of choice for building our data science container.

Python正在Swift成为数据科学家的首选语言，因此，我们将使用Python作为构建数据科学容器的首选语言。

基本的Alpine Linux映像 (The Base Alpine Linux Image)

Alpine Linux is a tiny Linux distribution designed for power users who appreciate security, simplicity and resource efficiency.

Alpine Linux是一个小型Linux发行版，设计用于了解安全性，简单性和资源效率的高级用户。

As claimed by Alpine:

如Alpine所述：

Small. Simple. Secure. Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox.

小。 简单。 安全。 Alpine Linux是基于musl libc和busybox的面向安全的轻量级Linux发行版。

The Alpine image is surprisingly tiny with a size of no more than 8MB for containers. With minimal packages installed to reduce the attack surface on the underlying container. This makes Alpine an image of choice for our data science container.

Alpine图片非常小巧，容器的大小不超过8MB。安装最少的软件包以减少对基础容器的攻击面。这使Alpine成为我们数据科学容器的首选图像。

Downloading and Running an Alpine Linux container is as simple as:

下载和运行Alpine Linux容器非常简单：

$ docker container run --rm alpine:latest cat /etc/os-release

In our, Dockerfile we can simply use the Alpine base image as:

在我们的Dockerfile中，我们可以简单地将Alpine基本映像用作：

FROM alpine:latest

谈话很便宜，让我们构建Dockerfile (Talk is cheap let’s build the Dockerfile)

Now let’s work our way through the Dockerfile.

现在让我们通过Dockerfile进行工作。

The FROM directive is used to set alpine:latest as the base image. Using the WORKDIR directive we set the /var/www as the working directory for our container. The ENV PACKAGES lists the software packages required for our container like git, blas and libgfortran. The python packages for our data science container are defined in the ENV PACKAGES.

FROM指令用于将alpine:latest设置为基本映像。使用WORKDIR指令，将/var/www设置为容器的工作目录。 ENV PACKAGES列出了我们的容器所需的软件包，例如git ， blas和libgfortran 。在ENV PACKAGES中定义了用于数据科学容器的python软件包。

We have combined all the commands under a single Dockerfile RUN directive to reduce the number of layers which in turn helps in reducing the resultant image size.

我们将所有命令组合在一个Dockerfile RUN指令下，以减少层数，从而有助于减小最终图像大小。

建立并标记图像 (Building and tagging the image)

Now that we have our Dockerfile defined, navigate to the folder with the Dockerfile using the terminal and build the image using the following command:

现在我们已经定义了Dockerfile，使用终端浏览到Dockerfile所在的文件夹，并使用以下命令构建映像：

$ docker build -t faizanbashir/python-datascience:2.7 -f Dockerfile .

The -t flag is used to name a tag in the 'name:tag' format. The -f tag is used to define the name of the Dockerfile (Default is 'PATH/Dockerfile').

-t标志用于以'name：tag'格式命名标签。 -f标记用于定义Dockerfile的名称(默认为“ PATH / Dockerfile”)。

运行容器 (Running the container)

We have successfully built and tagged the docker image, now we can run the container using the following command:

我们已经成功构建并标记了Docker映像，现在我们可以使用以下命令运行容器：

$ docker container run --rm -it faizanbashir/python-datascience:2.7 python

Voila, we are greeted by the sight of a python shell ready to perform all kinds of cool data science stuff.

瞧，我们为准备执行各种超酷数据科学工作的python shell感到高兴。

Python 2.7.15 (default, Aug 16 2018, 14:17:09) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>

Our container comes with Python 2.7, but don’t be sad if you wanna work with Python 3.6. Lo, behold the Dockerfile for Python 3.6:

我们的容器随附Python 2.7，但是如果您想使用Python 3.6，请不要感到难过。瞧，看看Python 3.6的Dockerfile：

https://gist.github.com/faizanbashir/9443a7149cc53f81d84d0d356f871ec7#file-datascience-python3-6-dockerfile

Build and tag the image like so:

像这样构建并标记图像：

Run the container like so:

像这样运行容器：

$ docker container run --rm -it faizanbashir/python-datascience:3.6 python

With this, you have a ready to use container for doing all kinds of cool data science stuff.

有了这个，您就可以使用容器来完成各种很酷的数据科学工作。

服务布丁 (Serving Puddin’)

Figures, you have the time and resources to set up all this stuff. In case you don’t, you can pull the existing images that I have already built and pushed to Docker’s registry Docker Hub using:

数字，您有时间和资源来设置所有这些东西。如果您不这样做，则可以使用以下方法拉出我已经构建的现有映像，并将其推送到Docker的注册表Docker Hub ：

# For Python 2.7 pull$ docker pull faizanbashir/python-datascience:2.7

# For Python 3.6 pull$ docker pull faizanbashir/python-datascience:3.6

After pulling the images you can use the image or extend the same in your Dockerfile file or use it as an image in your docker-compose or stack file.

拉取映像后，您可以在Dockerfile文件中使用该映像或对其进行扩展，或者将其用作docker-compose或堆栈文件中的映像。

后果 (Aftermath)

The world of AI, ML is getting pretty exciting these days and will continue to become even more exciting. Big players are investing heavily in these domains. About time you start to harness the power of data, who knows it might lead to something wonderful.

如今，ML的AI世界变得越来越令人兴奋，并将继续变得更加令人兴奋。大型企业正在这些领域进行大量投资。大约在您开始利用数据的力量的时候，谁知道这可能会带来美妙的事情。

You can check out the code here.

您可以在此处查看代码。

faizanbashir/python-datascienceDocker image for python datascience container with NumPy, SciPy, Scikit-learn, Matplotlib, nltk, pandas packages…github.com

faizanbashir / python-datascience 用于带有NumPy，SciPy，Scikit-learn，Matplotlib，nltk，pandas软件包的python datascience容器的Docker映像… github.com

I hope this article helped in building containers for your data science projects. Clap if it increased your knowledge, help it reach more people.

我希望本文有助于为您的数据科学项目构建容器。鼓掌，如果它增加了您的知识，则可以帮助它吸引更多的人。