# 使用Docker构建Python数据科学容器

#### TL; DR (TL;DR)

Artificial Intelligence(AI) and Machine Learning(ML) are literally on fire these days. Powering a wide spectrum of use-cases ranging from self-driving cars to drug discovery and to God knows what. AI and ML have a bright and thriving future ahead of them.

On the other hand, Docker revolutionized the computing world through the introduction of ephemeral lightweight containers. Containers basically package all the software required to run inside an image(a bunch of read-only layers) with a COW(Copy On Write) layer to persist the data.

Enough talk let’s get started with building a Python data science container.

#### Python数据科学软件包 (Python Data Science Packages)

Our Python data science container makes use of the following super cool python packages:

1. NumPy: NumPy or Numeric Python supports large, multi-dimensional arrays and matrices. It provides fast precompiled functions for mathematical and numerical routines. In addition, NumPy optimizes Python programming with powerful data structures for efficient computation of multi-dimensional arrays and matrices.

NumPy ：NumPy或Numeric Python支持大型多维数组和矩阵。 它为数学和数字例程提供快速的预编译功能。 另外，NumPy使用强大的数据结构优化Python编程，以有效地计算多维数组和矩阵。

2. SciPy: SciPy provides useful functions for regression, minimization, Fourier-transformation, and many more. Based on NumPy, SciPy extends its capabilities. SciPy’s main data structure is again a multidimensional array, implemented by Numpy. The package contains tools that help with solving linear algebra, probability theory, integral calculus, and many more tasks.

SciPy ：SciPy提供有用的函数进行回归，最小化，傅立叶变换等。 基于NumPy，SciPy扩展了其功能。 SciPy的主要数据结构还是由Numpy实现的多维数组。 该软件包包含的工具可帮助解决线性代数，概率论，积分演算以及更多任务。

3. Pandas: Pandas offer versatile and powerful tools for manipulating data structures and performing extensive data analysis. It works well with incomplete, unstructured, and unordered real-world data — and comes with tools for shaping, aggregating, analyzing, and visualizing datasets.

熊猫 ：熊猫提供了多种功能强大的工具来处理数据结构和执行广泛的数据分析。 它适用于不完整，无结构和无序的真实数据，并带有用于整形，汇总，分析和可视化数据集的工具。

4. SciKit-Learn: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. It is one of the best-known machine-learning libraries for python. The Scikit-learn package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. The primary emphasis is upon ease of use, performance, documentation, and API consistency. With minimal dependencies and easy distribution under the simplified BSD license, SciKit-Learn is widely used in academic and commercial settings. Scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems.

SciKit-Learn ：Scikit-learn是一个Python模块，它集成了各种针对中型监督和非监督问题的最新机器学习算法。 它是最著名的python机器学习库之一。 Scikit-learn软件包专注于使用通用高级语言将机器学习带给非专业人员。 主要重点是易用性，性能，文档和API一致性。 SciKit-Learn具有最小的依赖性，并且在简化的BSD许可下易于分发，因此广泛用于学术和商业环境。 Scikit-learn为通用的机器学习算法提供了一个简洁而一致的界面，使将ML引入生产系统变得很容易。

5. Matplotlib: Matplotlib is a Python 2D plotting library, capable of producing publication quality figures in a wide variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Matplotlib ：Matplotlib是Python 2D绘图库，能够以多种硬拷贝格式和跨平台的交互式环境生成出版物质量的图形。 Matplotlib可用于Python脚本，Python和IPython Shell，Jupyter笔记本，Web应用程序服务器以及四个图形用户界面工具包。

6. NLTK: NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

NLTK ：NLTK是构建Python程序以使用人类语言数据的领先平台。 它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面，并提供了一套用于分类，标记化，词干，标记，解析和语义推理的文本处理库。

#### 构建数据科学容器 (Building the Data Science Container)

Python is fast becoming the go-to language for data scientists and for this reason we are going to use Python as the language of choice for building our data science container.

Python正在Swift成为数据科学家的首选语言，因此，我们将使用Python作为构建数据科学容器的首选语言。

##### 基本的Alpine Linux映像 (The Base Alpine Linux Image)

Alpine Linux is a tiny Linux distribution designed for power users who appreciate security, simplicity and resource efficiency.

Alpine Linux是一个小型Linux发行版，设计用于了解安全性，简单性和资源效率的高级用户。

As claimed by Alpine:

Alpine所述

Small. Simple. Secure. Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox.

The Alpine image is surprisingly tiny with a size of no more than 8MB for containers. With minimal packages installed to reduce the attack surface on the underlying container. This makes Alpine an image of choice for our data science container.

Alpine图片非常小巧，容器的大小不超过8MB。 安装最少的软件包以减少对基础容器的攻击面。 这使Alpine成为我们数据科学容器的首选图像。

$docker container run --rm alpine:latest cat /etc/os-release In our, Dockerfile we can simply use the Alpine base image as: 在我们的Dockerfile中，我们可以简单地将Alpine基本映像用作： FROM alpine:latest ##### 谈话很便宜，让我们构建Dockerfile (Talk is cheap let’s build the Dockerfile) Now let’s work our way through the Dockerfile. 现在让我们通过Dockerfile进行工作。 The FROM directive is used to set alpine:latest as the base image. Using the WORKDIR directive we set the /var/www as the working directory for our container. The ENV PACKAGES lists the software packages required for our container like git, blas and libgfortran. The python packages for our data science container are defined in the ENV PACKAGES. FROM指令用于将alpine:latest设置为基本映像。 使用WORKDIR指令，将/var/www设置为容器的工作目录。 ENV PACKAGES列出了我们的容器所需的软件包，例如gitblaslibgfortran 。 在ENV PACKAGES中定义了用于数据科学容器的python软件包。 We have combined all the commands under a single Dockerfile RUN directive to reduce the number of layers which in turn helps in reducing the resultant image size. 我们将所有命令组合在一个Dockerfile RUN指令下，以减少层数，从而有助于减小最终图像大小。 ##### 建立并标记图像 (Building and tagging the image) Now that we have our Dockerfile defined, navigate to the folder with the Dockerfile using the terminal and build the image using the following command: 现在我们已经定义了Dockerfile，使用终端浏览到Dockerfile所在的文件夹，并使用以下命令构建映像： $ docker build -t faizanbashir/python-datascience:2.7 -f Dockerfile .

The -t flag is used to name a tag in the 'name:tag' format. The -f tag is used to define the name of the Dockerfile (Default is 'PATH/Dockerfile').

-t标志用于以'name：tag'格式命名标签。 -f标记用于定义Dockerfile的名称(默认为“ PATH / Dockerfile”)。

##### 运行容器 (Running the container)

We have successfully built and tagged the docker image, now we can run the container using the following command:

$docker container run --rm -it faizanbashir/python-datascience:2.7 python Voila, we are greeted by the sight of a python shell ready to perform all kinds of cool data science stuff. 瞧，我们为准备执行各种超酷数据科学工作的python shell感到高兴。 Python 2.7.15 (default, Aug 16 2018, 14:17:09) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> Our container comes with Python 2.7, but don’t be sad if you wanna work with Python 3.6. Lo, behold the Dockerfile for Python 3.6: 我们的容器随附Python 2.7，但是如果您想使用Python 3.6，请不要感到难过。 瞧，看看Python 3.6的Dockerfile： https://gist.github.com/faizanbashir/9443a7149cc53f81d84d0d356f871ec7#file-datascience-python3-6-dockerfile Build and tag the image like so: 像这样构建并标记图像： Run the container like so: 像这样运行容器： $ docker container run --rm -it faizanbashir/python-datascience:3.6 python

With this, you have a ready to use container for doing all kinds of cool data science stuff.

#### 服务布丁 (Serving Puddin’)

Figures, you have the time and resources to set up all this stuff. In case you don’t, you can pull the existing images that I have already built and pushed to Docker’s registry Docker Hub using:

# For Python 2.7 pull$docker pull faizanbashir/python-datascience:2.7 # For Python 3.6 pull$ docker pull faizanbashir/python-datascience:3.6

After pulling the images you can use the image or extend the same in your Dockerfile file or use it as an image in your docker-compose or stack file.

#### 后果 (Aftermath)

The world of AI, ML is getting pretty exciting these days and will continue to become even more exciting. Big players are investing heavily in these domains. About time you start to harness the power of data, who knows it might lead to something wonderful.

You can check out the code here.

I hope this article helped in building containers for your data science projects. Clap if it increased your knowledge, help it reach more people.

• 0
点赞
• 0
收藏
觉得还不错? 一键收藏
• 0
评论
06-06 1298
12-28 4569
05-18 142
07-28 1万+
03-31 1949
09-02 4480
05-08 6211
10-16 763
07-15 867
12-23 1193
08-05 1032

### “相关推荐”对你有帮助么？

• 非常没帮助
• 没帮助
• 一般
• 有帮助
• 非常有帮助

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。