大数据 notebook_Dockerless Notebook：数据科学期待已久的未来

最新推荐文章于 2024-03-21 09:44:02 发布

weixin_26746401

最新推荐文章于 2024-03-21 09:44:02 发布

阅读量183

点赞数

文章标签： python java 人工智能大数据机器学习

原文链接：https://towardsdatascience.com/dockerless-notebook-the-long-awaited-future-of-data-science-7cde7707f7ff

版权

大数据 notebook

Data science is hard. Data scientists spend hours figuring out how to install that Python package on their laptops. Data scientists read many pages of Google search results to connect to that database. Data scientists write a detailed document for engineers to deploy machine learning models into production. Data scientists prepare nice slides to convince business guys on how to improve retention rates. Data scientists worry about their data pipeline breaks which cause data quality issues.

数据科学很难。数据科学家花了数小时来弄清楚如何在笔记本电脑上安装该Python软件包。数据科学家阅读了许多Google搜索结果页面以连接到该数据库。数据科学家为工程师编写了详细的文档，以将机器学习模型部署到生产中。数据科学家准备了不错的幻灯片，以说服业务人员如何提高保留率。数据科学家担心他们的数据管道中断会导致数据质量问题。

The challenge of data science is real. There are steep learning curves of new languages that they are not familiar with. There are business impact requirements that no one knows how to meet in limited time. There are the best engineering practices to follow to ensure the quality of their deliverables. There is limited engineering support for the data science team.

数据科学的挑战是真实的。他们不熟悉的新语言有很多陡峭的学习曲线。有一些业务影响需求，没人会在有限的时间内满足。有最佳的工程实践可遵循，以确保其交付成果的质量。数据科学团队的工程支持有限。

docker容器可以解决什么问题？ (What problems do docker containers solve?)

For individual data scientists and other data team members: It is a frustrating experience to set up a development environment and maintain a consistent operating environment. The installation instructions often do not cover all dependency required. Some GPU-based AI libraries require data scientists to be familiar with low-level details of the hardware. The error information is not informative enough to explain the causes of the error. The dependency conflicts between libraries make it is hard to maintain a working development environment for multiple projects. The collaboration between data scientists and engineers requires extra and unnecessary works from both.

对于单个数据科学家和其他数据团队成员：设置开发环境并维护一致的操作环境是令人沮丧的经验。安装说明通常不会涵盖所有必需的依赖项。一些基于GPU的AI库要求数据科学家熟悉硬件的底层细节。错误信息的信息不足以解释错误的原因。库之间的依赖关系冲突使得很难为多个项目维护有效的开发环境。数据科学家和工程师之间的合作需要双方的额外和不必要的工作。

Python虚拟环境如何？ (How about Python virtual environment?)

Admittedly, Python virtual environment works for some data scientists nicely. However, it does not meet the diverse requirements for data science tasks:

诚然，Python虚拟环境非常适合某些数据科学家。但是，它不能满足数据科学任务的各种要求：

It’s become more common that data scientists are using Spark, R, and SQL daily. How can Python virtual environment work for different languages and frameworks other than Python?
数据科学家每天使用Spark，R和SQL变得越来越普遍。 Python虚拟环境如何在Python以外的其他语言和框架下工作？
Some data scientists mainly work with their engineering teammates to deploy machine learning models to production. How does Python virtual environment if there is a dependency on the operating system rather than the python library?
一些数据科学家主要与工程团队合作，将机器学习模型部署到生产环境中。如果依赖操作系统而不是python库，那么Python虚拟环境如何处理？

The birth of conda alleviates these two issues and it is a fact that conda is quite popular among the data science community. The installation of conda itself is not difficult and it ships environments with many common data science packages.

conda的诞生缓解了这两个问题，事实是conda在数据科学界非常流行。 conda本身的安装并不困难，它随环境提供了许多常见的数据科学软件包。

However, not all packages that are available in pip are available on conda. If one package cannot be found on conda, then data scientists may have to use pip alongside conda which is a major source of confusion and unexpected issues. For example, in this unsolved Github issue, there are many arguments over how does pip work with conda.

然而，并非在所有可用的软件包pip可在conda 。如果不能找到一个包conda ，那么数据科学家可能需要使用PIP一起conda这是混乱和意外问题的主要来源。例如，在这个尚未解决的Github问题中，关于pip如何与conda一起使用存在许多争论。

Ironically, the VP of Anaconda once made a speech titled “Conda, Docker, and Kubernetes: The cloud-native future of data science”. It is useless if the environment-related issue is solved by 99%. It is the 1% issue left that makes the developer experience unacceptable.

具有讽刺意味的是，Anaconda的副总裁曾经发表过一篇题为“ Conda，Docker和Kubernetes：数据科学的云原生未来”的演讲。如果与环境有关的问题解决了99％，那就没有用了。剩下的1％问题使开发人员无法接受。

泊坞窗容器如何提供帮助？ (How does a docker container help?)

Loosely speaking, a docker container is a “lightweight virtual machine” that packages everything needed to run applications into one docker image. Docker image is designed to move between servers and guarantee the environments are consistent.

松散地说，泊坞窗容器是“轻量级虚拟机”，它将运行应用程序所需的所有内容打包到一个泊坞窗映像中。 Docker映像旨在在服务器之间移动并确保环境一致。

As a result, data scientists would not worry anymore about the dependency breaks when deploying machine learning models into production. The new graduate onboarded last week can start to make contributions to the team as soon as the docker container is running, rather than secretly searching for new positions at companies that have a better infrastructure set up data science teams.

因此，在将机器学习模型部署到生产环境中时，数据科学家将不再担心依赖关系中断。上周入职的新毕业生可以在Docker容器运行后立即开始为团队做出贡献，而不是在具有更好基础架构的公司中秘密寻找新职位，以建立数据科学团队。

Why are docker containers not popular among the data science community?

为什么Docker容器在数据科学界不受欢迎？

Docker is not a new technology at all, why the majority of data scientists have not adopted it? There are mainly two reasons:

Docker根本不是一种新技术，为什么大多数数据科学家都没有采用它？主要有两个原因：

The learning curve is steep.
学习曲线陡峭。
The developer experience is bad.
开发人员体验很差。

To get started with docker containers, one has to learn at least how to

要开始使用Docker容器，必须至少学习如何

start/stop a container
启动/停止容器
attach the shell to a running container
将外壳连接到正在运行的容器
mount the local volume to a container
将本地卷安装到容器

In reality, these are not enough: how to sudo inside a container that I do not know the password? Why my docker container lost all the data after it is stopped? How do I set up a private docker registry so I can pull the docker image from my remote clusters? How can I kill the processes that are using port 8808?

实际上，这些还不够：如何在我不知道密码的容器内进行sudo操作？为什么我的Docker容器停止后会丢失所有数据？如何设置私有Docker注册表，以便可以从远程集群中提取Docker映像？如何杀死正在使用端口8808的进程？

When it comes to writing Dockerfile, one has to be familiar with Linux Shell command and Dockerfile syntax. If one project is going to use one docker image, there are so many docker images to manage than a software engineer may have.

在编写Dockerfile ，必须熟悉Linux Shell命令和Dockerfile语法。如果一个项目要使用一个docker映像，那么要管理的docker映像太多了，而软件工程师可能没有。

So data scientists either having a hard time fixing environment-related issues, giving up reproducibility and suffering from bad engineering practice, or spend too much time learning and operating docker.

因此，数据科学家要么很难解决与环境相关的问题，要么放弃可重复性并遭受不良的工程实践之苦，要么花太多时间学习和操作docker。

It is NOT data scientists’ job to take care of the environment

照顾环境不是数据科学家的工作

Data scientists should NOT spend time on environments so that they can focus on what they are good at building dashboards, developing machine learning models, informing business teammates with actionable insights.

数据科学家不应该把时间花在环境，使他们能够在构建仪表板，开发机器学习模型，提供可操作的见解通知业务的队友们专注于他们所擅长。

Dockerless Notebook是未来 (Dockerless Notebook is the future)

Imagine there is a smart and capable docker helper that does everything for you: When you start the notebook, it can automatically start the container and attach it to the notebook. When you want to move your notebook to run on a remote cluster, it can commit your local docker container, send it to a remote local cluster, and manage it automatically.

想象一下，有一个聪明而功能强大的docker helper可以为您完成所有工作：启动笔记本计算机时，它可以自动启动容器并将其连接到笔记本计算机。当您要移动笔记本以在远程群集上运行时，它可以提交本地docker容器，将其发送到远程本地群集，并自动进行管理。

The idea “Dockerless notebook” is that it allows you to develop and operate notebooks without thinking about docker containers. It is tightly integrated with the notebook data scientists use everyday. It eliminates learning docker container and operating tasks such as start/stop container, attach the shell to containers, and mount volumes to containers. You won’t even notice that a docker is running on your laptop like the way that you won’t notice how Jupyter Notebook exchanges data between browser and memory.

“无Docker笔记本 ”的想法是，它使您无需考虑Docker容器即可开发和操作笔记本。它与科学家每天使用的笔记本电脑紧密集成。它消除了学习docker容器和操作任务(例如启动/停止容器，将外壳连接到容器以及将卷安装到容器)的麻烦。您甚至不会注意到docker在笔记本电脑上运行，就像您不会注意到Jupyter Notebook如何在浏览器和内存之间交换数据的方式一样。

The “Dockerless notebook” will help the Data Science community move closer to “reproducible data science” and “frictionless data science” without unacceptable costs.

“无Docker笔记本 ”将帮助数据科学界向“可复制数据科学”和“无摩擦数据科学”靠拢，而不会产生不可接受的成本。