我如何在咨询项目中使用Vagrant和Docker

By Doug Ashton – Data Scientist, UK

作者:道格·阿什顿(Doug Ashton)–英国数据科学家

Just like you I like to try out all the latest tech. If there’s a new feature in Shiny then I’ll download the latest version without thinking. I’ve currently got 4 versions of R on my laptop, 270 packages, 2 versions of Java, and a number of other open source tools. While being on the cutting edge is part of my job, this conflicts with the need for strict audit and reproducibility requirements that we have for project work.

就像您一样,我喜欢尝试所有最新技术。 如果Shiny中有一项新功能,那么我会不加考虑地下载最新版本。 我目前在笔记本电脑上有4个版本的R,270个软件包,2个Java版本以及许多其他开源工具。 虽然在我的工作中处于最前沿,但这与我们对项目工作的严格审核和可重复性要求的需求相矛盾。

One problem with R is that due to the fast changing nature of CRAN it can be difficult to gain a consistent combination of packages across your team and production servers. The R community has responded to this problem with a number of noteworthy packages for managing package libraries, such as packrat, checkpoint, switchr and our own pkgsnap. Another approach is to use the MRAN mirror to freeze CRAN to a particular date.

R的一个问题是,由于CRAN的特性日新月异,因此很难在团队和生产服务器之间获得一致的软件包组合。 R社区已通过许多值得注意的软件包来解决此问题,这些软件包用于管理软件包库,例如packratcheckpointswitchr和我们自己的pkgsnap 。 另一种方法是使用MRAN镜像将CRAN冻结到特定日期。

A bigger problem is how R is interacting with the various system depenedencies you have installed. At Mango this is why we use continuous integration and unit testing to make sure our results are reproducible on dedicated build servers. Even this can leave you scratching your head when tests don’t match.

一个更大的问题是R如何与您已安装的各种系统依赖关系进行交互。 在Mango,这就是为什么我们使用持续集成和单元测试来确保我们的结果在专用构建服务器上可再现的原因。 即使测试不匹配,这也会使您挠头。

All this led us to look for a better way of working. We needed an environment that was easily reproducible, and more in line with the production environment we are deploying to. We’ve already been using Docker for some time so this was the natural choice.

所有这些使我们寻求一种更好的工作方式。 我们需要一个易于复制的环境,并且与我们要部署到的生产环境更加一致。 我们已经使用Docker已有一段时间了,所以这是自然的选择。

码头工人 (Docker)

As described in a previous post, Docker is designed to provide an isolated, portable and repeatable wrapper around your applications. We use this in a number of ways:

一篇文章所述 ,Docker旨在为您的应用程序提供一个隔离,可移植和可重复的包装器。 我们以多种方式使用它:

1.可重现的环境 (1. Reproducible environments)

Each project can run inside its own container, completely sandboxed from the rest of your system. We have a number of base images, each built on specific R versions and provisioned with standard sets of packages (using our pkgsnap package) and RStudio Server. Each project can build on one of these images with any specific package dependencies. The recipe to build this image is stored in the Dockerfile that can be saved in the project directory. An example project Docker file is shown in this demonstration.

每个项目都可以在自己的容器中运行,并且与系统其余部分完全沙盒化。 我们有许多基础映像,每个基础映像都基于特定的R版本构建,并配有标准的软件包集(使用我们的pkgsnap软件包)和RStudio服务器。 每个项目都可以在这些映像之一上建立任何特定的程序包依赖关系。 构建该映像的配方存储在Dockerfile中,该文件可以保存在项目目录中。 此演示中显示了一个示例项目Docker文件。

2.系统依赖性 (2. System dependencies)

If there are system dependencies such as database connections or external libraries, then building an image with these installed makes it much easier to distribute the project to others. This also makes Docker a great way of trying a new technology without the pain of installing it on your system. For example the excellent Jupyter/all-spark-notebook has everything you need to get started with Spark from R, Python or Scala.

如果存在诸如数据库连接或外部库之类的系统依赖项,则在安装了这些依赖项的情况下构建映像将使将项目分发给其他人更加容易。 这也使Docker成为尝试一项新技术的好方法,而无需在系统上安装新技术。 例如,出色的Jupyter / all-spark-notebook提供了从R,Python或Scala入门Spark所需的一切。

3.可扩展性 (3. Scalability)

Once you’re used to working in containers it can significantly lower the barrier to scaling up the compute power when needed. Your container will work just the same on your laptop and a 32 core EC2 instance. You just spin up a node, pull the image and deploy your application. Multiple containers from the same image can be spawned across a grid in seconds and a small scale Spark cluster can be swapped out for a much larger one.

一旦习惯了在容器中工作,它就可以显着降低在需要时扩展计算能力的障碍。 您的容器在笔记本电脑和32核EC2实例上的工作原理相同。 您只需旋转一个节点,拉取映像并部署您的应用程序。 可以在几秒钟内跨网格生成同一图像中的多个容器,并且可以将小规模的Spark集群换成更大的集群。

流浪汉 (Vagrant)

For larger software development projects we also use Vagrant as a tool for reproducible development environments. As described in an earlier post Vagrant is a set of command line tools for managing virtual machines (VMs). This creates a dedicated VM for each project that is consistent across the development team and only creates a small file in version control.

对于大型软件开发项目,我们还使用Vagrant作为可重现的开发环境的工具。 如之前的文章所述,Vagrant是一组用于管理虚拟机(VM)的命令行工具。 这将为每个项目创建专用的VM,该VM在整个开发团队中是一致的,并且仅在版本控制中创建一个小文件。

更多资源 (More resources)

翻译自: https://www.pybloggers.com/2015/12/how-i-use-vagrant-and-docker-in-consultancy-projects/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值