从独立模型到模型工厂的模型演变

In Part 1 of this series we examined the key differences between software and models and in Part 2 we explored the twelve traps of conflating models with software. Both these articles were focused on highlighting the issues, but did not provide any solutions. In the next couple of articles we will focus on providing some concrete practices for addressing these gaps.

在本系列的第1部分中 ,我们研究了软件和模型之间的关键区别,在第2部分中,我们探讨了将模型与软件合并的十二个陷阱。 这两篇文章都着重于突出问题,但未提供任何解决方案。 在接下来的几篇文章中,我们将重点介绍解决这些差距的一些具体实践。

The potential contribution of AI to the global economy and the importance of investing in AI is well recognized within the business and technical communities. In a recent CEO survey, more than 85% of the CEO’s believed that AI will significantly change the way they do business. Although only 6% of those surveyed admitted to having enterprise-wide AI initiatives, nearly 20% of them plan to deploy AI enterprise-wide in the near-term. One of the biggest challenges in enterprise-wide deployment of models is the time it takes to deploy models. In a recent survey, nearly 58% of companies surveyed reported that it took 31 days or more to deploy models.

人工智能对全球经济的潜在贡献以及对人工智能的投资重要性已在商业和技术社区中得到公认。 在最近的CEO调查中,超过85%的CEO认为AI将极大地改变他们的业务方式 。 尽管只有6%的受访者承认拥有企业范围的AI计划 ,但将近20%的人计划在近期内在企业范围内部署AI 。 企业范围内模型部署的最大挑战之一就是部署模型所需的时间。 在最近的一项调查中,近58%的被调查公司表示部署模型花费了31天或更长时间

模型演变 (Model Evolution)

As we have tracked and helped enterprises evolve from building simple analytical models to more sophisticated, continuously learning models embedded within larger transactional applications we have seen three distinct phases.

在我们跟踪并帮助企业从构建简单的分析模型到嵌入更大的交易应用程序中的更复杂的,不断学习的模型的过程中,我们看到了三个不同的阶段。

Standalone Model Phase: In this phase, companies have typically deployed models on a standalone basis using command line interfaces and Jupyter notebooks. The models were also typically simpler, not interacting real-time with data sources or software applications. Companies were using these models within a single group in a larger functional or business unit. Also, the data scientists were typically the ones to scope, design, build, deploy and maintain the models. This phase was essential for companies to prove the value of AI/ML models and to learn the challenges of more widespread usage.

独立模型阶段:在此阶段,公司通常使用命令行界面和Jupyter笔记本电脑独立部署模型。 这些模型通常也更简单,不与数据源或软件应用程序进行实时交互。 公司在较大的职能部门或业务部门的单个组中使用这些模型。 而且,数据科学家通常是负责范围,设计,构建,部署和维护模型的人员。 该阶段对于公司证明AI / ML模型的价值并了解更广泛使用的挑战至关重要。

Prediction as a Service Phase: Once companies were able to demonstrate the value of these models there was a demand to make the predictions from these models available to other groups within the enterprise. In addition, as multiple groups within large companies started building their own models there was a significant level of duplication of effort, skills, and also heterogeneity of architectures, design patterns, and tools. This led to more of an engineering approach to model delivery, deployment and monitoring. Advances in software engineering, including micro-services architecture, Dockers, and Kubernetes were used to provide model predictions as a service to other software systems. For example, a NER (Named-Entity-Recognition) model could be developed once and provided as a REST API service to other web applications.

预测即服务阶段:一旦公司能够证明这些模型的价值,便需要将这些模型的预测提供给企业内的其他团队。 此外,随着大型公司中的多个小组开始构建自己的模型,工作,技能以及架构,设计模式和工具的异构性也出现了很大程度的重复。 这导致更多的工程方法可以用于模型交付,部署和监视。 软件工程的进步,包括微服务架构,Docker和Kubernetes,被用来提供模型预测作为对其他软件系统的服务。 例如,可以一次开发一个NER(命名实体识别)模型,并将其作为REST API服务提供给其他Web应用程序。

Model Factory Phase: More advanced companies are moving to a factory model where hundreds (if not thousands) of models are deployed. Automated CD/CI (Continuous Deployment/Continuous Integration) pipelines are being combined with CL (Continuous Learning) pipelines. This allows companies to automate the data ingestion as well as periodic or continuous retraining of models. This allows companies to have flexible deployment strategies that allow for automated or semi-automated model retraining. This addresses some of the significant challenges of deployment time and cost that we discussed earlier. Automated deployment also opens up the opportunity to automate the continuous monitoring of models — essential in addressing the return realization trap that we discussed in Part 2 of this series. Finally, integrated and end-to-end model lineage tracking, experiment and inference logging enable factory-like operations of models.

模型工厂阶段:更多高级公司正在迁移到工厂模型,其中部署了数百个(如果不是数千个)模型。 自动化的CD / CI(连续部署/连续集成)管道与CL(连续学习)管道结合在一起。 这使公司可以自动执行数据提取以及模型的定期或连续重新训练。 这使公司具有灵活的部署策略,可以进行自动或半自动模型再培训。 这解决了我们前面讨论的部署时间和成本方面的一些重大挑战。 自动化部署还为自动化模型的连续监视提供了机会,这对于解决我们在本系列第2部分中讨论的收益实现陷阱至关重要。 最后,集成的和端到端的模型沿袭,实验和推理记录功能使模型可以像工厂一样进行操作。

As an organization progresses along these three phases, the number of models they deploy and the sophistication of these models increase — essentially the scale and scope of model building, deployment, and monitoring increase. Second, the data ingestion moves from batch-mode to streaming and real-time. This facilitates the increase in volume, variety, and velocity of data. Data versioning gets integrated with model versioning to enable rapid experimentation and retraining of models. Third, the software applications encapsulate traditional software (Software 1.0 code) as well as machine learning models (Software 2.0) enabling truly ‘smart software’ for business problems.

随着组织在这三个阶段中的发展,他们部署的模型数量和这些模型的复杂性都会增加-本质上,模型构建,部署和监视的规模和范围会增加。 其次,数据摄取从批处理模式转变为流式传输和实时。 这有助于增加数据的数量,种类和速度。 数据版本控制与模型版本控制集成在一起,可以对模型进行快速试验和重新训练。 第三,软件应用程序封装了传统软件( 软件1.0代码)以及机器学习模型( 软件2.0 ),从而为业务问题提供了真正的“智能软件”。

Image for post
Model Evolution: From Standalone Models to Model Factory (Source: PwC Analysis)
模型演进:从独立模型到模型工厂(来源:普华永道分析)

新兴角色 (Emerging Roles)

The increasing scale and sophistication across the combination of software, models, and data has seen the emergence of new roles. Software development has had roles such as business analyst, systems analyst, architect, developer, tester, development-operations (DevOps), etc. When you examine more closely, these roles reflect the scoping, design, development, operations, and maintenance phases of the software life-cyle. With the emergence of Machine Learning models and the paradigm of Software 2.0 we see a number of new skills and roles.

在软件,模型和数据的组合中,规模和复杂性的不断提高已经看到了新角色的出现。 软件开发曾担任过业务分析师,系统分析师,架构师,开发人员,测试人员,开发运营(DevOps)等角色。当您仔细研究时,这些角色反映了范围,设计,开发,运营和维护阶段。软件生命周期。 随着机器学习模型和Software 2.0范式的出现,我们看到了许多新技能和新角色。

The role of the data scientist emerged during the standalone phase of model evolution when a combination of three skills — subject matter or domain expertise, mathematics and statistics, and computer science and big data — were required. In a 2012 article Hal Varian, the Chief Economist at Google called the Data Scientist as the sexiest job in the 21st century.

数据科学家的角色出现在模型演化的独立阶段,当时需要将三种技能 (主题或领域专业知识,数学和统计以及计算机科学和大数据)结合起来。 Google的首席经济学家在2012年发表的一篇文章中称Google 数据科学家为21世纪最性感的工作

The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?

未来10年的工作将是统计学家。 人们以为我在开玩笑,但是谁能想到计算机工程师会成为1990年代最性感的工作?

As we started moving from the first foundational phase of models to the second phase of prediction as a service, there was a need for someone to scale and optimize the proof-of-concept models developed by data scientists. This led to the emergence of Machine Learning or ML Engineers. Jeff Hale describes the role of ML engineers as follows:

随着我们开始从模型的第一基础阶段过渡到预测即服务的第二阶段,需要有人来扩展和优化数据科学家开发的概念验证模型。 这导致了机器学习或ML工程师的出现Jeff Hale描述了ML工程师的角色,如下所示:

Machine learning engineers take the models data scientists have created that show predictive promise and turn them into code that performs well in production.

机器学习工程师采用数据科学家创建的模型来显示可预测的前景,并将其转换为在生产中表现良好的代码。

With more ML models being developed, scaled, and deployed the task of maintaining these models in a manual fashion was becoming infeasible. As organizations started moving towards the third phase of a Model Factory we saw the emergence of yet another role — Machine Learning or (ML) Operations (or MLOps for short). MLOps specialists seek to deploy and maintain ML systems in production reliably and efficiently.

随着更多ML模型的开发,扩展和部署,以手动方式维护这些模型的任务变得不可行。 随着组织开始迈向模型工厂的第三阶段,我们看到了又一个角色的出现-机器学习或(ML)操作(或简称MLOps )。 MLOps专家寻求可靠,高效地在生产中部署和维护机器学习系统。

In the Model Factory phase we are seeing the integration of data, software, and models. Machine learning pipelines integrate data and code. Cristian Breuel in his paper on MLOps: Machine Learning as an Engineering Discipline elaborates this point further.

在“模型工厂”阶段,我们将看到数据,软件和模型的集成。 机器学习管道集成了数据和代码。 Cristian Breuel在其关于MLOps:机器学习作为工程学科的论文中进一步阐述了这一点。

The root cause is that there’s a fundamental difference between ML and traditional software: ML is not just code, it’s code plus data. An ML model, the artifact that you end up putting in production, is created by applying an algorithm to a mass of training data, which will affect the behavior of the model in production. Crucially, the model’s behavior also depends on the input data that it will receive at prediction time, which you can’t know in advance.

根本原因是ML和传统软件之间存在根本差异: ML不仅是代码,还包括代码和数据 。 通过将算法应用于大量训练数据来创建ML模型(最终要投入生产的工件),这将影响模型在生产中的行为。 至关重要的是,模型的行为还取决于它在预测时将收到的输入数据,您可能无法事先知道。

In summary, there are four key dimensions — software engineering, software operations, statistics and machine learning, and data management coming together. Data Engineers are at the intersection of software engineering and data management; ML Engineers are at the intersection of software engineering and statistics and machine learning. Data Scientists are at the intersection of statistics and machine learning, data management, and domain expertise (not shown in the diagram below). Finally, MLOps are at the intersection of software operations, statistics and ML, and data management.

总之,有四个关键维度-软件工程,软件操作,统计和机器学习以及数据管理结合在一起。 数据工程师处在软件工程和数据管理的交汇处。 ML工程师处在软件工程与统计和机器学习的交汇处。 数据科学家处于统计与机器学习,数据管理和领域专业知识的交叉点(下图中未显示)。 最后,MLOps处于软件操作,统计和ML以及数据管理的交汇处。

Image for post
Four Emerging Roles at the Intersection of Software, Statistics & ML, and Data Management (Source: PwC Analysis)
软件,统计与机器学习和数据管理交叉口的四个新兴角色(来源:普华永道分析)

Each of these four roles have very specific skillsets. The detailed role description and the skill sets for these are given below.

这四个角色中的每一个都有非常特定的技能。 下面给出了详细的角色描述和技能。

Image for post
Primary Role and Skill Set for Four Emerging Roles (Source: PwC Analysis)
四个新兴角色的主要角色和技能组合(来源:普华永道分析)

摘要 (Summary)

In this blog we traced the evolution of how software, models, and data are coming together to create a powerful approach to solving business problems. Being conscious of the differences between software and models (Part 1) allows us to recognize the unique skills required to address some of the traps (Part 2) in building intelligent software. While necessary, they are not sufficient to address all the challenges that we have outlined so far in bringing software, models, and data together. In the next few blogs we will address issues related to the development methodology and model lifecycle.

在此博客中,我们跟踪了软件,模型和数据如何组合在一起以创建解决业务问题的有效方法的演变。 意识到软件和模型之间的差异( 第1部分 )使我们认识到解决构建智能软件中的某些陷阱( 第2部分 )所需的独特技能。 尽管有必要,但它们不足以解决我们迄今为止概述的将软件,模型和数据整合在一起的所有挑战。 在接下来的几个博客中,我们将讨论与开发方法和模型生命周期相关的问题。

Authors: Anand S. Rao, Joseph Voyles and Shinan Zhang

作者: Anand S. RaoJoseph VoylesShinan Zhang

翻译自: https://medium.com/@AnandSRao/model-evolution-from-standalone-models-to-model-factory-5a8e01fa03cb

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值