通才与专家_为什么数据科学团队需要通才而不是专家

通才与专家

重点(Top highlight)

Editor’s note: Eric Colson brings a unique perspective to organizing data teams as Chief Algorithms Officer at Stitch Fix, enlisting full-stack data science generalists — an approach that tech professionals love debating. In this article, Eric describes the challenges and payoffs of employing generalists, not specialists.

编者注:埃里克·科尔森(Eric Colson)为Stitch Fix的首席算法官组织数据团队带来了独特的见解,招募了全栈数据科学通才,这是技术专业人员喜欢辩论的方法。 在本文中,埃里克(Eric)描述了聘用通才而不是专家的挑战和收获。

In The Wealth of Nations, Adam Smith demonstrates how the division of labor is the chief source of productivity gains using the vivid example of a pin factory assembly line: “One person draws out the wire, another straights it, a third cuts it, a fourth points it, a fifth grinds it.” With specialization oriented around function, each worker becomes highly skilled in a narrow task leading to process efficiencies.

《国富论》中,亚当·史密斯(Adam Smith)用别针工厂装配线的生动示例演示了劳动分工是生产力提高的主要来源: “一个人拉出电线,另一个人拉直电线,第三个人切断电线,第四点,第五点磨。” 通过以功能为中心的专业化,每个工人在完成狭窄任务时变得非常熟练,从而提高了流程效率。

The allure of such efficiencies has led us to organize even our data science teams by speciality functions such as data engineers, machine learning engineers, research scientist, causal inference scientists, and so on. Specialists’ work is coordinated by a product manager, with hand-offs between the functions in a manner resembling the pin factory: “one person sources the data, another models it, a third implements it, a fourth measures it” and on and on.

如此高效率的诱惑力使我们甚至通过专门职能(例如数据工程师,机器学习工程师,研究科学家,因果推理科学家等)组织了我们的数据科学团队。 专家的工作由产品经理协调,在功能之间的交接类似于引脚工厂: “一个人获取数据,另一个人建模数据,第三个人实施数据,第四个人对其进行测量”等等。

The challenge with this approach is that data science products and services can rarely be designed up-front. They need to be learned and developed via iteration. Yet, when development is distributed among multiple specialists, several forces can hamper iteration cycles. Coordination costs, the time spent communicating, discussing, justifying, each change, scale proportionally with the number of people involved.

这种方法面临的挑战是,很少能预先设计数据科学产品和服务。 它们需要通过迭代来学习和开发。 但是,当开发分散在多个专家之间时,多种力量会阻碍迭代周期。 协调成本,沟通,讨论,论证,每项变更所花费的时间与参与人数成正比。

Even with just a few specialists, the cost of coordinating their work can quickly exceed any benefit from their division of labor. Even more nefarious, is the ‘wait-times’ that elapse between the units of work performed by the specialists. Schedules of specialists are difficult to align so projects often sit idle waiting for specialists resources to become available. These two forces can impair iteration, which is critical to the development of data science products. Status updates like “waiting on ETL changes” or “waiting on ML Eng for implementation” are common symptoms that you have over-specialized.

即使只有几个专家,协调工作的成本也可能很快超过他们分工带来的任何收益。 更为恶毒的是专家执行的工作单元之间的“等待时间”。 专家计划表难以统一,因此项目经常处于闲置状态,等待专家资源可用。 这两个因素会削弱迭代,这对于数据科学产品的开发至关重要。 状态更新,例如“等待ETL更改”或“等待ML Eng实施”,是您过度专业化的常见症状。

Instead of organizing data scientists by specialty function, give each end-to-end ownership for different business capabilities. For example, one data scientist can build a product recommendation capability, a second can build a customer prospecting capability, and so on. Each data scientist would then perform all the functions required to develop each capability, from model training to ETL to implementation to measurement. Of course, these data scientist generalists have to perform their work sequentially rather than in parallel. However, doing the work typically takes just a fraction of the wait-time it would take for separate specialist resources to come available. So, iteration and development time goes down. Learning and development is faster.

不必按专业职能组织数据科学家,而应赋予每个端对端所有权以拥有不同的业务能力。 例如,一位数据科学家可以建立产品推荐功能,另一位数据科学家可以建立客户预期功能,依此类推。 然后,每个数据科学家将执行开发每种功能所需的所有功能,从模型训练到ETL到实施再到测量。 当然,这些数据科学家通才必须顺序执行工作,而不是并行执行。 但是,完成工作通常仅花费等待时间的一小部分,而等待时间只有分开的专家资源才可用。 因此,迭代和开发时间减少了。 学习和发展更快。

Many find this notion of full-stack data science generalists to be daunting. Particularly, it’s the technical skills that most find so challenging to acquire, as many data scientists have not been trained as software engineers. However, much of technical complexity can be abstracted away through a robust data platform. Data scientists can be shielded from the inner workings of containerization, distributed processing, automatic failover, etc. This allows the data scientists to focus more on the science side of things, learning and developing solutions through iteration.

许多人发现全栈数据科学通才的这一概念令人生畏。 特别是,由于大多数数据科学家都没有经过软件工程师的培训,因此最难获得的是技术技能。 但是,可以通过强大的数据平台来消除很多技术复杂性。 数据科学家可以免受容器化,分布式处理,自动故障转移等内部工作的影响。这使数据科学家可以将更多的精力放在事物的科学方面,通过迭代学习和开发解决方案。

学得更快。 深入挖掘。 看得更远。 (Learn faster. Dig deeper. See farther.)

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

加入O'Reilly在线学习平台。 立即获得免费试用版,即时找到答案,或者掌握一些新的有用的知识。

Learn more

学到更多

Eric Colson is Chief Analytics Officer at a Stitch Fix. For more than 18 years, he has led data-oriented teams that span algorithms & machine learning, Big Data & data warehousing, and analytics & business intelligence. Prior to Stitch Fix, Eric was Vice President of Data Science & Engineering at Netflix. He holds degrees in Information Systems and Economics.

Eric Colson是Stitch Fix的首席分析官。 在超过18年的时间里,他领导了面向数据的团队,该团队横跨算法和机器学习,大数据和数据仓库以及分析和商业智能。 在Stitch Fix之前,Eric是Netflix数据科学与工程副总裁。 他拥有信息系统和经济学学位。

翻译自: https://medium.com/oreillymedia/why-data-science-teams-need-generalists-not-specialists-955383d0bb1b

通才与专家

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值