



  • I have encountered a lot of resistance in the data science community against agile methodology and specifically scrum framework;

  • I don’t see it this way and claim that most disciplines would improve by adopting agile mindset;

  • We will go through a typical scrum sprint to highlight the compatibility of the data science process and the agile development process.

  • Finally, we discuss when a scrum is not an appropriate process to follow. If you are a consultant working on many projects at a time or your work requires deep concentration on a single and narrow issue (narrow, so that you alone can solve it).

    最后,我们讨论了Scrum何时不适合遵循的过程。 如果您是同时从事多个项目的顾问,或者您的工作需要专注于一个狭窄的问题(狭窄,那么您一个人就能解决)。

I have found a medium post recently, which claims that Scrum is awful for data science. I’m afraid I have to disagree and would like to make a case for Agile Data Science.

我最近发现了一篇中篇文章,其中声称Scrum 对于数据科学非常糟糕 。 恐怕我不得不不同意,并希望为敏捷数据科学辩护。

Ideas for this post are significantly influenced by the Agile Data Science 2.0 book (which I highly recommend) and personal experience. I am eager to know other experiences, so please share them in the comments.

这篇文章的想法在很大程度上受到敏捷数据科学2.0本书(我强烈推荐)和个人经验的影响。 我很想知道其他经历,所以请在评论中分享。

First, we need to agree on what data science is and how it solves business problems so we can investigate the process of data science and how agile (and specifically Scrum) can improve it.


什么是数据科学? (What is Data Science?)

There are countless definitions online. For example, Wikipedia gives such a description:

在线上有无数的定义。 例如, 维基百科给出了这样的描述:

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.


In my opinion, it is quite an accurate definition of what data science tries to accomplish. But I would simplify this definition further.

我认为,这是对数据科学要完成的工作的准确定义。 但是,我将进一步简化该定义。

Data Science solves business problems by combining business understanding, data and algorithms.


Compared to the definition in Wikipedia, I would like to stress that data scientists should aim to solve business problems rather than “extract knowledge and insights.”

与Wikipedia中的定义相比,我想强调的是,数据科学家应该致力于解决业务问题,而不是“ 提取知识和见解”。

数据科学如何解决业务问题? (How Data Science Solves business problems?)

So data science is here to solve business problems. We need to accomplish a few things along the way:

因此,数据科学在这里可以解决业务问题。 我们需要在此过程中完成几件事:

  1. Understand the business problem;

  2. Identify and acquire available data;

  3. Clean / transform / prepare data;

  4. Select and fit an appropriate “model” for a given data;

  5. Deploy model to “production” — this is our attempt to solving a given problem;

  6. Monitoring performance;


As with everything, there are countless ways to go about implementing those steps, but I will try to persuade you that the agile (incremental and iterative) approach brings the most value to the company and the most joy to data scientists.


敏捷数据科学宣言 (Agile Data Science Manifesto)

I took this from page 6 in the Agile Data Science 2.0 book, so you are encouraged to read the original, but here it is:


  • Iterate, iterate, iterate — tables, charts, reports, predictions.

  • Ship intermediate output. Even failed experiments have output.

    运送中间输出。 即使失败的实验也可以输出。
  • Prototype experiments over implementing tasks.

  • Integrate the tyrannical opinion of data in product management.

  • Climb up and down the data-value pyramid as you work.

  • Discover and pursue the critical path to a killer product.

  • Get meta. Describe the process, not just the end state.

    获取元数据。 描述过程,而不仅仅是结束状态。

Not all the steps are self-explanatory, and I encourage you to go and read what Russel Jurney had to say, but I hope that the main idea is clear — we share and intermediate output, and we iterate to achieve value.

并非所有步骤都是不言自明的,我鼓励您去阅读Russel Jurney所说的内容,但是我希望主要思想是明确的-我们共享和中间产出,并不断迭代以实现价值。

Given the above preliminaries, let us go over a standard week for a scrum team. And we will assume a one week sprint.

鉴于以上初步介绍,让我们为一个Scrum团队度过一个标准的星期。 我们将假设一个星期的冲刺。

Scrum团队冲刺 (Scrum Team Sprint)

第一天 (Day 1)

There are many sprint structure variations, but I will assume that planning is done on Monday morning. The team will decide which user stories from the product backlog will be transferred to the Sprint backlog. The most pressing issue for our business, as evident from the backlog ranking, is customer fraud — fraudulent transactions are causing our valuable customers out of our platform. During the previous backlog refinement session, the team already discussed this task, and the product owner got additional information from the Fraud Investigation team. So during the meeting, the team decides to start with a simple experiment (and already is thinking of interesting iterations further down the road) — an initial model based on simple features of the transaction and participating users. Work is split so that the data scientist can go and have a look at the data team identified for this problem. The data engineer will set up the pipeline for model output integration to DWH systems, and the full-stack engineer starts to set up a page for transaction review and alert system for the Fraud Investigation team.

sprint结构有很多变化,但我将假定计划在星期一早上完成。 团队将决定将产品积压中的哪些用户故事转移到Sprint积压中。 从积压的排名中可以明显看出,我们业务最紧迫的问题是客户欺诈-欺诈性交易正使我们宝贵的客户退出平台。 在上一个待办事项优化会话中,团队已经讨论了此任务,产品所有者从欺诈调查团队获得了更多信息。 因此,在会议期间,团队决定从一个简单的实验开始(并且已经在考虑下一步的有趣迭代),这是一个基于交易和参与用户的简单特征的初始模型。 工作是分开的,以便数据科学家可以去看看针对此问题确定的数据团队。 数据工程师将建立将模型输出集成到DWH系统的管道,而全栈工程师将开始为欺诈调查团队设置一个页面,用于事务审查和警报系统。

第二天 (Day 2)

At the start of Tuesday, all team gathers and shares progress. Data scientist shows a few graphs which indicate that even with limited features, we will have a decent model. At the same time, the data engineer is already halfway through setting up the system to score incoming transactions with the new model. The full-stack engineer is also progressing nicely, and just after a few minutes, everyone is back at their desk working on the agreed tasks.

在星期二初,所有团队聚集并分享进步。 数据科学家显示了一些图表,这些图表表明即使功能有限,我们也将拥有一个不错的模型。 同时,数据工程师已经完成设置系统的一半,以使用新模型对传入的交易进行评分。 全职工程师的进度也不错,几分钟后,每个人都回到了办公桌前,完成约定的任务。

第三天 (Day 3)

As with Tuesday, the team starts Wednesday with a standup meeting to share their progress. There is already a simple model build and some accuracy and error rate numbers. The data engineer shows the infrastructure for the transaction scoring, and the team discusses how the features arrive at the system and what needs to be done for them to be ready for the algorithm. The full-stack engineer shows the admin panel with metadata on transactions is displayed and the triggering mechanism. Another discussion follows on the threshold value for the model output to trigger a message for a fraud analyst. The team agrees that we need to be able to adjust this value since different models might have different distributions, and also, depending on other variables, we might want to increase and decrease the number of approved transactions.

与星期二一样,团队从星期三开始进行站立会议,以分享他们的进度。 已经有一个简单的模型构建以及一些准确性和错误率数字。 数据工程师展示了交易评分的基础架构,团队讨论了功能如何到达系统以及需要做什么才能使其准备好算法。 全栈工程师将显示管理面板,其中显示有关事务的元数据以及触发机制。 接下来是关于模型输出的阈值以触发欺诈分析者消息的讨论。 团队同意我们必须能够调整此值,因为不同的模型可能具有不同的分布,并且根据其他变量,我们可能希望增加和减少批准的交易数量。

第四天 (Day 4)

On Thursday, the team already has all the pieces, and during the standup, discuss how to integrate those pieces. Team also outlines how to best monitor models in production, so that model performance could be evaluated and also degradation could be detected before it causes any real damage. They agree that a simple dashboard for monitoring accuracy and error rates will suffice for now.

星期四,团队已经掌握了所有内容,在站立比赛中,讨论了如何整合这些内容。 团队还概述了如何在生产中最好地监视模型,以便可以评估模型性能并在导致任何实际损害之前检测出退化。 他们一致认为,目前仅需要一个用于监视准确性和错误率的简单仪表板即可。

第五天 (Day 5)

Friday is a demo day. During standup, the team discusses the last issues remaining with the first iteration of the transaction fraud detection. Team members prepare for the meeting with the fraud analysts that will be using this solution.

星期五是演示日。 在站立期间,团队讨论事务欺诈检测的第一次迭代中剩下的最后一个问题。 团队成员准备与将使用此解决方案的欺诈分析师进行会议。

During the demo, the team shows what they have built for the fraud analysts. The team presents performance metrics and their implications for the fraud analysts. All feedback is converted to tasks for future sprints.

在演示期间,团队将展示他们为欺诈分析人员构建的内容。 该团队介绍了绩效指标及其对欺诈分析师的影响。 所有反馈都转换为任务,以供将来冲刺。

Another vital part of the Sprint is a retrospective — meeting where the team discusses three things:1. What went well in the Sprint;

Sprint的另一个重要组成部分是回顾会议-团队讨论三件事的会议:1。 在Sprint中进展顺利;

2. What could be improved;


3. What will we commit to improving in the next Sprint;


再往前走 (Further down the road)

During the next Sprint, the team is working on another most important item from the product backlog. It might be feedback from the fraud analysts, or it might be something else that the product owner thinks will improve the overall business the most. However, the team closely monitors the performance of the initial version of the solution. It will continue to do so because ML solutions are sensitive to changes in underlying assumptions that the model made about data distribution.

在下一个Sprint期间,团队正在处理产品积压中的另一个最重要的项目。 这可能是欺诈分析师的反馈,也可能是产品所有者认为可以最大程度改善整体业务的其他方面。 但是,团队将密切监视解决方案初始版本的性能。 它将继续这样做,因为ML解决方案对模型对数据分布所做的基本假设的更改敏感。

讨论区 (Discussion)

Above is a relatively “clean” exposition of the scrum process for data science solutions. Real-world rarely is that way, but I wanted to convey a few points:

上面是数据科学解决方案的Scrum过程的相对“干净”的阐述。 现实世界很少采用这种方式,但我想表达几点:

  1. Data Science cannot stand on its own. If we’re going to impact the real world we have to collaborate in a cross-functional team, it should be a part of a wider team;

    数据科学不能自立。 如果要影响现实世界,我们必须在跨职能团队中进行协作,这应该成为更广泛团队的一部分。
  2. Iteration is critical in data science, and we should expose artifacts of those iterations to our stakeholders to receive feedback as fast as possible;

  3. Scrum is a framework that is designed for iterative progress. Therefore it is a perfect fit for data science work;

    Scrum是一个专为迭代进度而设计的框架。 因此,它非常适合数据科学工作;

However, it is not a framework for any endeavor. If your job requires you to think deeply for days, then Scrum and agile would probably be very disruptive and counterproductive. Also, if your work requires you to handle a lot of different and small data science-related tasks, following Scrum would be inappropriate, and maybe Kanban should be considered. However, typical product data science work is not like that. Iteration is king, and getting feedback fast is key to providing the right solutions to business problems.

但是,这不是任何努力的框架。 如果您的工作需要您深入思考数日,那么Scrum和敏捷可能会非常破坏性且适得其反。 另外,如果您的工作要求您处理许多与小数据科学相关的不同任务,那么遵循Scrum是不合适的,也许应该考虑看板。 但是,典型的产品数据科学工作并非如此。 迭代为王,快速​​获得反馈对于提供正确的业务问题解决方案至关重要。

综上所述 (In summary)

Data Science is a perfect fit for the Scrum with a single modification — we do not expect to ship finished models. Instead, we ship artifacts of our work and solicit feedback from our stakeholders so we can make progress faster. Project managers might not like data science for the unpredictability of the progress, but iteration is not at fault, it is the only way forward.

只需修改一下,Data Science就非常适合Scrum —我们不希望交付完成的模型。 取而代之的是,我们运送工作的工件并征求利益相关者的反馈,以便我们更快地取得进展。 项目经理可能不喜欢数据科学,因为它具有不可预测的进度,但是迭代并不是错误,这是前进的唯一途径。

I would like to know what you think about agile data science? What has worked for you and your team? What didn’t work? I hope you will leave a comment!

我想知道您如何看待敏捷数据科学? 什么对您和您的团队有用? 什么没用? 希望您发表评论!

