玛雅人失去了在快速创业公司中进行数据科学的指南

A machine learning model is useful only if it’s put to production at the right time.

机器学习模型只有在正确的时间投入生产时才有用。

No kidding!

别开玩笑了!

The quest is what are those key practices left unsaid while working in a fast and highly chaotic environment? Are the AI foundations and computer science degrees enough to make scalable models in real life? It’s not as easy as eating cotton candy.

追求的是在快速且高度混乱的环境中工作时未讲哪些关键实践? AI基础和计算机科学学位是否足以在现实生活中创建可扩展模型? 这不像吃棉花糖那么容易。

Driven by intellectual progress, I decided to do a review of my past experience in working with Machine Learning and Deep Learning models. In this article, I’ll be using my work experience on one of my recent projects where the aim was to predict a user’s purchasing probability for our various offerings. The purpose of this project was to help the Performance Marketing and Growth Team at HealthifyMe to evaluate their efficacy and efficiency of our sales engines respectively. I am sharing below the insights I’ve had from the same on what one needs to strive towards while solving data problems:

在知识进步的推动下,我决定回顾一下我过去使用机器学习和深度学习模型的经验。 在本文中,我将在最近的一个项目中使用我的工作经验,该项目的目的是预测用户对我们各种产品的购买可能性。 该项目的目的是帮助HealthifyMe的绩效营销和增长团队分别评估他们对我们销售引擎的效力和效率。 我在下面分享我对解决数据问题时需要努力的见解:

Image for post
  1. Writing a code that’s modular and reusable. Data cleaning and preprocessing does take up more than half the time of a modeling journey. Redundant work can be greatly reduced and debugging made easier on writing generic parameterized functions.

    编写模块化且可重用的代码。 数据清理和预处理确实占用了建模过程一半以上的时间。 编写通用参数化函数可以大大减少冗余工作,并使调试变得更容易。

    We, in HealthifyMe, created many generic public transformers for this which made further iterations of the model quicker and more simple. And the best outcome: they can be reused for any standard data modeling in the future!

    为此,我们在HealthifyMe中创建了许多通用的公共转换器,从而使该模型的进一步迭代更快,更简单。 最好的结果:将来可以将它们重新用于任何标准数据建模!

  2. Writing both unit and integration tests (so what if it’s an AI model?). This is a vital practice especially when the code is scattered in various modules and has to go through constant iterations. A single module might work but fail when combined with data pipelines.

    编写单元测试和集成测试(如果是AI模型,该怎么办?)。 这是至关重要的实践,尤其是当代码分散在各个模块中并且必须经历不断的迭代时。 单个模块可能可以工作,但与数据管道结合使用时会失败。

  3. Gradually iterating the model. It is best to add one set of features after another instead of throwing all the features in it at once, expecting the model to do the magic. Gradual iterations give a greater understanding of feature importances. Releasing small models iteratively would also keep the stakeholder’s boat moving.

    逐步迭代模型。 最好先添加一组功能,而不要一次将所有功能都扔进去,而要让模型发挥作用。 逐步迭代使人们对功能的重要性有了更深入的了解。 迭代发布小模型也将保持利益相关者的利益。

    At HealthifyMe, we decided to launch the initial version of our current model two months prior to the New Year 2020, the peak period for business, rather than waiting on its perfection, which would have come with a great opportunity cost.

    在HealthifyMe,我们决定在2020年新年高峰期之前的两个月推出当前模型的初始版本,而不是等待其完善,因为这会带来巨大的机会成本。

  4. Documenting the tiniest bit of progress. The more attention given to details, the better. Sounds like a boring chore, but it helps tremendously while working on complex projects that involve too many moving pieces and uncertainties.

    记录最微小的进展。 对细节的关注越多越好。 听起来像是一个无聊的琐事,但是在从事涉及太多动人的事物和不确定性的复杂项目时,它会大有帮助。

    Among us, this made all interactions with the data engineering team super smooth and efficient. In retrospection, I feel getting them more involved in the initial stages of the journey would have actually helped us as it would have given us an understanding of their concerns.

    在我们当中,这使得与数据工程团队的所有交互都非常顺畅和高效。 回想起来,我认为让他们更多地参与旅程的最初阶段实际上会对我们有所帮助,因为这会使我们对他们的关切有所了解。

  5. Behaving like an owner. You, and only you know where things could get really messed up. Take up responsibility and make sure that the business stakeholders, data engineers, and all other teams associated are in the loop throughout the process so that the work cycle remains unaffected. Business stakeholders may get frustrated real quick in the face of ambiguities.

    表现得像主人。 您,只有您知道事情可能会真正搞砸。 承担责任,并确保业务利益相关者,数据工程师和所有其他相关团队在整个过程中都处于循环中,以使工作周期不受影响。 面对歧义,业务利益相关者可能会很快感到沮丧。

    Our team in HealthifyMe, scheduled regular brainstorming sessions with the stakeholders. This was really valuable as it helped in maintaining the morale and the exchange of quick feedback. It also helped in planning their sprints/timelines better. This was a great lesson in managing the unknowns. Planning is a great life skill to have when your games are not so finite!

    我们的HealthifyMe团队安排了与利益相关者的定期集思广益会议。 这真的很有价值,因为它有助于保持士气和快速反馈的交换。 它还有助于更好地计划其冲刺/时间表。 这是管理未知数的重要课程。 当游戏不是那么有限时,规划是一项很棒的生活技能!

  6. Making sure that the retaining mechanism and model evaluation processes are automated. (Even if it comes to extending the development phase in your sprints). The complete cycle of Data pulling + Retraining the model + Evaluating the iteration + Pushing to production, should not require another data scientist reworking on it from scratch.

    确保保留机制和模型评估过程是自动化的。 (即使涉及到在冲刺中扩展开发阶段)。 数据提取+训练模型+评估迭代+推送到生产的完整周期,不需要其他数据科学家从头开始对其进行重新加工。

    We were quick to learn from our mistakes when we realized how much of our engineering bandwidth was spent on just trying to crack the processes involved in the previous model.

    当我们意识到要尝试破解先前模型中涉及的流程时,我们花费了多少工程带宽时,我们很快就从错误中学习。

  7. A/B testing on the previous and new models. This is again a vital step in understanding delta improvement. It’s a good practice to compare each feature to judge the efficacy of the latest model. Building data algorithms is a serious mental investment with a lot of uncertainties in terms of results. Hence it is psychologically rewarding for data scientists to witness the potential impact.

    对旧模型和新模型进行A / B测试。 这再次是理解增量改进的关键步骤。 比较每个功能来判断最新模型的有效性是一个好习惯。 建立数据算法是一项严肃的精神投资,在结果方面存在很多不确定性。 因此,数据科学家亲眼目睹潜在影响在心理上是有益的。

  8. Keeping the model updated. Models deprecate with time in an environment where the business needs are rapidly changing. Product/strategic changes are frequent and inconsistent as we continuously strive to keep the user experience the top priority.

    保持模型更新。 在业务需求快速变化的环境中,模型会随着时间而弃用。 由于我们不断努力使用户体验成为重中之重,因此产品/策略变更经常出现且前后不一致。

    In dealing with the same, we created monitors and alerts to keep an eye on when the relevancy of the model falls and would need tweaking/revisiting in the complete approach.

    为了处理这些问题,我们创建了监视器和警报,以密切关注模型的相关性何时下降,并且需要在完整方法中进行调整/重新访问。

  9. Being super conscious of the timeline. This is extremely important to projects that have the potential to paralyze you due to over-analysis. Even if all the steps are crystal clear in your head — it’s wise to give a 2X timeline of what you think should be the actual timeline.

    对时间表非常了解。 这对于因过度分析而可能使您瘫痪的项目极为重要。 即使您脑中清楚地知道所有步骤,也应该给您一个2倍的时间轴,这是您认为应该是实际时间轴的时间。

    We made the mistake of giving 0.5X and ended up taking 2X of the actual time it should have taken in an ideal scenario. An exemplar bad way of managing the expectations of the teams involved. It’s always the smartest decision to give timelines for each iteration to avoid surprises.

    我们犯了一个错误,即给出0.5倍,最终花费了理想情况下应该花费的实际时间的2倍。 管理所涉及团队的期望的示例性坏方法。 每次迭代给出时间表以免出现意外,这始终是最明智的决定。

  10. Keeping up the partnership with stakeholders. A proactive expectations management with stakeholders plays a pivotal role in an organization’s data projects. This point cannot be stressed enough and needs to be at the forefront in every stage of the work. It should be one of the KPIs for a Data scientist to be able to do stakeholder management efficiently and educate them on how to use insights from the model in making decisions in different strategic scenarios, at the same time.

    保持与利益相关者的伙伴关系。 利益相关者的积极主动的期望管理在组织的数据项目中扮演着至关重要的角色。 这一点强调得不够,需要在工作的每个阶段都处于最前沿。 它应该是数据科学家能够有效地进行利益相关者管理并教育他们如何在不同战略场景下同时做出决策时如何使用模型的见解的KPI之一。

Image for post

Saurav Agarwal, Senior Data Scientist at HealthifyMe published this blog in hopes to make any organization that strives with its product, engineering, and data teams, understand the underlying intricacies and effective ways to solve them while pursuing their data strategies.

HealthifyMe的高级数据科学家 Saurav Agarwal 发布了此博客,以期使任何致力于其产品,工程和数据团队的组织,了解潜在的复杂性以及在寻求数据策略时解决这些问题的有效方法。

翻译自: https://medium.com/healthify-tech/the-mayans-lost-guide-to-doing-data-science-in-fast-paced-startups-2531128aecd0

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值