netflix_Netflix的计算因果推论

netflix

Jeffrey Wong, Colin McFarland

杰弗里·黄 科林·麦克法兰

Every Netflix data scientist, whether their background is from biology, psychology, physics, economics, math, statistics, or biostatistics, has made meaningful contributions to the way Netflix analyzes causal effects. Scientists from these fields have made many advancements in causal effects research in the past few decades, spanning instrumental variables, forest methods, heterogeneous effects, time-dynamic effects, quantile effects, and much more. These methods can provide rich information for decision making, such as in experimentation platforms (“XP”) or in algorithmic policy engines.

每个Netflix数据科学家,无论其背景是生物学,心理学,物理学,经济学,数学,统计学还是生物统计学,都对Netflix分析因果关系的方式做出了有意义的贡献。 在过去的几十年中,这些领域的科学家在因果效应研究方面取得了许多进步,涵盖了工具变量,森林方法,非均质效应,时间动态效应,分位数效应等等。 这些方法可以为决策提供丰富的信息,例如在实验平台(“ XP”)或算法策略引擎中。

We want to amplify the effectiveness of our researchers by providing them software that can estimate causal effects models efficiently, and can integrate causal effects into large engineering systems. This can be challenging when algorithms for causal effects need to fit a model, condition on context and possible actions to take, score the response variable, and compute differences between counterfactuals. Computation can explode and become overwhelming when this is done with large datasets, with high dimensional features, with many possible actions to choose from, and with many responses. In order to gain broad software integration of causal effects models, a significant investment in software engineering, especially in computation, is needed. To address the challenges, Netflix has been building an interdisciplinary field across causal inference, algorithm design, and numerical computing, which we now want to share with the rest of the industry as computational causal inference (CompCI). A whitepaper detailing the field can be found here.

我们希望通过提供能够有效估计因果关系模型并将因果关系整合到大型工程系统中的软件来扩大研究人员的效率。 当因果效应算法需要适合模型,根据情况和采取的可能措施,对响应变量进行评分以及计算反事实之间的差异时,这可能会具有挑战性。 当使用大型数据集,具有高维特征,有很多可能的动作可供选择以及有很多响应时,计算可能会爆炸并变得不堪重负。 为了获得因果模型的广泛软件集成,需要在软件工程上,特别是在计算上进行大量投资。 为了应对这些挑战,Netflix一直在跨因果推理,算法设计和数值计算领域建立跨学科领域,我们现在希望将其作为计算因果推理 (CompCI)与业界其他人士共享。 可以在此处找到详细说明该领域的白皮书。

Computational causal inference brings a software implementation focus to causal inference, especially in regards to high performance numerical computing. We are implementing several algorithms to be highly performant, with a low memory footprint. As an example, our XP is pivoting away from two sample t-tests to models that estimate average effects, heterogeneous effects, and time-dynamic treatment effects. These effects help the business understand the user base, different segments in the user base, and whether there are trends in segments over time. We also take advantage of user covariates throughout these models in order to increase statistical power. While this rich analysis helps to inform business strategy and increase member joy, the volume of the data demands large amounts of memory, and the estimation of the causal effects on such volume of data is computationally heavy.

计算因果推理将软件实现重点放在因果推理上,尤其是在高性能数值计算方面。 我们正在实现几种算法,以实现高性能,低内存占用。 例如,我们的XP正在从两个样本t检验转向使用估计平均效果,异构效果和时间动态处理效果的模型。 这些效果有助于企业了解用户群,用户群中的不同细分以及细分随时间的变化趋势。 我们还利用这些模型中的用户协变量来提高统计能力。 尽管这种丰富的分析有助于告知业务策略并增加成员的满意度,但数据量需要大量的内存,并且对这种数据量的因果效应的估计在计算上很繁琐。

In the past, the computations for covariate adjusted heterogeneous effects and time-dynamic effects were slow, memory heavy, hard to debug, a large source of engineering risk, and ultimately could not scale to many large experiments. Using optimizations from CompCI, we can estimate hundreds of conditional average effects and their variances on a dataset with 10 million observations in 10 seconds, on a single machine. In the extreme, we can also analyze conditional time dynamic treatment effects for hundreds of millions of observations on a single machine in less than one hour. To achieve this, we leverage a software stack that is completely optimized for sparse linear algebra, a lossless data compression strategy that can reduce data volume, and mathematical formulas that are optimized specifically for estimating causal effects. We also optimize for memory and data alignment.

过去,协变量调整后的异构效应和时动态效应的计算速度慢,内存繁重,难以调试,工程风险很大,最终无法扩展到许多大型实验。 使用CompCI的优化,我们可以在一台机器上用10秒钟内进行1000万次观测的数据集上估计数百个条件平均效果及其方差。 在极端情况下,我们还可以在不到一小时的时间内对一台机器上的亿万个观测值进行条件时间动态处理效果分析。 为了实现这一目标,我们利用了针对稀疏线性代数进行了完全优化的软件堆栈,可以减少数据量的无损数据压缩策略以及专门用于估计因果关系的数学公式。 我们还针对内存和数据对齐进行了优化。

This level of computing affords us a lot of luxury. First, the ability to scale complex models means we can deliver rich insights for the business. Second, being able to analyze large datasets for causal effects in seconds increases research agility. Third, analyzing data on a single machine makes debugging easy. Finally, the scalability makes computation for large engineering systems tractable, reducing engineering risk.

这种级别的计算为我们提供了很多奢侈。 首先,扩展复杂模型的能力意味着我们可以为企业提供丰富的见解。 其次,能够在几秒钟内分析大型数据集的因果关系,从而提高了研究敏捷性。 第三,在一台机器上分析数据使调试变得容易。 最后,可伸缩性使大型工程系统的计算变得容易处理,从而降低了工程风险。

Computational causal inference is a new, interdisciplinary field we are announcing because we want to build it collectively with the broader community of experimenters, researchers, and software engineers. The integration of causal inference into engineering systems can lead to large amounts of new innovation. Being an interdisciplinary field, it truly requires the community of local, domain experts to unite. We have released a whitepaper to begin the discussion. There, we describe the rising demand for scalable causal inference in research and in software engineering systems. Then, we describe the state of common causal effects models. Afterwards, we describe what we believe can be a good software framework for estimating and optimizing for causal effects.

计算因果推理是我们宣布的一个新的跨学科领域,因为我们希望与更广泛的实验人员,研究人员和软件工程师共同构建该因果推理。 将因果推理集成到工程系统中可以导致大量新的创新。 作为一个跨学科领域,它确实需要本地领域专家的社区团结。 我们发布了一份白皮书来开始讨论。 在这里,我们描述了在研究和软件工程系统中对可伸缩因果推理的不断增长的需求。 然后,我们描述了常见因果模型的状态。 然后,我们描述我们认为可以成为评估和优化因果关系的良好软件框架。

Finally, we close the CompCI whitepaper with a series of open challenges that we believe require an interdisciplinary collaboration, and can unite the community around. For example:

最后,我们以一系列公开挑战结束了CompCI白皮书,我们认为这需要跨学科合作,并且可以团结社区。 例如:

  1. Time dynamic treatment effects are notoriously hard to scale. They require a panel of repeated observations, which generate large datasets. They also contain autocorrelation, creating complications for estimating the variance of the causal effect. How can we make the computation for the time-dynamic treatment effect, and its distribution, more scalable?

    众所周知,时间动态治疗效果很难扩展。 他们需要一组重复的观察结果,从而生成大型数据集。 它们还包含自相关,从而产生了复杂的估计因果效应的方差。 我们如何使时间动态治疗效果及其分布的计算更具可扩展性?
  2. In machine learning, specifying a loss function and optimizing it using numerical methods allows a developer to interact with a single, umbrella framework that can span several models. Can such an umbrella framework exist to specify different causal effects models in a unified way? For example, could it be done through the generalized method of moments? Can it be computationally tractable?

    在机器学习中,指定损失函数并使用数值方法对其进行优化,使开发人员可以与可以跨多个模型的单个伞形框架进行交互。 是否可以使用这样的伞形框架以统一的方式指定不同的因果模型? 例如,可以通过广义矩方法来完成吗? 它在计算上可以处理吗?
  3. How should we develop software that understands if a causal parameter is identified? A solution to this helps to create software that is safe to use, and can provide safe, programmatic access to the analysis of causal effects. We believe there are many edge cases in identification that require an interdisciplinary group to solve.

    我们应该如何开发能够识别因果参数的软件? 解决此问题的方法有助于创建安全使用的软件,并可以安全,编程地访问因果关系分析。 我们认为,鉴定中存在许多需要跨学科小组解决的边缘案例。

We hope this begins the discussion, and over the coming months we will be sharing more on the research we have done to make estimation of causal effects performant. There are still many more challenges in the field that are not listed here. We want to form a community spanning experimenters, researchers, and software engineers to learn about problems and solutions together. If you are interested in being part of this community, please reach us at compci-public@netflix.com.

我们希望这能开始讨论,在接下来的几个月中,我们将分享更多有关所做的研究以评估绩效因果关系。 该领域中还有许多其他挑战未在此处列出。 我们希望形成一个由实验人员,研究人员和软件工程师组成的社区,以共同了解问题和解决方案。 如果您有兴趣加入这个社区,请通过compci-public@netflix.com与我们联系。

翻译自: https://netflixtechblog.com/computational-causal-inference-at-netflix-293591691c62

netflix

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值