netflix的准实验面临的主要挑战

最新推荐文章于 2024-10-13 19:04:04 发布

张_伟_杰

最新推荐文章于 2024-10-13 19:04:04 发布

阅读量460

点赞数 1

文章标签： python 人工智能算法

原文链接：https://netflixtechblog.com/key-challenges-with-quasi-experiments-at-netflix-89b4f234b852

版权

重点 (Top highlight)

Kamer Toker-Yildiz, Colin McFarland, Julia Glick

KAMER Toker-耶尔德兹 ， 科林·麦克法兰 ， Julia·格里克

At Netflix, when we can’t run A/B experiments we run quasi experiments! We run quasi experiments with various objectives such as non-member experiments focusing on acquisition, member experiments focusing on member engagement, or video streaming experiments focusing on content delivery. Consolidating on one methodology could be a challenge, as we may face different design or data constraints or optimization goals. We discuss some key challenges and approaches Netflix has been using to handle small sample size and limited pre-intervention data in quasi experiments.

在Netflix，当我们无法进行A / B实验时，我们会进行准实验！我们运行具有各种目标的准实验，例如专注于获取的非成员实验，专注于成员参与的成员实验或专注于内容交付的视频流实验。由于我们可能面临不同的设计或数据约束或优化目标，因此将一种方法论整合起来可能是一个挑战。 我们讨论了Netflix在准实验中用于处理小样本量和有限的干预前数据的一些关键挑战和方法。

设计与随机化 (Design and Randomization)

We face various business problems where we cannot run individual level A/B tests but can benefit from quasi experiments. For instance, consider the case where we want to measure the impact of TV or billboard advertising on member engagement. It is impossible for us to have identical treatment and control groups at the member level as we cannot hold back individuals from such forms of advertising. Our solution is to randomize our member base at the smallest possible level. For instance, TV advertising can be bought at TV media market level only in most countries. This usually involves groups of cities in closer geographic proximity.

我们面临各种业务问题，无法运行单独的A / B级测试，但可以从准实验中受益。例如，考虑我们要评估电视或广告牌广告对会员参与度的影响的情况。对于我们来说，在会员级别拥有相同的待遇和对照组是不可能的，因为我们无法阻止此类广告形式的个人。我们的解决方案是在尽可能小的水平上随机分配我们的会员基础。例如，电视广告只能在大多数国家/地区在电视媒体市场上购买。这通常涉及地理上更接近的城市群。

One of the major problems we face in quasi experiments is having small sample size where asymptotic properties may not practically hold. We typically have a small number of geographic units due to test limitations and also use broader or distant groups of units to minimize geographic spillovers. We are also more likely to face high variation and uneven distributions in treatment and control groups due to heterogeneity across units. For example, let’s say we are interested in measuring the impact of marketing Lost in Space series on sci-fi viewing in the UK. London with its high population is randomly assigned to the treatment cell, and people in London love sci-fi much more than other cities. If we ignore the latter fact, we will overestimate the true impact of marketing — which is now confounded. In summary, simple randomization and mean comparison we typically utilize in A/B testing with millions of members may not work well for quasi experiments.

我们在准实验中面临的主要问题之一是样本量较小，而渐近性质可能实际上不成立。由于测试的局限性，我们通常会有少量的地理单位，并且还会使用更广或更远的单位组，以最大程度地减少地理溢出。由于单位间的异质性，我们也更有可能在治疗组和对照组中面临较高的变异和分布不均。例如，假设我们对衡量“ 迷失太空”系列营销对英国科幻观看的影响感兴趣。人口众多的伦敦被随机分配到治疗室，伦敦的人们比其他城市更喜欢科幻小说。如果我们忽略后一个事实，我们将高估行销的真正影响，而现在却感到困惑。总而言之，我们通常在A / B测试中使用数以百万计的成员进行的简单随机化和均值比较可能不适用于准实验。

Completely tackling these problems during the design phase may not be possible. We use some statistical approaches during design and analysis to minimize bias and maximize precision of our estimates. During design, one approach we utilize is running repeated randomizations, i.e. ‘re-randomization’. In particular, we keep randomizing until we find a randomization that gives us the maximum desired level of balance on key variables across test cells. This approach generally enables us to define more similar test groups (i.e. getting closer to apples to apples comparison). However, we may still face two issues: 1) we can only simultaneously balance on a limited number of observed variables, and it is very difficult to find identical geographic units on all dimensions, and 2) we can still face noisy results with large confidence intervals due to small sample size. We next discuss some of our analysis approaches to further tackle these problems.

在设计阶段可能无法完全解决这些问题。在设计和分析过程中，我们使用一些统计方法来最小化偏差并最大化我们的估计精度。在设计期间，我们使用的一种方法是运行重复随机化，即“ 重新随机化” 。特别是，我们一直进行随机化，直到找到一个随机化，该随机化可为我们提供跨测试单元的关键变量的最大期望平衡水平。这种方法通常使我们能够定义更多相似的测试组(即，越来越接近苹果与苹果之间的比较)。但是，我们仍然可能面临两个问题：1)我们只能同时在有限数量的观察变量上保持平衡，并且很难在所有维度上找到相同的地理单位，并且2)我们仍然可以非常有信心地面对嘈杂的结果由于样本量较小，因此间隔不大。接下来，我们将讨论一些分析方法来进一步解决这些问题。

分析 (Analysis)

超越简单的比较 (Going Beyond Simple Comparisons)

Difference in differences (diff-in-diff or DID) comparison is a very common approach used in quasi experiments. In diff-in-diff, we usually consider two time periods; pre and post intervention. We utilize the pre-intervention period to generate baselines for our metrics, and normalize post intervention values by the baseline. This normalization is a simple but very powerful way of controlling for inherent differences between treatment and control groups. For example, let’s say our success metric is signups and we are running a quasi experiment in France. We have Paris and Lyon in two test cells. We cannot directly compare signups in two cities as populations are very different. Normalizing with respect to pre-intervention signups would reduce variation and help us make comparisons at the same scale. Although the diff-in-diff approach generally works reasonably well, we have observed some cases where it may not be as applicable as we discuss next.

差异比较(diff-in-diff或DID)比较是准实验中一种非常常用的方法。在差异比较中，我们通常考虑两个时间段；干预前后。我们利用干预前时期为我们的指标生成基线，并根据该基线对干预后的值进行标准化。这种归一化是控制治疗组和对照组之间固有差异的简单但非常有效的方法。例如，假设我们的成功指标是注册，而我们正在法国进行一次准实验。我们在两个测试单元中有巴黎和里昂。由于人口差异很大，我们无法直接比较两个城市的注册人数。干预前签约的规范化将减少差异，并帮助我们进行相同规模的比较。尽管diff-in-diff方法通常可以很好地工作，但我们观察到某些情况下可能不像我们接下来讨论的那样适用。

具有历史观察结果但样本量较小的成功指标 (Success Metrics With Historical Observations But Small Sample Size)

In our non-member focused tests, we can observe historical acquisition metrics, e.g. signup counts, however, we don’t typically observe any other information about non-members. High variation in outcome metrics combined with small sample size can be a problem to design a well powered experiment using traditional diff-in-diff like approaches. To tackle this problem, we try to implement designs involving multiple interventions in each unit over an extended period of time whenever possible (i.e. instead of a typical experiment with single intervention period). This can help us gather enough evidence to run a well-powered experiment even with a very small sample size (i.e. few geographic units).

在我们的非会员重点测试中，我们可以观察到历史获取指标，例如注册计数，但是，我们通常不会观察到任何有关非会员的信息。结果量度的高变化与小样本量相结合可能是使用类似传统的差异比较法设计功能强大的实验的问题。为了解决这个问题，我们尝试在可能的情况下在较长的时间内实施涉及每个单元的多次干预的设计(即，代替具有单个干预期的典型实验)。这可以帮助我们收集足够的证据，即使样本量很小(即地理单位很少)，也可以运行功能强大的实验。

In particular, we turn the intervention (e.g. advertising) “on” and “off” repeatedly over time in different patterns and geographic units to capture short term effects. Every time we “toggle” the intervention, it gives us another chance to read the effect of the test. So even if we only have few geographic units, we can eventually read a reasonably precise estimate of the effect size (although, of course, results may not be generalizable to others if we have very few units). As our analysis approach, we can use observations from steady-state units to estimate what would otherwise have happened in units that are changing. To estimate the treatment effect, we fit a dynamic linear model (aka DLM), a type of state space model where the observations are conditionally Gaussian. DLMs are a very flexible category of models, but we only use a narrow subset of possible DLM structures to keep things simple. We currently have a robust internal package embedded in our internal tool, Quasimodo, to cover experiments that have similar structure. Our model is comparable to Google’s CausalImpact package, but uses a multivariate structure to let us analyze more than a single point-in-time intervention in a single region.

特别是，我们会随着时间的推移以不同的模式和地理单位反复“打开”和“关闭”干预措施(例如广告)，以捕获短期影响。每次我们“切换”干预时，它都会给我们另一个机会来阅读测试的效果。因此，即使我们只有很少的地理单位，我们最终仍可以读取对效果大小的合理精确的估计(尽管，如果我们只有很少的单位，则结果可能无法推广到其他地区)。作为我们的分析方法，我们可以使用来自稳态单位的观察值来估计发生变化的单位中发生的情况。为了估算治疗效果，我们拟合了动态线性模型(又称DLM)，这是一种状态空间模型，其中的观测条件是有条件的高斯模型。 DLM是模型的一种非常灵活的类别，但是我们仅使用一小部分可能的DLM结构来简化事情。目前，我们在内部工具Quasimodo中嵌入了一个强大的内部程序包，以涵盖具有相似结构的实验。我们的模型可与Google的CausalImpact包相媲美，但使用多元结构可以让我们分析单个区域中的多个时间点干预。

没有历史观察的成功指标 (Success Metrics Without Historical Observations)

In our member focused tests, we sometimes face cases where we don’t have success metrics with historical observations. For example, Netflix promotes its new shows that are yet to be launched on service to increase member engagement once the show is available. For a new show, we start observing metrics only when the show launches. As a result, our success metrics inherently don’t have any historical observations making it impossible to utilize the benefits of similar time series based approaches.

在针对会员的测试中，有时我们会遇到一些案例，其中没有根据历史观察得出的成功指标。例如，Netflix推广了尚未投入使用的新节目，以在节目可用后增加会员的参与度。对于新节目，我们仅在节目开始时才开始观察指标。结果，我们的成功指标天生就没有任何历史观察，因此无法利用类似基于时间序列的方法的优势。

In these cases, we utilize the benefits of richer member data to measure and control for members’ inherent engagement or interest with the show. We do this by using relevant pre-treatment proxies, e.g. viewing of similar shows, interest in Netflix originals or similar genres. We have observed that controlling for geographic as well as individual level differences work best in minimizing confounding effects and improving precision. For example, if members in Toronto watch more Netflix originals than members in other cities in Canada, we should then control for pre-treatment Netflix originals viewing at both individual and city level to capture within and between unit variation separately.

在这种情况下，我们利用丰富的会员数据的优势来衡量和控制会员对节目的内在参与或兴趣。我们通过使用相关的预处理代理来做到这一点，例如，观看类似的节目，对Netflix原创作品或相似类型的兴趣。我们已经观察到，控制地理差异和个人水平差异可以最大程度地减少混淆影响并提高精度。例如，如果多伦多的成员观看的Netflix原件比加拿大其他城市的成员多，则我们应控制在个人和城市级别观看的Netflix原件的预处理，以分别捕获单位差异内和单位间的差异。

This is in nature very similar to covariate adjustment. However, we do more than just running a simple regression with a large set of control variables. At Netflix, we have worked on developing approaches at the intersection of regression covariate adjustment and machine learning based propensity score matching by using a wide set of relevant member features. Such combined approaches help us explicitly control for members’ inherent interest in the new show using hundreds of features while minimizing linearity assumptions and degrees of freedom challenges we may face. We thus gain significant wins in both reducing potential confounding effects as well as maximizing precision to more accurately capture the treatment effect we are interested in.

本质上，这与协变量调整非常相似。但是，我们所做的不只是运行带有大量控制变量的简单回归。在Netflix，我们使用大量相关成员功能来开发回归协变量调整和基于机器学习的倾向得分匹配的交集方法。这种组合方法有助于我们使用数百种功能来明确控制成员对新展会的内在兴趣，同时最大程度地减少线性假设和我们可能面临的自由度挑战。因此，我们在减少潜在的混杂影响以及最大程度地提高精确度以更准确地捕获我们感兴趣的治疗效果方面均获得了重大胜利。

下一步 (Next Steps)

We have excelled in the quasi experimentation space with many measurement strategies now in play across Netflix for various use cases. However we are not done yet! We can expand methodologies to more use cases and continue to improve the measurement. As an example, another exciting area we have yet to explore is combining these approaches for those metrics where we can use both time series approaches and a rich set of internal features (e.g. general member engagement metrics). If you’re interested in working on these and other causal inference problems, join our dream team!

我们在准实验领域表现出色，目前针对各种用例，整个Netflix都在使用许多测量策略。但是，我们还没有完成！我们可以将方法扩展到更多用例，并继续改进度量。例如，我们尚未探索的另一个令人兴奋的领域是将这些方法与那些指标结合使用，在这些指标中，我们既可以使用时间序列方法，又可以使用丰富的内部功能(例如，一般成员参与指标)。如果您有兴趣解决这些和其他因果推理问题，请加入我们的理想团队！