数据分析36计(21)：Uber、Netflix 常用倍差法模型量化营销活动、产品改版影响效果...

最新推荐文章于 2023-11-19 15:05:10 发布

糖甜甜甜74

最新推荐文章于 2023-11-19 15:05:10 发布

阅读量1.2k

点赞数 3

文章标签：广告机器学习数据分析 python 人工智能

本文链接：https://blog.csdn.net/Pylady/article/details/112167138

版权

本文介绍了倍差法（Difference in Difference, DID）在Uber和Netflix数据分析中的应用，用于量化营销活动和产品改版的影响。通过DID模型，分析了1980年肯塔基州政策变化对高收入工人重返工作决策的效应，并通过R语言展示了DID回归的计算过程。在控制其他变量后，发现政策导致失业时间延长16.9%。" 124140559,9259489,排序链表的两种方法：保存节点排序与归并排序,"['算法', '数据结构', '链表', 'C++']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 案例背景

目前 Uber、Netflix 在商业分析中的因果推断常用模型主要是倍差法(Difference in Difference)和匹配（Matching），目前已在其平台中建立相关方法的自主分析工具。这篇文章将介绍倍差法和普通回归法在效应量评估上的真实差异。

Figure 1. Uber’s Experimentation Platform

这里以一个简单的经济学案例来讲述倍差法 DID 的原理，根据美国联邦法规，每个州的劳工赔偿计划在赔偿受伤的工人时，其赔偿范围从一定的“补偿率”（通常是受伤前工资的三分之二）到某个特定的最高额。对理性决策者而言，继续伤残所能获得的补偿越高，重返工作的动力就越小。

1980年，肯塔基州提高了每周工伤赔偿的最高限额，这里检验赔偿额的变化是否对重返工作的决策有显著影响。我们关心的主要结果变量是 log_duration（在数据中为 ldurat ），或记录的工伤补偿期限（以周为单位）。这里该变量取 log 是因为该变量存在较大的偏差，大多数人失业了几周，而有些人失业了很长时间。制定该政策的目的是使上限增加不会影响低收入工人，但会影响高收入工人，因此我们将低收入工人作为我们的对照组，将高收入工人作为我们的处理组。

数据集包含在 wooldridge R 软件包中的 injury：

durat（duration）：失业救济金的持续时间，以周为单位
ldurat（log_duration）：log(durat)
after_1980（after_1980）：指标变量，观察是在1980年政策更改之前（0）还是之后（1）进行，时间变量：before/after
highearn：指示变量，用于标记观察值是低（0）还是高（1）收入，分组变量：处理/对照

2 加载和清理数据

首先下载数据集并加载相关库：

library(tidyverse)  # ggplot(), %>%, mutate(), and friends
library(broom)  # Convert models to data frames
library(scales)  # Format numbers with functions like comma(), percent(), and dollar()
library(modelsummary)  # Create side-by-side regression tables

# Load the data. 
# It'd be a good idea to click on the "injury_raw" object in the Environment
# panel in RStudio to see what the data looks like after you load it
injury_raw <- read_csv("data/injury.csv")

injury <- injury_raw %>% 
  filter(ky == 1) %>% 
  # The syntax for rename is `new_name = original_name`
  rename(duration = durat, log_duration = ldurat,
         after_1980 = afchnge)

3 探索性数据分析

首先可以看看高低收入者（对照和处理组）中失业补偿的分布：

ggplot(data = injury, aes(x = duration)) +
  # binwidth = 8 makes each column represent 2 months (8 weeks) 
  # boundary = 0 make it so the 0-8 bar starts at 0 and isn't -4 to 4
  geom_histogram(binwidth = 8, color = "white", boundary = 0) +
  facet_wrap(vars(highearn))

分配情况确实存在偏差，两组中的大多数人都可以享受 0-8 周的福利（还有少数可以享受 180 周以上的福利！这就是 3.5 年！）

如果使用持续时间的对数，则可以得到较少偏斜的分布，该分布更适合回归模型：

ggplot(data = injury, mapping = aes(x = log_duration)) +
  geom_histogram(binwidth = 0.5, color = "white", boundary = 0) + 
  # Uncomment this line if you want to exponentiate the logged values on the
  # x-axis. Instead of showing 1, 2, 3, etc., it'll show e^1, e^2, e^3, etc. and
  # make the labels more human readable
  # scale_x_continuous(labels = trans_format("exp", format = round)) +
  facet_wrap(vars(highearn))

我们还应该检查政策改变前后的失业情况：

ggplot(data = injury, mapping = aes(x = log_duration)) +
  geom_histogram(binwidth = 0.5, color = "white", boundary = 0) + 
  facet_wrap(vars(after_1980))