文章目录
- Task-distribution-aware Meta-learning for Cold-start CTR Prediction
- Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction
- A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback
- Detecting and Correcting for Label Shift with Black Box Predictors
- Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction
Task-distribution-aware Meta-learning for Cold-start CTR Prediction
cold-start问题主要是两方面,第一,怎么给几乎没见过的ad做预测,第二,怎么给只见过很少的样本预测更准
解决方法:把id embedding换成side-information
Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction
这篇感觉重点在label上
A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback
在目标函数前面增加一个importance weight用来调整delayed feedback,weight叫做feedback shift importance weight/FSIW,此时目标函数为:
min G ^ I W ( n ) ≡ 1 n ∑ i = 1 n P ( C = y i ∣ X = x i ) P ( Y = y i ∣ X = x i ) L ( x i , y i ; f ^ ( x i , θ ) ) \min \hat G^{(n)}_{IW} \equiv \frac{1}{n} \sum_{i=1}^n \frac{P(C=y_i|X=x_i)}{P(Y=y_i|X=x_i)}L(x_i,y_i;\hat f(x_i, \theta)) minG^IW(n)≡n1i=1∑nP(Y=yi∣X=xi)P(C=yi∣X=xi)L(xi,yi;f^(xi,θ))
变量定义
变量名 | 变量意义 |
---|---|
X X X | feature |
Y ∈ { 0 , 1 } Y \in \{0, 1\} Y∈{ 0,1} | 1表示在training term中出现了转化 |
C ∈ { 0 , 1 } C \in \{0,1\} C∈{ 0,1} | 1表示是否出现了转化 |
S ∈ { 0 , 1 } S \in \{0, 1\} S∈{ 0,1} | 1表示在training term中样本的label是正确的 |
D ∈ R D \in \R D∈R | click和之后的转化之间的time delay,如果 C = 0 C = 0 C=0,则 D D D无定义 |
E ∈ R E \in \R E∈R | click和training time之间的time |
positive sample一般没什么问题, Y = 1 ⟺ S = 1 , C = 1 Y = 1 \iff S = 1, C = 1 Y=1⟺S=1,C=1。但是negative sample包含了错误的样本和正确的样本,即: Y = 0 ⟺ C = 0 or S = 0 Y = 0 \iff C = 0\text{ or }S = 0 Y=0⟺C=0 or S=0
有:
P ( Y = 1 ∣ X = x ) = P ( C = 1 ∣ X = x ) P ( S = 1 ∣ C = 1 , X = x ) P ( Y = 0 ∣ X = x ) = P ( C = 0 ∣ X = x ) + P ( S = 0 ∣ C = 1 , X = x ) \begin{aligned} P(Y=1|X=x) = P(C=1|X=x)P(S=1|C=1,X=x) \\ P(Y=0|X=x)=P(C=0|X=x) + P(S=0|C=1,X=x) \end{aligned} P(Y=1∣X=x)=P(C=1∣X=x)P(S=1∣C=1,X=x)P(Y=0∣X=x)=P(C=0∣X=x)+P(S=0∣C=1,X=x)
训练的时候是 P ( Y ∣ X ) P(Y|X) P(Y∣X),测试的时候是 P ( C ∣ X ) P(C|X) P(C∣X),这个条件概率的差异就是feedback shift
用优化器优化泛化误差:
G ≡ E ( x , c ) ∼ ( X , C ) [ L ( x , c ; f ^ ( x , θ ) ) ] G \equiv \mathbb{E}_{(x,c) \sim (X,C)}[L(x,c;\hat{f}(x, \theta))] G≡E(x,c)∼(X,C)[L(x,c;f^(x,θ))]
其中 θ \theta θ是模型的参数, θ ∗ = arg min θ ∈ Θ G \theta^* = \argmin_{\theta \in \Theta}G θ∗=θ∈ΘargminG
由于 c c c是在training sample中观测不到的,所以通常的泛化边界 G G G会改而使用 y y y来代替 c c c,此时:
G ^ ( n ) ≡ 1 n ∑ i = 1 n L ( x i , y i ; f ^ ( x i , θ ) ) \hat G^{(n)} \equiv \frac{1}{n}\sum_{i=1}^nL(x_i,y_i;\hat{f}(x_i, \theta)) G^(n)≡n1i=1∑nL(xi,yi;f^(xi,θ))
当 c i c_i ci和 y i y_i yi是相同分布的时候,最小化 G G G和 G ^ \hat G G^就是一样的,但是在delayed feedback中这俩是不一样的分布,因为 P ( Y = 1 ∣ X = x ) ≤ P ( C = 1 ∣ X = x ) P(Y=1|X=x) \leq P(C=1|X=x) P(Y=1∣X=x)≤P(C=1∣X=x)
引入一个feedback shift importance weight/FSIW,其定义为:
G ^ I W ( n ) ≡ 1 n ∑ i = 1 n P ( C = y i ∣ X = x i ) P ( Y = y i ∣ X = x i ) L ( x i , y i ; f ^ ( x i , θ ) ) \hat G^{(n)}_{IW} \equiv \frac{1}{n} \sum_{i=1}^n \frac{P(C=y_i|X=x_i)}{P(Y=y_i|X=x_i)}L(x_i,y_i;\hat f(x_i, \theta)) G^IW(n)≡n<