Why Use Propensity Score
Setup
Set up three matrices:
- signol matrix S ∈ R m , n S \in \mathbb{R}^{m,n} S∈Rm,n.
- noise matrix W ∈ R m , n W \in \mathbb{R}^{m,n} W∈Rm,n.
- probability matrix P ∈ [ 0 , 1 ] m , n P \in [0, 1]^{m,n} P∈[0,1]m,n independent of everything else
- Suppose we observe a matrix
X
∗
∈
R
m
,
n
X^* \in \mathbb{R}^{m,n}
X∗∈Rm,n without missing entries
X ∗ : = S + W X^* := S+W X∗:=S+W- In order to estimate S S S with low MSE, S ^ \hat{S} S^ can be obtained from L full MSE ( S ^ ) : = 1 m n ∑ u = 1 m ∑ i = 1 n ( S ^ u , i − X u , i ∗ ) (1) L_{\text{full MSE}}(\hat{S}) := \frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} (\hat{S}_{u,i} - X^*_{u,i}) \tag{1} Lfull MSE(S^):=mn1u=1∑mi=1∑n(S^u,i−Xu,i∗)(1)
- However, we only observe a matrix
X
∈
(
R
∪
{
⋆
}
)
m
×
n
X \in (\mathbb{R} \cup \{ \star \} )^{m \times n}
X∈(R∪{⋆})m×n, in which each entry is a revealed rating
r
r
r or a unrevealed rating
⋆
\star
⋆.
X u , i = S u , i + W u , i with probability P u , i X_{u,i} = S_{u,i}+W_{u,i} \text{ with probability } P_{u,i} Xu,i=Su,i+Wu,i with probability Pu,i- In order to estimate
S
S
S with low MSE,
S
^
\hat{S}
S^, we need
L
M
S
E
(
S
^
)
:
=
1
∣
Ω
∣
∑
(
u
,
i
)
∈
Ω
(
S
^
u
,
i
−
X
u
,
i
)
2
(2)
L_{MSE}(\hat{S}) := \frac{1}{|\Omega|} \sum_{(u,i)\in \Omega} (\hat{S}_{u,i} - X_{u,i})^2 \tag{2}
LMSE(S^):=∣Ω∣1(u,i)∈Ω∑(S^u,i−Xu,i)2(2)
If the probability of every entry in X being revealed is the same, P u , i = C P_{u,i}=C Pu,i=C for all u and I, then (2) is un unbiased estimate of (1).
However, the above assumption won’t be true in reality because the missing/revealed probability would be different for different u, i. Therefore, we need use inverse propensity score/inverse probability to solve the issue.
- In order to estimate
S
S
S with low MSE,
S
^
\hat{S}
S^, we need
L
M
S
E
(
S
^
)
:
=
1
∣
Ω
∣
∑
(
u
,
i
)
∈
Ω
(
S
^
u
,
i
−
X
u
,
i
)
2
(2)
L_{MSE}(\hat{S}) := \frac{1}{|\Omega|} \sum_{(u,i)\in \Omega} (\hat{S}_{u,i} - X_{u,i})^2 \tag{2}
LMSE(S^):=∣Ω∣1(u,i)∈Ω∑(S^u,i−Xu,i)2(2)
Conditional On Propensity Scores
- Inverse Propensity Score Technique: L I P S − M S E ( S ^ ∣ P ) : = 1 m n ∑ ( u , i ) ∈ Ω ( S ^ u , i − X u , i ) 2 P u , i (3) L_{IPS-MSE}(\hat{S}|P) := \frac{1}{mn} \sum_{(u,i)\in \Omega} \frac{(\hat{S}_{u,i}-X_{u,i})^2}{P_{u,i}} \tag{3} LIPS−MSE(S^∣P):=mn1(u,i)∈Ω∑Pu,i(S^u,i−Xu,i)2(3), which is an unbiased estimate of (1)
- Appendix: Suppose P P P is known, then estimate L I P S − M S E L_{IPS-MSE} LIPS−MSE in (3) is an unbiased estimate for L full MSE ( S ^ ) L_{\text{full MSE}}(\hat{S}) Lfull MSE(S^) in (1) is unbiased estimator or (1): L I P S − M S E ( S ^ ∣ P ) : = 1 m n ∑ u = 1 m ∑ i = 1 n 1 { ( u , i ) ∈ Ω } ( S ^ u , i − X u , i ∗ ) 2 P u , i L_{IPS-MSE}(\hat{S}|P) := \frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} 1\{(u, i) \in \Omega \} \frac{(\hat{S}_{u,i}-X^*_{u,i})^2}{P_{u,i}} LIPS−MSE(S^∣P):=mn1u=1∑mi=1∑n1{(u,i)∈Ω}Pu,i(S^u,i−Xu,i∗)2, the we take expectation of it with respect to which entries are revealed: E Ω [ L full MSE ( S ^ ) ] = 1 m n ∑ u = 1 m ∑ i = 1 n P u , i ( S ^ u , i − X u , i ∗ ) 2 P u , i = L full MSE ( S ^ ) E_{\Omega} [L_{\text{full MSE}}(\hat{S})]=\frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} P_{u,i} \frac{(\hat{S}_{u,i}-X^*_{u,i})^2}{P_{u,i}}=L_{\text{full MSE}}(\hat{S}) EΩ[Lfull MSE(S^)]=mn1u=1∑mi=1∑nPu,iPu,i(S^u,i−Xu,i∗)2=Lfull MSE(S^)
Nulear Norm Regularization
- Overall, we add nuclear norm regularization to (3) and the final objective function is the following: S ^ = arg min Γ ∈ R m × n L IPS-MSE ( Γ ∣ P ) + λ ∣ ∣ Γ ∣ ∣ ∗ (4**) \hat{S} = \argmin_{\Gamma \in \mathbb{R}^{m \times n}} L_{\text{IPS-MSE}}(\Gamma | P) + \lambda ||\Gamma||_{*} \tag{4**} S^=Γ∈Rm×nargminLIPS-MSE(Γ∣P)+λ∣∣Γ∣∣∗(4**), where λ > 0 \lambda>0 λ>0 is user-specified parameter, and ∣ ∣ ⋅ ∣ ∣ ∗ ||\cdot||_* ∣∣⋅∣∣∗ denotes the nuclear norm, ∣ ∣ A ∣ ∣ ∗ = t r a c e ( A T A ) = ∑ i = 1 min of (m, n) σ i ( A ) ||A||_* = trace(\sqrt{A^TA})= \sum^{\text{min of (m, n)}}_{i=1} \sigma_i(A) ∣∣A∣∣∗=trace(ATA)=∑i=1min of (m, n)σi(A)
Estimate Propensity Score
We only talk about one bit matrix completion of the propensity score estimation in this article
1bit Matrix Completion
Setup:
-
M
M
M: Missingness Mask Matrix used to estimate
P
P
P
* M ∈ { 0 , 1 } m × n \in \{0,1 \}^{m \times n} ∈{0,1}m×n
* M u , i = 1 { X u , i ≠ ⋆ } M_{u,i}=1\{X_{u,i} \ne \star \} Mu,i=1{Xu,i=⋆} -
A
∈
R
m
×
n
A \in \mathbb{R}^{m \times n}
A∈Rm×n: parameter matrix
* A ∈ F τ , γ : = { Γ ∈ R m × n : ∣ ∣ Γ ∣ ∣ ∗ ≤ τ m n , ∣ ∣ Γ ∣ ∣ max ≤ γ } A\in \mathcal{F}_{\tau, \gamma} := \{ \Gamma \in \mathbb{R}^{m \times n} : ||\Gamma||_* \le \tau \sqrt{mn}, ||\Gamma||_{\max} \le \gamma \} A∈Fτ,γ:={Γ∈Rm×n:∣∣Γ∣∣∗≤τmn,∣∣Γ∣∣max≤γ}, τ > 0 \tau>0 τ>0 and γ > 0 \gamma>0 γ>0 are user-specified parameters. - P = σ ( A u , i ) P = \sigma(A_{u,i}) P=σ(Au,i), σ \sigma σ can, for instance, be logistic function σ ( x ) = 1 ( 1 + e − x ) \sigma(x)=\frac{1}{(1+e^{-x})} σ(x)=(1+e−x)1
1BitMC is given as follows:
4. Solve the constrained Bernoulli maximum likelihood problem:
A
^
=
arg max
Γ
∈
F
τ
,
γ
∑
u
=
1
m
∑
i
=
1
n
[
M
u
,
i
log
σ
(
Γ
u
,
i
)
+
(
1
−
M
u
,
i
)
l
o
g
(
1
−
σ
(
Γ
u
,
i
)
)
]
\hat{A}=\argmax_{\Gamma \in \mathcal{F_{\tau, \gamma}}} \sum_{u=1}^m \sum_{i=1}^n [M_{u,i} \log \sigma(\Gamma_{u,i}) + (1 - M_{u,i}) log(1 - \sigma(\Gamma_{u,i}))]
A^=Γ∈Fτ,γargmaxu=1∑mi=1∑n[Mu,ilogσ(Γu,i)+(1−Mu,i)log(1−σ(Γu,i))]
5. Construct the matrix
P
^
∈
[
0
,
1
]
m
×
n
\hat{P} \in [0,1]^{m \times n}
P^∈[0,1]m×n
Take away meesage for 1bitmc
The algorithm applies one bit matrix compeltion on mask matrix to find a reconstructed matrix penalized with nuclear norm that can achieve the maximum of bernouli likelihood.
Naive Bayes (Tobias Schnabel )
Let’s look at the formula of Naive Bayes:
y
^
=
arg max
y
∈
Y
p
(
y
∣
x
;
θ
)
=
arg max
y
∈
Y
p
(
x
∣
y
;
θ
)
p
(
y
)
\begin{aligned} \hat{y} & = \argmax_{y \in \mathfrak{Y}} p(y|\textbf{x};\boldsymbol \theta) \\ & = \argmax_{y \in \mathfrak{Y}} p(\textbf{x}|y;\boldsymbol \theta) p(y) \\ \end{aligned}
y^=y∈Yargmaxp(y∣x;θ)=y∈Yargmaxp(x∣y;θ)p(y)
- p ( x ∣ y ; θ ) p(\textbf{x}|y;\boldsymbol \theta) p(x∣y;θ) can be compare to a model of tossing a dice with V faces for n times which is proportional to multinomial distribution
likelihood= p ( x ∣ y ; θ ) ∝ Π i = 1 V θ i , y x i \text{likelihood=} p(\textbf{x}|y;\boldsymbol \theta) \propto \Pi_{i=1}^{V} \theta_{i, y}^{x_i} likelihood=p(x∣y;θ)∝Πi=1Vθi,yxi
-
Prior
p ( y ) = θ y p(y)=\theta_y p(y)=θy -
Number of parameters in ; θ = { θ y , θ i , y } , ∀ i , y ;\boldsymbol \theta = \{\theta_y, \theta_{i, y}\}, \forall i, y ;θ={θy,θi,y},∀i,y:
K + K V = ( V + 1 ) K K + KV = (V + 1)K K+KV=(V+1)K
We need first to know some assumptions used in propensity score wiki
- Strongly Ignorable: If the potential outcomes are independent of teratment conditional on background variables X. r 0 , r 1 ⊥ Z ∣ X r_0,r_1 \perp Z |X r0,r1⊥Z∣X
- Balancing score: b ( x ) b(x) b(x) is a function of the observed covariates X X X such that the conditional distribution of X given b ( X ) b(X) b(X) is the same for treated ( Z = 1 ) (Z=1) (Z=1) and control ( Z = 0 ) (Z=0) (Z=0): Z ⊥ X ∣ b ( x ) Z \perp X | b(x) Z⊥X∣b(x), the most trivial function is b(X)=X or e(X)=propensity score. (My thoughts: we can think this of a confounder relation in three nodes graph such that X is the confounder affect both treatment Z and response Y. Once we condiction on b(X), the confounder effect of X on Z can be removed)
- Propensity Score: is the probability of a unit being assinged to a particular treatment given a set of observed covariates. e ( x ) = d e f Pr ( Z = 1 ∣ X = x ) e(x) \stackrel{def}{=} \Pr(Z=1 | X=x) e(x)=defPr(Z=1∣X=x)
- Main Thereoms:
- The propensity score e ( x ) e(x) e(x) is a balancing score
- The propensity score is the coarsest balancing score function because it takes a (possibly) multidimensional object ( X i X_i Xi) to be transofrmed into one dimension, while b(X)=X is the finnest one
- If treatment assignment is strongly ignorable given X then:
- it is also strongly ignorable given any balancing function. Specifically, given the propensity score: ( r 0 , r 1 ⊥ Z ∣ e ( X ) ) (r_0, r_1 \perp Z | e(X)) (r0,r1⊥Z∣e(X))
Now, we will look at the Naive Bayes approach proposed by Schnabel.
P
(
O
u
,
i
=
1
∣
Y
u
,
i
=
r
)
=
P
(
Y
=
r
∣
O
=
1
)
P
(
O
=
1
)
P
(
Y
=
r
)
P(O_{u,i}=1|Y_{u,i}=r)=\frac{P(Y=r|O=1)P(O=1)}{P(Y=r)}
P(Ou,i=1∣Yu,i=r)=P(Y=r)P(Y=r∣O=1)P(O=1)
- The subscripts u , i u, i u,i are dropped. This means that Parameters are tied across all u u u and i i i
- P ( Y = r ∣ O = 1 ) P(Y=r|O=1) P(Y=r∣O=1) and P ( O = 1 ) P(O=1) P(O=1): can be obtained by counting observed ratings in Missing Not At Random (MNAR) data
- P ( Y = r ) P(Y=r) P(Y=r): we need a sample of Missing Completely At Random (MCAR) data to be combined with the existing Missing Not At Random.
Take away meesage for Naive Bayes
1. This algorithm first have a utilize missing completely at random once to find the normalized term in denominator. Second, we should think the likelihood part P ( Y = r ∣ O = 1 ) P(Y=r|O=1) P(Y=r∣O=1) in naive bayes as tossing a dice model where each side is each rating scaled r from 1 to 5. So the estimated probability will be same for the same ratings rather than different probabilities for different pairs of user and item.
2. In propensity assumption senario, the naive bayes approach seems not make sense. The strongly ignorablity says
(
Y
0
,
Y
1
⊥
O
∣
e
(
X
)
)
(Y_0, Y_1 \perp O | e(X))
(Y0,Y1⊥O∣e(X)). However, naive approach is
e
(
X
)
=
P
(
O
∣
Y
)
e(X)=P(O|Y)
e(X)=P(O∣Y) which is little bit confusing and does not make sense in fact.
Reference
MNAR_1BitMC(Ma)
Naive_Bayes(Tobias Schnabel)