Overview Of MNAR Matrix Completion Under Nuclear Norm assumption

Why Use Propensity Score

Setup

Set up three matrices:

  1. signol matrix S ∈ R m , n S \in \mathbb{R}^{m,n} SRm,n.
  2. noise matrix W ∈ R m , n W \in \mathbb{R}^{m,n} WRm,n.
  3. probability matrix P ∈ [ 0 , 1 ] m , n P \in [0, 1]^{m,n} P[0,1]m,n independent of everything else
  • Suppose we observe a matrix X ∗ ∈ R m , n X^* \in \mathbb{R}^{m,n} XRm,n without missing entries
    X ∗ : = S + W X^* := S+W X:=S+W
    • In order to estimate S S S with low MSE, S ^ \hat{S} S^ can be obtained from L full MSE ( S ^ ) : = 1 m n ∑ u = 1 m ∑ i = 1 n ( S ^ u , i − X u , i ∗ ) (1) L_{\text{full MSE}}(\hat{S}) := \frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} (\hat{S}_{u,i} - X^*_{u,i}) \tag{1} Lfull MSE(S^):=mn1u=1mi=1n(S^u,iXu,i)(1)
  • However, we only observe a matrix X ∈ ( R ∪ { ⋆ } ) m × n X \in (\mathbb{R} \cup \{ \star \} )^{m \times n} X(R{})m×n, in which each entry is a revealed rating r r r or a unrevealed rating ⋆ \star .
    X u , i = S u , i + W u , i  with probability  P u , i X_{u,i} = S_{u,i}+W_{u,i} \text{ with probability } P_{u,i} Xu,i=Su,i+Wu,i with probability Pu,i
    • In order to estimate S S S with low MSE, S ^ \hat{S} S^, we need L M S E ( S ^ ) : = 1 ∣ Ω ∣ ∑ ( u , i ) ∈ Ω ( S ^ u , i − X u , i ) 2 (2) L_{MSE}(\hat{S}) := \frac{1}{|\Omega|} \sum_{(u,i)\in \Omega} (\hat{S}_{u,i} - X_{u,i})^2 \tag{2} LMSE(S^):=Ω1(u,i)Ω(S^u,iXu,i)2(2)
      If the probability of every entry in X being revealed is the same, P u , i = C P_{u,i}=C Pu,i=C for all u and I, then (2) is un unbiased estimate of (1).
      However, the above assumption won’t be true in reality because the missing/revealed probability would be different for different u, i. Therefore, we need use inverse propensity score/inverse probability to solve the issue.

Conditional On Propensity Scores

  • Inverse Propensity Score Technique: L I P S − M S E ( S ^ ∣ P ) : = 1 m n ∑ ( u , i ) ∈ Ω ( S ^ u , i − X u , i ) 2 P u , i (3) L_{IPS-MSE}(\hat{S}|P) := \frac{1}{mn} \sum_{(u,i)\in \Omega} \frac{(\hat{S}_{u,i}-X_{u,i})^2}{P_{u,i}} \tag{3} LIPSMSE(S^P):=mn1(u,i)ΩPu,i(S^u,iXu,i)2(3), which is an unbiased estimate of (1)
  • Appendix: Suppose P P P is known, then estimate L I P S − M S E L_{IPS-MSE} LIPSMSE in (3) is an unbiased estimate for L full MSE ( S ^ ) L_{\text{full MSE}}(\hat{S}) Lfull MSE(S^) in (1) is unbiased estimator or (1): L I P S − M S E ( S ^ ∣ P ) : = 1 m n ∑ u = 1 m ∑ i = 1 n 1 { ( u , i ) ∈ Ω } ( S ^ u , i − X u , i ∗ ) 2 P u , i L_{IPS-MSE}(\hat{S}|P) := \frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} 1\{(u, i) \in \Omega \} \frac{(\hat{S}_{u,i}-X^*_{u,i})^2}{P_{u,i}} LIPSMSE(S^P):=mn1u=1mi=1n1{(u,i)Ω}Pu,i(S^u,iXu,i)2, the we take expectation of it with respect to which entries are revealed: E Ω [ L full MSE ( S ^ ) ] = 1 m n ∑ u = 1 m ∑ i = 1 n P u , i ( S ^ u , i − X u , i ∗ ) 2 P u , i = L full MSE ( S ^ ) E_{\Omega} [L_{\text{full MSE}}(\hat{S})]=\frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} P_{u,i} \frac{(\hat{S}_{u,i}-X^*_{u,i})^2}{P_{u,i}}=L_{\text{full MSE}}(\hat{S}) EΩ[Lfull MSE(S^)]=mn1u=1mi=1nPu,iPu,i(S^u,iXu,i)2=Lfull MSE(S^)

Nulear Norm Regularization

  • Overall, we add nuclear norm regularization to (3) and the final objective function is the following: S ^ = arg min ⁡ Γ ∈ R m × n L IPS-MSE ( Γ ∣ P ) + λ ∣ ∣ Γ ∣ ∣ ∗ (4**) \hat{S} = \argmin_{\Gamma \in \mathbb{R}^{m \times n}} L_{\text{IPS-MSE}}(\Gamma | P) + \lambda ||\Gamma||_{*} \tag{4**} S^=ΓRm×nargminLIPS-MSE(ΓP)+λΓ(4**), where λ > 0 \lambda>0 λ>0 is user-specified parameter, and ∣ ∣ ⋅ ∣ ∣ ∗ ||\cdot||_* denotes the nuclear norm, ∣ ∣ A ∣ ∣ ∗ = t r a c e ( A T A ) = ∑ i = 1 min of (m, n) σ i ( A ) ||A||_* = trace(\sqrt{A^TA})= \sum^{\text{min of (m, n)}}_{i=1} \sigma_i(A) A=trace(ATA )=i=1min of (m, n)σi(A)

Estimate Propensity Score

We only talk about one bit matrix completion of the propensity score estimation in this article

1bit Matrix Completion

Setup:

  1. M M M: Missingness Mask Matrix used to estimate P P P
    * M ∈ { 0 , 1 } m × n \in \{0,1 \}^{m \times n} {0,1}m×n
    * M u , i = 1 { X u , i ≠ ⋆ } M_{u,i}=1\{X_{u,i} \ne \star \} Mu,i=1{Xu,i=}
  2. A ∈ R m × n A \in \mathbb{R}^{m \times n} ARm×n: parameter matrix
    * A ∈ F τ , γ : = { Γ ∈ R m × n : ∣ ∣ Γ ∣ ∣ ∗ ≤ τ m n , ∣ ∣ Γ ∣ ∣ max ⁡ ≤ γ } A\in \mathcal{F}_{\tau, \gamma} := \{ \Gamma \in \mathbb{R}^{m \times n} : ||\Gamma||_* \le \tau \sqrt{mn}, ||\Gamma||_{\max} \le \gamma \} AFτ,γ:={ΓRm×n:Γτmn ,Γmaxγ}, τ > 0 \tau>0 τ>0 and γ > 0 \gamma>0 γ>0 are user-specified parameters.
  3. P = σ ( A u , i ) P = \sigma(A_{u,i}) P=σ(Au,i), σ \sigma σ can, for instance, be logistic function σ ( x ) = 1 ( 1 + e − x ) \sigma(x)=\frac{1}{(1+e^{-x})} σ(x)=(1+ex)1

1BitMC is given as follows:
4. Solve the constrained Bernoulli maximum likelihood problem:
A ^ = arg max ⁡ Γ ∈ F τ , γ ∑ u = 1 m ∑ i = 1 n [ M u , i log ⁡ σ ( Γ u , i ) + ( 1 − M u , i ) l o g ( 1 − σ ( Γ u , i ) ) ] \hat{A}=\argmax_{\Gamma \in \mathcal{F_{\tau, \gamma}}} \sum_{u=1}^m \sum_{i=1}^n [M_{u,i} \log \sigma(\Gamma_{u,i}) + (1 - M_{u,i}) log(1 - \sigma(\Gamma_{u,i}))] A^=ΓFτ,γargmaxu=1mi=1n[Mu,ilogσ(Γu,i)+(1Mu,i)log(1σ(Γu,i))]
5. Construct the matrix P ^ ∈ [ 0 , 1 ] m × n \hat{P} \in [0,1]^{m \times n} P^[0,1]m×n

Take away meesage for 1bitmc

The algorithm applies one bit matrix compeltion on mask matrix to find a reconstructed matrix penalized with nuclear norm that can achieve the maximum of bernouli likelihood.

Naive Bayes (Tobias Schnabel )

Let’s look at the formula of Naive Bayes:
y ^ = arg max ⁡ y ∈ Y p ( y ∣ x ; θ ) = arg max ⁡ y ∈ Y p ( x ∣ y ; θ ) p ( y ) \begin{aligned} \hat{y} & = \argmax_{y \in \mathfrak{Y}} p(y|\textbf{x};\boldsymbol \theta) \\ & = \argmax_{y \in \mathfrak{Y}} p(\textbf{x}|y;\boldsymbol \theta) p(y) \\ \end{aligned} y^=yYargmaxp(yx;θ)=yYargmaxp(xy;θ)p(y)

  • p ( x ∣ y ; θ ) p(\textbf{x}|y;\boldsymbol \theta) p(xy;θ) can be compare to a model of tossing a dice with V faces for n times which is proportional to multinomial distribution

likelihood= p ( x ∣ y ; θ ) ∝ Π i = 1 V θ i , y x i \text{likelihood=} p(\textbf{x}|y;\boldsymbol \theta) \propto \Pi_{i=1}^{V} \theta_{i, y}^{x_i} likelihood=p(xy;θ)Πi=1Vθi,yxi

  • Prior
    p ( y ) = θ y p(y)=\theta_y p(y)=θy

  • Number of parameters in ; θ = { θ y , θ i , y } , ∀ i , y ;\boldsymbol \theta = \{\theta_y, \theta_{i, y}\}, \forall i, y ;θ={θy,θi,y},i,y:
    K + K V = ( V + 1 ) K K + KV = (V + 1)K K+KV=(V+1)K

We need first to know some assumptions used in propensity score wiki

  • Strongly Ignorable: If the potential outcomes are independent of teratment conditional on background variables X. r 0 , r 1 ⊥ Z ∣ X r_0,r_1 \perp Z |X r0,r1ZX
  • Balancing score: b ( x ) b(x) b(x) is a function of the observed covariates X X X such that the conditional distribution of X given b ( X ) b(X) b(X) is the same for treated ( Z = 1 ) (Z=1) (Z=1) and control ( Z = 0 ) (Z=0) (Z=0): Z ⊥ X ∣ b ( x ) Z \perp X | b(x) ZXb(x), the most trivial function is b(X)=X or e(X)=propensity score. (My thoughts: we can think this of a confounder relation in three nodes graph such that X is the confounder affect both treatment Z and response Y. Once we condiction on b(X), the confounder effect of X on Z can be removed)
  • Propensity Score: is the probability of a unit being assinged to a particular treatment given a set of observed covariates. e ( x ) = d e f Pr ⁡ ( Z = 1 ∣ X = x ) e(x) \stackrel{def}{=} \Pr(Z=1 | X=x) e(x)=defPr(Z=1X=x)
  • Main Thereoms:
    • The propensity score e ( x ) e(x) e(x) is a balancing score
    • The propensity score is the coarsest balancing score function because it takes a (possibly) multidimensional object ( X i X_i Xi) to be transofrmed into one dimension, while b(X)=X is the finnest one
    • If treatment assignment is strongly ignorable given X then:
      • it is also strongly ignorable given any balancing function. Specifically, given the propensity score: ( r 0 , r 1 ⊥ Z ∣ e ( X ) ) (r_0, r_1 \perp Z | e(X)) (r0,r1Ze(X))

Now, we will look at the Naive Bayes approach proposed by Schnabel.
P ( O u , i = 1 ∣ Y u , i = r ) = P ( Y = r ∣ O = 1 ) P ( O = 1 ) P ( Y = r ) P(O_{u,i}=1|Y_{u,i}=r)=\frac{P(Y=r|O=1)P(O=1)}{P(Y=r)} P(Ou,i=1Yu,i=r)=P(Y=r)P(Y=rO=1)P(O=1)

  • The subscripts u , i u, i u,i are dropped. This means that Parameters are tied across all u u u and i i i
  • P ( Y = r ∣ O = 1 ) P(Y=r|O=1) P(Y=rO=1) and P ( O = 1 ) P(O=1) P(O=1): can be obtained by counting observed ratings in Missing Not At Random (MNAR) data
  • P ( Y = r ) P(Y=r) P(Y=r): we need a sample of Missing Completely At Random (MCAR) data to be combined with the existing Missing Not At Random.

Take away meesage for Naive Bayes

1. This algorithm first have a utilize missing completely at random once to find the normalized term in denominator. Second, we should think the likelihood part P ( Y = r ∣ O = 1 ) P(Y=r|O=1) P(Y=rO=1) in naive bayes as tossing a dice model where each side is each rating scaled r from 1 to 5. So the estimated probability will be same for the same ratings rather than different probabilities for different pairs of user and item.

2. In propensity assumption senario, the naive bayes approach seems not make sense. The strongly ignorablity says ( Y 0 , Y 1 ⊥ O ∣ e ( X ) ) (Y_0, Y_1 \perp O | e(X)) (Y0,Y1Oe(X)). However, naive approach is e ( X ) = P ( O ∣ Y ) e(X)=P(O|Y) e(X)=P(OY) which is little bit confusing and does not make sense in fact.
Reference
MNAR_1BitMC(Ma)
Naive_Bayes(Tobias Schnabel)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值