Overview Of MNAR Matrix Completion Under Nuclear Norm assumption

最新推荐文章于 2022-04-01 23:25:47 发布

Haiyun_Jin

最新推荐文章于 2022-04-01 23:25:47 发布

阅读量268

点赞数

分类专栏： Statistics machine learning

本文链接：https://blog.csdn.net/weixin_32334291/article/details/105126841

版权

Statistics 同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

machine learning

9 篇文章 0 订阅

订阅专栏

Why Use Propensity Score

Setup

Set up three matrices:

signol matrix $\in \mathbb{R}^{m,n}$ .
noise matrix $\in \mathbb{R}^{m,n}$ .
probability matrix $\in [0, 1]^{m,n}$ independent of everything else

Suppose we observe a matrix $X^* \in \mathbb{R}^{m,n}$ without missing entries
$X^* := S+W$
- In order to estimate $S$ with low MSE, $\hat{S}$ can be obtained from $L_{\text{full MSE}}(\hat{S}) := \frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} (\hat{S}_{u,i} - X^*_{u,i}) \tag{1}$
However, we only observe a matrix $\in (\mathbb{R} \cup \{ \star \} )^{m \times n}$ , in which each entry is a revealed rating $r$ or a unrevealed rating $\star$ .
$X_{u,i} = S_{u,i}+W_{u,i} \text{ with probability } P_{u,i}$
- In order to estimate $S$ with low MSE, $\hat{S}$ , we need $L_{MSE}(\hat{S}) := \frac{1}{|\Omega|} \sum_{(u,i)\in \Omega} (\hat{S}_{u,i} - X_{u,i})^2 \tag{2}$
  If the probability of every entry in X being revealed is the same, $P_{u,i}=C$ for all u and I, then (2) is un unbiased estimate of (1).
  However, the above assumption won’t be true in reality because the missing/revealed probability would be different for different u, i. Therefore, we need use inverse propensity score/inverse probability to solve the issue.

Conditional On Propensity Scores

Inverse Propensity Score Technique: $L_{IPS-MSE}(\hat{S}|P) := \frac{1}{mn} \sum_{(u,i)\in \Omega} \frac{(\hat{S}_{u,i}-X_{u,i})^2}{P_{u,i}} \tag{3}$ , which is an unbiased estimate of (1)
Appendix: Suppose $P$ is known, then estimate $L_{IPS-MSE}$ in (3) is an unbiased estimate for $L_{\text{full MSE}}(\hat{S})$ in (1) is unbiased estimator or (1): $L_{IPS-MSE}(\hat{S}|P) := \frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} 1\{(u, i) \in \Omega \} \frac{(\hat{S}_{u,i}-X^*_{u,i})^2}{P_{u,i}}$ , the we take expectation of it with respect to which entries are revealed: $E_{\Omega} [L_{\text{full MSE}}(\hat{S})]=\frac{1}{mn} \sum^m_{u=1} \sum^n_{i=1} P_{u,i} \frac{(\hat{S}_{u,i}-X^*_{u,i})^2}{P_{u,i}}=L_{\text{full MSE}}(\hat{S})$

Nulear Norm Regularization

Overall, we add nuclear norm regularization to (3) and the final objective function is the following: $\hat{S} = \argmin_{\Gamma \in \mathbb{R}^{m \times n}} L_{\text{IPS-MSE}}(\Gamma | P) + \lambda ||\Gamma||_{*} \tag{4**}$ , where $\lambda>0$ is user-specified parameter, and $||\cdot||_*$ denotes the nuclear norm, $||A||_* = trace(\sqrt{A^TA})= \sum^{\text{min of (m, n)}}_{i=1} \sigma_i(A)$

Estimate Propensity Score

We only talk about one bit matrix completion of the propensity score estimation in this article

1bit Matrix Completion

Setup:

$M$ : Missingness Mask Matrix used to estimate $P$
* M $\in \{0,1 \}^{m \times n}$
* $M_{u,i}=1\{X_{u,i} \ne \star \}$
$\in \mathbb{R}^{m \times n}$ : parameter matrix
* $A\in \mathcal{F}_{\tau, \gamma} := \{ \Gamma \in \mathbb{R}^{m \times n} : ||\Gamma||_* \le \tau \sqrt{mn}, ||\Gamma||_{\max} \le \gamma \}$ , $\tau>0$ and $\gamma>0$ are user-specified parameters.
$\sigma(A_{u,i})$ , $\sigma$ can, for instance, be logistic function $\sigma(x)=\frac{1}{(1+e^{-x})}$

1BitMC is given as follows:
4. Solve the constrained Bernoulli maximum likelihood problem:
$\hat{A}=\argmax_{\Gamma \in \mathcal{F_{\tau, \gamma}}} \sum_{u=1}^m \sum_{i=1}^n [M_{u,i} \log \sigma(\Gamma_{u,i}) + (1 - M_{u,i}) log(1 - \sigma(\Gamma_{u,i}))]$
5. Construct the matrix $\hat{P} \in [0,1]^{m \times n}$

Take away meesage for 1bitmc

The algorithm applies one bit matrix compeltion on mask matrix to find a reconstructed matrix penalized with nuclear norm that can achieve the maximum of bernouli likelihood.

Naive Bayes （Tobias Schnabel ）

Let’s look at the formula of Naive Bayes:
$\begin{aligned} \hat{y} & = \argmax_{y \in \mathfrak{Y}} p(y|\textbf{x};\boldsymbol \theta) \\ & = \argmax_{y \in \mathfrak{Y}} p(\textbf{x}|y;\boldsymbol \theta) p(y) \\ \end{aligned}$

$p(\textbf{x}|y;\boldsymbol \theta)$ can be compare to a model of tossing a dice with V faces for n times which is proportional to multinomial distribution

$\text{likelihood=} p(\textbf{x}|y;\boldsymbol \theta) \propto \Pi_{i=1}^{V} \theta_{i, y}^{x_i}$

Prior
$p(y)=\theta_y$
Number of parameters in $;\boldsymbol \theta = \{\theta_y, \theta_{i, y}\}, \forall i, y$ :
$K + K V = (V + 1) K$

We need first to know some assumptions used in propensity score wiki

Strongly Ignorable: If the potential outcomes are independent of teratment conditional on background variables X. $r_0,r_1 \perp Z |X$
Balancing score: $b (x)$ is a function of the observed covariates $X$ such that the conditional distribution of X given $b (X)$ is the same for treated $(Z = 1)$ and control $(Z = 0)$ : $\perp X | b(x)$ , the most trivial function is b(X)=X or e(X)=propensity score. (My thoughts: we can think this of a confounder relation in three nodes graph such that X is the confounder affect both treatment Z and response Y. Once we condiction on b(X), the confounder effect of X on Z can be removed)
Propensity Score: is the probability of a unit being assinged to a particular treatment given a set of observed covariates. $\stackrel{def}{=} \Pr(Z=1 | X=x)$
Main Thereoms:
- The propensity score $e (x)$ is a balancing score
- The propensity score is the coarsest balancing score function because it takes a (possibly) multidimensional object ( $X_i$ ) to be transofrmed into one dimension, while b(X)=X is the finnest one
- If treatment assignment is strongly ignorable given X then:
  - it is also strongly ignorable given any balancing function. Specifically, given the propensity score: $(r_0, r_1 \perp Z | e(X))$

Now, we will look at the Naive Bayes approach proposed by Schnabel.
$P(O_{u,i}=1|Y_{u,i}=r)=\frac{P(Y=r|O=1)P(O=1)}{P(Y=r)}$

The subscripts $u, i$ are dropped. This means that Parameters are tied across all $u$ and $i$
$P (Y = r ∣ O = 1)$ and $P (O = 1)$ : can be obtained by counting observed ratings in Missing Not At Random (MNAR) data
$P (Y = r)$ : we need a sample of Missing Completely At Random (MCAR) data to be combined with the existing Missing Not At Random.

Take away meesage for Naive Bayes

1. This algorithm first have a utilize missing completely at random once to find the normalized term in denominator. Second, we should think the likelihood part $P (Y = r ∣ O = 1)$ in naive bayes as tossing a dice model where each side is each rating scaled r from 1 to 5. So the estimated probability will be same for the same ratings rather than different probabilities for different pairs of user and item.

2. In propensity assumption senario, the naive bayes approach seems not make sense. The strongly ignorablity says $(Y_0, Y_1 \perp O | e(X))$ . However, naive approach is $e (X) = P (O ∣ Y)$ which is little bit confusing and does not make sense in fact.
Reference
MNAR_1BitMC(Ma)
Naive_Bayes(Tobias Schnabel)

Haiyun_Jin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Overview Of MNAR Matrix Completion Under Nuclear Norm assumption

Why Use Propensity ScoreSetupSet up three matrices:signol matrix S∈Rm,nS \in \mathbb{R}^{m,n}S∈Rm,n.noise matrix W∈Rm,nW \in \mathbb{R}^{m,n}W∈Rm,n.probability matrix P∈[0,1]m,nP \in [0, 1]^{m,n...
复制链接

扫一扫

专栏目录