1 Introduction
min β E P ∗ [ h β ( x , y ) ] (1.1) \min _{\beta} \mathbb{E}^{\mathbb{P}^{*}}\left[h_{\beta}(\mathbf{x}, y)\right]\tag{1.1} βminEP∗[hβ(x,y)](1.1)
- we will restrict our attention to the intersection of statistical learning and Distributionally Robust Optimization (DRO) under the Wasserstein metric
- static single-period --> dynamic setting
1.1 Robust Optimization
min β max z ∈ Z h β ( z ) \min _{\boldsymbol{\beta}} \max _{\mathbf{z} \in \mathcal{Z}} h_{\beta}(\mathbf{z}) βminz∈Zmaxhβ(z)
- feature uncertainties & label uncertainties
1.2 Distributionally Robust Optimization
inf β sup Q ∈ Ω E Q [ h β ( z ) ] \inf _{\boldsymbol{\beta}} \sup _{\mathbb{Q} \in \Omega} \mathbb{E}^{\mathbb{Q}}\left[h_{\boldsymbol{\beta}}(\mathbf{z})\right] βinfQ∈ΩsupEQ[hβ(z)]
- The existing literature on DRO can be split into two main branches:
- a moment ambiguity set
- a ball of distributions
Ω ≜ { Q ∈ P ( Z ) : D ( Q , P 0 ) ≤ ϵ } \Omega \triangleq\left\{\mathbb{Q} \in \mathcal{P}(\mathcal{Z}): D\left(\mathbb{Q}, \mathbb{P}_{0}\right) \leq \epsilon\right\} Ω≜{Q∈P(Z):D(Q,P0)≤ϵ}
where P 0 \mathbb{P}_0 P0 is a nominal distribution
- we adopt the Wasserstein metric to define a data-driven DRO problem
Ω ≜ { Q ∈ P ( Z ) : W s , t ( Q , P ^ N ) ≤ ϵ } \Omega \triangleq\left\{\mathbb{Q} \in \mathcal{P}(\mathcal{Z}): W_{s, t}\left(\mathbb{Q}, \hat{\mathbb{P}}_{N}\right) \leq \epsilon\right\} Ω≜{Q∈P(Z):Ws,t(Q,P^N)≤ϵ}
W s , t ( Q , P ^ N ) ≜ ( min π ∈ P ( Z × Z ) ∫ Z × Z ( s ( z 1 , z 2 ) ) t d π ( z 1 , z 2 ) ) 1 / t W_{s, t}\left(\mathbb{Q}, \hat{\mathbb{P}}_{N}\right) \triangleq\left(\min _{\pi \in \mathcal{P}(\mathcal{Z} \times \mathcal{Z})} \int_{\mathcal{Z} \times \mathcal{Z}}\left(s\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right)\right)^{t} \mathrm{~d} \pi\left(\mathbf{z}_{1}, \mathbf{z}_{2}\right)\right)^{1 / t} Ws,t(Q,P^N)≜(π∈P(Z×Z)min∫Z×Z(s(z1,z2))t dπ(z1,z2))1/t - We choose the Wasserstein metric for two main reasons:
- On one hand, the Wasserstein ambiguity set is rich enough to contain both continuous and discrete relevant distributions
- On the other hand, measure concentration results guarantee that the Wasserstein set contains the true data-generating distribution with high confidence for a sufficiently large sample size
1.3 Outline
- The learning problems that are studied in this monograph include:
- Distributionally Robust Linear Regression (DRLR), which estimates a robustified linear regression plane by minimizing the worst-case expected absolute loss over a probabilistic ambiguity set characterized by the Wasserstein metric;
- Groupwise Wasserstein Grouped LASSO (GWGL), which aims at inducing sparsity at a group level when there exists a predefined grouping structure for the predictors, through defining a specially structured Wasserstein metric for DRO;
- Distributionally Robust Multi-Output Learning, which solves a DRO problem with a multi-dimensional response/label vector,
generalizing the single-output model addressed in DRLR. - Optimal decision making using DRLR informed K-Nearest Neighbors (K-NN) estimation, which selects among a set of actions the optimal one through predicting the outcome under each action using K-NN with a distance metric weighted by the DRLR solution;
- Distributionally Robust Semi-Supervised Learning, which estimates a robust classifier with partially labeled data, through (i) either restricting the marginal distribution to be consistent with the unlabeled data, (ii) or modifying the structure of DRO by allowing the center of the ambiguity set to vary, reflecting the uncertainty in the labels of the unsupervised data.
- Distributionally Robust Reinforcement Learning, which considers Markov Decision Processes (MDPs) and seeks to inject robustness into the probabilistic transition model, deriving a lower bound for the distributionally robust value function in a regularized form.