Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization


Feb. 23, 2021


Aim ‾ \underline{\text{Aim}} Aim

In this paper, an efficient Bandit Online Linear Optimization algorithm is proposed, which achieves an optimal O ∗ ( T 1 2 ) O^*(T^{\frac{1}{2}}) O(T21) regret. Actually the existence of an efficient algorithm has already been posed in a few papers. This paper exploit a self-concordant potential function to the difficulties encountered in the previous studys.

Background ‾ \underline{\text{Background}} Background

A sequential decision making problem, termed “the multiarmed bandit problem”, inherits from a model that, on each round in a sequence, a gambler must pull the arm on one of several slot machines (“one-armed bandits”) that each returns a reward chosen stochastically from a fixed distribution, The gambler does not know the best arm a priori, his goal is to maximize the reward of his strategy relative to reward he would receive had he known the optimal arm.

Several authors have proposed a very natural generalization of the multi-armed bandit problem to the field of convex optimization, and this is called “bandit linear optimization”. In this setting we imagine that, on each round t, an adversary chooses some linear function f t ( ⋅ ) f_t(\cdot) ft() which is not revealed to the player. The player then chooses a point x t \mathbf{x}_t xt within some given convex set K ∈ R n \mathcal{K} \in \mathbb{R}^n KRn. The player then suffers f t ( x t ) f_t(\mathbf{x}_t) ft(xt) and this quantity is reveled to him. This process continues for T rounds, and at the end the learner’s payoff is his regret:
R T = ∑ t = 1 T f t ( x t ) − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ( x ∗ ) R_{T}=\sum_{t=1}^{T} f_{t}\left(\mathbf{x}_{t}\right)-\min _{\mathbf{x}^{*} \in \mathcal{K}} \sum_{t=1}^{T} f_{t}\left(\mathbf{x}^{*}\right) RT=t=1Tft(xt)xKmint=1Tft(x)

In the full-information model, it has been known for some time that the optimal regret bound is O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21). It had been conjectured that this O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21) bound also holds for the bandit version. However, several algorithms proposed only achieve O ( T 3 4 ) O(T^{\frac{3}{4}}) O(T43) or O ( T 2 3 ) O(T^{\frac{2}{3}}) O(T32). The one achieves O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21) is, unfortunately, not efficient.

This paper propose an algorithm which achieves high efficiency and an O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21) regret bound. Moreover, the paper discovers a link between the Bregman divergences and self-concordant barriers: divergence functions provide the right perspective for the problem of managing uncertainty given limited feedback.

Brief Project Description ‾ \underline{\text{Brief Project Description}} Brief Project Description

The terms “full-information version” and “bandit version” were mentioned above. Here they will be explained after the definition of an online linear optimization problem. This problem is is defined as the following repeated game between the learner (player) and the environment (adversary).

At each time step t = 1 t=1 t=1 to T T T,

∙ \bullet Player chooses x t ∈ K \mathbf{x}_t\in\mathcal{K} xtK
∙ \bullet Adversary independently chooses f t ∈ R n \mathbf{f}_t\in\mathbb{R}^n ftRn
∙ \bullet Player suffers loss f t ⊤ x t \mathbf{f}_t^\top\mathbf{x}_t ftxt and observes feedback ℑ \Im .

In this game, the Player’s goal is to minimize his regret R T R_T RT defined as

R T : = ∑ t = 1 T f t ⊤ x t − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ⊤ x ∗ R_{T}:=\sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{x}_{t}-\min _{\mathbf{x}^{*} \in \mathcal{K}} \sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{x}^{*} RT:=t=1TftxtxKmint=1Tftx

Now, the The full-information version, the Player may observe the entire function f t \mathbf{f}_t ft as his feedback ℑ \Im and can exploit this in making his decisions. In comparison, the player can only observe a scalar value feedback f t x t \mathbf{f}_t\mathbf{x}_t ftxt after he has made the decision x t \mathbf{x}_t xt at that round.

Though the algorithm proposed in this paper can deal with the bandit version problem, it is still reasonable to utilize a reduction to the full-information setting, as any algorithm that aimed for low-regret in the bandit setting would necessarily have to achieve low regret given full information. For example, the well know Follow The Leader (FTL) stragety using the “select the best choice so far”:
x t + 1 : = arg ⁡ min ⁡ x ∈ K ∑ s = 1 t f s ⊤ x .                               ( 1 ) \mathbf{x}_{t+1}:=\arg \min _{\mathbf{x} \in \mathcal{K}} \sum_{s=1}^{t} \mathbf{f}_{s}^{\top} \mathbf{x}.\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) xt+1:=argxKmins=1tfsx.                             (1)
And the Follow The Regularized Leader (FTRL):
x t + 1 : = arg ⁡ min ⁡ x ∈ K [ ∑ s = 1 t f s ⊤ x + λ R ( x ) ] .          ( 2 ) \mathbf{x}_{t+1}:=\arg \min _{\mathbf{x} \in \mathcal{K}}\left[\sum_{s=1}^{t} \mathbf{f}_{s}^{\top} \mathbf{x}+\lambda \mathcal{R}(\mathbf{x})\right]. \ \ \ \ \ \ \ \ (2) xt+1:=argxKmin[s=1tfsx+λR(x)].        (2)
Given that R \mathcal{R} R is convex and differentiable, the general form of the update of FTRL is as follow:
x ‾ t + 1 = ∇ R ∗ ( ∇ R ( x ‾ t ) − η f t ) ,                       ( 3 ) \overline{\mathbf{x}}_{t+1}=\nabla \mathcal{R}^{*}\left(\nabla \mathcal{R}\left(\overline{\mathbf{x}}_{t}\right)-\eta \mathbf{f}_{t}\right),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) xt+1=R(R(xt)ηft),                     (3)
followed by a projection onto K \mathcal{K} K with respect to the divergence D R D_\mathcal{R} DR:
x t + 1 = arg ⁡ min ⁡ u ∈ K D R ( u , x ‾ t + 1 ) . \mathbf{x}_{t+1}=\arg \min _{\mathbf{u} \in \mathcal{K}} D_{\mathcal{R}}\left(\mathbf{u}, \overline{\mathbf{x}}_{t+1}\right). xt+1=arguKminDR(u,xt+1).
Here R ∗ \mathcal{R}^* R is the Fenchel dual function and η \eta η is a parameter. This procedure is known as the mirror descent.

For an online learning algorithm A \mathcal{A} A, “explore or exploit” is a serious problem. A player first choose some fullinformation online learning algorithm A \mathcal{A} A. A \mathcal{A} A will receive input vectors f 1 , ⋯   , f t \mathbf{f}_1,\cdots, \mathbf{f}_t f1,,ft corresponding to previously observed functions, and will return some point x t + 1 ∈ K \mathbf{x}_{t+1}\in\mathcal{K} xt+1K to predict. It is assumed that f 1 , ⋯   , f t \mathbf{f}_1,\cdots, \mathbf{f}_t f1,,ft are just realizations of the random variable (vector) f ~ t \tilde{\mathbf{f}}_{t} f~t. So, the prediction will be more accurate if there are more “new” f \mathbf{f} f input vectors. Here comes the dilemma of “explore or exploit”: whether to follow the advice of A \mathcal{A} A of predicting x t \mathbf{x}_t xt, or to try to estimate f t \mathbf{f}_t ft by sampling in a wide region around K \mathcal{K} K, possibly hurting its performance on the given round. This exploration exploitation trade-off is the primary source of difficulty in obtaining O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21) guarantees on the regret.

Roughly two categories of approaches, namely Alternating Explore/Exploit and Simultaneous Explore/Exploit, perform both exploration and exploitation. The first category fail to obtain the desired O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21)., so the second one will be the focus. The two Simultaneous-Explore/Exploit-type algorithms, proposed by Auer et at [1] and Flaxman et al [2] respectively, are reviewed. Both of their schedules are: Query A \mathcal{A} A for x t \mathbf{x}_t xt and construct a random vector X t \bm{X}_t Xt such that E ( X t ) = x t \mathbb{E}(\bm{X}_t) = \mathbf{x}_t E(Xt)=xt. Construct f ~ t \tilde{\mathbf{f}}_t f~t randomly based on the outcome of X t \bm{X}_t Xt and the learned value f t ⊤ X t \mathbf{f}_t^\top\bm{X}_t ftXt.

It is pointed out in the paper that the estimates of f ~ t \tilde{\mathbf{f}}_t f~t in both methods are reversely proportional to the distance of x t \mathbf{x}_t xt to the boundary, which implies high variance of the estimated functions. Indeed, most full-information algorithms scale linearly with the magnitude of the functions played by the environment. Fortunately, if If we restrict our search to a regularization algorithm of type (2), the expected regret can be proved to be equal to an expression involving E D R ( x t , x t + 1 ) \mathbb{E} D_{\mathcal{R}}\left(\mathbf{x}_{t}, \mathbf{x}_{t+1}\right) EDR(xt,xt+1) terms. For R ( x ) ∝ ∥ x ∥ 2 \mathcal{R}(\mathbf{x}) \propto\|\mathbf{x}\|^{2} R(x)x2, the paper recovers the method of Flaxman et al with its insurmountable hurdle of E ∥ f ~ t ∥ 2 \mathbb{E}\left\|\tilde{\mathbf{f}}_{t}\right\|^{2} Ef~t2.

The main result of this paper is an algorithm for online linear optimization in the bandit setting for an arbitrary compact convex set K \mathcal{K} K, which is as follows:
在这里插入图片描述
In Section 4 the regularization framework is discussed in detail and it will be shown that how the regret can be computed in terms of Bregman divergences. The theory and main properties of self-concordant functions will be presented in Section 5. In Section 6, several key elements of the proof of the regret bound of the proposed algorithm in this paper will be given. In Section 7 the paper shows how this algorithm can be used for one interesting case, namely the bandit version of the Online Shortest Path problem. The precise analysis of our algorithm is given in Section 8. Finally, in Section 9 is the implementation of the algorithm.

The main result of the paper is as follows:

Theorem 1 Let K \mathcal{K} K be a convex set and R \mathcal{R} R be a ℑ \Im -self-concordant barrier on K \mathcal{K} K. Let u \mathbf{u} u be any vector in K ′ = K T − 1 / 2 \mathcal{K}' = \mathcal{K}_{T^{-1/2}} K=KT1/2. Suppose we have the property that ∣ f t ⊤ x ∣ ≤ 1 \left|\mathbf{f}_{t}^{\top} \mathbf{x}\right| \leq 1 ftx1 for any x ∈ K \mathbf{x}\in\mathcal{K} xK. Setting η = ϑ log ⁡ T 4 n T \eta=\frac{\sqrt{\vartheta \log T}}{4 n \sqrt{T}} η=4nT ϑlogT , the regret of Algorithm 1 is bounded as
E ∑ t = 1 T f t ⊤ y t ≤ min ⁡ u ∈ K ′ E ( ∑ t = 1 T f t ⊤ u ) + 16 n ϑ T log ⁡ T \mathbb{E} \sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{y}_{t} \leq \min _{\mathbf{u} \in \mathcal{K}^{\prime}} \mathbb{E}\left(\sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{u}\right)+16 n \sqrt{\vartheta T \log T} Et=1TftytuKminE(t=1Tftu)+16nϑTlogT
whenever T > 8 ϑ log ⁡ T T>8 \vartheta \log T T>8ϑlogT.

Here the definition of the scaled version of K \mathcal{K} K and the ℑ \Im -self-concordant function are used. Thr scaled version of K \mathcal{K} Kis define as:
K δ = { u : π x 1 ( u ) ≤ ( 1 + δ ) − 1 } \mathcal{K}_{\delta}=\left\{\mathbf{u}: \pi_{\mathbf{x}_{1}}(\mathbf{u}) \leq(1+\delta)^{-1}\right\} Kδ={u:πx1(u)(1+δ)1}
To define a ℑ \Im -self-concordant function, first we give the definition of a self-concordant function as follows:

Definition (self-concordant function) A self-concordant function R \mathcal{R} R: i n t   K → R int \ \mathcal{K} \rightarrow\mathbb{R} int KR is a C 3 C^3 C3 convex function such that
∣ D 3 R ( x ) [ h , h , h ] ∣ ≤ 2 ( D 2 R ( x ) [ h , h ] ) 3 / 2 \left|D^{3} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}, \mathbf{h}]\right| \leq 2\left(D^{2} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}]\right)^{3 / 2} D3R(x)[h,h,h]2(D2R(x)[h,h])3/2
Here, the third-order differential is defined as

D 3 R ( x ) [ h 1 , h 2 , h 3 ] : = ∂ 3 ∂ t 1 ∂ t 2 ∂ t 3 ∣ t 1 = t 2 = t 3 = 0 R ( x + t 1 h 1 + t 2 h 2 + t 3 h 3 ) D^{3} \mathcal{R}(\mathbf{x})\left[\mathbf{h}_{1}, \mathbf{h}_{2}, \mathbf{h}_{3}\right] := \left.\frac{\partial^{3}}{\partial t_{1} \partial t_{2} \partial t_{3}}\right|_{t_{1}=t_{2}=t_{3}=0} \mathcal{R}\left(\mathbf{x}+t_{1} \mathbf{h}_{1}+t_{2} \mathbf{h}_{2}+t_{3} \mathbf{h}_{3}\right) D3R(x)[h1,h2,h3]:=t1t2t33t1=t2=t3=0R(x+t1h1+t2h2+t3h3)

Now we can define the ℑ \Im -self-concordant function as follows:

Definition ( ℑ \Im -self-concordant function) A ℑ \Im -self-concordant barrier R \mathcal{R} R is a self-concordant function with
∣ D R ( x ) [ h ] ∣ ≤ ϑ 1 / 2 [ D 2 R ( x ) [ h , h ] ] 1 / 2 . |D \mathcal{R}(\mathbf{x})[\mathbf{h}]| \leq \vartheta^{1 / 2}\left[D^{2} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}]\right]^{1 / 2}. DR(x)[h]ϑ1/2[D2R(x)[h,h]]1/2.

Significance of Paper ‾ \underline{\text{Significance of Paper}} Significance of Paper

This is the first paper to achieve both high efficiency and a O ( p o l y ( n ) T ) O(poly(n) \sqrt{T}) O(poly(n)T ) regret bound. The bound O ( T ) O(\sqrt{T}) O(T ) is a regret bound for the full-information model and now it becomes the one for bandit setting as well. This is surely a breakthrough since what a player can observe at the end of each round int the bandit setting is far less than that in a full-information setting. Also, as the paper reviewed, only bounds like O 3 / 4 O^{3/4} O3/4, O 2 / 3 O^{2/3} O2/3 are obtained in quite a few previous papers. Now this “goal” bound is achieved efficiently, finally.

Reference {\text{\Large Reference}} Reference

[1] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2003.

[2] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA ’05: Proceedings ofthe sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics.

[3] Abernethy J D, Hazan E, Rakhlin A. Competing in the dark: An efficient algorithm for bandit linear optimization[J]. 2009.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
东南亚位于我国倡导推进的“一带一路”海陆交汇地带,作为当今全球发展最为迅速的地区之一,近年来区域内生产总值实现了显著且稳定的增长。根据东盟主要经济体公布的最新数据,印度尼西亚2023年国内生产总值(GDP)增长5.05%;越南2023年经济增长5.05%;马来西亚2023年经济增速为3.7%;泰国2023年经济增长1.9%;新加坡2023年经济增长1.1%;柬埔寨2023年经济增速预计为5.6%。 东盟国家在“一带一路”沿线国家中的总体GDP经济规模、贸易总额与国外直接投资均为最大,因此有着举足轻重的地位和作用。当前,东盟与中国已互相成为双方最大的交易伙伴。中国-东盟贸易总额已从2013年的443亿元增长至 2023年合计超逾6.4万亿元,占中国外贸总值的15.4%。在过去20余年中,东盟国家不断在全球多变的格局里面临挑战并寻求机遇。2023东盟国家主要经济体受到国内消费、国外投资、货币政策、旅游业复苏、和大宗商品出口价企稳等方面的提振,经济显现出稳步增长态势和强韧性的潜能。 本调研报告旨在深度挖掘东南亚市场的增长潜力与发展机会,分析东南亚市场竞争态势、销售模式、客户偏好、整体市场营商环境,为国内企业出海开展业务提供客观参考意见。 本文核心内容: 市场空间:全球行业市场空间、东南亚市场发展空间。 竞争态势:全球份额,东南亚市场企业份额。 销售模式:东南亚市场销售模式、本地代理商 客户情况:东南亚本地客户及偏好分析 营商环境:东南亚营商环境分析 本文纳入的企业包括国外及印尼本土企业,以及相关上下游企业等,部分名单 QYResearch是全球知名的大型咨询公司,行业涵盖各高科技行业产业链细分市场,横跨如半导体产业链(半导体设备及零部件、半导体材料、集成电路、制造、封测、分立器件、传感器、光电器件)、光伏产业链(设备、硅料/硅片、电池片、组件、辅料支架、逆变器、电站终端)、新能源汽车产业链(动力电池及材料、电驱电控、汽车半导体/电子、整车、充电桩)、通信产业链(通信系统设备、终端设备、电子元器件、射频前端、光模块、4G/5G/6G、宽带、IoT、数字经济、AI)、先进材料产业链(金属材料、高分子材料、陶瓷材料、纳米材料等)、机械制造产业链(数控机床、工程机械、电气机械、3C自动化、工业机器人、激光、工控、无人机)、食品药品、医疗器械、农业等。邮箱:market@qyresearch.com

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值