MDP
目标:找个策略--max期望累计奖励
难点:不知道reward function,不知道转移概率P;
样本复杂性sample efficiency
design sample-efficient RL algorithm with powerful function approximators(e.g.,NN)
offline RL:learning from dataset;no interaction;(pessimistic value iteration)
online RL:learning from interation;to collect a good dataset; exploration and exploitation(optimistic value iteration)
===================================================================
(1)offline RL
problem: lack of convergence&uncertainty
insufficient coverage:并不知道如果执行这个策略,会带来什么后果;自由策略
原因:我们无法收集更多的数据
--评估 基于数据的估算 的 不确定性
原因:epistemic uncertainty(知识的)---会产生--spurious correlation
悲观主义--通过惩罚penalize epistemic uncertainty,去除eliminate spurious correlation
比如说:uncertainty比较大,减掉lower confidence bound,选择比较确定的动作;
算法: Pessimistic LSVI(least square value iteration)
增加步骤:评估不确定性;构造悲观值函数
【图】
不需要:data coverage assumption;
function approximator:linear,kernel,neural network
algorithm:uncertainty quantification+least-square valur iteration
===================================================================
online RL:
面临的问题与离线学习完全相反:因为我们可以更多探索;均衡exploration and exploitation利用和探索;
deep exploration深度探索:
用不确定做奖励,让agent更想去探索新的状态;
问题
algorithm:optimistic LSVI
增加步骤:评估不确定性;构造乐观值函数(upper confidence bound)
result:可以达到polynomial complexity
===================================================================
Summary:Pessimistic & Optimistic