Reinforcement Learning Notes I: Intro. to RL

a summary of lecture slides of Prof. Lin Yang (UCLA, ECE 239AS 2020 Spring) with contents from "Reinforcement Learning, An Introduction" by Richard S. Sutton and Andrew G. Barto (the textbook). 

all graphs and pictures are from the slides and the textbook, ignore the marks added by platform.

Introduction to RL

Faces of Machine Learning

 

Characteristic of RL

 

Reward

Examples

Alternatives: Regret

after T round, assume the optimal value (after taking globally optimal action each step) is:

then we can define regret as:

where r_t are the actual reward for each step.

obviously we can define the learning goal as minimizing (ave.) regret; and we can evaluate policies by ave. regret, e.g. for:

tilda_O is the magnitude;

then we know that the policy will converge to optimal solution in long term (T->Inf).

 

Essence of RL

Sequential Decision Problem (why we cannot rely on myopic greedy options):

 

Agent and Environment

//or O->A->R is such notation makes more logical sense.

States

==> [Environment states] may contain irrelevant information, i.e. the actor may not need every bit of info from the env. to make decisions.

==> the simple case states: ideally, the world is revealed fully to the actor and the actor observed all information, based on which the actor makes decisions.

Information State (Markov State)

i.e. the future state of the world can be derived based on the present, regardless of the past.

==> the past is relevant but not necessary in the derivation. ==> there is a reliable induction rule based only on the current state.

Markov Decision Process

contrary to Markov Decision Process, we have:

Partially Observable Environment

 

Elements of RL

==> attention: "model" represents the environment here, instead of a trained agent as in SL and UL

 

Policy, \pi(state\_of\_agent)

"The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action." --textbook

 

Value Function

==> expectedValue = curStepReward + prevExpectedCumulativeVal

"Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state." 

"Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary." ==> usually estimate the value of a state can be hard and needs repeated evaluation.

"We seek actions that bring about states of highest value, not highest reward" --the textbook

 

Model

the dot is actually:

==> a dummy variable to calculate the most probable next state.

 

Example

Model: Deterministic Transition and deterministic reward

Types of RL Agents

•Value Based: Only based on the value function
•Policy Based: Only have a policy
•Actor Critic: Policy and Value
•Model Based: Learns the model and then computes policy/value
•Model Free: Only learns the policy/value without the model

 

Aspects of (R)Learning

Learning and Planning

they are both fundamental problems in sequential decision making

"Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning." --textbook

while planning usually requires a perfect simulator.

Exploration and Exploitation

Imitation Learning

e.g.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值