Reinforcement Learning Notes I: Intro. to RL

最新推荐文章于 2022-03-25 10:55:49 发布

EverNoob

最新推荐文章于 2022-03-25 10:55:49 发布

阅读量211

点赞数

分类专栏： Machine_Learning Notes 文章标签：机器学习

本文链接：https://blog.csdn.net/maxzcl/article/details/116026159

版权

Notes 同时被 2 个专栏收录

140 篇文章 0 订阅

订阅专栏

Machine_Learning

54 篇文章 1 订阅

订阅专栏

a summary of lecture slides of Prof. Lin Yang (UCLA, ECE 239AS 2020 Spring) with contents from "Reinforcement Learning, An Introduction" by Richard S. Sutton and Andrew G. Barto (the textbook).

all graphs and pictures are from the slides and the textbook, ignore the marks added by platform.

Introduction to RL

Faces of Machine Learning

Characteristic of RL

Reward

Examples

Alternatives: Regret

after T round, assume the optimal value (after taking globally optimal action each step) is:

then we can define regret as:

where r_t are the actual reward for each step.

obviously we can define the learning goal as minimizing (ave.) regret; and we can evaluate policies by ave. regret, e.g. for:

tilda_O is the magnitude;

then we know that the policy will converge to optimal solution in long term (T->Inf).

Essence of RL

Sequential Decision Problem (why we cannot rely on myopic greedy options):

Agent and Environment

//or O->A->R is such notation makes more logical sense.

States

==> [Environment states] may contain irrelevant information, i.e. the actor may not need every bit of info from the env. to make decisions.

==> the simple case states: ideally, the world is revealed fully to the actor and the actor observed all information, based on which the actor makes decisions.

Information State (Markov State)

i.e. the future state of the world can be derived based on the present, regardless of the past.

==> the past is relevant but not necessary in the derivation. ==> there is a reliable induction rule based only on the current state.

Markov Decision Process

contrary to Markov Decision Process, we have:

Partially Observable Environment

Elements of RL

==> attention: "model" represents the environment here, instead of a trained agent as in SL and UL

Policy, $\pi(state\_of\_agent)$

"The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action." --textbook

Value Function

==> expectedValue = curStepReward + prevExpectedCumulativeVal

"Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state."

"Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary." ==> usually estimate the value of a state can be hard and needs repeated evaluation.

"We seek actions that bring about states of highest value, not highest reward" --the textbook

Model

the dot is actually:

==> a dummy variable to calculate the most probable next state.

Example

Model: Deterministic Transition and deterministic reward

Types of RL Agents

•Value Based: Only based on the value function
•Policy Based: Only have a policy
•Actor Critic: Policy and Value
•Model Based: Learns the model and then computes policy/value
•Model Free: Only learns the policy/value without the model

Aspects of (R)Learning

Learning and Planning

they are both fundamental problems in sequential decision making

"Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning." --textbook