a summary of lecture slides of Prof. Lin Yang (UCLA, ECE 239AS 2020 Spring) with contents from "Reinforcement Learning, An Introduction" by Richard S. Sutton and Andrew G. Barto (the textbook).
all graphs and pictures are from the slides and the textbook, ignore the marks added by platform.
Introduction to RL
Faces of Machine Learning
Characteristic of RL
Reward
Examples
Alternatives: Regret
after T round, assume the optimal value (after taking globally optimal action each step) is:
then we can define regret as:
where r_t are the actual reward for each step.
obviously we can define the learning goal as minimizing (ave.) regret; and we can evaluate policies by ave. regret, e.g. for:
tilda_O is the magnitude;
then we know that the policy will converge to optimal solution in long term (T->Inf).
Essence of RL
Sequential Decision Problem (why we cannot rely on myopic greedy options):
Agent and Environment
//or O->A->R is such notation makes more logical sense.
States
==> [Environment states] may contain irrelevant information, i.e. the actor may not need every bit of info from the env. to make decisions.
==> the simple case states: ideally, the world is revealed fully to the actor and the actor observed all information, based on which the actor makes decisions.
Information State (Markov State)
i.e. the future state of the world can be derived based on the present, regardless of the past.
==> the past is relevant but not necessary in the derivation. ==> there is a reliable induction rule based only on the current state.
Markov Decision Process
contrary to Markov Decision Process, we have:
Partially Observable Environment
Elements of RL
==> attention: "model" represents the environment here, instead of a trained agent as in SL and UL
Policy,
"The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action." --textbook
Value Function
==> expectedValue = curStepReward + prevExpectedCumulativeVal
"Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state."
"Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary." ==> usually estimate the value of a state can be hard and needs repeated evaluation.
"We seek actions that bring about states of highest value, not highest reward" --the textbook
Model
the dot is actually:
==> a dummy variable to calculate the most probable next state.
Example
Model: Deterministic Transition and deterministic reward
Types of RL Agents
•Value Based: Only based on the value function
•Policy Based: Only have a policy
•Actor Critic: Policy and Value
•Model Based: Learns the model and then computes policy/value
•Model Free: Only learns the policy/value without the model
Aspects of (R)Learning
Learning and Planning
they are both fundamental problems in sequential decision making
"Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning." --textbook
while planning usually requires a perfect simulator.
Exploration and Exploitation
Imitation Learning
e.g.