Reinforcement learning:
1 Introduction:
Inmachine learning area, there are three types of learning, Supervised learning,Reinforcement learning, Unsupervised learning. The difference between them arebelow:
1) Supervised Learning: In Supervised Learning,teacher provides a desired response (output) for a given situation (input) andlearner is suppose to learn mapping of best response given a situation.
2) Reinforcement Learning (RL): InRL, given a situation (state) and associated action a reward is known tolearner. Based on these rewards, learner is supposed to learn the best actiongiven a situation.
3) Unsupervised Learning: Inunsupervised learning, learner tries to learn the pattern in data representing differentsituations (input).
2 The model of environments
The environments are modeled in MDP. That contains the following parts:
1) State: S={S1, S2 ,... ,Sn}
2) Action: A={a1, a2,...an}
3) Reward of immediately action: R(S)
4) probability of State Transition: Pr(S'|S, a)
And our goal is to find an optimal policy:
policy: is mapping of states to actions: π(s)
3 How to find policy in environments
Our goal is to find optimal policy given astart point and a goal in an environment.
RL have two flavor of method, one isthe agent know well the environments, which are represented by MDP. The otheris the agent know part of the environments.
3.1 The environment, model, is known to agents
1) Value iteration
Strength: It can find optimal policy.
Weakness: It is hard to converge,and may not converge in some situations; long steps to converge.
2) policy iteration
Modified policy iteration:
Due to the long time to converge in value iteration, we can use thefollowing algorithm, and then use the policy greedily.
3.2 The environment, model is partially known to agents.
1)Q-Learning
Model free. We don't need to model. We just learn reward from the environment.
Strength: It does not need to knowthe Transition Pr, and Reward.
weakness:
1)it may can not find optimal policy;
2) it takes long time to converge,
3) it does not work for many states
notice: we can use greedy a torandom the greedy actions for state s.
2) Prioritized Sweeping
Learn model: Pr, Reward.