Notes of chapter 1: Introduction

1 Summary:

1.1 Definition:

Reinforced learning is to train a learning agent, which observes, acts and takes rewards, namely interacts with the environment, to maximize the long-run accumulated rewards which is also known as expectation of rewards or values.

Reinforcement learning uses the formal framework of Markov decision processes to define the interaction between a learning agent and its environment in terms of states, actions, and rewards.

1.2 Features:

  • Trail-and-error search
  • Close loop (observe, act, reward)
  • Delayed (sequential) reward
  • Explicitly considers the whole problem
  • Trade-off between exploitation and exploration
  • Evaluative rather than instructive feedback.

1.3 Elements:

Leaning agent, environment, policy, reward, value, model of the environment (optional)

  1. Policy defines the way that an agent acts, it is a mapping from perceived states of the world to actions. It may be stochastic.
  2. Reward defines the goal of the problem. A number given to the agent as a (possibly stochastic) function of the state of the environment and the action taken.
  3. Value function specifies what is good in the long run, essentially to maximise the expected reward. The central role of value estimation is arguably the most important thing that has been learned about reinforcement learning over the last six decades.
  4. Model mimics the environment to facilitate planning. Not all reinforcement learning algorithms have a model (if they don’t then they can’t plan, i.e. must use trial and error, and are called model free).

1.4 Comparison with other methods:

  • Reinforced learning (learn from interaction, maximize long-run accumulated rewards in a certain time period)
  • Supervised learning (learn from labeled data, generalize the relation between data and its labels)
  • Unsupervised learning (learn from unlabeled data, find hidden structure, namely cluster unlabeled data)

2 Questions:

2.1 Q1: Do exploratory move result in learning in figure 1.1?

It is a question posted on piazza and is also my confusion. The endorsed answer are summarized as follows:

  • (1) Exploratory move doesn’t result in learning anything about the “source” state (in this case board state “d”). Mainly, if “e” ends up being worse than “e*” then we don’t want to discount the current value of state “d” - “d” is still just as good as it was before. What it does do is explore and learn more about state “e” (instead of “e*”) which may well influence what is done next time “d” is encountered.
  • (2) Remember that states are updated based on the current evaluation of the next state, so it’s actually impossible for an exploratory action to give a better value than the greedy action.

I am actually holding some different opinions towards this two points:

  • (1) It is said that exploratory move does not update the source node, in which I think the source node should be c#c* in figure 1.1, because we should only set values for those states after our move. For example, during the a–b--c, in which a–b is the observation that opponent will move to b in case of a, then we take greedy action (move to c), and then the reward (in this case it may denote the increasing value of win or the increasingly accurate estimate of value function, and we update the values of all the former states). I think in this way to abey the closed loop of “observation–action–reward”. I am not sure if it is correct.
  • (2) About why the exploratory does not update former source node and why the updated value of e is definitely smaller than the greedy selection e* before updating e, I am still confused. If we set random initial values of different states that the original e* and e is initialized as very close but original e is slightly smaller than e* . Let’s say they are infinitely close to each other that for any update that increase e we can always set an initial difference between original e and e* which is smaller than such increase. By thinking in this drastically and limit way, it seems possible that updated value of e can be larger than the original initial value of e*.

3 Exercises:

Exercise 1.1: Self-Play

It could lead to a stronger learning agent, but some bad cases should be carefully avoided in which fixed or cyclic behaviors may happen.

Exercise 1.2: Symmetries

Using symmetry can greatly reduce dimensions and memory. However, if the opponent does not take advantage of the symmetry, we should not take it either because symmetric configurations now represent different situations, but it is can also be explored to use some extra flag to denote which situation the symmetric configurations are mapped into.

Exercise 1.3: Greedy Play

As shown later in the book, there is a trade-off between exploitation and exploration. If we only apply greedy strategy, some more potential actions may not be taken forever.

Exercise 1.4: Learning from Exploration

Learning from exploration should lead to more wins in the long run. By exploring more, it would be more capable to perform better on nonstationary and long-run oriented problems.

Exercise 1.5: Other Improvements

Using adaptive learning rates, more reasonable trade-off between exploitation and exploration, using ANN to generalize empirical experience.

This article is for self-learners. If you are taking a course, please do not copy this note.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值