chapter 1: Introduction
1 Summary:
1.1 Definition:
Reinforced learning is to train a learning agent, which observes, acts and takes rewards, namely interacts with the environment, to maximize the long-run accumulated rewards which is also known as expectation of rewards or values.
Reinforcement learning uses the formal framework of Markov decision processes to define the interaction between a learning agent and its environment in terms of states, actions, and rewards.
1.2 Features:
- Trail-and-error search
- Close loop (observe, act, reward)
- Delayed (sequential) reward
- Explicitly considers the whole problem
- Trade-off between exploitation and exploration
- Evaluative rather than instructive feedback.
1.3 Elements:
Leaning agent, environment, policy, reward, value, model of the environment (optional)
- Policy defines the way that an agent acts, it is a mapping from perceived states of the world to actions. It may be stochastic.
- Reward defines the goal of the problem. A number given to the agent as a (possibly stochastic) function of the state of the environment and the action taken.
- Value function specifies what is good in the long run, essentially to maximise the expected reward. The central role of value estimation is arguably the most important thing that has been learned about reinforcement learning over the last six decades.
- Model mimics the environment to facilitate planning. Not all reinforcement learning algorithms have a model (if they don’t then they can’t plan, i.e. must use trial and error, and are called model free).
1.4 Comparison with other methods:
- Reinforced learning (learn from interaction, maximize long-run accumulated rewards in a certain time period)
- Supervised learning (learn from labeled data, generalize the relation between data and its labels)
- Unsupervised learning (learn from unlabeled data, find hidden structure, namely cluster unlabeled data)
2 Questions:
2.1 Q1: Do exploratory move result in learning in figure 1.1?
It is a question posted on piazza and is also my confusion. The endorsed answer are summarized as follows:
- (1) Exploratory move doesn’t result in learning anything about the “source” state (in this case board state “d”). Mainly, if “e” ends up being worse than “e*” then we don’t want to discount the current value of state “d” - “d” is still just as good as it was before. What it does do is explore and learn more about state “e” (instead of “e*”) which may well influence what is done next time “d” is encountered.
- (2) Remember that states are updated based on the current evaluation of the next state, so it’s actually impossible for an exploratory action to give a better value than the greedy action.
I am actually holding some different opinions towards this two points:
- (1) It is said that exploratory move does not update the source node, in which I think the source node should be c#c* in figure 1.1, because we should only set values for those states after our move. For example, during the a–b--c, in which a–b is the observation that opponent will move to b in case of a, then we take greedy action (move to c), and then the reward (in this case it may denote the increasing value of win or the increasingly accurate estimate of value function, and we update the values of all the former states). I think in this way to abey the closed loop of “observation–action–reward”. I am not sure if it is correct.
- (2) About why the exploratory does not update former source node and why the updated value of e is definitely smaller than the greedy selection e* before updating e, I am still confused. If we set random initial values of different states that the original e* and e is initialized as very close but original e is slightly smaller than e* . Let’s say they are infinitely close to each other that for any update that increase e we can always set an initial difference between original e and e* which is smaller than such increase. By thinking in this drastically and limit way, it seems possible that updated value of e can be larger than the original initial value of e*.
3 Exercises:
Exercise 1.1: Self-Play
It could lead to a stronger learning agent, but some bad cases should be carefully avoided in which fixed or cyclic behaviors may happen.
Exercise 1.2: Symmetries
Using symmetry can greatly reduce dimensions and memory. However, if the opponent does not take advantage of the symmetry, we should not take it either because symmetric configurations now represent different situations, but it is can also be explored to use some extra flag to denote which situation the symmetric configurations are mapped into.
Exercise 1.3: Greedy Play
As shown later in the book, there is a trade-off between exploitation and exploration. If we only apply greedy strategy, some more potential actions may not be taken forever.
Exercise 1.4: Learning from Exploration
Learning from exploration should lead to more wins in the long run. By exploring more, it would be more capable to perform better on nonstationary and long-run oriented problems.
Exercise 1.5: Other Improvements
Using adaptive learning rates, more reasonable trade-off between exploitation and exploration, using ANN to generalize empirical experience.
This article is for self-learners. If you are taking a course, please do not copy this note.