对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。
Reinforcement learning problems involve learning what to do - how to map situations to actions - so as to maximize a numerical reward signal.
RL is different from supervised learning/unsupervised learning.
There is no supervisor (to tell what is best!), only a reward signal, must discover which actions yield the most reward by trying them out
action influence the environment and sub-sequential data; data distribution is not iid
Feedback is (sometimes) delayed, not instantaneous
trade-off between exploration and exploitation
for stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward
elements of RL
reward signal: Reward Hypothesis, All goals can be described by the maximisation of expected cumulative reward
value function: Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.
speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
(sometimes) model of environment: P(s'|s,a) and R(s'|s,a). Models are used for planning(without actually take interaction with the environment)
1.4 Limitations and Scope
用大量的篇幅讲了genetic algorithms/evolutionary methods（EM）、optimization methods（OM）和RL的区别，指出：
evolutionary methods适合small space of policies；或者agent cant accurately sense the state of the environment。但是EM方法只看policy的最后结果而不考虑中间的演变的过程（the details of individual behavioral interactions），效率不如RL高：they do not use the fact that the policy they are searching for is a function from states to actions; they do not notice which states an individual passes through during its lifetime, or which actions it selects. In some cases this information can be misleading (e.g., when states are misperceived), but more often it should enable more efficient search.
1.5 An Extended Example: Tic-Tac-Toe
举例说明了传统的AI方法，比如minimax、dynamic programming、evolutionary method都不太适合即使是这么简单的RL问题。
the classical "minimax" solution from game theory is not correct here because it assumes a particular way of playing by the opponent.
dynamic programming, can compute an optimal solution for any opponent, but require as input a complete specification of that opponent
evolutionary method: To evaluate a policy an evolutionary method holds the policy fixed and plays many games against the opponent, or simulates many games using a model of the opponent. The frequency of wins gives an unbiased estimate of the probability of winning with that policy, and can be used to direct the next policy selection. But each policy change is made only after many games, and only the final outcome of each game is used: what happens during the games is ignored. For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win. Credit is even given to moves that never occurred!
RL: Value function methods, in contrast, allow individual states to be evaluated. In the end, evolutionary and value function methods both search the space of policies, but learning a value function takes advantage of information available during the course of play.
reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions.（从TD-learning的角度去看，RL确实是不需要model就有一定的lookahead功能）
1.7 History of Reinforcement Learning
RL三条研究主线：learning with trial and error; optimal control and its solution using value functions and dynamic programming(planning); TD-methods;然后举了各种researchers的研究。。。
下面是silver课程《Lecture 1，Introduction to Reinforcement Learning》我觉得应该知道的内容：
8：characteristics make RL different from other ML paradigms
There is no supervisor (to tell what is best!), only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
13：Reward Hypothesis, All goals can be described by the maximisation of expected cumulative reward
18：History is the sequence of observations, actions, rewards, H_t = o_0, r_0, a_0, o_1, r_1, a_1, ..., o_t, r_t, a_t
State is the information used to determine what happens next, S_t = fun(H_t)
Markov state contains all useful information from the history, S is Markov iff P[S_t+1|S_t]=P[S_t+1|S_t,..,S_0],
The future is independent of the past given the present
23：Full observability: agent directly observes Markov state, i.e., O_t = S_t = fun(H_t), MDP
Partial observability: agent indirectly observes environment, O_t != S_t, POMDP
Agent must construct its own state representation:
Complete history: S_t = H_t
Beliefs of environment state: S_t = (P[S = s1]; ...; P[S = sn])
RNN: S_t = σ(W*S_t-1 + V*O_t)
25：four main subelements of a reinforcement learning system
a policy: agent’s behaviour function
a reward signal: indicates what is good in an immediate sense, the primary basis for altering the policy
a value function: specifies what is good in the long run, the expect total accumulated future reward
optionally, a model of the environment: something that mimics the behavior of the environment, P[S'|S,A] and R[S'|S,A]
37：planning and (reinforcement)learning，大多数棋类游戏都是planning(know environment/model/rules, tree-search)
40：exploration and exploitation
43：prediction and control