Chap 2
S 2.1
- Action value = expected or mean reward.
S 2.2
- Action-value methods = estimating the values of actions + using the estimates to make action selection decisions.
- Sample-average method -> estimating the action value
denotes the random variable that is 1 if predicate is true and 0 if it is not. - Greedy action selection - > action selection rule.
- ε-greedy method:ε probability to randomly select from among all actions with equal probability. for k-armed bandits with ε, the probability of the optimal action will converge to:
(Simple derivation: 1-ε this is greedy action selection probability however since for k actions, each action is selected with equal probability, therefore there is (1/k)ε to still select the greedy action. so the total probability will be the upper equation shows. )