1 value iteration
for i in max-iteration:
for j in states:
v[j] = max(r[j,a] + sum(p(j'|j,a)* v[j'])
2 policy iteration
for i in max-iteration:
policy-evaluation
(迭代计算v [state]直至稳定,采取的action已知)
policy-improvement
(依次更新each state对应的action,每次取最优值)
3 model based learning
for i in max-iteration:
1)follow policy pi, get transition list as history
2) calculate reward, transition probability from history, and get P(state, prob,action,next_state)
3) update policy using value iteration