CS234 Assignment#1 value iteration/policy iteration
这一部分做的是值迭代和策略迭代的,以一个FrozenLake为例子作为测试。
FrozenLake:
Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you’ll fall into the freezing water.
At this time, there’s an international frisbee shortage, so it’s absolutely imperative that
you navigate across the lake and retrieve the disc.
However, the ice is slippery, so you won’t always move in the direction you intend.
The surface is described using a grid like the following
SFFF
FHFH
FFFH
HFFG
S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal, where the frisbee is located
The episode ends when you reach the goal or fall in a hole.
You receive a reward of 1 if you reach the goal, and zero otherwise.
CS234 vi_and_pi.py
附注:
这个例子中
0:← // 1:↓// 2:→// 3:↑
参数P中terminal 为true or false
P中的probability在例子中都为1.0,但实际上在不为1的时候
P[state][action]是这样的:
{(probability1, nextstate1, reward1, terminal1)
(probability2, nextstate2, reward2, terminal2)
…..}
value iteration
def value_iteration(P, nS, nA, gamma=0.9, max_iteration=20, tol=1e-3):
"""
Learn value function and policy by using value iteration method for a given
gamma and environment.
Parameters:
----------
P: dictionary
It is from gym.core.Environment
P[state][action] is tuples with (probability, nextstate, reward, terminal)
nS: int
number of states
nA: int
number of actions
gamma: float
Discount factor. Number in range [0, 1)
max_iterat