强化学习：Q-Learning

最新推荐文章于 2024-07-25 18:23:06 发布

DNAAAAAAAA

最新推荐文章于 2024-07-25 18:23:06 发布

阅读量1.2k

点赞数

分类专栏： Python 机器学习强化学习文章标签：强化学习 Q-Learning SARSA

本文链接：https://blog.csdn.net/u010279433/article/details/79634826

版权

本文探讨了强化学习中的Q-Learning和SARSA算法。Q-Learning通过更新Q值来学习最佳策略，但在探索时可能会导致如老鼠跳下悬崖的问题。相比之下，SARSA考虑了实际的行动策略，从而更稳定地学习到安全的路径。通过老鼠寻找奶酪的例子，展示了SARSA如何避免Q-Learning中可能出现的错误决策。

摘要由CSDN通过智能技术生成

强化学习初探1

本篇文章主要是用于记录学习Travis DeWolf（加拿大滑铁卢大学系统设计工程学院的博士后）的博客文章的总结，REINFORCEMENT LEARNING PART 1: Q-LEARNING AND EXPLORATION 和REINFORCEMENT LEARNING PART 2: SARSA VS Q-LEARNING。

1.Q-Learning and exploration

1.1 Q-learning review

Representation:

s - environment states

a - possible actions in those states

Q - state-action value, value/quality of each of those actions in each of those states

In Q-learning, starting by setting all state-action/Q to 0(simple implementation), then go around and explore the state-action space.

After you try an action in a state, you evaluate the state that it has lead to. If it has lead to an undesirable outcome you reduce the Q value (or weight) of that action from that state so that other actions will have a greater value and be chosen instead the next time you’re in that state. Similarly, if you’re rewarded for taking a particular action, the weight of that action for that state is increased, so you’re more likely to choose it again the next time you’re in that state. Importantly, when you update Q, you’re updating it for the previous state-action combination. You can only update Q after you’ve seen what results.

Example: cat, mouse, cheese

state (where the cat was in front of you), action(go forward) -> reduce the Q(weight of that action for that state)

So that the next time the cat is in front of the mouse, the mouse won’t choose to go forward and might choose to go to the side or away from the cat instead (you are a mouse with respawning powers). Note that this doesn’t reduce the value of moving forward when there is no cat in front of you, the value for ‘move forward’ is only reduced in the situation that there’s a cat in front of you.

In the opposite case, s(where the cheese in front of you), a(move forward) -> get rewarded, add the Q

So now the next time you’re in the situation (state) that there’s cheese in front of you, the action ‘move forward’ is more likely to be chosen, because last time you chose it you were rewarded.

No foresignt further than one time step, which make a look-ahead value.

When we’re updating a given Q value for the state-action combination we just experienced, we do a search over all the Q values for the state the we ended up in. We find the maximum state-action value in this state, and incorporate that into our update of the Q value representing the state-action combination we just experienced.

d. Now when we are updating the Q value for the previous state-action combination we look at all the Q values for the state ‘cheese is one step ahead’. We see that one of these values is high (this being for the action ‘move forward’) and this value is incorporated in the update of the previous state-action combination.

Specifically we’re updating the previous state-action value using the equation:

Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))

where s is the previous state, a is the previous action, s' is the current state, and alpha is the discount factor (set to .5 here).

Intuitively, the change in the Q-value for performing action a in state s is the difference between the actual reward (reward(s,a) + max(Q(s'))) and the expected reward (Q(s,a)) multiplied by a learning rate, alpha. You can think of this as a kind of PD control, driving your system to the target, which is in this case the correct Q-value.

Here, we evaluate the reward of moving ahead when the cheese is two steps ahead as the reward for moving into that state (0), plus the reward of the best action from that state (moving into the cheese +50), minus the expected value of that state (0), multiplied by our learning rate (.5) = +25

2. Exploration

In the most straightforward implementation of Q-learning, state-action values are stored in a look-up table. So we have a giant table, which is size N x M, where N is the number of different possible states, and M is the number of different possible actions. So then at decision time we simply go to that table, look up the corresponding action values for that state, and choose the maximum, in equation form:

def choose_action(self, state):
    q = [self.getQ(state, a) for a in self.actions]
    maxQ = max(q)
    action = self.actions[maxQ]
    return action

Almost. There are a couple of additional things we need to add. First, we need to cover the case where there are several actions that all have the same value. So to do that, if there are several with the same value, randomly choose one of them.

 
        def 
        choose_action(

最低0.47元/天解锁文章

DNAAAAAAAA

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
强化学习：Q-Learning

强化学习初探1本篇文章主要是用于记录学习Travis DeWolf（加拿大滑铁卢大学系统设计工程学院的博士后）的博客文章的总结，REINFORCEMENT LEARNING PART 1: Q-LEARNING AND EXPLORATION 和REINFORCEMENT LEARNING PART 2: SARSA VS Q-LEARNING。1.Q-Learning and explorati...
复制链接

扫一扫

专栏目录