reinforcement learning
The exercise of Richard S. Sutton and Andrew G. Barto's book 'Reinforcement Learning an introduction' 2nd edition.
YeXiang\^-^/
这个作者很懒,什么都没留下…
展开
-
Reinforcement Learning Exercise 7.4
Exercise 7.4 Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, asGt:t+n=Qt−1(St,At)+∑k=tmin(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)]G_{t:t+n}=Q_{t...原创 2019-12-08 22:22:58 · 415 阅读 · 0 评论 -
Reinforcement Learning Exercise 7.1
Exercise 7.1 In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in (7....原创 2019-11-14 22:42:03 · 530 阅读 · 0 评论 -
Reinforcement Learning Exercise 6.6
Exercise 6.6 In Example 6.2 we stated that the true values for the random walk example are 16\frac{1}{6}61, 26\frac{2}{6}62, 36\frac{3}{6}63, 46\frac{4}{6}64 and 56\frac{5}{6}65, for states A thr...原创 2019-10-29 23:08:48 · 592 阅读 · 0 评论 -
Reinforcement Learning Exercise 5.6
Exercise 5.6 What is the equation analogous to (5.6) for action values Q(s,a)Q(s, a)Q(s,a) instead ofstate values V(s)V(s)V(s), again given returns generated using bbb?Given a starting state StStSt, ...原创 2019-08-03 22:35:54 · 361 阅读 · 0 评论 -
Reinforcement Learning Exercise 5.5
Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability ppp and transitions to the terminal state with probabil...原创 2019-08-03 16:28:15 · 1130 阅读 · 2 评论 -
Reinforcement Learning Exercise 5.4
Exercise 5.4 The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient t...原创 2019-08-03 15:38:48 · 547 阅读 · 0 评论 -
Reinforcement Learning Exercise 5.1 & 5.2
Exercise 5.1 Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left?...原创 2019-08-03 14:37:24 · 571 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.10
Exercise 4.10 What is the analog of the value iteration update (4.10) for action values,qk+1(s,a)q_{k+1}(s, a)qk+1(s,a)?Use the result of exercise 3.17:Qπ(s,a)=∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,...原创 2019-07-06 20:21:33 · 268 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.9
Exercise 4.9 (programming) Implement value iteration for the gambler’s problem and solve it for php_hph = 0.25 and php_hph = 0.55. In programming, you may find it convenient to introduce two dummy s...原创 2019-07-04 23:56:28 · 793 阅读 · 2 评论 -
Reinforcement Learning Chapter 5, Example of Blackjack
This article exhibits a source code and experiment result for the blackjack example in the book. Both on-policy and off-policy are implemented in this source code. The off-policy includes ordinary imp...原创 2019-08-04 11:39:13 · 436 阅读 · 0 评论 -
Reinforcement Learning Chapter 5 Example 5.5
In the book, example 5.5 shows the infinity variance of the ordinary importance sampling in a specific case. I tried this experiment in my computer and get a similar result. Unfortunately, my computer...原创 2019-08-05 22:53:47 · 174 阅读 · 0 评论 -
Reinforcement Learning Exercise 5.9
Exercise 5.9 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.The modified algorithm should b...原创 2019-08-06 23:05:51 · 233 阅读 · 0 评论 -
Reinforcement Learning Exercise 5.10
Exercise 5.10 Derive the weighted-average update rule (5.8) from (5.7). Follow thepattern of the derivation of the unweighted rule (2.3)原创 2019-08-10 16:07:23 · 233 阅读 · 0 评论 -
Reinforcement Learning Exercise 5.13
Exercise 5.13 Show the steps to derive (5.14) from (5.12).ρt:T−1Rt+1=π(At∣St)b(At∣St)π(At+1∣St+1)b(At+1∣St+1)π(At+2∣St+2)b(At+2∣St+2)⋯π(AT−1∣ST−1)b(AT−1∣ST−1)Rt+1(5.12)\rho_{t:T-1}R_{t+1} = \frac{\p...原创 2019-09-10 23:40:51 · 282 阅读 · 0 评论 -
Reinforcement Learning Exercises 6.1
Exercise 6.1 If VVV changes during the episode, then (6.6) only holds approximately; what would the difference be between the two sides? Let VtV_tVt denote the array of state values used at time ttt ...原创 2019-10-03 12:31:59 · 334 阅读 · 1 评论 -
Reinforcement Learning Exercise 6.9 & 6.10
Exercise 6.9: Windy Gridworld with King’s Moves (programming) Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than the usual four. How much better ca...原创 2019-10-03 18:37:08 · 858 阅读 · 0 评论 -
Reinforcement Learning Exercise 6.2
Exercise 6.3 From the results shown in the left graph of the random walk example it appears that the first episode results in a change in only V(A)V (A)V(A). What does this tell you about what happene...原创 2019-10-04 14:21:04 · 523 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.7
Exercise 4.7 (programming) Write a program for policy iteration and re-solve Jack’s car rental problem with the following changes. One of Jack’s employees at the first location rides a bus home each n...原创 2019-07-04 22:36:59 · 1637 阅读 · 0 评论 -
Reinforcement Learning--Explanation to Formula (5.2)
The formula (5.2) in Chapter 5 of the book is a little difficult for me to understand. So I spent a while to derive it in detail to make it clear to understand.First,qπ(s,π′(s))=∑aπ′(a∣s)qπ(s,a)∵for...原创 2019-07-13 21:58:50 · 151 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.4
Exercise 4.4 The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is OK...原创 2019-06-21 23:04:15 · 1362 阅读 · 2 评论 -
Reinforcement Learning Exercise 3.24
Exercise 3.24 Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically...原创 2019-05-02 20:40:45 · 654 阅读 · 0 评论 -
Reinforcement Learning Exercise 3.29
Exercise 3.29 Rewrite the four Bellman equations for the four value functions (vπv_\pivπ, v∗v*v∗, qπq_\piqπ, and q∗q_*q∗) in terms of the three argument function p (3.4) and the two-argument functi...原创 2019-05-03 18:29:19 · 430 阅读 · 2 评论 -
Reinforcement Learning Exercise 3.22
Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards th...原创 2019-05-01 15:27:09 · 901 阅读 · 0 评论 -
Reinforcement Learning Exercise 3.11
Exercise 3.11 If the current state is StS_tSt, and actions are selected according to stochastic policy π\piπ, then what is the expectation of Rt+1R_{t+1}Rt+1 in terms of π\piπ and the four-argument ...原创 2019-05-25 22:38:19 · 551 阅读 · 0 评论 -
Reinforcement Learning Exercise 3.12
Exercise 3.12 Give an equation for vπv_\pivπ in terms of qπq_\piqπ and π\piπ.vπ(s)=Eπ(Gt∣St=s)=∑gt[gt⋅p(gt∣s)]=∑gt[gt⋅p(gt,s)p(s)]=∑gt[gt⋅∑a∈Ap(gt,s,a)p(s)]=∑gt{gt⋅∑a∈A[p(gt∣s,a)⋅p(s,a)]p(s)}=∑gt{g...原创 2019-05-26 17:41:07 · 923 阅读 · 3 评论 -
Reinforcement Learning exercise 3.13
Exercise 3.13 Give an equation for qπq_\piqπ in terms of vπv_\pivπ and the four-argument ppp.First, we need to derive a formula from multiplication formula of probability theory:p(x∣y)=p(x,y)p(y)=...原创 2019-05-26 21:00:45 · 603 阅读 · 2 评论 -
Reinforcement Learning Exercise 3.19
Exercise 3.19 The value of an action, qπ(s,a)q_\pi(s, a)qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small back...原创 2019-04-09 23:15:14 · 672 阅读 · 0 评论 -
Reinforcement Learning Exercise 3.18
Exercise 3.18υπ(s)=Eπ(Gt∣St=s)=∑a∈AEπ(Gt∣St=s,At=a)P(At=a∣St=s)∵P(At=a∣St=s)=π(a∣s)∴υπ(s)=∑a∈AEπ(Gt∣St=s,At=a)π(a∣s)\begin{aligned}\upsilon_\pi(s) &= \mathbb E_\pi ( G_t | S_t = s ) \\&...原创 2019-04-07 22:45:25 · 393 阅读 · 0 评论 -
Reinforcement Learning- Exercise 3.17
Reinforcement Learning- Exercise 3.17求关于action-value的Bellman方程。根据定义有:Qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+1∣St=s,At=a)=∑s′[Eπ(∑k=0∞γkRt+k+1∣St=s,At=a,St+1=s′)P(St+1=s′∣At=a,St=s)]=∑s′{[Eπ(Rt+1∣S...原创 2019-04-07 21:29:48 · 607 阅读 · 2 评论 -
Reinforcement Learning Exercise 3.15
Exercise 3.15 In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or onl...原创 2019-05-27 21:48:28 · 492 阅读 · 0 评论 -
The derivation of Bellman equation for value of a policy
In book ‘Reinforcement Learning - An Introduction’, Chapter 3, the author gives out the Bellman equation for vπv_\pivπ as equation (3.14), but without detailed derivation. That makes me feel confused...原创 2019-05-28 22:59:10 · 244 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.1
Example 4.1 Consider the 4×44 \times 44×4 gridworld shown below.The nonterminal states are S={1,2,...,14}\mathcal S = \{1, 2, . . . , 14\}S={1,2,...,14}. There are four actions possible in each stat...原创 2019-06-02 15:06:43 · 829 阅读 · 2 评论 -
Reinforcement Learning Exercise 4.2
Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively...原创 2019-06-02 18:23:10 · 670 阅读 · 2 评论 -
Reinforcement Learning Exercise 4.3
Exercise 4.3 What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function qπq_\piqπ and its successive approximation by a sequence of functions q0,q1,q2,...q_0, q_1, q_2,...原创 2019-06-02 18:41:25 · 352 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.6
Exercise 4.6 Suppose you are restricted to considering only policies that are ϵ\epsilonϵ-soft, meaning that the probability of selecting each action in each state, sss, is at least ϵ/∣A(s)∣\epsilon/|\...原创 2019-06-20 22:12:02 · 384 阅读 · 0 评论 -
Reinforcement Learning Exercise 4.5
Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q∗q_*q∗, analogous to that on page 80 for computing v∗v_*v∗. Please pay special attentio...原创 2019-06-20 23:34:13 · 454 阅读 · 0 评论 -
Reinforcement Learning Exercise 3.23
Exercise 3.23 Give the Bellman equation for q∗q_*q∗ for the recycling robot.This picture shows the mechanism of the recycling robot.To give the Bellman equation for q∗q_*q∗ for the recycling robo...原创 2019-05-02 19:19:15 · 367 阅读 · 0 评论