YeXiang\^-^/-CSDN博客

原创 Reinforcement Learning Exercise 7.4

Exercise 7.4 Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, asGt:t+n=Qt−1(St,At)+∑k=tmin(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)]G_{t:t+n}=Q_{t...

2019-12-08 22:22:58 405

原创 Reinforcement Learning Exercise 7.1

Exercise 7.1 In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in (7....

2019-11-14 22:42:03 524

原创 Reinforcement Learning Exercise 6.6

Exercise 6.6 In Example 6.2 we stated that the true values for the random walk example are 16\frac{1}{6}61, 26\frac{2}{6}62, 36\frac{3}{6}63, 46\frac{4}{6}64 and 56\frac{5}{6}65, for states A thr...

2019-10-29 23:08:48 563

原创 Reinforcement Learning Exercise 6.2

Exercise 6.3 From the results shown in the left graph of the random walk example it appears that the first episode results in a change in only V(A)V (A)V(A). What does this tell you about what happene...

2019-10-04 14:21:04 517

原创 Reinforcement Learning Exercise 6.9 & 6.10

Exercise 6.9: Windy Gridworld with King’s Moves (programming) Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than the usual four. How much better ca...

2019-10-03 18:37:08 845

原创 Reinforcement Learning Exercises 6.1

Exercise 6.1 If VVV changes during the episode, then (6.6) only holds approximately; what would the difference be between the two sides? Let VtV_tVt denote the array of state values used at time ttt ...

2019-10-03 12:31:59 326 1

原创 Reinforcement Learning Exercise 5.13

Exercise 5.13 Show the steps to derive (5.14) from (5.12).ρt:T−1Rt+1=π(At∣St)b(At∣St)π(At+1∣St+1)b(At+1∣St+1)π(At+2∣St+2)b(At+2∣St+2)⋯π(AT−1∣ST−1)b(AT−1∣ST−1)Rt+1(5.12)\rho_{t:T-1}R_{t+1} = \frac{\p...

2019-09-10 23:40:51 269

原创 Reinforcement Learning Exercise 5.10

Exercise 5.10 Derive the weighted-average update rule (5.8) from (5.7). Follow thepattern of the derivation of the unweighted rule (2.3)

2019-08-10 16:07:23 223

原创 Reinforcement Learning Exercise 5.9

Exercise 5.9 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.The modified algorithm should b...

2019-08-06 23:05:51 222

原创 Reinforcement Learning Chapter 5 Example 5.5

In the book, example 5.5 shows the infinity variance of the ordinary importance sampling in a specific case. I tried this experiment in my computer and get a similar result. Unfortunately, my computer...

2019-08-05 22:53:47 168

原创 Reinforcement Learning Chapter 5, Example of Blackjack

This article exhibits a source code and experiment result for the blackjack example in the book. Both on-policy and off-policy are implemented in this source code. The off-policy includes ordinary imp...

2019-08-04 11:39:13 418

原创 Reinforcement Learning Exercise 5.6

Exercise 5.6 What is the equation analogous to (5.6) for action values Q(s,a)Q(s, a)Q(s,a) instead ofstate values V(s)V(s)V(s), again given returns generated using bbb?Given a starting state StStSt, ...

2019-08-03 22:35:54 348

原创 Reinforcement Learning Exercise 5.5

Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability ppp and transitions to the terminal state with probabil...

2019-08-03 16:28:15 1121 2

原创 Reinforcement Learning Exercise 5.4

Exercise 5.4 The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient t...

2019-08-03 15:38:48 534

原创 Reinforcement Learning Exercise 5.1 & 5.2

Exercise 5.1 Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left?...

2019-08-03 14:37:24 561

原创 Reinforcement Learning--Explanation to Formula (5.2)

The formula (5.2) in Chapter 5 of the book is a little difficult for me to understand. So I spent a while to derive it in detail to make it clear to understand.First,qπ(s,π′(s))=∑aπ′(a∣s)qπ(s,a)∵for...

2019-07-13 21:58:50 139

原创 Reinforcement Learning Exercise 4.10

Exercise 4.10 What is the analog of the value iteration update (4.10) for action values,qk+1(s,a)q_{k+1}(s, a)qk+1(s,a)?Use the result of exercise 3.17:Qπ(s,a)=∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,...

2019-07-06 20:21:33 258

原创 Reinforcement Learning Exercise 4.9

Exercise 4.9 (programming) Implement value iteration for the gambler’s problem and solve it for php_hph = 0.25 and php_hph = 0.55. In programming, you may find it convenient to introduce two dummy s...

2019-07-04 23:56:28 784 2

原创 Reinforcement Learning Exercise 4.7

Exercise 4.7 (programming) Write a program for policy iteration and re-solve Jack’s car rental problem with the following changes. One of Jack’s employees at the first location rides a bus home each n...

2019-07-04 22:36:59 1616

原创 Reinforcement Learning Exercise 4.4

Exercise 4.4 The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is OK...

2019-06-21 23:04:15 1353 2

原创 Reinforcement Learning Exercise 4.5

Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q∗q_*q∗, analogous to that on page 80 for computing v∗v_*v∗. Please pay special attentio...

2019-06-20 23:34:13 440

原创 Reinforcement Learning Exercise 4.6

Exercise 4.6 Suppose you are restricted to considering only policies that are ϵ\epsilonϵ-soft, meaning that the probability of selecting each action in each state, sss, is at least ϵ/∣A(s)∣\epsilon/|\...

2019-06-20 22:12:02 376

原创 Reinforcement Learning Exercise 4.3

Exercise 4.3 What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function qπq_\piqπ and its successive approximation by a sequence of functions q0,q1,q2,...q_0, q_1, q_2,...

2019-06-02 18:41:25 339

原创 Reinforcement Learning Exercise 4.2

Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively...

2019-06-02 18:23:10 658 2

原创 Reinforcement Learning Exercise 4.1

Example 4.1 Consider the 4×44 \times 44×4 gridworld shown below.The nonterminal states are S={1,2,...,14}\mathcal S = \{1, 2, . . . , 14\}S={1,2,...,14}. There are four actions possible in each stat...

2019-06-02 15:06:43 810 2

原创 The derivation of Bellman equation for value of a policy

In book ‘Reinforcement Learning - An Introduction’, Chapter 3, the author gives out the Bellman equation for vπv_\pivπ as equation (3.14), but without detailed derivation. That makes me feel confused...

2019-05-28 22:59:10 233

原创 Reinforcement Learning Exercise 3.15

Exercise 3.15 In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or onl...

2019-05-27 21:48:28 483

原创 Reinforcement Learning exercise 3.13

Exercise 3.13 Give an equation for qπq_\piqπ in terms of vπv_\pivπ and the four-argument ppp.First, we need to derive a formula from multiplication formula of probability theory:p(x∣y)=p(x,y)p(y)=...

2019-05-26 21:00:45 588 2

原创 Reinforcement Learning Exercise 3.12

Exercise 3.12 Give an equation for vπv_\pivπ in terms of qπq_\piqπ and π\piπ.vπ(s)=Eπ(Gt∣St=s)=∑gt[gt⋅p(gt∣s)]=∑gt[gt⋅p(gt,s)p(s)]=∑gt[gt⋅∑a∈Ap(gt,s,a)p(s)]=∑gt{gt⋅∑a∈A[p(gt∣s,a)⋅p(s,a)]p(s)}=∑gt{g...

2019-05-26 17:41:07 911 3

原创 Reinforcement Learning Exercise 3.11

Exercise 3.11 If the current state is StS_tSt, and actions are selected according to stochastic policy π\piπ, then what is the expectation of Rt+1R_{t+1}Rt+1 in terms of π\piπ and the four-argument ...

2019-05-25 22:38:19 534

原创 Reinforcement Learning Exercise 3.29

Exercise 3.29 Rewrite the four Bellman equations for the four value functions (vπv_\pivπ, v∗v*v∗, qπq_\piqπ, and q∗q_*q∗) in terms of the three argument function p (3.4) and the two-argument functi...

2019-05-03 18:29:19 423 2

原创 Reinforcement Learning Exercise 3.24

Exercise 3.24 Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically...

2019-05-02 20:40:45 641

原创 Reinforcement Learning Exercise 3.23

Exercise 3.23 Give the Bellman equation for q∗q_*q∗ for the recycling robot.This picture shows the mechanism of the recycling robot.To give the Bellman equation for q∗q_*q∗ for the recycling robo...

2019-05-02 19:19:15 356

原创 Reinforcement Learning Exercise 3.22

Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards th...

2019-05-01 15:27:09 892

原创 Reinforcement Learning Exercise 3.19

Exercise 3.19 The value of an action, qπ(s,a)q_\pi(s, a)qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small back...

2019-04-09 23:15:14 657

原创 Reinforcement Learning Exercise 3.18

Exercise 3.18υπ(s)=Eπ(Gt∣St=s)=∑a∈AEπ(Gt∣St=s,At=a)P(At=a∣St=s)∵P(At=a∣St=s)=π(a∣s)∴υπ(s)=∑a∈AEπ(Gt∣St=s,At=a)π(a∣s)\begin{aligned}\upsilon_\pi(s) &= \mathbb E_\pi ( G_t | S_t = s ) \\&amp...

2019-04-07 22:45:25 383

原创 Reinforcement Learning- Exercise 3.17

Reinforcement Learning- Exercise 3.17求关于action-value的Bellman方程。根据定义有：Qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+1∣St=s,At=a)=∑s′[Eπ(∑k=0∞γkRt+k+1∣St=s,At=a,St+1=s′)P(St+1=s′∣At=a,St=s)]=∑s′{[Eπ(Rt+1∣S...

2019-04-07 21:29:48 598 2

ballade2012的博客