RL-赵-(一):基本概念【state value(v)、action value(q)、policy(π)、reward、return、trajectories、episode】

在这里插入图片描述

1.1 A grid world example

Consider an example as shown in Figure 1.2, where a robot moves in a grid world. The robot, called agent, can move across adjacent cells in the grid. At each time step, it can only occupy a single cell. The white cells are accessible for entry, and the orange cells are forbidden. There is a target cell that the robot would like to reach. We will use such grid world examples throughout this book since they are intuitive for illustrating new concepts and algorithms.

The ultimate goal of the agent is to find a “good” policy that enables it to reach the target cell when starting from any initial cell. How can the “goodness” of a policy be defined? The idea is that the agent should reach the target without entering any forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.

1.2 State and action

The first concept to be introduced is the state, which describes the agent’s status with respect to the environment.

In the grid world example, the state corresponds to the agent’s location. Since there are nine cells, there are nine states as well.

They are indexed as s1, s2, . . . , s9, as shown in Figure 1.3(a). The set of all the states is called the state space, denoted as S = {s1, . . . , s9}.

1.3 State transition

1.4 Policy

1.5 Reward

1.6 Trajectories, returns, and episodes

1.7 Markov decision processes

  • 7
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
2023-06-17 23:47:22.786162: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2023-06-17 23:47:22.786281: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-06-17 23:47:24.419330: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found 2023-06-17 23:47:24.419809: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303) 2023-06-17 23:47:24.426229: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: ����� 2023-06-17 23:47:24.426345: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ����� 2023-06-17 23:47:24.430552: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Traceback (most recent call last): File "C:\Users\10290\Desktop\test\writer.py", line 20, in <module> write_reward_tb(summary_writer, rewards[i], i) File "C:\Users\10290\Desktop\test\writer.py", line 9, in write_reward_tb summary = tf.Summary(value=[tf.Summary.Value(tag='reward', simple_value=reward)]) AttributeError: module 'tensorflow' has no attribute 'Summary'分析错误原因
06-18
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值