RL(Chapter 3): Finite Markov Decision Processes (有限马尔可夫决策过程)

本文为强化学习笔记,主要参考以下内容:

Finite MDPs

  • MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states. Thus MDPs involve delayed reward and the need to trade off immediate and delayed reward.
  • Whereas in bandit problems we estimated the value q ∗ ( a ) q_*(a) q(a) of each action a a a, in MDPs we estimate the value q ∗ ( s , a ) q_*(s, a) q(s,a) of each action a a a in each state s s s, or we estimate the value v ∗ ( s ) v_*(s) v(s) of each state given optimal action selections.

The Agent–Environment Interface

智能体-环境 交互接口

  • MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
    在这里插入图片描述
  • The MDP and agent together give rise to a sequence or t r a j e c t o r y trajectory trajectory that begins like this:
    在这里插入图片描述

智能体在某个时间 t t t 得到环境状态 S t ∈ S S_t\in\mathcal S StS,基于 S t S_t St 做出动作 A t ∈ A ( s ) A_t\in\mathcal A(s) AtA(s),并在下一个时间 t + 1 t+1 t+1 得到收益 R t + 1 ∈ R R_{t+1}\in\mathcal R Rt+1R,同时得到新的环境状态 S t + 1 S_{t+1} St+1… 这个过程不断持续进而得到上面的轨迹

  • In a finite MDP, the sets of states, actions, and rewards ( S , A \mathcal S, \mathcal A S,A, and R \mathcal R R) all have a finite number of elements. In this case, the random variables R t R_t Rt and S t S_t St have well defined discrete probability distributions dependent only on the preceding state and action. The function p p p defines the d y n a m i c s dynamics dynamics(动态特性) of the MDP. That is, the probability of each possible value for S t S_t St and R t R_t Rt depends on the immediately preceding state and action, S t − 1 S_{t−1} St1 and A t − 1 A_{t−1} At1.
    在这里插入图片描述
    • p p p specifies a probability distribution for each choice of s s s and a a a, that is, that
      在这里插入图片描述
  • This is best viewed as a restriction not on the decision process, but on the s t a t e state state. The state must include information about all aspects of the past agent–environment interaction that make a difference for the future. If it does, then the state is said to have the M a r k o v \boldsymbol {Markov} Markov p r o p e r t y \boldsymbol {property} property.
    • 例如,在玩游戏时,你不知道对手的行动,但状态会同时被你和对手影响,这种情况就不是 M D P MDP MDP

MDP 将目标导向的交互式学习问题概括为了三个信号:
r e w a r d s , a c t i o n s , s t a t e s rewards, actions, states rewards,actions,states
当然,要注意每个状态都必须满足 M a r k o v \boldsymbol {Markov} Markov p r o p e r t y \boldsymbol {property} property (马尔科夫性)


From the four-argument dynamics function, p p p, one can compute anything else one might want to know about the environment, such as

  • the s t a t e state state- t r a n s i t i o n transition transition p r o b a b i l i t i e s probabilities probabilities:

在这里插入图片描述

  • the expected rewards for state–action pairs

在这里插入图片描述

  • the expected rewards for state–action–next-state triples

在这里插入图片描述


  • e.g. two ways to summarize the d y n a m i c s dynamics dynamics of a finite MDP: t r a n s i t i o n transition transition t a b l e table table and t r a n s i t i o n transition transition g r a p h graph graph:
    在这里插入图片描述
    • In the transition graph, there is a s t a t e state state n o d e node node for each possible state (a large open circle labeled by the name of the state), and an a c t i o n action action n o d e node node for each state–action pair (a small solid circle labeled by the name of the action and connected by a line to the state node). Each arrow corresponds to a triple ( s , s ′ , a ) (s, s', a) (s,s,a), and we label the arrow with the transition probability, p ( s ′ ∣ s , a ) p(s' |s, a) p(ss,a), and the expected reward for that transition, r ( s , a , s ′ ) r(s, a, s') r(s,a,s).

Goals and Rewards

在这里插入图片描述


The reward signal is your way of communicating to the agent what you want achieved, not how you want it achieved.

  • It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.(Better places for imparting this kind of prior knowledge are the initial policy or initial value function.)
    • For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponent’s pieces or aining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent’s pieces even at the cost of losing the game.

Returns and Episodes

回报 和 分幕

  • In general, we seek to maximize the e x p e c t e d expected expected r e t u r n return return G t G_t Gt, which is defined as some specific function of the reward sequence.

e p i s o d i c episodic episodic t a s k s tasks tasks (分幕式任务)

  • In the simplest case the return is the sum of the rewards:
    在这里插入图片描述where T T T is a final time step.
  • This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call e p i s o d e s \boldsymbol {episodes} episodes. Each episode ends in a t e r m i n a l terminal terminal s t a t e state state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Thus the episodes can all be considered to end in the same terminal state, with different rewards for the different outcomes.

In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted S S S, from the set of all states plus the terminal state, denoted S + S^+ S+.


c o n t i n u i n g continuing continuing t a s k s tasks tasks (持续性任务)

  • On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit.
  • The return formulation (3.7) is problematic for continuing tasks because the final time step would be T = ∞ T = \infty T=, and the return could easily be infinite.
  • Thus, in this book we usually use a definition of return that is slightly more complex conceptually but much simpler mathematically.

d i s c o u n t e d   r e t u r n discounted\ return discounted return (折后回报):

  • The agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized
    在这里插入图片描述where 0 ≤ γ ≤ 1 0\leq\gamma\leq1 0γ1 is called d i s c o u n t   r a t e discount\ rate discount rate (折扣率).
    • If γ < 1 \gamma < 1 γ<1, the infinite sum in (3.8) has a finite value as long as the reward sequence { R k } \{R_k\} {Rk} is bounded.
    • If γ = 0 \gamma = 0 γ=0, the agent is “ m y o p i c myopic myopic” in being concerned only with maximizing immediate rewards.
    • As γ \gamma γ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more f a r s i g h t e d farsighted farsighted.
  • Returns at successive time steps are related to each other in a way described below:
    在这里插入图片描述
    • Note that this works for all time steps t < T t < T t<T, even if termination occurs at t + 1 t+1 t+1, provided we define G T = 0 G_T = 0 GT=0. This often makes it easy to compute returns from reward sequences.

Note that although the return (3.8) is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if γ < 1 \gamma < 1 γ<1. For example, if the reward is a constant + 1 +1 +1, then the return is
在这里插入图片描述


Example 3.4: Pole-Balancing

  • The objective in this task is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over: A failure is said to occur if the pole falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical after each failure.

在这里插入图片描述

  • This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be + 1 +1 +1 for every time step on which failure did not occur, so that the return at each time would be the number of steps until failure. In this case, successful balancing forever would mean a return of infinity.
  • Alternatively, we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be − 1 −1 1 on each failure and zero at all other times. The return at each time would then be related to − γ K − 1 −\gamma^{K−1} γK1, where K K K is the number of time steps before failure.
  • In either case, the return is maximized by keeping the pole balanced for as long as possible.

Exercise 3.7

Imagine that you are designing a robot to run a maze. You decide to give it a reward of + 1 +1 +1 for escaping from the maze and a reward of zero at all other times. You decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

ANSWER

在这里插入图片描述

Unified Notation for Episodic and Continuing Tasks

分幕式和持续性任务的统一表示法

  • S t , i S_{t,i} St,i: the state representation at time t t t of episode i i i
    (and similarly for A t , i , R t , i , π t , i , T i A_{t,i}, R_{t,i}, \pi_{t,i}, T_i At,i,Rt,i,πt,i,Ti, etc.).

  • In fact, when we discuss episodic tasks we almost never have to distinguish between different episodes. We are almost always considering a particular episode, or stating something that is true for all episodes.
  • Accordingly, in practice, we write S t S_t St to refer to S t , i S_{t,i} St,i, and so on.

  • We have defined the return as a sum over a finite number of terms in one case (3.7) and as a sum over an infinite number of terms in the other (3.8). These two can be unified by considering episode termination to be the entering of a special a b s o r b i n g absorbing absorbing s t a t e state state that transitions only to itself and that generates only rewards of zero.
    • For example, consider the state transition diagram:
      在这里插入图片描述Summing the reward sequence + 1 , + 1 , + 1 , 0 , 0 , 0... +1, +1, +1, 0, 0, 0... +1,+1,+1,0,0,0..., we get the same return whether we sum over the first T T T rewards (here T = 3 T = 3 T=3) or over the full infinite sequence. This remains true even if we introduce discounting.
  • Thus, we can define the return, in general, according to (3.8).
    在这里插入图片描述Alternatively, we can write
    在这里插入图片描述including the possibility that T = 1 T = 1 T=1 or γ = 1 \gamma = 1 γ=1 (but not both).

Policies and Value Functions

策略 与 价值函数

  • Value functions: functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state.
    • Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.
  • Policy: a mapping from states to probabilities of selecting each possible action.
    • If the agent is following policy π \pi π at time t t t, then π ( a ∣ s ) \pi(a|s) π(as) is the probability that A t = a A_t = a At=a if S t = s S_t = s St=s.

Exercise 3.11

If the current state is S t S_t St, and actions are selected according to a stochastic policy π \pi π, then what is the expectation of R t + 1 R_{t+1} Rt+1 in terms of π \pi π and the four-argument function p p p (3.2)?

ANSWER

在这里插入图片描述


  • The value function of a state s s s under a policy π \pi π, denoted v π ( s ) v_{\pi}(s) vπ(s), is the expected return when starting in s s s and following π \pi π . We call the function v π v_{\pi} vπ the s t a t e state state- v a l u e value value f u n c t i o n function function for policy π \pi π.thereafter.
  • For MDPs, we can define v π v_{\pi} vπ formally by
    在这里插入图片描述where E π [ ⋅ ] \mathbb E_{\pi}[·] Eπ[] denotes the expected value of a random variable given that the agent follows policy π \pi π.
  • Similarly, we define the value of taking action a a a in state s s s under a policy π \pi π, denoted q π ( s , a ) q_{\pi}(s, a) qπ(s,a). We call q π q_{\pi} qπ the a c t i o n action action- v a l u e value value function for policy π \pi π.
    在这里插入图片描述

Exercise 3.12

Give an equation for v π v_\pi vπ in terms of q π q_\pi qπ and π \pi π.

ANSWER
v π ( s ) = E π [ G t ∣ S t = s ] = ∑ a ∑ s ′ , r ′ p ( s ′ , r ′ , a ∣ s ) G t = ∑ a ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) p ( a ∣ s ) G t = ∑ a π ( a ∣ s ) ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) G t = ∑ a π ( a ∣ s ) q π ( s , a ) \begin{aligned}v_\pi(s)&=\mathbb E_\pi[G_t|S_t=s]=\sum_a\sum_{s',r'}p(s',r',a|s)G_t \\&=\sum_a\sum_{s',r'}p(s',r'|s,a)p(a|s)G_t \\&=\sum_a\pi(a|s)\sum_{s',r'}p(s',r'|s,a)G_t \\&=\sum_a\pi(a|s)q_\pi(s,a) \end{aligned} vπ(s)=Eπ[GtSt=s]=as,rp(s,r,as)Gt=as,rp(s,rs,a)p(as)Gt=aπ(as)s,rp(s,rs,a)Gt=aπ(as)qπ(s,a)


Exercise 3.13

Give an equation for q π q_\pi qπ in terms of v π v_\pi vπ and the four-argument p p p.

ANSWER
q π ( s , a ) = E [ G t ∣ s , a ] = E [ R t + 1 ∣ s , a ] + E [ γ G t + 1 ∣ s , a ] = ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) r ′ + γ ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) E [ G t + 1 ∣ s ′ , r ′ , s , a ] = ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) r ′ + γ ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) E [ G t + 1 ∣ s ′ ] = ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) r ′ + γ ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) v π ( s ′ ) = ∑ s ′ , r ′ p ( s ′ , r ′ ∣ s , a ) [ r ′ + γ v π ( s ′ ) ] \begin{aligned}q_\pi(s,a)&=\mathbb E[G_t|s,a] \\&=\mathbb E[R_{t+1}|s,a]+\mathbb E[\gamma G_{t+1}|s,a] \\&=\sum_{s',r'}p(s',r'|s,a)r'+\gamma\sum_{s',r'}p(s',r'|s,a)\mathbb E[G_{t+1}|s',r',s,a] \\&=\sum_{s',r'}p(s',r'|s,a)r'+\gamma\sum_{s',r'}p(s',r'|s,a)\mathbb E[G_{t+1}|s'] \\&=\sum_{s',r'}p(s',r'|s,a)r'+\gamma\sum_{s',r'}p(s',r'|s,a)v_\pi(s') \\&=\sum_{s',r'}p(s',r'|s,a)[r'+\gamma v_\pi(s')] \end{aligned} qπ(s,a)=E[Gts,a]=E[Rt+1s,a]+E[γGt+1s,a]=s,rp(s,rs,a)r+γs,rp(s,rs,a)E[Gt+1s,r,s,a]=s,rp(s,rs,a)r+γs,rp(s,rs,a)E[Gt+1s]=s,rp(s,rs,a)r+γs,rp(s,rs,a)vπ(s)=s,rp(s,rs,a)[r+γvπ(s)]


Bellman equation (贝尔曼方程)

  • A fundamental property of value functions is that they satisfy recursive relationships:
    在这里插入图片描述It is really a sum over all values of the three variables, a a a, s ′ s' s, and r r r. For each triple, we compute its probability, π ( a ∣ s ) p ( s ′ , r ∣ s , a ) \pi(a|s)p(s', r|s, a) π(as)p(s,rs,a), weight the quantity in brackets by that probability, then sum over all possibilities to get an expected value.
  • Equation (3.14) is the B e l l m a n Bellman Bellman e q u a t i o n equation equation for v π v_\pi vπ. It expresses a relationship between the value of a state and the values of its successor states.

The existence and uniqueness of v π v_\pi vπ are guaranteed as long as either γ < 1 \gamma < 1 γ<1 or eventual termination is guaranteed from all states under the policy π \pi π.


Backup diagram (回溯图):
在这里插入图片描述

  • Starting from state s s s, the agent could take any of some set of actions—three are shown in the diagram—based on its policy π \pi π. From each of these, the environment could respond with one of several next states, s ′ s' s (two are shown in the figure), along with a reward, r r r, depending on its dynamics given by the function p p p.
  • The Bellman equation (3.14) averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.
  • We call diagrams like that above b a c k u p backup backup d i a g r a m s diagrams diagrams because they diagram relationships that form the basis of the update or backup operations. These operations transfer value information back to a state (or a state–action pair) from its successor states (or state–action pairs).

Note that, unlike transition graphs, the state nodes of backup diagrams do not necessarily represent distinct states; for example, a state might be its own successor.


Exercise 3.17

What is the Bellman equation for action values, that is, for q π q_\pi qπ?

ANSWER

在这里插入图片描述


Exercise 3.18

The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy.

在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the value at the root node, v π ( s ) v_\pi(s) vπ(s), in terms of the value at the expected leaf node, q π ( s , a ) q_\pi(s, a) qπ(s,a), given S t = s S_t = s St=s.

ANSWER

在这里插入图片描述


Exercise 3.19

The value of an action, q π ( s , a ) q_\pi(s, a) qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards.

在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, q π ( s , a ) q_\pi(s, a) qπ(s,a), in terms of the expected next reward, R t + 1 R_{t+1} Rt+1, and the expected next state value, v π ( S t + 1 ) v_\pi(S_{t+1}) vπ(St+1), given that S t = s S_t=s St=s and A t = a A_t=a At=a.

ANSWER

在这里插入图片描述

Optimal Policies and Optimal Value Functions

  • A policy π \pi π is defined to be better than or equal to a policy π ′ \pi' π if its expected return is greater than or equal to that of π ′ \pi' π for all states. In other words, π ≥ π ′ \pi\geq\pi' ππ if and only if v π ( s ) ≥ v π ′ ( s ) v_\pi(s) \geq v_{\pi'}(s) vπ(s)vπ(s) for all s ∈ S s\in\mathcal S sS. (这实际上是一个偏序关系)
  • There is always at least one policy that is better than or equal to all other policies. This is an o p t i m a l \boldsymbol{optimal} optimal p o l i c y \boldsymbol{policy} policy. We denote all the optimal policies by π ∗ \boldsymbol{\pi_*} π. They share the same state-value function, called the o p t i m a l optimal optimal s t a t e state state- v a l u e value value f u n c t i o n function function, denoted v ∗ v_* v, and defined as
    v ∗ ( s ) = max ⁡ π v π ( s )           ( 3.15 ) v_*(s)=\max_\pi v_\pi(s)\ \ \ \ \ \ \ \ \ (3.15) v(s)=πmaxvπ(s)         (3.15)for all s ∈ S s\in\mathcal S sS.

  • This is the Bellman equation for v ∗ v_* v, or the Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:
    在这里插入图片描述The last two equations are two forms of the Bellman optimality equation for v ∗ v_* v.
  • The Bellman optimality equation for q ∗ q_* q is
    在这里插入图片描述在这里插入图片描述
    • These are the same as the backup diagrams for v π v_\pi vπ and q π q_\pi qπ presented earlier except that arcs have been added at the agent’s choice points to represent that the maximum over that choice is taken rather than the expected value given some policy.

  • For finite MDPs, the Bellman optimality equation for v ∗ v_* v (3.19) has a unique solution. The Bellman optimality equation is actually a system of equations, one for each state, so if there are n n n states, then there are n n n equations in n n n unknowns. If the dynamics p p p of the environment are known, then in principle one can solve this system of equations for v ∗ v_* v using any one of a variety of methods for solving systems of nonlinear equations. One can solve a related set of equations for q ∗ q_* q.

Example 3.9: Bellman Optimality Equations for the Recycling Robot

  • 机器人有两个状态:高电量和低电量,每个状态下可以采取三种动作:主动清扫、等待别人扔垃圾给它、返回充电;工作时没电得到负收益(这里假设没电之后会有人把机器人重新拿去充电,因此如果没电,下一个状态即为高电量),其余情况则得到相应收益
    在这里插入图片描述
  • Using (3.19), we can explicitly give the Bellman optimality equation for the recycling robot example. To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h , l , s , w , h, l, s, w, h,l,s,w, and r e re re. Because there are only two states, the Bellman optimality equation consists of two equations.
    在这里插入图片描述在这里插入图片描述
    For any choice of r s , r w , α , β r_s, r_w, \alpha, \beta rs,rw,α,β, and γ \gamma γ, with 0 ≤ γ < 1 0\leq \gamma < 1 0γ<1, 0 ≤ α , β ≤ 1 0 \leq \alpha,\beta \leq 1 0α,β1, there is exactly one pair of numbers, v ∗ ( h ) v_*(h) v(h) and v ∗ ( l ) v_*(l) v(l), that simultaneously satisfy these two nonlinear equations.

  • Once one has v ∗ v_* v, it is relatively easy to determine an optimal policy.
    • For each state s s s, there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy. You can think of this as a one-step search. If you have the optimal value function, v ∗ v_* v, then the actions that appear best after a one-step search will be optimal actions.
    • Another way of saying this is that any policy that is g r e e d y greedy greedy with respect to the optimal evaluation function v ∗ v_* v is an optimal policy.
      (对于 v ∗ v_* v,任何贪心策略都是最优策略)
    • The beauty of v ∗ v_* v is that if one uses it to evaluate the short-term consequences of actions—specifically, the one-step consequences—then a greedy policy is actually optimal in the long-term sense in which we are interested because v ∗ v_* v already takes into account the reward consequences of all possible future behavior.
  • Having q ∗ q_* q makes choosing optimal actions even easier. With q ∗ q_* q, the agent does not even have to do a one-step-ahead search: for any state s s s, it can simply find any action that maximizes q ∗ ( s , a ) q_*(s, a) q(s,a).

  • Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy, and thus to solving the reinforcement learning problem. However, this solution is rarely directly useful. It is akin to an exhaustive search (穷举), looking ahead at all possibilities, computing their probabilities of occurrence and their desirabilities in terms of expected rewards. This solution relies on at least three assumptions that are rarely true in practice:
    • the dynamics of the environment are accurately known;
    • computational resources are sufficient to complete the calculation; In particular, extensive memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states than could possibly be entries in a table, and approximations must be made.
    • the states have the Markov property.

  • In reinforcement learning one typically has to settle for approximate solutions. Many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using actual experienced transitions in place of knowledge of the expected transitions.
    • For example, heuristic search methods (启发式搜索) can be viewed as expanding the right-hand side of (3.19) several times, up to some depth, forming a “tree” of possibilities, and then using a heuristic evaluation function to approximate v ∗ v_* v at the “leaf” nodes.

Exercise 3.22

Consider the continuing MDP shown below. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, π l e f t \pi_{left} πleft and π r i g h t \pi_{right} πright. What policy is optimal if γ = 0 \gamma = 0 γ=0? If γ = 0.9 \gamma = 0.9 γ=0.9? If γ = 0.5 \gamma = 0.5 γ=0.5?

在这里插入图片描述
ANSWER

在这里插入图片描述

  • Based on the above return formulas for each policy, γ = 0.5 \gamma = 0.5 γ=0.5 seems to be the borderline. If γ > 0.5 \gamma > 0.5 γ>0.5, right is optimal; if γ < 0.5 \gamma < 0.5 γ<0.5, left is optimal. If γ = 0.5 \gamma = 0.5 γ=0.5, both are optimal.
  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值