强化学习
文章平均质量分 61
强化学习
Bungehurst
极限尤可突破,至臻亦不可止
展开
-
【已解决】“The Unity environment took too long to respond. Make sure that :\n“
在运行多进程并行训练时,Unity会有如下报错,此时程序卡死怀疑有可能是没有关闭图形造成的.修改文件,目录在:.原创 2022-06-20 17:33:27 · 1661 阅读 · 0 评论 -
What is Bootstrapping?
Bootstrapping(自举):Both policy iteration and value iteration, they update estimates of the value of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates, which we called Bootstrapping.原创 2021-01-11 16:33:19 · 155 阅读 · 0 评论 -
How to learn out an optimal policy?(Dynamic Programming)
Policy EvaluationPolicy ImprovementPolicy IterationValue Iteration原创 2021-01-11 16:12:17 · 223 阅读 · 1 评论 -
Basic terminologies in RL
Goals & RewardsRewardsImmediate reward: At each time step, the reward is a simple number, Rt∈RR_t \in \RRt∈R.Cumulative reward(Returns): The total amount of immediate reward the agent receives.Reward Hypothesis: That all of what we mean by goals原创 2021-01-02 12:28:36 · 164 阅读 · 0 评论 -
From Bandit Problems to MDPs
BackgroundIn bandit problems, we estimated the value q∗(a)q_*(a)q∗(a) of each action aaa. In MDPs, we estimate the value q∗(s,a)q_*(s, a)q∗(s,a) of each action aaa in each state sss, or we estimate the value v∗(s)v_*(s)v∗(s) of each state given optimal原创 2020-12-31 15:21:34 · 120 阅读 · 0 评论 -
Gradient Bandit Algorithms
BackgroundWe consider learning a numerical preference for each action aaa, which we denote Ht(a)H_t(a)Ht(a).Denote:Pr{At=a}≐eHt(a)∑b=1keHt(b)≐πt(a)Denote: \Pr\{A_t=a\}\doteq\frac{e^{H_t(a)}}{\sum^{k}_{b=1}{e^{H_t(b)}}}\doteq\pi_t(a)Denote:Pr{At=a}≐∑b=原创 2020-12-31 11:51:39 · 397 阅读 · 0 评论 -
Upper-Confidence-Bound(UCB) Action Selection
BackgroundIn ε-greedy method, we randomly choose non-greedy actions as exploration, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain.Upper-Confidence-BoundIn order to take into account both how close th原创 2020-12-30 22:13:51 · 313 阅读 · 0 评论 -
Incremental Implementation
Stationary ProblemIn stationary problems, the probability of reward do not change over time.As step goes larger, we would like to compute the value of an action efficiently, in particular, with constant memory and constant per-time-step computation.To s原创 2020-12-30 19:43:41 · 199 阅读 · 0 评论 -
Greedy method and ε-greedy method
BackgroundThe multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called “exploration”) and optimize their decisions based on existing knowledge (called “exploitation”). The agent attempts to balance these com原创 2020-12-30 17:08:49 · 299 阅读 · 0 评论