强化学习笔记：多臂老虎机问题(4)--跟踪非平稳环境

最新推荐文章于 2024-03-19 14:59:00 发布

笨牛慢耕

最新推荐文章于 2024-03-19 14:59:00 发布

阅读量1.3k

点赞数 3

分类专栏：强化学习文章标签：机器学习强化学习 k-armed badit 非平稳环境跟踪 python

本文链接：https://blog.csdn.net/chenxy_bwave/article/details/121630186

版权

强化学习专栏收录该内容

27 篇文章 90 订阅

订阅专栏

3.1 k_armed_bandit_one_run()接口扩张

3.2 Comparison in stationary environment

3.3 Comparison in non-stationary environment

0. 前言

前面几节我们已经就多臂老虎机问题进行了一些讨论。详细参见：

强化学习笔记：多臂老虎机问题(1)

强化学习笔记：多臂老虎机问题(2)--Python仿真
 强化学习笔记：多臂老虎机问题(3)--行动价值估计的增量实现

本节我们继续基于多臂老虎机问题探讨非平稳环境的跟踪。

1. 问题描述

到目前为止的讨论是面向平稳（Stationary）bandit problem的，即假定各行动的价值q*(a)是不随时间变化而变化的。基于这种假设，我们在进行行动价值估计时给与每次行动的回报的权重是相同的，由此得到的Step-Size参数是随时间推进而逐渐变小。但是，现实中的强化学习问题通常并不是这么友好，也就是说行动的价值q*(a)其实是随时间变化而变化的。在这种情况下，我们需要给越晚近的奖励更高的权重，一种最流行的做法就是使用固定步长参数。比如说，将前面的Q的递推更新规则修改为如下形式，其中：

需要注意的是，在前面面向stationary bandit problem的递推更新规则中，Q1是对后面的行动估计值是没有影响的，但是在式(10)中体现出初始化值Q1的影响了，这其实是一个不希望的结果。

由于式(10)中的各项系数总和为1，即（等比数列的求和问题，初中低年级的数学功课^-^希望大家没有忘掉^-^）：

因此我们可以称式(10)表示一个加权平均。加权系数随着距离当前时刻变远而呈指数方式变小，因此也常称为exponential recency-weighted average（以下称‘指数加权平均’）. 当然熟悉数字信号处理的小伙伴们，知道在信号处理领域，这个也通常称之为忘却滤波(forgetting-filtering)或者指数滤波(exponential filtering)，说得更直白一些就是一阶IIR滤波。

如前所述，每个奖励的加权系数置为1/n时得到的就是sample-average method，而且这种方法能够确保行动价值的估计值Q最终是收敛于行动价值真值q*的。

但是，并非任意的加权系数设定机制都能够保证这种收敛的。统计近似理论中有一个众所周知的结论是，当满足以下条件时，能够确保收敛（assure convergence with probability 1）：

前一个条件用于确保加权系数足够大以确保加权平均最终能够克服初始条件(即Q1)以及随机波动，后一个条件则用于确保最终能够收敛。太大了不好，太小了也不行。

sample-average method中所使用的加权系数策略满足以上两个条件。不幸的是，指数加权平均中所使用的加权系数策略不满足以上第2个条件，这意味着，指数加权平均永远都不会完全收敛，而是随着最近的奖励的变化而变化。幸运的是，这恰恰是我们跟踪非平稳问题（绝大多数的实际的强化学习问题都是属于这一类！）时所期望拥有的特性。

此外，满足式(12)所示条件的加权系数策略通常都收敛得非常慢，需要非常仔细的微调(tuning)才能获得令人满意的收敛速率。虽然满足式(12)所示条件的加权系数策略通常用于理论分析，但是在实际应用中很少使用。

2. 练习1

这是原书(Sutton-RL-book-section2.5)的习题。[注，书中的(2.6)对应本文中的式(10)].

解答：

考虑一般性的情况，加权系数为，式(10)可以扩展为：

3. 练习2

这是原书(Sutton-RL-book-section2.5)的习题。

3.1 k_armed_bandit_one_run()接口扩张

追加QUpdtAlgo和alpha两个参数，以同时支持sample-average method和exponential-recency-weighted average两种Q值估计方法，并且支持stationary vs non-stationary evaluation。

代码如下：

def k_armed_bandit_one_run(qstar,epsilon,nStep,QUpdtAlgo='sample_average',alpha=0, stationary=True):
    """
    One run of K-armed bandit simulation.
    Input:
        qstar:     Mean reward for each candition actions
        epsilon:   Epsilon value for epsilon-greedy algorithm
        nStep;     The number of steps for simulation
        QUpdtAlgo: The algorithm for updating Q value--'sample_average','exp_decaying'
        alpha:     step-size in case of 'exp_decaying'
    Output:
        a[t]: action series for each step in one run
        r[t]: reward series for each step in one run
        Q[k]: reward sample average up to t-1 for action[k]
        aNum[k]: The number of being selected for action[k]
        optRatio[t]: Ration of optimal action being selected over tim
    """
    
    K     = len(qstar)
    Q     = np.zeros(K)
    a     = np.zeros(nStep+1,dtype='int') # Item#0 for initialization
    aNum  = np.zeros(K,dtype='int')       # Record the number of action#k being selected
    
    r     = np.zeros(nStep+1)             # Item#0 for initialization

    if stationary == False:
        qstar = np.ones(K)/K                 # qstart initialized to 1/K for all K actions    
    
    optCnt   = 0
    optRatio = np.zeros(nStep+1,dtype='float') # Item#0 for initialization

    for t in range(1,nStep+1):

        #0. For non-stationary environment, optAct also changes over time.Hence, move to inside the loop.
        optAct   = np.argmax(qstar)
        #1. action selection
        tmp = np.random.uniform(0,1)
        #print(tmp)
        if tmp < epsilon: # random selection
            a[t] = np.random.choice(np.arange(K))
            #print('random selection: a[{0}] = {1}'.format(t,a[t]))
        else:             # greedy selection
            #选择Q值最大的那个，当多个Q值并列第一时，从中任选一个--但是如何判断有多个并列第一的呢？
            #对Q进行random permutation处理后再找最大值可以等价地解决这个问题
            p = np.random.permutation(K)
            a[t] = p[np.argmax(Q[p])]
            #print('greedy selection: a[{0}] = {1}'.format(t,a[t]))

        aNum[a[t]] = aNum[a[t]] + 1

        #2. reward: draw from the pre-defined probability distribution    
        r[t] = np.random.randn() + qstar[a[t]]        

        #3.Update Q of the selected action - #2.4 Incremental Implementation
        # Q[a[t]] = (Q[a[t]]*(aNum[a[t]]-1) + r[t])/aNum[a[t]]    
        if QUpdtAlgo == 'sample_average':
            Q[a[t]] = Q[a[t]] + (r[t]-Q[a[t]])/aNum[a[t]]    
        elif QUpdtAlgo == 'exp_decaying':
            Q[a[t]] = Q[a[t]] + (r[t]-Q[a[t]])*alpha
        
        #4. Optimal Action Ratio tracking
        #print(a[t], optAct)
        if a[t] == optAct:
            optCnt = optCnt + 1
        optRatio[t] = optCnt/t

        #5. Random walk of qstar simulating non-stationary environment
        # Take independent random walks (say by adding a normally distributed increment with mean 0
        # and standard deviation 0.01 to all the q⇤(a) on each step).   
        if stationary == False:        
            qstar = qstar + np.random.randn(K)*0.01 # Standard Deviation = 0.01
            #print('t={0}, qstar={1}, sum={2}'.format(t,qstar,np.sum(qstar)))
        
    return a,aNum,r,Q,optRatio

函数实现的更新有以下几点：

(1) 追加了行动价值估计的'exponential recency-weighted average'算法的支持

(2) 在non-stationary环境中，

(2-1)qstar在内部先初始化为1/K for each action；

(2-2)在每一步之后，每个行动的qstar叠加一个随机值，随机值从从零均值，标准偏差为0.01的正态分布中抽取

(2-3)由于qstar是每一步都随机变化的，所以optAct也需要每一步进行评估

3.2 Comparison in stationary environment

仍然在平稳环境下对比两种Q值估计方法的表现。代码如下：

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

nStep  = 1000
nRun   = 1000
K      = 10
alpha  = 0.1
r_smpaver = np.zeros((nRun,nStep+1))
optRatio_smpaver  = np.zeros((nRun,nStep+1))

r_exp = np.zeros((nRun,nStep+1))
optRatio_exp  = np.zeros((nRun,nStep+1))

for run in range(nRun):
    print('.',end='')
    if run%100==99:        
        print('run = ',run+1)
    
    qstar   = np.random.randn(K) 
    a_smpaver,aNum_smpaver,r_smpaver[run,:],Q,optRatio_smpaver[run,:] = k_armed_bandit_one_run(qstar,0.1,nStep)
    a_exp,aNum_exp,r_exp[run,:],Q,optRatio_exp[run,:] = k_armed_bandit_one_run(qstar,0.1,nStep,'exp_decaying',alpha)

rEnsembleMean_smpaver = np.mean(r_smpaver,axis=0)
optRatioEnsembleMean_smpaver = np.mean(optRatio_smpaver,axis=0)

rEnsembleMean_exp = np.mean(r_exp,axis=0)
optRatioEnsembleMean_exp = np.mean(optRatio_exp,axis=0)

fig,ax = plt.subplots(1,2,figsize=(15,4))
ax[0].plot(smooth(rEnsembleMean_smpaver,5))
ax[0].plot(smooth(rEnsembleMean_exp,5))
ax[1].plot(smooth(optRatioEnsembleMean_smpaver,5))
ax[1].plot(smooth(optRatioEnsembleMean_exp,5))
ax[0].legend(['sample average method','exponential decaying'])
ax[1].legend(['sample average method','exponential decaying'])
ax[0].set_title('ensemble mean reward')
ax[1].set_title('ensemble mean optimal ratio')

其中，smooth()函数参见本系列前面章节：强化学习笔记：多臂老虎机问题(2)--Python仿真

仿真结果如下：

结果表明，就总体的平均奖励而言，两种算法似乎相差不大，但是就最佳行动选择而言，可以看出"sample-average method"有明显的优势。

那为什么总体平均奖励差异不大呢？

3.3 Comparison in non-stationary environment

接下来在非平稳环境下对比两种Q值估计方法的表现。代码如下：

nStep  = 20000
nRun   = 1000
K      = 10
alpha  = 0.1
r_smpaver = np.zeros((nRun,nStep+1))
optRatio_smpaver  = np.zeros((nRun,nStep+1))

r_exp = np.zeros((nRun,nStep+1))
optRatio_exp  = np.zeros((nRun,nStep+1))

for run in range(nRun):
    print('.',end='')
    if run%100==99:        
        print('run = ',run+1)
    
    qstar   = np.random.randn(K) 
    a_smpaver,aNum_smpaver,r_smpaver[run,:],Q,optRatio_smpaver[run,:] = k_armed_bandit_one_run(qstar,0.1,nStep,'sample_average',alpha,False)
    a_exp,aNum_exp,r_exp[run,:],Q,optRatio_exp[run,:] = k_armed_bandit_one_run(qstar,0.1,nStep,'exp_decaying',alpha,False)

rEnsembleMean_smpaver = np.mean(r_smpaver,axis=0)
optRatioEnsembleMean_smpaver = np.mean(optRatio_smpaver,axis=0)

rEnsembleMean_exp = np.mean(r_exp,axis=0)
optRatioEnsembleMean_exp = np.mean(optRatio_exp,axis=0)

fig,ax = plt.subplots(1,2,figsize=(15,4))
ax[0].plot(util.smooth(rEnsembleMean_smpaver,5))
ax[0].plot(util.smooth(rEnsembleMean_exp,5))
ax[1].plot(util.smooth(optRatioEnsembleMean_smpaver,5))
ax[1].plot(util.smooth(optRatioEnsembleMean_exp,5))
ax[0].legend(['sample average method','exponential decaying'])
ax[1].legend(['sample average method','exponential decaying'])
ax[0].set_title('ensemble mean reward')
ax[1].set_title('ensemble mean optimal ratio')

仿真结果如下：