【强化学习】在gym环境下，老虎机的算法总结

最新推荐文章于 2024-06-11 09:16:00 发布

danyow-4

最新推荐文章于 2024-06-11 09:16:00 发布

阅读量2.8k

点赞数

文章标签：算法强化学习 python

本文链接：https://blog.csdn.net/dannnnnnnnnnnn/article/details/122772611

版权

问题描述：

实现步骤：

1.环境的部署与实现

2.贪心策略(The epsilon-greedy algorithm)

3.玻尔兹曼勘探(The softmax exploration algorithm)

4.置信上限算法(The upper confidence bound algorithm)

5.汤普森采样算法(The Thompson sampling algorithm)

参考：

问题描述：

多臂老虎机问题(Multi-Armed Bandit Problem)是强化学习的经典问题。MAB实际上是一个台机器，在赌场玩的一种赌博游戏，你拉动手臂(杠杆)并得到一个支付(奖励)基于随机生成的概率分布。

我们的目标是，随着时间序列，找出哪台机器可以得出最大的累计奖励，即最大化累计奖励

实现步骤：

1.环境的部署与实现

pip3 install gym_bandits

import  gym
import gym_bandits
import numpy as np
env = gym.make("BanditTenArmedGaussian-v0")

print(env.action_space.n)

2.贪心策略(The epsilon-greedy algorithm)

在贪心策略中，我们要么选择表现最好的臂，要么是随机选择臂


'''initialize all variables'''
#number of rounds
num_rounds = 20000
#count of number of times an arm was pulled
count =np.zeros(10)
#sum of rewards of each arm
sum_rewards = np.zeros(10)

#q value is the average reward
Q = np.zeros(10)

#define epsilon_greedy function
def epsilon_greedy(epsilon):
    rand = np.random.random()
    if rand < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q)

    return action

#start pulling arm
for i in range(num_rounds):
    #select the arm using epsilon greedy
    arm = epsilon_greedy(0.5)
    #get the reward
    observation,reward,done,info = env.step(arm)

    #update the count of that arm
    count[arm] += 1
    #sum the reward
    sum_rewards[arm] += reward

    #calculate Q value which is the average rewards of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
print('the optimal arm is {}'.format(np.argmax(Q)))

3.玻尔兹曼勘探(The softmax exploration algorithm)

在softmax探索中，我们根据玻尔兹曼概率选择臂

import math
import random

''' in softmax exploration, we select an arm based on a probability from
the Boltzmann distribution'''
#define the softmax function
def softmax(tau):
    total = sum(math.exp(val/tau) for val in Q)
    probs = [math.exp(val/tau) /total for val in Q]
    threshold = random.random()
    cumulative_prob = 0.0
    for i in range(len(probs)):
        cumulative_prob += probs[i]
        if (cumulative_prob > threshold):
            return i
    return np.argmax(probs)

#begining
for i in range(num_rounds):
    #selct the using arm
    arm = softmax(0.5)

    #get the reward
    observation,reward,done,info = env.step(arm)

    #update the count of arm
    count[arm] += 1

    #sum the rewards
    sum_rewards[arm] += reward

    #calculate Q value 
    Q[arm] = sum_rewards[arm]/count[arm]

print("the optimal arm is {}".format(np.argmax(Q)))

4.置信上限算法(The upper confidence bound algorithm)

在此算法中，我们注重于在初期表现很差，但是在后面的回合中，表现不错的臂。置信上限算法也称为乐观面对不确定性

'''
1. Select the action (arm) that has a high sum of average reward and upper
confidence bound
2. Pull the arm and receive a reward
3. Update the arm's reward and confidence bound

'''
#define the upper confidence bound function
def UCB(iters):
    ucb = np.zeros(10)
    #explore all the arm
    if iters <10:
        return i
    else:
        for arm in range(10):
            #calculate upper bound
            upper_bound = math.sqrt((2*math.log(sum(count))) / count[arm])

            #add upper bound to the Q value
            ucb[arm] = Q[arm] + upper_bound
        #return the arm which has maximum value
        return (np.argmax(ucb))

#begining 
for i in range(num_rounds):
    #select the arm using UCB
    arm = UCB(i)

    #get the reward
    observation,reward,done,info = env.step(arm)

    #update the count
    count[arm] += 1

    #sum the rewards
    sum_rewards[arm] += reward

    #calculate Q value
    Q[arm] = sum_rewards[arm] /count[arm]

print("the optimal arm is {}".format(np.argmax(Q)))

5.汤普森采样算法(The Thompson sampling algorithm)

是一种基于先验的概率算法分布。

'''
1. Sample a value from each of the k distributions and use this value as a prior
mean.
2. Select the arm that has the highest prior mean and observes the reward.
3. Use the observed reward to modify the prior distribution.

'''
#initialize alpha and beta value
alpha = np.ones(10)
beta  = np.ones(10)

#define the thompson_sampling function
def thompson_sampling(alpha,beta):
    samples = [np.random.beta(alpha[i] +1,beta[i] +1) for i in range(10)]

    return np.argmax(samples)


#begining
for i in range(num_rounds):
    arm = thompson_sampling(alpha,beta)

    observation,reward,done,info = env.step(arm)

    count[arm] += 1

    sum_rewards[arm] += reward

    Q[arm] = sum_rewards[arm] /count[arm]

    if reward>0:
        alpha[arm] += 1
    else:
        beta[arm] += 1
print('the optimal arm is {}'.format(np.argmax(Q)))