本算法可归类到《强化学习》一书第一章中提出的“环境不变化的K臂赌博机”。程序参考了 [日] 小高知宏 在其著作《强化学习与深度强化学习》第一章的代码。问题是这样的,假设已经有了一个对手,按照2:2:1的比例进行剪刀、石头、布的出拳,编写算法在有限次迭代之后实现学会最优出拳策略。
程序本身很简单,但是给编写强化学习程序提供了具体的代码范例和训练思路。
目录
1.算法思想
初始化的时候三个出拳概率都相同,设置为1.0:1.0:1.0,设置一个学习率LEARNRATE,每次输了,就加上LEARNRATE * (+/- 1/0)(+1/-1/0 由双方出拳0,1,2放到得分矩阵中算出)* 当前的手势的出拳的概率。足够多次迭代,直到收敛,学习到最优策略,算法结束。
2.编程实现
2.1 battle.c
// battle.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SCISSOR 0 // 0; 1; 2 here is set for the array rate's index
#define STONE 1
#define CLOTH 2
#define WIN 1
#define LOSE -1
#define STANDOFF 0 /* e.g. The match is going to be a stand-off */
#define LEARNINGRATE 0.001
// decide among scissor / stone / cloth
int hand(double* rate);
// generate decimal fraction in [0, 1]
double frand(void);
// Set the rate of each gesture in [0, +inf)
void correct(double* rate);
int payOffMatrix[3][3] ={
0, -1, 1,
1, 0, -1,
-1, 1, 0
};
/*
payOffMatrix[i][j] :
In one battle, my value = i
rival's value = j
i, j range in {0, 1, 2}
0: scissor
1: stone
2: cloth
*/
int main(int argc, char ** argv) {
srand(time(NULL));
int count = 0; // store the battle times
int myhand = -1, ohand = -1;
double rate[] = {1.0, 1.0, 1.0}; // Initialized the rate
int gain = 0; // In each battle, gain should be set to WIN / LOSE / STANDOFF
while(scanf("%d", &ohand) != EOF)
{
if ((ohand < 0) || (ohand > 2)) // Illegal data
{
continue;
}
myhand = hand(rate); // Use the RL algorithm to change the rate of scissor; stone; cloth
gain = payOffMatrix[myhand][ohand];
printf("myhand = %d ohand = %d gain = %d \n", myhand, ohand, gain); // Output the result of each epoch
rate[myhand] += gain * LEARNINGRATE * rate[myhand]; // Update the rate array from the epoch result
correct(rate); // check the rate of scissor; stone; cloth and set the value in [0, 1]
printf("scissor rate = %lf, stone rate = %lf, cloth rate = %lf \n", rate[SCISSOR], rate[STONE], rate[CLOTH]); // Print the latest rate
}
return 0;
}
int hand(double rate[])
{
double scissor = rate[SCISSOR] * frand();
double stone = rate[STONE] * frand();
double cloth = rate[CLOTH] * frand();
return scissor > stone ? 0 : (stone > cloth ? 1 : 2); // Return the index of the max rate gesture
}
void correct(double* rate)
{
int i = 0;
for ( ; i < 2; rate++, ++i)
{
(*rate) = (*rate < 0.0) ? 0.0 : (*rate);
(*rate) = (*rate > 3.0) ? 3.0 : (*rate);
}
}
double frand(void)
{
return (double)rand() / RAND_MAX;
}
battle.c 是模拟自己和对手交战的程序。
2.2 randHandGenrator.c
// randHandGenrator.c
// The file is used to generate hand of the rival's
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SCISSOR 0 // 0; 1; 2 here is set for the array rate's index
#define STONE 1
#define CLOTH 2
#define LEARNINGRATE 0.001
// decide among scissor / stone / cloth
int hand(double* rate);
// generate decimal fraction in [0, 1]
double frand(void);
double max(double a, double b)
{
return a > b ? a : b;
}
int main(int argc, char ** argv) {
srand(time(NULL));
int battleTimes = 0; // store the battle times
int ohand = -1, legalHandCounter = 0;
int gestureCounter[3] = { 0,0,0 };
if (argc < 5)
{
printf("Not enough parameters\n");
printf("Usage: %s BattleTimes initialScissorRate initialStoneRate initialClothRate \n", argv[0]);
printf("Usage example: %s 1000 2 2 1 \n", argv[0]);
exit(1);
}
battleTimes = atoi(argv[1]);
float initialScissorRate = atof(argv[2]);
float initialStoneRate = atof(argv[3]);
float initialClothRate = atof(argv[4]);
double rate[] = { initialScissorRate, initialStoneRate, initialClothRate }; // Initialized the rate
int gain = 0; // In each battle, gain should be set to WIN / LOSE / STANDOFF
while(1)
{
ohand = hand(rate);
if ((ohand < 0) || (ohand > 2)) // Illegal data
{
continue;
}
legalHandCounter++;
if (legalHandCounter > battleTimes)
{
break;
}
printf("%d\n", ohand);
gestureCounter[ohand]++;
}
// printf("OHand Times : scissor = %d, stone = %d, cloth = %d \n", gestureCounter[0], gestureCounter[1], gestureCounter[2]);
return 0;
}
int hand(double rate[])
{
double scissor = rate[SCISSOR] * frand();
double stone = rate[STONE] * frand();
double cloth = rate[CLOTH] * frand();
double maxValue = max(scissor, max(stone, cloth));
if (scissor == maxValue) return 0;
if (stone == maxValue) return 1;
return 2;
// return scissor > stone ? (scissor > cloth ? 0 : 2) : (stone > cloth ? 1 : 2); // Return the index of the max rate gesture
}
double frand(void)
{
return (double)rand() / RAND_MAX;
}
3.训练过程及注意事项
3.1 一些提示
// README.txt
Hint 1:
Type the following comand in the bash to generate the hand data for 1000 epoch with rate scissor : stone : cloth == 2 : 2 : 1 ,
and then store the data in the handData.txt.
bash> ./handGen 1000 2 2 1 > handData.txt
Hint 2:
Type the following command in the bash to load data from handData.txt to the executable program battle and show the details
to the stdout.
bash>./battle < handData.txt
Hint 3:
Type the following command in the bash to load data from handData.txt to the executable program battle and store the details
in the file battleResult.txt
bash>./battle < handData.txt > battleResult.txt
Hint 4:
Type the following command in the bash to show the result from the battleResult.txt to the stdout.
bash>cat battleResult.txt
Hint 5:
Random function cannot generate real random numbers.
Sometimes, u find the generated num with wired proportions like set 2 : 2 : 1, but get 5 : 5 : 1
写的程序在Ubuntu16.04下训练效果稳定。
我已经提供了三个文件,拷贝到电脑上后,使用Linux打开应该是这样:
3.2 使用步骤
3.2.1 编译battle.c
3.2.2 编译randHandGenerator.c
3.2.3 生成对手的数据
3.2.4 将生成的数据传入battle
3.2.5 查看结果
学习率定得太低,导致学习过慢,2000epoch之后任然没有收敛,可以进一步改进为提高训练次数:
2W轮训练之后:
这就是学到的策略。
4. 训练过程中的问题
过程中发现一个问题 ,也写在了上面的README.txt中,就是随机数虽然设置的是2:2:1,但是自己统计发现最终得到比例差不多是5:5:1,因为C语言的随机函数并不是真的随机函数,这个之后也需要想想办法。