[强化学习] 从剪刀石头布中学习策略C语言实现

本算法可归类到《强化学习》一书第一章中提出的“环境不变化的K臂赌博机”。程序参考了 [日] 小高知宏 在其著作《强化学习与深度强化学习》第一章的代码。问题是这样的,假设已经有了一个对手,按照2:2:1的比例进行剪刀、石头、布的出拳,编写算法在有限次迭代之后实现学会最优出拳策略。

程序本身很简单,但是给编写强化学习程序提供了具体的代码范例和训练思路。

目录

1.算法思想

2.编程实现

2.1 battle.c

2.2  randHandGenrator.c

 3.训练过程及注意事项

3.1 一些提示 

3.2 使用步骤

3.2.1 编译battle.c

3.2.2 编译randHandGenerator.c

3.2.3  生成对手的数据

3.2.4 将生成的数据传入battle

 

3.2.5 查看结果

4. 训练过程中的问题


1.算法思想

初始化的时候三个出拳概率都相同,设置为1.0:1.0:1.0,设置一个学习率LEARNRATE,每次输了,就加上LEARNRATE * (+/- 1/0)(+1/-1/0 由双方出拳0,1,2放到得分矩阵中算出)* 当前的手势的出拳的概率。足够多次迭代,直到收敛,学习到最优策略,算法结束。

2.编程实现

2.1 battle.c

// battle.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SCISSOR 0   // 0; 1; 2 here is set for the array rate's index
#define STONE 1
#define CLOTH 2

#define WIN 1
#define LOSE -1
#define STANDOFF 0  /* e.g. The match is going to be a stand-off */

#define LEARNINGRATE 0.001

// decide among scissor / stone / cloth
int hand(double* rate);

// generate decimal fraction in [0, 1] 
double frand(void);

// Set the rate of each gesture in [0, +inf)
void correct(double* rate); 


int payOffMatrix[3][3] ={
    0, -1, 1,
    1, 0, -1,
    -1, 1, 0
};


/*

payOffMatrix[i][j] : 
In one battle, my value = i
rival's  value = j
i, j range in {0, 1, 2}

0: scissor
1: stone
2: cloth

*/

int main(int argc, char ** argv) {
    srand(time(NULL));
    int count = 0;  // store the battle times
    int myhand = -1, ohand = -1;
    double rate[] = {1.0, 1.0, 1.0}; // Initialized the rate
    int gain = 0;  // In each battle, gain should be set to WIN / LOSE / STANDOFF
    
    while(scanf("%d", &ohand) != EOF) 
    {
        if ((ohand < 0) || (ohand > 2))   // Illegal data
        {
            continue;
        }
        myhand = hand(rate);  // Use the RL algorithm to change the rate of scissor; stone; cloth    
        gain = payOffMatrix[myhand][ohand];
        printf("myhand = %d ohand = %d gain = %d \n", myhand, ohand, gain);  // Output the result of each epoch
        rate[myhand] += gain * LEARNINGRATE * rate[myhand];  // Update the rate array from the epoch result
        correct(rate);  // check the rate of scissor; stone; cloth and set the value in [0, 1]
        printf("scissor rate = %lf, stone rate = %lf, cloth rate = %lf \n", rate[SCISSOR], rate[STONE], rate[CLOTH]); // Print the latest rate
          
    }    	  
    
    return 0;
}

int hand(double rate[]) 
{
    double scissor = rate[SCISSOR] * frand();
    double stone = rate[STONE] * frand();
    double cloth = rate[CLOTH] * frand();
    return scissor > stone ? 0 : (stone > cloth ? 1 : 2);  // Return the index of the max rate gesture
}

void correct(double* rate) 
{
    int i = 0;
    for ( ; i < 2; rate++, ++i)
    {
        (*rate) = (*rate < 0.0) ? 0.0 : (*rate);
		(*rate) = (*rate > 3.0) ? 3.0 : (*rate);
    }
}

double frand(void) 
{
    return (double)rand() / RAND_MAX;

}

battle.c 是模拟自己和对手交战的程序。 

2.2  randHandGenrator.c

// randHandGenrator.c
// The file is used to generate hand of the rival's
#include <stdio.h>
#include <stdlib.h>
#include <time.h>


#define SCISSOR 0   // 0; 1; 2 here is set for the array rate's index
#define STONE 1
#define CLOTH 2

#define LEARNINGRATE 0.001

// decide among scissor / stone / cloth
int hand(double* rate);

// generate decimal fraction in [0, 1] 
double frand(void);

double max(double a, double b) 
{
	return a > b ? a : b;
}

int main(int argc, char ** argv) {
    srand(time(NULL));
    int battleTimes = 0;  // store the battle times
    int ohand = -1, legalHandCounter = 0;
	int gestureCounter[3] = { 0,0,0 };
	if (argc < 5) 
	{
		printf("Not enough parameters\n");
		printf("Usage: %s BattleTimes initialScissorRate initialStoneRate initialClothRate \n", argv[0]);
		printf("Usage example: %s 1000 2 2 1 \n", argv[0]);
		exit(1);
	}
	battleTimes = atoi(argv[1]);
	float initialScissorRate = atof(argv[2]);
	float initialStoneRate = atof(argv[3]);
	float initialClothRate = atof(argv[4]);
    double rate[] = { initialScissorRate, initialStoneRate, initialClothRate }; // Initialized the rate
    int gain = 0;  // In each battle, gain should be set to WIN / LOSE / STANDOFF
    
    while(1) 
    {
        ohand = hand(rate); 
		if ((ohand < 0) || (ohand > 2))   // Illegal data
		{
			continue;
		}
		legalHandCounter++;
		if (legalHandCounter > battleTimes)
		{
			break;
		}
		printf("%d\n", ohand);
		gestureCounter[ohand]++;
    }    	  
	// printf("OHand Times : scissor = %d, stone = %d, cloth = %d \n", gestureCounter[0], gestureCounter[1], gestureCounter[2]);
    return 0;
}

int hand(double rate[]) 
{
    double scissor = rate[SCISSOR] * frand();
    double stone = rate[STONE] * frand();
    double cloth = rate[CLOTH] * frand();
	double maxValue = max(scissor, max(stone, cloth));
	if (scissor == maxValue) return 0;
	if (stone == maxValue) return 1;
	return 2;
	//  return scissor > stone ? (scissor > cloth  ? 0 : 2) : (stone > cloth ? 1 : 2);  // Return the index of the max rate gesture
}

double frand(void) 
{
    return (double)rand() / RAND_MAX;

}

 3.训练过程及注意事项

3.1 一些提示 

// README.txt
Hint 1:
Type the following comand in the bash to generate the hand data for 1000 epoch with rate scissor : stone : cloth == 2 : 2 : 1 ,
and then store the data in the handData.txt.

bash> ./handGen 1000 2 2 1 > handData.txt

Hint 2:
Type the following command in the bash to load data from handData.txt to the executable program battle and show the details 
to the stdout.

bash>./battle < handData.txt

Hint 3:
Type the following command in the bash to load data from handData.txt to the executable program battle and store the details
in the file battleResult.txt

bash>./battle < handData.txt > battleResult.txt 

Hint 4:
Type the following command in the bash to show the result from the battleResult.txt to the stdout.

bash>cat battleResult.txt

Hint 5:
Random function cannot generate real random numbers.
Sometimes, u  find the generated num with wired proportions like set 2 : 2 : 1, but get 5 : 5 : 1

写的程序在Ubuntu16.04下训练效果稳定。

我已经提供了三个文件,拷贝到电脑上后,使用Linux打开应该是这样:

3.2 使用步骤

3.2.1 编译battle.c

3.2.2 编译randHandGenerator.c

3.2.3  生成对手的数据

3.2.4 将生成的数据传入battle

 

3.2.5 查看结果

学习率定得太低,导致学习过慢,2000epoch之后任然没有收敛,可以进一步改进为提高训练次数:

2W轮训练之后:

 

这就是学到的策略。

4. 训练过程中的问题

过程中发现一个问题 ,也写在了上面的README.txt中,就是随机数虽然设置的是2:2:1,但是自己统计发现最终得到比例差不多是5:5:1,因为C语言的随机函数并不是真的随机函数,这个之后也需要想想办法。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值