[强化学习] 从剪刀石头布中学习策略C语言实现

最新推荐文章于 2023-12-25 19:59:32 发布

kikook

最新推荐文章于 2023-12-25 19:59:32 发布

阅读量1.2k

点赞数

分类专栏：强化学习

本文链接：https://blog.csdn.net/chenhanxuan1999/article/details/103789675

版权

强化学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本算法可归类到《强化学习》一书第一章中提出的“环境不变化的K臂赌博机”。程序参考了 [日] 小高知宏在其著作《强化学习与深度强化学习》第一章的代码。问题是这样的，假设已经有了一个对手，按照2：2：1的比例进行剪刀、石头、布的出拳，编写算法在有限次迭代之后实现学会最优出拳策略。

程序本身很简单，但是给编写强化学习程序提供了具体的代码范例和训练思路。

1.算法思想

2.编程实现

2.1 battle.c

2.2 randHandGenrator.c

3.2.2 编译randHandGenerator.c

1.算法思想

初始化的时候三个出拳概率都相同，设置为1.0：1.0：1.0，设置一个学习率LEARNRATE，每次输了，就加上LEARNRATE * （+/- 1/0）（+1/-1/0 由双方出拳0，1，2放到得分矩阵中算出）* 当前的手势的出拳的概率。足够多次迭代，直到收敛，学习到最优策略，算法结束。

2.编程实现

2.1 battle.c

// battle.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SCISSOR 0   // 0; 1; 2 here is set for the array rate's index
#define STONE 1
#define CLOTH 2

#define WIN 1
#define LOSE -1
#define STANDOFF 0  /* e.g. The match is going to be a stand-off */

#define LEARNINGRATE 0.001

// decide among scissor / stone / cloth
int hand(double* rate);

// generate decimal fraction in [0, 1] 
double frand(void);

// Set the rate of each gesture in [0, +inf)
void correct(double* rate); 


int payOffMatrix[3][3] ={
    0, -1, 1,
    1, 0, -1,
    -1, 1, 0
};


/*

payOffMatrix[i][j] : 
In one battle, my value = i
rival's  value = j
i, j range in {0, 1, 2}

0: scissor
1: stone
2: cloth

*/

int main(int argc, char ** argv) {
    srand(time(NULL));
    int count = 0;  // store the battle times
    int myhand = -1, ohand = -1;
    double rate[] = {1.0, 1.0, 1.0}; // Initialized the rate
    int gain = 0;  // In each battle, gain should be set to WIN / LOSE / STANDOFF
    
    while(scanf("%d", &ohand) != EOF) 
    {
        if ((ohand < 0) || (ohand > 2))   // Illegal data
        {
            continue;
        }
        myhand = hand(rate);  // Use the RL algorithm to change the rate of scissor; stone; cloth    
        gain = payOffMatrix[myhand][ohand];
        printf("myhand = %d ohand = %d gain = %d \n", myhand, ohand, gain);  // Output the result of each epoch
        rate[myhand] += gain * LEARNINGRATE * rate[myhand];  // Update the rate array from the epoch result
        correct(rate);  // check the rate of scissor; stone; cloth and set the value in [0, 1]
        printf("scissor rate = %lf, stone rate = %lf, cloth rate = %lf \n", rate[SCISSOR], rate[STONE], rate[CLOTH]); // Print the latest rate
          
    }    	  
    
    return 0;
}

int hand(double rate[]) 
{
    double scissor = rate[SCISSOR] * frand();
    double stone = rate[STONE] * frand();
    double cloth = rate[CLOTH] * frand();
    return scissor > stone ? 0 : (stone > cloth ? 1 : 2);  // Return the index of the max rate gesture
}

void correct(double* rate) 
{
    int i = 0;
    for ( ; i < 2; rate++, ++i)
    {
        (*rate) = (*rate < 0.0) ? 0.0 : (*rate);
		(*rate) = (*rate > 3.0) ? 3.0 : (*rate);
    }
}

double frand(void) 
{
    return (double)rand() / RAND_MAX;

}

battle.c 是模拟自己和对手交战的程序。

2.2 randHandGenrator.c

// randHandGenrator.c
// The file is used to generate hand of the rival's
#include <stdio.h>
#include <stdlib.h>
#include <time.h>


#define SCISSOR 0   // 0; 1; 2 here is set for the array rate's index
#define STONE 1
#define CLOTH 2

#define LEARNINGRATE 0.001

// decide among scissor / stone / cloth
int hand(double* rate);

// generate decimal fraction in [0, 1] 
double frand(void);

double max(double a, double b) 
{
	return a > b ? a : b;
}

int main(int argc, char ** argv) {
    srand(time(NULL));
    int battleTimes = 0;  // store the battle times
    int ohand = -1, legalHandCounter = 0;
	int gestureCounter[3] = { 0,0,0 };
	if (argc < 5) 
	{
		printf("Not enough parameters\n");
		printf("Usage: %s BattleTimes initialScissorRate initialStoneRate initialClothRate \n", argv[0]);
		printf("Usage example: %s 1000 2 2 1 \n", argv[0]);
		exit(1);
	}
	battleTimes = atoi(argv[1]);
	float initialScissorRate = atof(argv[2]);
	float initialStoneRate = atof(argv[3]);
	float initialClothRate = atof(argv[4]);
    double rate[] = { initialScissorRate, initialStoneRate, initialClothRate }; // Initialized the rate
    int gain = 0;  // In each battle, gain should be set to WIN / LOSE / STANDOFF
    
    while(1) 
    {
        ohand = hand(rate); 
		if ((ohand < 0) || (ohand > 2))   // Illegal data
		{
			continue;
		}
		legalHandCounter++;
		if (legalHandCounter > battleTimes)
		{
			break;
		}
		printf("%d\n", ohand);
		gestureCounter[ohand]++;
    }    	  
	// printf("OHand Times : scissor = %d, stone = %d, cloth = %d \n", gestureCounter[0], gestureCounter[1], gestureCounter[2]);
    return 0;
}

int hand(double rate[]) 
{
    double scissor = rate[SCISSOR] * frand();
    double stone = rate[STONE] * frand();
    double cloth = rate[CLOTH] * frand();
	double maxValue = max(scissor, max(stone, cloth));
	if (scissor == maxValue) return 0;
	if (stone == maxValue) return 1;
	return 2;
	//  return scissor > stone ? (scissor > cloth  ? 0 : 2) : (stone > cloth ? 1 : 2);  // Return the index of the max rate gesture
}

double frand(void) 
{
    return (double)rand() / RAND_MAX;

}

3.训练过程及注意事项

3.1 一些提示

// README.txt
Hint 1：
Type the following comand in the bash to generate the hand data for 1000 epoch with rate scissor : stone : cloth == 2 : 2 : 1 ,
and then store the data in the handData.txt.

bash> ./handGen 1000 2 2 1 > handData.txt

Hint 2：
Type the following command in the bash to load data from handData.txt to the executable program battle and show the details 
to the stdout.

bash>./battle < handData.txt

Hint 3:
Type the following command in the bash to load data from handData.txt to the executable program battle and store the details
in the file battleResult.txt

bash>./battle < handData.txt > battleResult.txt 

Hint 4:
Type the following command in the bash to show the result from the battleResult.txt to the stdout.

bash>cat battleResult.txt

Hint 5:
Random function cannot generate real random numbers.
Sometimes, u  find the generated num with wired proportions like set 2 : 2 : 1, but get 5 : 5 : 1

写的程序在Ubuntu16.04下训练效果稳定。

我已经提供了三个文件，拷贝到电脑上后，使用Linux打开应该是这样：