Adaptive Learning for Multi-Agent Coodination and Control

 

Adaptive Learning for Multi-Agent Coodination and Control

Meng Xiangping to Robert Babuka

December 20, 2004

Overview

1 Introduction

In this part we give a brief introduction about the learning problem in multi-agent systems, and investigate several multi-agent reinforcement learning algorithms, compare and analyse them. We also discuss the kinds and applications of multi-agent learning problems as well as related problems.

 

 

1.1 Multi-Agent Systems

 

 

1.1.1 Agent

 

 

According to the definition by Russel and Norvig [2003], an agent is anything that can perceive its environment through sensors and act upon that environment through actuators. Here agent is autonomous, i.e., an agent with its goal embedded in an environment learns how to transform one environmental state into another that contains its goal. An agent that has the ability of doing this task with minimal human supervision is called autonomous [Kaya and Alhajj, 2004]. Agent is also a rational or intelligent agent, i.e., it always tries to optimize an appropriate performance measure. Therefore, agent may be a human agent, robotic agent, or software agent, because they all have sensors and actuators, and can make optimal decision making relying to a larger extent on its own perception than on prior knowledge given to it at design time [Vlassis, 2003].

 

1.1.2 Multi-agent systems

 

 

A multi-agent system (MAS) is a system that consists of a group of agents who are autonomous, possibly homogeneous or heterogeneous, and can make their decision independently, and can interact with each other to cope with complex problem involving distributed data, knowledge, or control, and reach an overall goal eventually.

A MAS can be constituted by designing different agents, such as, software (in the Internet) and hardware (robotic soccer) agents, heterogeneous and homogeneous agents and so on. The homogeneous agents are designed in an identical way and have a priori the same capabilities. However, the heterogeneous agents can be designed in a different way and have different capabilities or based on the same hardware/software but implement different behaviors.

In multi-agent systems (MASs) some autonomous agents interact with each other to complete complex tasks. Each agent has its own goal. These goals must be coordinated to achieve better and more efficient overall solution.

Multi-agent system research can be broadly classified into the cooperative agent systems and non-cooperative agent systems, based on the relationships between the agents [Sen, 1997].

 

1.1.2.1 Cooperative agent systems

 

 

In cooperative agent systems, a group of cooperative agents work jointly on achieving a common goal, or agents can form binding agreements to achieve mutual benefits. The team MASs distributed decision making result in asynchronous computation and certain speedups by communications. Although the cooperative agents have a common goal, each agent has its own sub-goal, these sub-goals may be consistent (hunters in pursuit problem [Ishiwaka, et al, 2003]) or conflict sometimes (traffic signal control [Choy, et al, 2003], robots in the same team of robot soccer [Park, et al, 2001]). This needs an appropriate coordination mechanism to solve such problem. Coordination can ensure that the individual decisions of the agents result in good joint decisions for the group.

The core problems of research in cooperative agent systems are how do agents decompose their goal into sub-goals, and assign these sub-goals to individual agents based on their capabilities and access to resources; how to develop agent organizations, problem-solving protocols and communication languages so that can make agents to share results and knowledge effectively and quickly; how do agents maintain coherence.

In cooperative agent systems, learning problems usually use reinforcement learning approach to solve.

 

1.1.2.2 Non-cooperative agent systems

 

 

In non-cooperative agent systems, a group of self-interested agents interact in a shared environment and these agents cannot form binding agreements. Such systems have two situations: (1) agents are adversarial to each other, such as, electronic marketplaces automated and dynamic pricing problem [Kutschinskia, et al, 2003] and Adaptive agent tracking (air-to-air combat) problem [Tambe, 1995]; (2) agents are indifferent to each other, for example, drivers sharing a highway are neither cooperative nor adversarial, they interact because they share a common resource [Sen, 1997].

In adversarial agents systems, researches mainly concentrate on modeling the knowledge and behavioral strategies of opponents, learning to exploit opponent weaknesses and developing interaction rules by which agents can arrive at equilibrium configurations.

In indifferent agent systems case, the key research problems are to design social laws, conventions, and protocols by which each agent can achieve its own goal without significantly affecting the chances that others have of achieving their goals.

The learning problems for non-cooperative agent systems usually use zero-sum or general-sum approaches of game theory combining with reinforcement learning to solve.

 

1.2 Characteristics of Multi-Agent Systems

 

 

Multi-agent systems possess the following characteristics:

(1) In multi-agent systems, there is no global control and globally consistent knowledge. In other words, there is only decentralized control and partial discrepant knowledge in multi-agent systems.

(2) Multi-agent systems are in general dynamic systems. The dynamics arise not only from changes in environment state, but also from evolution of agent behaviors. Agents that adapt their behavior, or learn, based on their experience with the environment and other agents, are thus themselves a source of dynamics, providing further impetus for others to adapt and learn [Hu and Weliman, 2001].

(3) Agents in a multi-agent system are often designed based on functionality; such systems are modular by design and hence can be easier to develop and maintain than centralized systems.

(4) Multi-agent learning in multi-agent systems has adaptive abilities and robustness. Since data and control are decentralized in multi-agent systems, there is no central controlling agent that can decide what each agent must do at each time step, but each agent through learning can discover and exploit the dynamics of the environment and also have the ability to adaptive their behavior and unforeseen difficulties in the task. And in multi-agent systems agent learning from an environment is directly affected by the dynamics of the environment does not rely on a single agent, so it has robustness.

(5) Multi-agent systems provide a very useful framework within which social aspects of intelligent behavior can be modeled, analyzed and evaluated under a variety of domain, behavior and knowledge assumptions.

 

1.3 Multi-Agent Learning Problems

 

 

1.3.1 Multi-agent learning

 

 

Multi-agent systems require determining a course of action for each agent just as in single-agent systems. Machine learning can be a powerful tool for finding a successful course of action and can greatly simplify the task of specifying appropriate behaviors for an agent. In particular, through learning an agent can discover and exploit the dynamics of the environment and also adapt to unforeseen difficulties in the task. These benefits have caused learning to be studied extensively for single-agent problems with a stationary environment. In multi-agent environments, learning is both more important and more difficult, since the selection of actions must take place in the presence of other agents who are also learning [Bowling and Veloso, 2002].

In multi-agent systems, each agent has itself a goal, possibly the same or different, and it must learn from environment or behavior of other agents by interaction to adapt to the changing demands of dynamic environments and to achieve its goal.

 

1.3.2 Kinds of multi-agent learning problems

 

 

Multi-agent systems require agents to work with other agents in their environment and to solve complicated problems. The environment of the multi-agent systems is dynamic, and the behavior of agents is changing constantly. In addition, agents may be cooperative agents, adversarial agents or indifferent agents to other agents. These make multi-agent learning problems more complex.

Learning problem may be classified to two kinds: (1) learning about the environment; (2) learning about other agents.

 

1.3.2.1 Learning about the environment

 

 

In multi-agent systems, the agents interact in open, dynamic environments, coordination of agent activities is very important. But pre-produced off-line coordination strategies can quickly become inadequate as environments change over time. Therefore, learning and adaptation are invaluable mechanisms by which agents can evolve coordination strategies that meet the demands of the environments and their individual requirements [Sen, 1998; Wei and Sen, 1996; Wei, 1997].

Adaptive and learning agents in their environments require knowledge and communication abilities, and use different strategies according to different environments with cooperative agents, selfish agents or partially cooperative agents.

Learning from the environment is mainly to learn if the learning approaches and strategies are suitable for the state changes of environment.

 

1.3.2.2 Learning about other agents

 

 

In multi-agent systems agents are forced to interact with other agents, these agents may have independent goals, assumptions, algorithms, and conventions. Where the agents can learn and adapt to the other agents’ behavior. Since the other agents also have the ability to adapt themselves behaviors, the optimal course of action of agents is changing as all the agents adapt.

The learning agent uses collected information to predict the individual actions of other agents. Agent can make its prediction in the way that the agent learns a time-series model of another agent’s actions, ignoring the underlying decision process of that agent or the agent learns an underlying decision model that determines another agent’s action. Based on rational decision making, the learning agent assumes that every agent is trying to maximize its payoff; therefore there exists a functional relation between that agent’s actions and its local states. A model-free approach based on time series analysis, or a model-based approach derived from assumptions of other agents’ recursive level [Hu and Weliman, 2001].

Learning about other agents needs to analyse and construct algorithms for guaranteeing convergence and stability of group behavior.

Learning from other agents is mainly to learn the actions and strategies of other agents choosing.

 

1.4 The effectiveness of learning

 

 

The effectiveness of learning depends not only on the learning method, but also on how much information is available to the agent. When an agent can’t observe other agents’ actions, it has to rely on indirect evidence of the effect of those actions. Such partially observable or incomplete information makes learning more difficult, sometimes imposing strict limitations [Hu and Weliman, 2001].

If learning agents can obtain more information, multi-agent systems might avoid the suboptimal result and produce better overall performance.

In real multi-agent systems, the environment for agents is usually partially observable. So, information about environment and behavior of other agents obtained by each agent is incomplete and inaccurate sometimes. Thus a rational agent must first consider carefully what actions he will choose through adaptive learning and reasoning combining with information obtained.

1.5 Communication

 

 

In a multi-agent environment, the cooperative agents need to communicate with each other for achieving their goals. Communication can help to reach a locally optimal individual behavior and a globally optimal behavior for the group as a whole.

Communication could be used to share information among agents, called informative. It could make an agent obtain knowledge better by share information than by direct observation only.

Communication could be also used to propose to another agent the execution of a cooperative behavior, called cooperation proposal. When an agent receives a cooperation proposal, it changes its internal state, thus predisposing itself to the execution of the proposed cooperative behavior. It is complementary behaviors, and it could result in a successful cooperative action, but the cooperation among agents dose not always mean that the same behavior is executed [Bonarini and Trianni, 2001].

A consistent communication protocol and comprehensive communication languages are necessary. There is not a universal communication language so far.

 

2 Learning Control of Multi-agent

 

 

2.1 Learning Control Systems

 

 

Learning control systems or intelligent control systems with the problem-solving or high-level decision capability is actually an intersection area of artificial intelligence and automatic control. This area includes: 1) control systems with human controller, 2) control systems with man-machine controller, and 3) autonomous robot systems [Fu, 1971; Fu, 1970].

 

2.1.1 Control systems with human controller

 

 

The adaptive and learning behavior of human controllers is expected to provide some clues to the synthesis of learning and intelligent control systems [Young, 1969].

The human controllers in general consist of a data acquisition device, pattern recognition and force program calculations as well as arm and hand dynamic executive mechanism. The determination of the performance index used by human controllers is usually a very important problem. It has been anticipated that a human controller uses different performance indices for different control tasks. In addition to the changes of plant parameters, the type of plant also changes, the pattern recognizer will recognize the change. Consequently, a change of control strategy will be initiated. The search of appropriate control parameters will be performed according to the new strategy [Gilstad and Fu, 1970; Thomas and Tou, 1968].

Specifically, real time parameter adjusting or parameter tracking was performed to follow the dynamics of the human operator. The decision making was determined through a period of learning.

 

2.1.2 Control systems with man-machine controller

 

 

The control systems with a man-machine combination have the problem of man-machine interaction and a problem of adaptive and learning behavior of such control systems. The later is closely related to our interest. In some very complicated control tasks, human supervision, acting as a teacher and a controller, is necessary. Through a period of learning, the machine controller will gradually participate in the control of the manipulator, and, consequently, the human operator will act mainly as an action initiator and inhibitor [Freedy, Hull and Lyman, 1970].

 

2.1.3 Autonomous robot systems

 

 

The robot system is viewed as a computer-controlled system in a complex environment; the controller should perform at least the following three major functions: 1) problem-solving, 2) modeling, and 3) perception [Rosen and Nilsson, 1967; Nilsson, 1969].

As the environment changes, either by the robot’s own actions or for other reasons, the model must be updated to record these changes. New information about the environment should be added to the model. The addition and the updating of information about the model is a learning process and is similar to the process introduced in many learning control systems. In order to give a robot system the information about its environment, sensors are necessary. A visual sensory system is used since it allows direct perception of a significant portion of the environment. Scene analysis and object recognition are required to provide the robot with a description of its environment [Fu, 1971].

To design a controller in autonomous robot systems, there are usually two different approaches i.e., one is based only on the amount of information available, the unknown information is either ignored or is assumed to take some known values from the designer’s best guess. The other one is to design a controller which is capable of estimating the unknown information during its operation and an optimal control action is determined on the basis of the estimated information [Fu, 1970].

The controller learns the unknown information during operation and the learned information is, in turn, used as an experience for future decisions or controls.

Therefore, as the controller accumulates more information about the unknown function or parameters, the control law will be altered according to the updated information in order to improve the system’s performance.

The process of learning can be classified into 1) learning with external supervised or off-line learning, and 2) learning without external supervised or on-line learning. In learning processes with external supervision, the desired output of the system or the desired optimal control action is usually considered exact known. Therefore the controller can modify its control strategy or control parameters to improve the system‘s performance. On the other hand, in learning processes without external supervision, the desired answer is not exactly known. Two approaches with mixture style are all usually used in designing learning controllers.

2.2 Game Theory

 

 

Learning controls used to multi-agent systems are mainly based on game theory and reinforcement learning at present.

 

2.2.1 Basic concepts

 

Mathematical games can be represented in different forms. The most important forms are the extensive game form and the strategic game form. The two types of games can be distinguished by the way the agents choose their actions. Although the extensive game form is the most richly structured way to describe game situations, the strategic game form is conceptually simpler and it can be derived from the extensive game form [Kononen, 2003].

In a strategic game, each agent chooses a single action and then he receives a payoff that depends on the selected joint action. This joint action is called the outcome of the game. The important point to note is that, although the payoff functions of the agents are common knowledge, an agent does not know in advance the action choices of the other agents. The best he can do is to try to predict the actions of the other agents. A solution to a game is a prediction of the outcome of the game using the assumption that all agents are rational and strategic [Vlassis, 2003]. Games in the strategic form are usually referred to as matrix games and particularly in the case of two players. In multi-player matrix game, each player simultaneously chooses a strategy from himself strategy set or a random strategy.

In decision problems with only one decision maker, it is adequate to maximize the expected utility of the decision maker. However, in a multi-agent system, where many agents take decisions at the same time, an agent will also be uncertain about the decisions of the other participating agents. Clearly, what an agent should do depends on what the other agents will do. The participating agents all have two assumptions, i.e., they are rational and they reason strategically, that is, they take into account the other agents' decisions in their decision making.

 

2.2.2 Markov Games

 

 

Markov game with multi-player is defined as under a finite set of states of the environment, when each agent gives an available action to the environment, then over states change to new states, each agent gains the expected immediate reward, all agents attempt to maximize their expected sum of rewards.

With two or more agents, the fundamental problem using single-agent Markov decision processes (MDPs) [Bellman, 1957; Howard, 1960] is that the approach treats other agents as a part of the environment and thus ignores the fact that the decisions of the other agents may influence the state of the environment.

One possible solution to this problem is to use competitive Markov decision processes, i.e. Markov games. In a Markov game, the process changes its state according to action choices of all agents and can thus be seen as a multi-controller Markov decision process.

 

2.2.3 Stochastic Games

 

 

2.2.3.1 Definitions of stochastic games

 

 

A stochastic game is a tuple ( ), where n is the number of agents, S is a set of states, Ai is a set of actions available to agent i, and is a reward function for agent i, is the transition probability map, where is the set of probability distributions over state space S.

Stochastic games (SGs) has been used as a framework to study the multiagent learning problem, in this situation, an agent tries to learn a policy in the presence of other agents. The goal of the game is to maximize the discounted future reward sum of each agent.

SGs are a very natural extension of MDPs to multiple agents. They are also an extension of matrix games to multiple states. Each state in a stochastic game can be viewed as a matrix game with the reward to player i determined by joint action in state s, where is the joint actions of all agents. After playing the matrix game and receiving the payoffs, the players are transitioned to another state (or matrix game) determined by their joint actions. Therefore SGs contain both MDPs (n = 1) and matrix games (|S| = 1) as subsets of the framework [Bowling and Veloso, 2002].

 

2.2.3.2 Types of stochastic games

 

 

Stochastic games may be classified as cooperative games and noncooperative games.

Full cooperative games, or team games, are ones where all the agents have the same reward function [Bowling and Veloso, 2002], i.e., reward function is a certainty constant, such as, all the agents in the same team have the same reward function in robot soccer game.

Noncooperative games may be classified as competitive games and general-sum games. Strictly competitive games, or zero-sum games, are two-player games where one player’s reward is always the negative of the others’, such as matching pennies game, etc.

General-sum games are ones where the reward sum is not restricted to 0 or any constant, and allow the agents’ rewards to be arbitrarily related. Zero-sum games are special instance cases of general-sum games, where agents’ rewards are always negatively related. But in full cooperative games, or team games, rewards are always positively related. In other cases, agents may have both compatible and conflicting interests. For example, in a market system, the buyer and seller have compatible interests in reaching a deal, but have conflicting interests in the direction of price [Hu and Wellman, 2003].

 

2.2.4 Nash equilibrium

 

 

The Nash equilibrium is a solution of game. The solution is the best response for each player to make himself strategy choice to his opponents’ play, therefore each player don’t need play a deviation strategy from this equilibrium point. Thus, the concept of Nash equilibrium solution provides a reasonable solution concept for a stochastic game when the roles of the players are symmetric.

2.3 Control Problems of Multi-Agent Systems

 

 

Multi-agent systems are proposed as an effective approach for the design and implementation of decentralized control systems.

The control approaches of multi-agent systems are mainly based on reinforcement learning and game theory.

Reinforcement learning is an adaptive and flexible control method for autonomous system. It dose not need priori knowledge; behaviors to accomplish given tasks are obtained automatically by repeating trial and error. However, as increasing complexity of the system, the learning costs are increased exponentially [Ito and Gofuku, 2003].

Game theory is based on Nash equilibrium in stochastic game by maximizing rewards or payoff (see above parts).

The two theory models have many developments and applications, see part 4.

 

3 Reinforcement Learning

 

 

3.1 Brief Introduction of Reinforcement Learning

 

 

Reinforcement learning addresses the question of how an autonomous agent that senses and acts in a given environment can learn to choose optimal actions to achieve its goal(s) [Kaelbling, et al, 1996]. Reinforcement learning techniques [Sutton and Barto, 1998] have attracted many researchers in investigating multiagent systems’ learning, perhaps because they do not require environment models and they allow agents to learn while they take actions. In typical multiagent systems, agents lack full information about their environment and other agents, and thus the multiagent environment constantly changes as agents learn about each other and adapt their behaviors accordingly.

The progress made on reinforcement learning has opened the way for designing autonomous agents capable of acting in unknown environments by exploring different possible actions and their consequences. In single-agent systems, Q-learning [Watkins and Dayan, 1992] is an available reinforcement learning technique which possesses a firm foundation in the theory of Markov decision processes. It has been especially well-studied and has wide applications. However, to apply single-agent Q-learning to multi-agent system in a straightforward fashion is very difficult. Firstly, the environment consisting of multiple agents is no longer stationary and these agents are simultaneously learning by interacting with the environment and with each other; secondly, there are also computational issues, because the dimensions of the state space for learning each agent grow exponentially in the number of its partners [Kaya and Alhajj, 2004]. So many theoretical results for single-agent RL do not directly apply in the case of multiple agents. So far multiagent reinforcement learning is less mature in many areas [Vlassis, 2003].

In researching multiagent Q-learning, most researches adopt the framework of general-sum stochastic games, because in a stochastic game, each agent’s reward depends on the joint action of all agents and the current state, and state transitions obey the Markov property. The concept of optimal Q-value can be naturally defined in terms of an agent maximizing its own expected payoffs with respect to a stochastic environment in single-agent systems. However, in multi-agent systems, Q-values are correlative with other agents’ strategies. In this situation, all agents maximize their respective rewards with respect to one another’s actions is not adequate, because deterministic actions that satisfy all Q-values optimal simultaneously are not exist. Several researchers have proposed some algorithms to resolve thus problems.

In the framework of general-sum stochastic games, Hu and Wellman define optimal Q-values as Nash Q-values [Hu and Wellman, 1998]. The goal of learning is to find Nash Q-values (NE) through repeated play. But this algorithm needs Nash equilibria to be globally optimal or saddle points, and it will fail to converge in else multiple equilibria exist.

Littman introduced a friend-or-foe-Q (FF-Q) learning algorithm [Littman, 2001]. This algorithm always converges, but it only learns equilibrium policies in restricted classes of games, i.e., zero-sum and team game Markov games. In zero-sum game, two players have completely opposite payoff function, called as foe-Q. In team game or full cooperative games, all agents in the team share the same payoff function. The Q-values of the agents is a globally optimal action profile (meaning that the payoff to any agent under that joint action is no less than his payoff under any other joint action), called as friend-Q [Shoham at, 2003].

Greenwald and Hall proposed a correlated equilibrium (CE) learning algorithm [Greenwald and Hall, 2003]. This algorithm is more general than a NE, because it resolves equilibrium selection problem by introducing four variants of CE-Q, based on four equilibrium selection functions, i.e., utilitarian, egalitarian, republican, and libertarian CE-Q learning. Unlike NE, CE can be computed easily via linear programming.

But in Nash-Q and CE-Q learning equilibrium policies in general-sum stochastic games, there exist multiple equilibria, therefore, these two algorithms have difficult convergence in conditions with multiple equilibria. Bowling and Veloso consider rational and convergent learning problem in stochastic games, and proposed a WoLF policy hill-climbing (PHC) learning algorithm [Bowling and Veloso, 2001]. The algorithm has good convergence. However its definition of Q-values is single-agent style not joint action style.


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值