credit assignment problem
Since all agents are exploring and learning at the same time,
it is difficult for any given agent to estimate the impact of their action on the
overall return
For example, an agent might have chosen the optimal action in a
given state, but the returns are lower than average since the teammate took an
exploratory action. The agent(not the teammate) will thus (falsely) learn to reduce the probability of selecting this (optimal) action