强化学习：它是什么以及它是如何工作的

资源存储库

已于 2024-08-03 10:03:06 修改

阅读量534

点赞数 15

分类专栏：笔记文章标签：人工智能

于 2024-08-03 10:01:27 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/140886080

版权

笔记专栏收录该内容

61 篇文章 1 订阅

订阅专栏

Reinforcement Learning: What It Is and How It Works

强化学习：它是什么以及它是如何工作的

What is reinforcement learning (RL)?

什么是强化学习（RL）？

Reinforcement vs. supervised and unsupervised learning

强化学习与监督学习和无监督学习

How reinforcement learning works

Types of reinforcement learning强化学习的类型

Model-free reinforcement learning

无模型强化学习

Model-based reinforcement learning

基于模型的强化学习

Hybrid reinforcement learning

混合强化学习

Applications of reinforcement learning

强化学习的应用

Advantages of reinforcement learning

强化学习的优势

Disadvantages of reinforcement learning

强化学习的缺点

Reinforcement Learning: What It Is and How It Works

强化学习：它是什么以及它是如何工作的

In the fascinating world of AI, reinforcement learning stands out as a powerful technique that enables machines to learn optimal behaviors through trial and error, much like how humans and animals acquire skills in the real world.
在引人入胜的人工智能世界中，强化学习作为一种强大的技术脱颖而出，它使机器能够通过反复试验来学习最佳行为，就像人类和动物在现实世界中获得技能一样。

What is reinforcement learning (RL)?

什么是强化学习（RL）？

Reinforcement learning (RL) is a type of machine learning (ML) in which an agent learns to make decisions by interacting with its environment. In this context, the agent is a program that makes decisions about actions to take, receives feedback in the form of rewards or penalties, and adjusts its behavior to maximize cumulative rewards.
强化学习（RL）是一种机器学习（ML），其中智能体通过与其环境交互来学习做出决策。在这种情况下，智能体是一个程序，它对要采取的行动做出决策，以奖励或惩罚的形式接收反馈，并调整其行为以最大化累积奖励。

Machine learning is a subset of artificial intelligence (AI) that uses data and statistical methods to build programs that mimic human reasoning rather than relying on hard-coded instructions. RL is directly inspired by how people use trial and error to optimize their decisions.
机器学习是人工智能（AI）的一个子集，它使用数据和统计方法来构建模仿人类推理的程序，而不是依赖于硬编码指令。RL 的直接灵感来自人们如何利用反复试验来优化他们的决策。

Reinforcement vs. supervised and unsupervised learning

强化学习与监督学习和无监督学习

In supervised learning, models are trained using labeled data, where the correct output is provided for each input. This guidance helps the model make accurate predictions when it’s faced with new, unseen data. Supervised learning is useful for tasks like spam detection, image classification, and weather forecasting.
在监督学习中，模型是使用标记数据进行训练的，其中为每个输入提供正确的输出。此指南可帮助模型在面对新的、看不见的数据时做出准确的预测。监督学习对于垃圾邮件检测、图像分类和天气预报等任务非常有用。

On the other hand, unsupervised learning works with unlabeled data to find patterns and groupings. It can cluster similar data points, find associations between items, and reduce data complexity for easier processing. Examples include customer segmentation, recommendation systems, and anomaly detection.
另一方面，无监督学习使用未标记的数据来查找模式和分组。它可以聚类相似的数据点，查找项目之间的关联，并降低数据复杂性，以便于处理。示例包括客户细分、推荐系统和异常情况检测。

Reinforcement learning is distinct from both. In RL, an agent learns by interacting with its environment and receiving positive or negative feedback. This feedback loop enables the agent to adjust its actions to achieve the best possible outcomes. RL is particularly useful for tasks where the agent needs to learn a sequence of decisions, as in game playing, robotics, and autonomous driving.
强化学习与两者不同。在RL中，智能体通过与其环境交互并接收正面或负面反馈来学习。这种反馈循环使智能体能够调整其操作以实现最佳结果。RL 对于智能体需要学习一系列决策的任务特别有用，例如在游戏、机器人和自动驾驶中。

How reinforcement learning works

强化学习的工作原理

Understanding the principles of RL is crucial for grasping how intelligent agents learn and make decisions. Below, we’ll explore the key concepts and the RL process in detail.
理解 RL 的原理对于掌握智能代理如何学习和做出决策至关重要。下面，我们将详细探讨关键概念和 RL 过程。

Key concepts in RL

RL 中的关键概念

RL has a distinct vocabulary that doesn’t apply to other types of ML. The primary notions to understand are:
RL 有一个独特的词汇表，不适用于其他类型的 ML。要理解的主要概念是：

1 Agent and environment: The agent is the decision-making computer program, while the environment encompasses everything the agent interacts with. This includes all possible states and actions, including prior decisions made by the agent. The interaction between the agent and the environment is the core of the learning process.
1 代理和环境：代理是决策计算机程序，而环境包含代理与之交互的一切。这包括所有可能的状态和操作，包括代理做出的先前决定。智能体与环境之间的交互是学习过程的核心。

2 State and action: The state represents the agent’s current situation at any given moment, and an action is a decision the agent can make in response to its state. The agent aims to choose actions that will lead to the most favorable states.
2 状态和动作：状态代表代理人在任何给定时刻的当前情况，动作是代理人可以根据其状态做出的决定。代理旨在选择将导致最有利状态的行动。

3 Reward and punishment: After taking an action, the agent receives feedback from the environment: if positive it’s called a reward, if negative, a punishment. This feedback helps the agent learn which actions are beneficial and which should be avoided, guiding its future decisions.
3 奖励和惩罚：在采取行动后，智能体会收到来自环境的反馈：如果是积极的，则称为奖励，如果为负的，则称为惩罚。这种反馈有助于座席了解哪些行动是有益的，哪些应该避免，从而指导其未来的决策。

4 Policy: A policy is the agent’s strategy for deciding which action to take in each state. It maps states to actions, serving as the agent’s guide to achieve the best outcomes based on past experiences.
4 策略：策略是代理的策略，用于决定在每个州采取哪种行动。它将状态映射到操作，作为代理的指南，根据过去的经验实现最佳结果。

5 Value function: The value function estimates the long-term benefit of being in a certain state or taking a certain action. It helps the agent understand the potential future rewards, even if it means enduring a short-term negative reward to maximize long-term gain. The value function is essential for making decisions that optimize cumulative rewards over time.
5 价值函数：价值函数估计处于某种状态或采取某种行动的长期利益。它帮助代理人了解潜在的未来回报，即使这意味着忍受短期的负面奖励以最大化长期收益。值函数对于做出优化累积奖励的决策至关重要。

The RL process

RL 过程

While the purpose and learning method are quite different from other types of ML, the process is similar in terms of preparing data, choosing parameters, evaluating, and iterating.
虽然目的和学习方法与其他类型的 ML 有很大不同，但该过程在准备数据、选择参数、评估和迭代方面是相似的。

Here’s a brief overview of the RL process:
以下是 RL 过程的简要概述：

1 Problem definition and goal setting. Clearly define the problem and determine the agent’s goals and objectives, including the reward structure. This will help you decide what data you need and what algorithm to select.
1 问题定义和目标设定。明确定义问题并确定代理人的目标和目标，包括奖励结构。这将帮助您确定需要哪些数据以及选择什么算法。

2 Data collection and initialization. Gather initial data, define the environment, and set up the necessary parameters for the RL experiment.
2 数据收集和初始化。收集初始数据，定义环境，并为 RL 实验设置必要的参数。

3 Preprocessing and feature engineering. Clean the data: spot-check, remove duplicates, ensure you have the proper feature labels, and decide how to handle missing values. In many cases, you’ll want to create new features to clarify important aspects of the environment, such as creating a single positioning data point from several sensor inputs.
3 预处理和特征工程。清理数据：抽查，删除重复项，确保具有正确的特征标签，并决定如何处理缺失值。在许多情况下，您需要创建新功能来阐明环境的重要方面，例如从多个传感器输入创建单个定位数据点。

4 Algorithm selection. Based on the problem and environment, choose the appropriate RL algorithm and configure core settings, known as hyperparameters. For instance, you’ll need to establish the balance of exploration (trying new paths) versus exploitation (following known pathways).
4 算法选择。根据问题和环境，选择适当的 RL 算法并配置核心设置，称为超参数。例如，您需要在探索（尝试新路径）与开发（遵循已知路径）之间建立平衡。

5 Training. Train the agent by allowing it to interact with the environment, take actions, receive rewards, and update its policy. Adjust the hyperparameters and repeat the process. Continue to monitor and adjust the exploration-exploitation trade-off to ensure the agent learns effectively.
5 培训。通过允许代理与环境交互、采取行动、接收奖励和更新其策略来培训代理。调整超参数并重复该过程。继续监控和调整勘探-开采权衡，以确保智能体有效地学习。

6 Evaluation. Assess the agent’s performance using metrics, and observe its performance in applicable scenarios to ensure it meets the defined goals and objectives.
6 评估。使用指标评估座席的绩效，并观察其在适用场景中的绩效，以确保其满足定义的目标和目的。

7 Model tuning and optimization. Adjust hyperparameters, refine the algorithm, and retrain the agent to improve performance further.
7 模型调优和优化。调整超参数，优化算法，并重新训练代理以进一步提高性能。

8 Deployment and monitoring. Once you’re satisfied with the agent’s performance, deploy the trained agent in a real-world environment. Continuously monitor its performance and implement a feedback loop for ongoing learning and improvement.
8 部署和监控。一旦您对代理的性能感到满意，就可以在真实环境中部署经过训练的代理。持续监控其绩效，并实施反馈循环，以进行持续学习和改进。

9 Maintenance and updating. While continual learning is very useful, occasionally you may need to retrain from initial conditions to make the most of new data and techniques. Periodically update the agent’s knowledge base, retrain it with new data, and ensure it adapts to changes in the environment or objectives.
9 维护和更新。虽然持续学习非常有用，但有时您可能需要从初始条件进行重新训练，以充分利用新的数据和技术。定期更新座席的知识库，用新数据重新训练它，并确保它适应环境或目标的变化。

Types of reinforcement learning
强化学习的类型

Reinforcement learning can be broadly categorized into three types: model-free, model-based, and hybrid. Each type has its specific use cases and methods.
强化学习大致可分为三种类型：无模型学习、基于模型学习和混合学习。每种类型都有其特定的用例和方法。

Model-free reinforcement learning

无模型强化学习

With model-free RL, the agent learns directly from interactions with the environment. It doesn’t try to understand or predict the environment but simply tries to maximize its performance within the situation presented. An example of model-free RL is a Roomba robotic vacuum: as it goes along, it learns where the obstacles are and incrementally bumps into them less while cleaning more.
借助无模型 RL，智能体能直接从与环境的交互中学习。它不试图理解或预测环境，而只是试图在所呈现的情况下最大化其性能。无模型 RL 的一个例子是 Roomba 机器人吸尘器：随着它的进行，它会了解障碍物的位置，并逐渐减少碰撞，同时增加清洁次数。

Examples: 例子：

Value-based methods. The most common is Q-learning, where a Q-value represents the expected future rewards for taking a given action in a given state. This method is optimal for situations with discrete choices, which is to say limited and defined options, such as which way to turn at an intersection. You can manually assign Q-values, use a zero or low value to avoid bias, randomize values to encourage exploration, or use uniformly high values to ensure thorough initial exploration. With each iteration, the agent updates these Q-values to reflect better strategies. Value-based learning is popular because it is simple to implement and works well in discrete action spaces, though it can struggle with too many variables.
基于价值的方法。 最常见的是 Q 学习，其中 Q 值表示在给定状态下执行给定操作的预期未来奖励。这种方法对于具有离散选择的情况是最佳的，即有限和定义的选项，例如在交叉路口转向哪个方向。您可以手动分配 Q 值，使用零值或低值来避免偏差，随机化值以鼓励探索，或使用均匀的高值来确保彻底的初始探索。每次迭代时，代理都会更新这些 Q 值以反映更好的策略。基于价值的学习很受欢迎，因为它易于实施，并且在离散的动作空间中效果很好，尽管它可能会遇到太多的变量。
Policy gradient methods: Unlike Q-learning, which tries to estimate the value of actions in each state, policy gradient methods focus directly on improving the strategy (or policy) the agent uses to choose actions. Instead of estimating values, these methods adjust the policy to maximize the expected reward. Policy gradient methods are useful in situations where actions can be any value —following the analogy above, this could be walking in any direction across a field—or where it’s hard to determine the value of different actions. They can handle more complex decision-making and a continuum of choices but usually need more computing power to work effectively.
策略梯度方法：与尝试估计每种状态下的行动价值的 Q 学习不同，策略梯度方法直接侧重于改进代理用来选择行动的策略（或策略）。这些方法不是估计值，而是调整策略以最大化预期奖励。策略梯度方法在以下情况下非常有用：操作可以是任何值（按照上面的类比，它可以是跨字段的任何方向移动），或者很难确定不同操作的值。它们可以处理更复杂的决策和一系列的选择，但通常需要更多的计算能力才能有效工作。

Model-based reinforcement learning

基于模型的强化学习

Model-based RL involves creating a model of the environment to plan actions and predict future states. These models capture the interplay between actions and state changes by predicting how likely an action is to affect the state of the environment and the resulting rewards or penalties. This approach can be more efficient, as the agent can simulate different strategies internally before acting. A self-driving car uses this approach to understand how to respond to traffic features and various objects. A Roomba’s model-free technique would be inadequate for such complex tasks.
基于模型的 RL 涉及创建环境模型来规划操作和预测未来状态。这些模型通过预测操作影响环境状态的可能性以及由此产生的奖励或惩罚来捕获操作与状态更改之间的相互作用。这种方法可能更有效，因为智能体可以在行动之前在内部模拟不同的策略。自动驾驶汽车使用这种方法来了解如何响应交通特征和各种物体。Roomba 的无模型技术对于如此复杂的任务来说是不够的。

Examples: 例子：

Dyna-Q: Dyna-Q is a hybrid reinforcement learning algorithm that combines Q-learning with planning. The agent updates its Q-values based on real interactions with the environment and on simulated experiences generated by a model. Dyna-Q is particularly useful when real-world interactions are expensive or time-consuming.
Dyna-Q：Dyna-Q 是一种混合强化学习算法，将 Q 学习与规划相结合。智能体根据与环境的真实交互以及模型生成的模拟体验更新其 Q 值。当现实世界的交互成本高昂或耗时时，Dyna-Q特别有用。
Monte Carlo Tree Search (MCTS): MCTS simulates many possible future actions and states to build a search tree to represent the decisions that follow each choice. The agent uses this tree to decide on the best action by estimating the potential rewards of different paths. MCTS excels in decision-making scenarios with a clear structure, such as board games like chess, and can handle complex strategic planning.
蒙特卡洛树搜索（MCTS）：MCTS 模拟许多可能的未来操作和状态，以构建一个搜索树来表示每个选项之后的决策。代理使用此树通过估计不同路径的潜在奖励来决定最佳操作。MCTS在结构清晰的决策场景中表现出色，例如国际象棋等棋盘游戏，并且可以处理复杂的战略规划。

Model-based methods are appropriate when the environment can be accurately modeled and when simulations can provide valuable insights. They require fewer samples compared to model-free methods, but those samples must be accurate, meaning they may require more computational effort to develop.
当可以准确建模环境并且模拟可以提供有价值的见解时，基于模型的方法是合适的。与无模型方法相比，它们需要的样本更少，但这些样本必须准确，这意味着它们可能需要更多的计算工作来开发。

Hybrid reinforcement learning

混合强化学习

Hybrid reinforcement learning combines approaches to leverage their respective strengths. This technique can help balance the trade-offs between sample efficiency and computational complexity.
混合强化学习结合了各种方法，以利用它们各自的优势。这种技术可以帮助平衡样本效率和计算复杂性之间的权衡。

Examples: 例子：

Guided policy search (GPS): GPS is a hybrid technique that alternates between supervised learning and reinforcement learning. It uses supervised learning to train a policy based on data generated from a model-based controller. The policy is then refined using reinforcement learning to handle parts of the state space where the model is less accurate. This approach helps in transferring knowledge from model-based planning to direct policy learning.
引导式策略搜索（GPS）：GPS 是一种混合技术，在监督学习和强化学习之间交替进行。它使用监督学习来根据从基于模型的控制器生成的数据来训练策略。然后，使用强化学习来优化策略，以处理模型不太准确的状态空间部分。这种方法有助于将知识从基于模型的规划转移到直接的政策学习。
Integrated architectures: Some architectures integrate various model-based and model-free components in a single framework, adapting to different aspects of a complex environment rather than forcing one approach upon everything. For instance, an agent might use a model-based approach for long-term planning and a model-free approach for short-term decision-making.
集成架构：一些架构将各种基于模型和无模型的组件集成到一个框架中，适应复杂环境的不同方面，而不是将一种方法强加于所有事物。例如，代理可能会使用基于模型的方法进行长期规划，并使用无模型的方法进行短期决策。
World models: World models are an approach where the agent builds a compact and abstract representation of the environment, which it uses to simulate future states. The agent uses a model-free approach to learn policies within this internal simulated environment. This technique reduces the need for real-world interactions.
世界模型：世界模型是一种方法，在这种方法中，智能体构建了一个紧凑而抽象的环境表示，并用它来模拟未来的状态。代理使用无模型方法在此内部模拟环境中学习策略。这种技术减少了对真实世界交互的需求。

Applications of reinforcement learning

强化学习的应用

RL has a wide range of applications across various domains:
RL在各个领域都有广泛的应用：

Game playing: RL algorithms have achieved superhuman performance in cases like chess and video games. A notable example is AlphaGo, which plays the board game Go by using a hybrid of deep neural networks and Monte Carlo Tree Search. These successes demonstrate RL’s ability to develop complex strategies and adapt to dynamic environments.
游戏玩法：RL算法在国际象棋和视频游戏等情况下实现了超人的性能。一个值得注意的例子是AlphaGo，它通过使用深度神经网络和蒙特卡洛树搜索的混合来玩棋盘游戏围棋。这些成功证明了RL有能力制定复杂的策略并适应动态环境。
Robotics: In robotics, RL helps in training robots to perform tasks like grasping objects and navigating obstacles. The trial-and-error learning process allows robots to adapt to real-world uncertainties and improve their performance over time, surpassing inflexible rule-based approaches.
机器人：在机器人技术中，RL有助于训练机器人执行抓取物体和穿越障碍物等任务。试错学习过程使机器人能够适应现实世界的不确定性，并随着时间的推移提高其性能，超越了僵化的基于规则的方法。
Healthcare: By responding to patient-specific data, RL can optimize treatment plans, manage clinical trials, and personalize medicine. RL can also suggest interventions that maximize patient outcomes by continuously learning from patient data.
医疗：通过响应患者特定的数据，RL可以优化治疗计划，管理临床试验，并实现个性化医疗。RL 还可以通过不断从患者数据中学习来提出干预措施，以最大限度地提高患者的结果。
Finance: Model-based RL is well suited to the clear parameters and complex dynamics of various parts of the finance field, especially those interacting with highly dynamic markets. Its uses here include portfolio management, risk assessment, and trading strategies that adapt to new market conditions.
金融：基于模型的强化学习非常适合金融领域各个部分的清晰参数和复杂动态，尤其是那些与高度动态市场交互的部分。它在这里的用途包括投资组合管理、风险评估和适应新市场条件的交易策略。
Autonomous vehicles: Self-driving cars use RL-trained models to respond to obstacles, road conditions, and dynamic traffic patterns. They immediately apply these models to adapt to current driving conditions while also feeding data back into a centralized continual training process. The continuous feedback from the environment helps these vehicles improve their safety and efficiency over time.
自动驾驶汽车：自动驾驶汽车使用 RL 训练的模型来响应障碍物、路况和动态交通模式。他们立即应用这些模型来适应当前的驾驶条件，同时将数据反馈到集中的持续训练过程中。来自环境的持续反馈有助于这些车辆随着时间的推移提高其安全性和效率。

Advantages of reinforcement learning

强化学习的优势

Adaptive learning: RL agents continuously learn from and adapt to their interactions with the environment. Learning on the fly makes RL particularly suited for dynamic and unpredictable settings.
自适应学习：RL 代理不断从环境中学习并适应它们与环境的交互。动态学习使得 RL 特别适合动态和不可预测的设置。
Versatility: RL works for a wide range of problems involving a sequence of decisions where one influences the environment of the next, from game playing to robotics to healthcare.
多面性：RL适用于广泛的问题，涉及一系列决策，其中一个决策会影响下一个环境，从游戏到机器人技术再到医疗保健。
Optimal decision-making: RL is focused on maximizing long-term rewards, ensuring that RL agents develop strategies optimized for the best possible outcomes over time rather than simply the next decision.
最优决策：RL 专注于最大化长期奖励，确保 RL 代理制定针对随着时间的推移获得最佳结果而优化的策略，而不仅仅是下一个决策。
Automation of complex tasks: RL can automate tasks that are difficult to hard-code, such as dynamic resource allocation, complex control systems like electricity grid management, and precisely personalized recommendations.
复杂任务的自动化：RL 可以自动执行难以硬编码的任务，例如动态资源分配、电网管理等复杂控制系统以及精确个性化的推荐。

Disadvantages of reinforcement learning

强化学习的缺点

Data and computational requirements: RL often requires extensive amounts of data and processing power, both of which can get quite expensive.
数据和计算要求：RL 通常需要大量的数据和处理能力，这两者都可能变得非常昂贵。
Long training time: Training RL agents can take weeks or even months when the process involves interacting with the real world and not simply a model.
培训时间长： 训练 RL 代理可能需要数周甚至数月的时间，因为该过程涉及与现实世界的交互，而不仅仅是模型。
Complexity: Designing and tuning RL systems involves careful consideration of the reward structure, policy representation, and exploration-exploitation balance. These decisions must be made thoughtfully to avoid taking too much time or resources.
复杂性：设计和调优强化学习系统涉及仔细考虑奖励结构、政策表示和勘探-开发平衡。这些决定必须经过深思熟虑，以避免花费太多时间或资源。
Safety and reliability: For critical applications like healthcare and autonomous driving, unexpected behavior and suboptimal decisions can have significant consequences.
安全性和可靠性：对于医疗保健和自动驾驶等关键应用，意外行为和次优决策可能会产生重大后果。
Low interpretability: In some RL processes, especially in complex environments, it’s difficult or impossible to know exactly how the agent came to its decisions.
低可解释性： 在一些 RL 过程中，尤其是在复杂的环境中，很难或不可能确切地知道智能体是如何做出决策的。
Sample inefficiency: Many RL algorithms require a large number of interactions with the environment to learn effective policies. This can limit their usefulness in scenarios where real-world interactions are costly or limited.
样本效率低下：许多 RL 算法需要与环境进行大量交互才能学习有效的策略。这可能会限制它们在现实世界交互成本高昂或受限的情况下的实用性。