监督学习：它是什么以及它是如何工作的_which of the following are common applications of -CSDN博客

From image recognition to spam filtering, discover how supervised learning powers many of the AI applications we encounter daily in this informative guide.
从图像识别到垃圾邮件过滤，在这本内容丰富的指南中，了解监督学习如何为我们每天遇到的许多 AI 应用程序提供支持。

What is supervised learning?

什么是监督学习？

Supervised learning is a type of machine learning (ML) that trains models using data labeled with the correct answer. The term supervised means these labels provide clear guidance on the relationship between inputs and outputs. This process helps the model make accurate predictions on new, unseen data.
监督学习是一种机器学习（ML），它使用标有正确答案的数据来训练模型。术语“监督”意味着这些标签为输入和输出之间的关系提供了明确的指导。此过程有助于模型对新的、看不见的数据做出准确的预测。

Machine learning is a subset of artificial intelligence (AI) that uses data and statistical methods to build models that mimic human reasoning rather than relying on hard-coded instructions. Supervised learning takes a guided, data-driven approach to identifying patterns and relationships in labeled datasets. It extrapolates from its evaluations to predict outcomes for new, unseen data. It learns by comparing its predictions against the known labels and adjusting its model to minimize errors.
机器学习是人工智能（AI）的一个子集，它使用数据和统计方法来构建模仿人类推理的模型，而不是依赖于硬编码指令。监督学习采用有指导的、数据驱动的方法来识别标记数据集中的模式和关系。它从评估中推断出新的、看不见的数据的结果。它通过将其预测与已知标签进行比较来学习，并调整其模型以最大程度地减少错误。

Supervised vs. unsupervised learning

监督学习与无监督学习

In contrast to supervised learning, which uses labeled data, unsupervised learning finds patterns in unlabeled data.
与使用标记数据的监督学习相比，无监督学习可以在未标记的数据中找到模式。

Without the “supervision” provided by explicit right answers in the training data, unsupervised learning treats everything it sees as data to analyze for patterns and groupings. The three main types are:
如果没有训练数据中明确的正确答案提供的“监督”，无监督学习会将其看到的所有内容视为数据分析模式和分组的数据。三种主要类型是：

Clustering: This technique groups data points that are most adjacent to each other. It is useful for customer segmentation or document sorting.
聚类： 此技术对彼此最相邻的数据点进行分组。它对于客户细分或文档排序很有用。
Association: Determining when things tend to co-occur, most notably to co-locate items frequently bought together or suggest what to stream next.
协会：确定事情何时倾向于同时发生，最明显的是将经常一起购买的商品放在一起或建议接下来要播放的内容。
Dimensionality reduction: Shrinking datasets to be easier to process while preserving all or most of the details.
降维：缩小数据集以更易于处理，同时保留全部或大部分细节。

On the other hand, supervised learning makes sense when you want the model to make decisions. Major applications include:
另一方面，当您希望模型做出决策时，监督学习是有意义的。主要应用包括：

Yes or no decisions: Marking data as either one class or another. Often used for filtering like spam or fraud detection.
是或否决定： 将数据标记为一个类或另一个类。通常用于垃圾邮件或欺诈检测等过滤。
Classification: Figuring out which of several classes something belongs to, such as identifying objects within an image or recognizing speech.
分类：弄清楚某物属于几个类别中的哪一类，例如识别图像中的对象或识别语音。
Regression: Predicting continuous values based on historical data, such as forecasting house prices or weather conditions.
回归：根据历史数据预测连续值，例如预测房价或天气状况。

Other types of ML sit between these two: semi-supervised, reinforcement, and self-supervised learning.
其他类型的机器学习介于这两者之间：半监督学习、强化学习和自我监督学习。

How supervised learning works

监督学习的工作原理

Supervised learning involves a structured process of choosing and formatting data, running the model, and testing its performance.
监督学习涉及选择和格式化数据、运行模型以及测试其性能的结构化过程。

Here’s a brief overview of the supervised learning process:
以下是监督学习过程的简要概述：

1 Labeling: Labeled data is essential for learning the correct association between inputs and outputs. For instance, if you’re creating a model to analyze sentiment in product reviews, start by having human evaluators read the reviews and mark them as positive, negative, or neutral.
1 标记：标记数据对于学习输入和输出之间的正确关联至关重要。例如，如果您正在创建一个模型来分析产品评论中的情绪，首先让人工评估员阅读评论并将它们标记为正面、负面或中立。

2 Data collection and cleaning: Ensure your training data is comprehensive and representative. Clean the data by removing duplicates, correcting errors, and handling any missing values to prepare it for analysis.
2 数据收集和清理：确保您的训练数据全面且具有代表性。通过删除重复项、更正错误和处理任何缺失值来清理数据，以便为分析做好准备。

3 Feature selection and extraction: Identify and select the most influential attributes, making the model more efficient and effective. This step may also involve creating new features from existing ones to better capture the underlying patterns in the data, such as converting date of birth to age.
3 特征选择和提取：识别和选择最有影响力的属性，使模型更加高效和有效。此步骤还可能涉及从现有要素创建新要素，以更好地捕获数据中的底层模式，例如将出生日期转换为年龄。

4 Data splitting: Divide the dataset into training and testing sets. Use the training set to train the model, and the testing set to see how well it generalizes to new, unseen data.
4 数据拆分：将数据集分为训练集和测试集。使用训练集来训练模型，使用测试集来查看它对新的、看不见的数据的泛化程度。

5 Algorithm selection: Choose a supervised learning algorithm based on the task and data characteristics. You can also run and compare multiple algorithms to find the best one.
5 算法选择：根据任务和数据特性选择监督学习算法。您还可以运行和比较多种算法以找到最佳算法。

6 Model training: Train the model using the data to improve its predictive accuracy. During this phase, the model learns the relationship between inputs and outputs by iteratively minimizing the error between its predictions and the actual labels provided in the training data. Depending on the algorithm’s complexity and the dataset’s size, this could take seconds to days.
6 模型训练：使用数据训练模型，以提高其预测准确性。在此阶段，模型通过迭代地最小化其预测值与训练数据中提供的实际标签之间的误差来学习输入和输出之间的关系。根据算法的复杂性和数据集的大小，这可能需要几秒钟到几天的时间。

7 Model evaluation: Evaluating the model’s performance ensures that it produces reliable and accurate predictions on new data. This is a key difference from unsupervised learning: Since you know the expected output, you can evaluate how well the model performed.
7 模型评估：评估模型的性能可确保它对新数据产生可靠和准确的预测。这是与无监督学习的一个关键区别：由于您知道预期的输出，因此您可以评估模型的性能。

8 Model tuning: Adjust and retrain the model’s parameters to fine-tune performance. This iterative process, called hyperparameter tuning, aims to optimize the model and prevent issues like overfitting. This process should be repeated after each adjustment.
8 模型调优：调整和重新训练模型的参数以微调性能。这种迭代过程称为超参数调优，旨在优化模型并防止过拟合等问题。每次调整后都应重复此过程。

9 Deployment and monitoring: Deploy the trained model to make predictions on new data in a real-world setting. For example, deploy the trained spam detection model to filter emails, monitor its performance, and adjust as needed.
9 部署和监视：部署经过训练的模型，以在真实环境中对新数据进行预测。例如，部署经过训练的垃圾邮件检测模型来过滤电子邮件、监控其性能并根据需要进行调整。

10 Fine-tuning over time: As you gather more real-world data, continue to train the model to become more accurate and relevant.
10 随着时间的推移进行微调：随着您收集更多真实世界的数据，请继续训练模型以使其更加准确和相关。

Types of supervised learning

监督学习的类型

There are two main types of supervised learning: classification and regression. Each type has its own sub-types and specific use cases. Let’s explore them in more detail:
监督学习有两种主要类型：分类和回归。每种类型都有自己的子类型和特定用例。让我们更详细地探讨它们：

Classification 分类

Classification involves predicting which category or class an input belongs to. Various sub-types and concepts are used to handle different classification problems. Here are some popular types:
分类涉及预测输入属于哪个类别或类。使用各种子类型和概念来处理不同的分类问题。以下是一些流行的类型：

Binary classification: The model predicts one of two possible classes. This is useful when the outcome is binary, meaning there are only two possible states or categories. This approach is used in decisions where a clear distinction is needed.
二元分类： 该模型预测两个可能的类别之一。当结果是二进制的时，这很有用，这意味着只有两种可能的状态或类别。这种方法用于需要明确区分的决策。
Multi-class classification: Like binary, but with more than two choices for which there is only one right answer. This approach is used when there are multiple categories that an input can belong to.
多类分类：像二进制一样，但有两个以上的选择，只有一个正确答案。当输入可以属于多个类别时，使用此方法。
Multi-label classification: Each input can belong to multiple classes simultaneously. Unlike binary or multi-class classification, where each input is assigned to a single class, multi-label classification allows for assigning multiple labels to a single input. This is a more complex analysis because rather than just choosing whichever class the input is most likely to belong to, you need to decide a probability threshold for inclusion.
多标签分类：每个输入可以同时属于多个类。与二元分类或多类分类不同，在二元分类或多类分类中，每个输入都分配给单个类，多标签分类允许将多个标签分配给单个输入。这是一个更复杂的分析，因为您不仅需要选择输入最有可能属于哪个类别，还需要确定包含的概率阈值。
Logistic regression: An application of regression (see below) to binary classification. This approach can tell you the confidence of its prediction rather than a simple this-or-that.
Logistic回归：回归（见下文）在二元分类中的应用。这种方法可以告诉你其预测的置信度，而不是简单的这个或那个。

There are several ways to measure the quality of a classification model, including:
有几种方法可以衡量分类模型的质量，包括：

Accuracy: How many of the total predictions were correct?
准确性： 总预测中有多少是正确的？
Precision: How many of the positives are actually positive?
精度： 有多少积极的因素实际上是积极的？
Recall: How many of the actual positives did it mark as positive?
召回： 它标记为阳性的实际阳性值中有多少？
F1 score: On a scale of 0% to 100%, how well does the model balance precision and recall?
F1 分数： 在 0% 到 100% 的范围内，模型在精度和召回率之间的平衡程度如何？

Regression 回归

Regression involves predicting a continuous value based on input features, outputting a number that can also be called a prediction. Various types of regression models are used to capture the relationships between these input features and the continuous output. Here are some popular types:
回归涉及根据输入特征预测连续值，输出一个数字，该数字也可以称为预测。各种类型的回归模型用于捕获这些输入特征与连续输出之间的关系。以下是一些流行的类型：

Linear regression: Models the relationship between the input features and the output as a straight line. The model assumes a linear relationship between the dependent variable (the output) and the independent variables (the inputs). The goal is to find the best-fitting line through the data points that minimizes the difference between the predicted and actual values.
线性回归： 将输入要素和输出之间的关系建模为一条直线。该模型假设因变量（输出）和自变量（输入）之间存在线性关系。目标是通过数据点找到最佳拟合线，以最小化预测值和实际值之间的差异。
Polynomial regression: More complex than linear regression because it uses polynomials such as squared and cubed to capture more complex relationships between the input and output variables. The model can fit nonlinear data by using these higher-order terms.
多项式回归：比线性回归更复杂，因为它使用多项式（如平方和立方）来捕获输入变量和输出变量之间更复杂的关系。该模型可以通过使用这些高阶项来拟合非线性数据。
Ridge and lasso regression: Addresses the problem of overfitting, which is the tendency of a model to read too much into the data it’s trained on at the expense of generalizing. Ridge regression reduces the model’s sensitivity to small details, while lasso regression eliminates less important features from consideration.
脊和套索回归：解决了过拟合问题，即模型倾向于以牺牲泛化为代价，过多地读取它所训练的数据。岭回归降低了模型对小细节的敏感性，而套索回归则从考虑中排除了不太重要的特征。

Most measurements of regression quality have to do with how far off the predictions are from the actual values. The questions they answer are:
回归质量的大多数度量都与预测值与实际值的差距有关。他们回答的问题是：

Mean absolute error: On average, how far off are the predictions from the actual values?
平均绝对误差：平均而言，预测值与实际值相差多远？
Mean squared error: How much do the errors grow when larger errors are more significant?
均方误差： 当较大的错误更严重时，错误会增加多少？
Root mean squared error: How much do large errors cause predictions to deviate from actual values?
均方根误差：大误差会导致预测值与实际值产生多大的偏差？
R-squared: How well does the regression fit the data?
R 平方：回归与数据的拟合程度如何？

Applications of supervised learning

监督学习的应用

Supervised learning has a wide range of applications across various industries. Here are some common examples:
监督学习在各个行业都有广泛的应用。以下是一些常见示例：

Spam detection: Email services use binary classification to decide whether an email should hit your inbox or be routed to spam. They continually improve in response to people marking emails in the spam folder as not spam, and vice versa.
垃圾邮件检测： 电子邮件服务使用二元分类来决定电子邮件是应发送到您的收件箱还是应被路由到垃圾邮件。当人们将垃圾邮件文件夹中的电子邮件标记为非垃圾邮件时，它们会不断改进，反之亦然。
Image recognition: Models are trained on labeled images to recognize and categorize objects. Examples include Apple’s Face ID feature, which unlocks your tablet or mobile device, optical character recognition (OCR) for turning printed words into digital text, and object detection for self-driving cars.
图像识别： 在标记图像上训练模型，以识别和分类对象。例如，Apple 的 Face ID 功能可以解锁您的平板电脑或移动设备，光学字符识别（OCR）可以将印刷文字转换为数字文本，以及用于自动驾驶汽车的物体检测。
Medical diagnosis: Supervised models can predict diseases and suggest potential diagnoses using patient data and medical records. For instance, models can be trained to recognize cancerous tumors in MRIs or develop diabetes management plans.
医学诊断：监督模型可以使用患者数据和医疗记录预测疾病并提出可能的诊断建议。例如，可以训练模型以识别MRI中的癌性肿瘤或制定糖尿病管理计划。
Fraud detection: Financial institutions use supervised learning to identify fraudulent transactions by analyzing patterns in labeled transaction data.
欺诈检测： 金融机构使用监督学习，通过分析标记交易数据中的模式来识别欺诈交易。
Sentiment analysis: Whether measuring positive or negative reactions or emotions such as happiness or disgust, manually tagged datasets inform models to interpret input such as social media posts, product reviews, or survey results.
情感分析： 无论是衡量积极或消极的反应或情绪，如快乐或厌恶，手动标记的数据集都会通知模型解释社交媒体帖子、产品评论或调查结果等输入。
Predictive maintenance: Based on historical performance data and environmental factors, models can predict when machines are likely to fail so they can be repaired or replaced before they do.
预测性维护： 根据历史性能数据和环境因素，模型可以预测机器何时可能发生故障，以便在故障发生之前进行维修或更换。

Advantages of supervised learning

监督学习的优势

Accurate and predictable. Assuming they’ve been given good data, supervised learning models tend to be more accurate than other machine learning methods. Simpler models are typically deterministic, meaning a given input will always produce the same output.
准确且可预测。 假设他们已经获得了良好的数据，监督学习模型往往比其他机器学习方法更准确。更简单的模型通常是确定性的，这意味着给定的输入将始终产生相同的输出。
Clear objective. Thanks to supervision, you know what your model is trying to accomplish. This is a clear contrast to unsupervised and self-supervised learning.
明确的目标。多亏了监督，您知道您的模型要完成什么。这与无监督学习和自我监督学习形成鲜明对比。
Easy to evaluate. There are several quality measures at your disposal for judging the accuracy of both classification and regression models.
易于评估。您可以使用多种质量度量来判断分类模型和回归模型的准确性。
Interpretable. Supervised models use techniques, such as regressions and decision trees, that are relatively straightforward for data scientists to understand. Interpretability improves decision-makers’ confidence, especially in high-impact settings and regulated industries.
解释。监督模型使用的技术（如回归和决策树）对于数据科学家来说相对容易理解。可解释性可以提高决策者的信心，尤其是在高影响力环境和受监管的行业中。

Disadvantages of supervised learning

监督学习的缺点

Requires labeled data. Your data has to have clear inputs and labels. This is often a challenge for classification training, with many thousands (if not millions) of people employed to annotate data manually.
需要标记的数据。 您的数据必须有清晰的输入和标签。对于分类训练来说，这通常是一个挑战，因为需要雇用数千（如果不是数百万）人来手动注释数据。
Errors and inconsistent judgment in training data. With human labeling comes human fallacies, such as errors, typos, and different opinions. The latter is a particularly challenging aspect of sentiment analysis; high-quality sentiment training data typically requires multiple people to evaluate a given data point with a result recorded only if there’s agreement.
训练数据中的错误和不一致的判断。随着人类标签的到来，人类的谬误也随之而来，例如错误、错别字和不同意见。后者是情感分析中特别具有挑战性的一个方面;高质量的情感训练数据通常需要多人评估给定的数据点，只有在达成一致的情况下才会记录结果。
Overfitting. Often a model will come up with calculations that work very well for the training data but poorly with data it hasn’t yet seen. A careful trainer will always look for overfitting and use techniques to reduce the impact.
过拟合。通常，模型会得出的计算结果对于训练数据非常有效，但对于尚未看到的数据则效果不佳。一个细心的教练总是会寻找过度拟合，并使用技术来减少影响。
Restricted to known patterns. If your stock price prediction model is based only on data from a bull market, it won’t be very accurate once a bear market hits. Accordingly, be sensitive to the limitations of the data you’ve shown your model, and consider whether to find training data that will expose it to more circumstances or simply ignore its output.
仅限于已知模式。如果你的股票价格预测模型只基于牛市的数据，那么一旦熊市来袭，它就不会非常准确。因此，请对已展示模型的数据的局限性保持敏感，并考虑是找到使其暴露于更多情况的训练数据，还是干脆忽略其输出。