机器学习学习笔记——1.1.1.2.3 Supervised learning part 2（监督学习——第2部分）

最新推荐文章于 2024-09-14 13:56:55 发布

预见未来to50

最新推荐文章于 2024-09-14 13:56:55 发布

阅读量515

点赞数 16

分类专栏：机器学习、深度学习（ML/DL) 文章标签：机器学习学习笔记

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/142255602

版权

机器学习、深度学习（ML/DL) 专栏收录该内容

148 篇文章 13 订阅

订阅专栏

So supervised learning algorithms learn to predict input, output or X to Y mapping. And in the last video you saw that regression algorithms, which is a type of supervised learning algorithm learns to predict numbers out of infinitely many possible numbers. There's a second major type of supervised learning algorithm called a classification algorithm.

Let's take a look at what this means. Take breast cancer detection as an example of a classification problem. Say you're building a machine learning system so that doctors can have a diagnostic tool to detect breast cancer. This is important because early detection could potentially save a patient's life. Using a patient's medical records your machine learning system tries to figure out if a tumor that is a lump is malignant meaning cancerous or dangerous. Or if that tumor, that lump is benign, meaning that it's just a lump that isn't cancerous and isn't that dangerous? Some of my friends have actually been working on this specific problem. So maybe your dataset has tumors of various sizes. And these tumors are labeled as either benign, which I will designate in this example with a 0 or malignant, which will designate in this example with a 1. You can then plot your data on a graph like this where the horizontal axis represents the size of the tumor and the vertical axis takes on only two values 0 or 1 depending on whether the tumor is benign, 0 or malignant 1.

One reason that this is different from regression is that we're trying to predict only a small number of possible outputs or categories. In this case two possible outputs 0 or 1, benign or malignant. This is different from regression which tries to predict any number, all of the infinitely many number of possible numbers. And so the fact that there are only two possible outputs is what makes this classification. Because there are only two possible outputs or two possible categories in this example, you can also plot this data set on a line like this.

Right now, I'm going to use two different symbols to denote the category using a circle an O to denote the benign examples and a cross to denote the malignant examples. And if new patients walks in for a diagnosis and they have a lump that is this size, then the question is, will your system classify this tumor as benign or malignant? It turns out that in classification problems you can also have more than two possible output categories. Maybe you're learning algorithm can output multiple types of cancer diagnosis if it turns out to be malignant. So let's call two different types of cancer type 1 and type 2. In this case the average would have three possible output categories it could predict. And by the way in classification, the terms output classes and output categories are often used interchangeably. So what I say class or category when referring to the output, it means the same thing.

So to summarize classification algorithms predict categories. Categories don't have to be numbers. It could be non numeric for example, it can predict whether a picture is that of a cat or a dog. And it can predict if a tumor is benign or malignant. Categories can also be numbers like 0, 1 or 0, 1, 2. But what makes classification different from regression when you're interpreting the numbers is that classification predicts a small finite limited set of possible output categories such as 0, 1 and 2 but not all possible numbers in between like 0.5 or 1.7. In the example of supervised learning that we've been looking at, we had only one input value the size of the tumor. But you can also use more than one input value to predict an output.

Here's an example, instead of just knowing the tumor size, say you also have each patient's age in years. Your new data set now has two inputs, age and tumor size. What in this new dataset we're going to use circles to show patients whose tumors are benign and crosses to show the patients with a tumor that was malignant. So when a new patient comes in, the doctor can measure the patient's tumor size and also record the patient's age. And so given this, how can we predict if this patient's tumor is benign or malignant? Well, given the day said like this, what the learning algorithm might do is find some boundary that separates out the malignant tumors from the benign ones. So the learning algorithm has to decide how to fit a boundary line through this data. The boundary line found by the learning algorithm would help the doctor with the diagnosis.

In this case the tumor is more likely to be benign. From this example we have seen how to inputs the patient's age and tumor size can be used. In other machine learning problems often many more input values are required. My friends who worked on breast cancer detection use many additional inputs, like the thickness of the tumor clump, uniformity of the cell size, uniformity of the cell shape and so on. So to recap supervised learning maps input x to output y, where the learning algorithm learns from the quote right answers. The two major types of supervised learning our regression and classification. In a regression application like predicting prices of houses, the learning algorithm has to predict numbers from infinitely many possible output numbers. Whereas in classification the learning algorithm has to make a prediction of a category, all of a small set of possible outputs.

So you now know what is supervised learning, including both regression and classification. I hope you're having fun. Next there's a second major type of machine learning called unsupervised learning. Let's go on to the next video to see what that is.

所以监督学习算法学习预测输入、输出或X到Y的映射。在上一个视频中，你看到了回归算法，这是一种监督学习算法，学习从无限多的可能数字中预测数字。有第二种主要的监督学习算法叫做分类算法。

让我们看看这意味着什么。以乳腺癌检测为分类问题的例子。假设你正在构建一个机器学习系统，以便医生可以使用诊断工具来检测乳腺癌。这很重要，因为早期发现可能潜在地拯救患者的生命。使用患者的医疗记录，你的机器学习系统试图弄清楚一个肿块是恶性的，意味着癌变的或危险的。或者那个肿块是良性的，意味着它只是一个肿块，不是癌变的，也不那么危险？我的一些朋友实际上一直在研究这个特定的问题。所以你的数据集可能有各种大小的肿瘤。这些肿瘤被标记为良性，我将在这个例子中用0表示，或者是恶性的，在这个例子中用1表示。然后你可以像这样在图上绘制你的数据，其中水平轴代表肿瘤的大小，垂直轴只取两个值0或1，取决于肿瘤是良性的还是恶性的。

这与回归不同的原因之一是我们试图预测的可能输出或类别只有少数几个。在这种情况下，两个可能的输出0或1，良性或恶性。这与回归不同，后者试图预测任何数量，所有无限多的可能数字。因此，只有两个可能的输出是分类的原因。因为在这个例子中只有两个可能的输出或两个可能的类别，你也可以像这样在一条线上绘制这个数据集。

现在，我将使用两个不同的符号来表示类别，使用圆圈O表示良性示例，使用叉号表示恶性示例。如果新病人来诊断，他们的肿块是这么大小，那么问题是，你的系统会将这个肿瘤分类为良性还是恶性？事实证明，在分类问题中，你也可以有两个以上的可能输出类别。也许你的学习算法可以输出多种类型的癌症诊断，如果结果是恶性的。所以我们称之为两种不同类型的癌症类型1和类型2。在这种情况下，平均可能有三个可能的输出类别它可以预测。顺便说一下，在分类中，输出类别和输出类别这两个术语经常可以互换使用。所以我说的是类别或者类别，当提到输出时，它们的意思是相同的。

总结一下，分类算法预测类别。类别不一定是数字。它可以是非数值的，例如，它可以预测一张图片是猫还是狗。它可以预测肿瘤是良性还是恶性。类别也可以是数字，如0、1或0、1、2。但是，当你解释数字时，分类与回归不同之处在于，分类预测了一个有限可能输出类别的小集合，如0、1和2，但不是所有可能的数字，如0.5或1.7。在我们一直在看的监督学习的例子中，我们只有一个输入值，肿瘤的大小。但你也可以使用一个以上的输入值来预测输出。

这里有个例子，而不是只知道肿瘤的大小，假设你还知道每个患者的年龄。你的新数据集现在有两个输入，年龄和肿瘤大小。在这个新数据集中，我们将使用圆圈来显示肿瘤为良性的患者，使用叉号显示肿瘤为恶性的患者。所以当一个新病人进来时，医生可以测量病人的肿瘤大小并记录病人的年龄。那么，给定这些，我们如何预测这个病人的肿瘤是良性还是恶性？鉴于这一天的说法，学习算法可能会做的是找到一些边界，将恶性肿瘤与良性肿瘤分开。因此，学习算法必须决定如何通过这些数据拟合一条边界线。学习算法找到的边界线将帮助医生进行诊断。

在这种情况下，肿瘤更可能是良性的。从这个例子中，我们已经看到如何使用病人的年龄和肿瘤大小这两个输入。在其他机器学习问题中，通常需要更多的输入值。我那些从事乳腺癌检测工作的朋友们使用了更多额外的输入，如肿瘤团块的厚度、细胞大小和形状的均匀性等等。所以回顾一下，监督学习将输入x映射到输出y，学习算法从正确答案中学习。两种主要的监督学习类型是我们的回归和分类。在像预测房价这样的回归应用中，学习算法必须从无限多的可能输出数字中预测数字。而在分类中，学习算法必须对一个小集合的可能输出进行类别预测。

所以你现在知道了什么是监督学习，包括回归和分类。我希望你玩得开心。接下来有第二种主要的机器学习类型叫做无监督学习。我们继续下一个视频，看看那是什么。