机器学习学习笔记——1.1.1.2.4 Unsupervised learning part 1（非监督学习——第1部分）

预见未来to50

于 2024-09-14 13:56:55 发布

阅读量501

点赞数 16

分类专栏：机器学习、深度学习（ML/DL) 文章标签：机器学习

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/142256691

版权

机器学习、深度学习（ML/DL) 专栏收录该内容

133 篇文章 12 订阅

订阅专栏

Unsupervised learning part 1（非监督学习——第1部分）

After supervised learning, the most widely used form of machine learning is unsupervised learning. Let's take a look at what that means, we've talked about supervised learning and this video is about unsupervised learning. But don't let the name uncivilized for you, unsupervised learning is I think just as super as supervised learning.

When we're looking at supervised learning in the last video recalled, it looks something like this in the case of a classification problem. Each example, was associated with an output label y such as benign or malignant, designated by the poles and crosses in unsupervised learning. Were given data that isn't associated with any output labels y, say you're given data on patients and their tumor size and the patient's age. But not whether the tumor was benign or malignant, so the dataset looks like this on the right. We're not asked to diagnose whether the tumor is benign or malignant, because we're not given any labels. Why in the dataset, instead, our job is to find some structure or some pattern or just find something interesting in the data. This is unsupervised learning, we call it unsupervised because we're not trying to supervise the algorithm. To give some quote right answer for every input, instead, we asked the our room to figure out all by yourself what's interesting. Or what patterns or structures that might be in this data, with this particular data set.

An unsupervised learning algorithm, might decide that the data can be assigned to two different groups or two different clusters. And so it might decide, that there's one cluster what group over here, and there's another cluster or group over here. This is a particular type of unsupervised learning, called a clustering algorithm. Because it places the unlabeled data, into different clusters and this turns out to be used in many applications.

For example, clustering is used in google news, what google news does is every day it goes. And looks at hundreds of thousands of news articles on the internet, and groups related stories together. For example, here is a sample from Google News, where the headline of the top article, is giant panda gives birth to rear twin cubs at Japan's oldest zoo. This article has actually caught my eye, because my daughter loves pandas and so there are a lot of stuff panda toys. And watching panda videos in my house, and looking at this, you might notice that below this are other related articles. Maybe from the headlines alone, you can start to guess what clustering might be doing. Notice that the word panda appears here here, here, here and here and notice that the word twin also appears in all five articles. And the word Zoo also appears in all of these articles, so the clustering algorithm is finding articles. All of all the hundreds of thousands of news articles on the internet that day, finding the articles that mention similar words and grouping them into clusters.

Now, what's cool is that this clustering algorithm figures out on his own which words suggest, that certain articles are in the same group. What I mean is there isn't an employee at google news who's telling the algorithm to find articles that the word panda. And twins and zoo to put them into the same cluster, the news topics change every day. And there are so many news stories, it just isn't feasible to people doing this every single day for all the topics that use covers. Instead the algorithm has to figure out on his own without supervision, what are the clusters of news articles today. So that's why this clustering algorithm, is a type of unsupervised learning algorithm.

Let's look at the second example of unsupervised learning applied to clustering genetic or DNA data. This image shows a picture of DNA micro array data, these look like tiny grids of a spreadsheet. And each tiny column represents the genetic or DNA activity of one person, So for example, this entire Column here is from one person's DNA. And this other column is of another person, each row represents a particular gene. So just as an example, perhaps this row here might represent a gene that affects eye color, or this wow here is a gene that affects how tall someone is. Researchers have even found a genetic link to whether someone dislikes certain vegetables, such as broccoli, or brussels sprouts, or asparagus. So next time someone asks you why didn't you finish your salad, you can tell them, maybe it's genetic for DNA micro race. The idea is to measure how much certain genes, are expressed for each individual person. So these colors red, green, gray, and so on, show the degree to which different individuals do, or do not have a specific gene active. And what you can do is then run a clustering algorithm to group individuals into different categories. Or different types of people like maybe these individuals that group together, and let's just call this type one. And these people are grouped into type two, and these people are groups as type three. This is unsupervised learning, because we're not telling the algorithm in advance, that there is a type one person with certain characteristics. Or a type two person with certain characteristics, instead what we're saying is here's a bunch of data. I don't know what the different types of people are but can you automatically find structure into data. And automatically figure out whether the major types of individuals, since we're not giving the algorithm the right answer for the examples in advance.

This is unsupervised learning, here's the third example, many companies have huge databases of customer information given this data. Can you automatically group your customers, into different market segments so that you can more efficiently serve your customers. Concretely the deep learning dot AI team did some research to better understand the deep learning dot AI community. And why different individuals take these classes, subscribed to the batch weekly newsletter, or attend our AI events. Let's visualize the deep learning dot AI community, as this collection of people running clustering. That is market segmentation found a few distinct groups of individuals, one group's primary motivation is seeking knowledge to grow their skills. Perhaps this is you, and so that's great, a second group's primary motivation is looking for a way to develop their career. Maybe you want to get a promotion or a new job, or make some career progression if this describes you, that's great too. And yet another group wants to stay updated on how AI impacts their field of work, perhaps this is you, that's great too. This is a clustering that our team used to try to better serve our community as we're trying to figure out. Whether the major categories of learners in the deeper and community, So if any of these is your top motivation for learning, that's great. And I hope I'll be able to help you on your journey, or in case this is you, and you want something totally different than the other three categories. That's fine too, and I want you to know, I love you all the same.

So to summarize a clustering algorithm. Which is a type of unsupervised learning algorithm, takes data without labels and tries to automatically group them into clusters. And so maybe the next time you see or think of a panda, maybe you think of clustering as well. And besides clustering, there are other types of unsupervised learning as well. Let's go on to the next video, to take a look at some other types of unsupervised learning algorithms.

在监督学习之后，机器学习中最广泛使用的方法是无监督学习。让我们看看这意味着什么，我们已经讨论过监督学习，而这个视频是关于无监督学习的。但不要让“无监督”这个名字误导你，我认为无监督学习和监督学习一样出色。当我们回顾上一个视频中的监督学习时，对于分类问题，每个例子都与一个输出标签y相关联，比如良性或恶性，用圆圈和叉号表示。在无监督学习中，我们得到的是没有与任何输出标签y关联的数据，比如说你得到了关于病人和他们肿瘤大小的数据，以及病人的年龄。但不知道肿瘤是良性还是恶性，所以数据集看起来像这样。我们没有被要求诊断肿瘤是良性还是恶性，因为我们没有给出任何标签。那么，我们的工作是在数据中找到一些结构或模式，或者只是找到一些有趣的东西。这就是无监督学习，我们称之为无监督，因为我们不是试图监督算法。为每个输入提供正确的答案，相反，我们让算法自己找出数据中有趣的内容、模式或结构。

使用这个特定的数据集，无监督学习算法可能会决定数据可以被分配到两个不同的组或两个不同的集群。因此，它可能会决定这里有一个集群或组，而在这里有另一个集群或组。这是一种特殊的无监督学习类型，称为聚类算法。因为它将未标记的数据放入不同的集群中，这在许多应用中被证明是有用的。例如，谷歌新闻使用了聚类，谷歌新闻每天会查看互联网上的成千上万篇新闻文章，并将相关故事分组在一起。例如，这里有来自谷歌新闻的样本，头条文章是“日本最古老的动物园大熊猫生下双胞胎”。这篇文章实际上引起了我的注意，因为我的女儿喜欢熊猫，所以我家里有很多熊猫玩具和看熊猫视频。看着这个，你可能注意到下面还有其他相关文章。也许仅从标题上，你就可以开始猜测聚类可能正在做什么。注意，“熊猫”这个词在这里、这里、这里、这里和这里出现，还要注意“双胞胎”这个词也在所有五篇文章中出现，“动物园”这个词也出现在所有这些文章中，所以聚类算法是在寻找当天互联网上成千上万篇新闻文章中提到的相似词汇的文章，并将它们分组到集群中。

现在，酷的是这个聚类算法自己找出了哪些词表明某些文章属于同一组。我的意思是谷歌新闻没有一个员工告诉算法要找到包含“熊猫”、“双胞胎”和“动物园”这些词的文章并将它们放入同一个集群，新闻话题每天都在变化。而且有这么多的新闻故事，让人每天为所有主题做这件事是不可行的。相反，算法必须自己弄清楚今天哪些新闻文章的集群没有受到监督。所以这就是为什么这种聚类算法是一种无监督学习算法的原因。

让我们看看第二个无监督学习的例子应用于遗传或DNA数据的聚类。这张图片显示了一张DNA微阵列数据的图片，这些看起来像电子表格的微小网格。每一小列代表一个人的遗传或DNA活动，所以例如，这一整列来自一个人的DNA。而另一列是另一个人的，每一行代表一个特定的基因。所以举个例子，也许这一行可能代表影响眼睛颜色的基因，或者这一行是一个影响人身高的基因。研究人员甚至发现了一个人是否不喜欢某些蔬菜（如西兰花、抱子甘蓝或芦笋）的遗传联系。所以下次有人问你为什么没吃完你的沙拉时，你可以告诉他们，也许是遗传原因导致我不喜欢吃DNA微阵列。这个想法是测量每个人的特定基因表达的程度。所以这些颜色红色、绿色、灰色等表示不同个体的特定基因活跃与否的程度。然后你可以运行一个聚类算法来将个体分成不同的类别或不同类型的人，比如这些个体归为一类，我们暂且称之为类型一。这些人被归为类型二，这些人被归为类型三。这是无监督学习，因为我们没有提前告诉算法有具有某些特征的类型一的人或有具有某些特征的类型二的人，相反，我们说的是这里有一堆数据。我不知道不同类型的人是什么，但你能否自动在数据中找到结构。并自动找出主要类型的个体，因为我们没有提前给算法正确答案的示例。

这是无监督学习，这里是第三个例子，许多公司都有庞大的客户信息数据库。给定这些数据，你能自动将你的客户分组到不同的市场细分中，以便你可以更有效地服务你的客户吗？具体来说，深度学习AI团队做了一些研究以更好地理解深度学习AI社区。以及为什么不同的个人参加这些课程、订阅我们的每周通讯或参加我们的AI活动。让我们把深度学习AI社区可视化为这些人的集合，运行聚类，即市场细分，找到了几个不同的个体组。一个组的主要动机是寻求知识以增长他们的技能。也许这就是你，那太好了，第二个组的主要动机是寻找发展职业的方法。也许你想获得晋升或新工作，或者取得一些职业进步，如果这描述了你，那也很棒。还有另一个组想要了解AI如何影响他们的工作领域，也许这就是你，那也很棒。这是我们的团队用来尝试更好地服务于我们社区的聚类，我们试图弄清楚。在深度学习社区中学习者的主要类别是什么，所以如果这些是你的学习主要动机之一，那就太棒了。我希望我能在你的旅程上帮助你，或者如果这是你，你想要的完全不同于其他三个类别的东西。那也没关系，我想让你知道，我一样爱你们所有人，

所以总结一下，一个聚类算法。它是一种无监督学习算法的类型，接收没有标签的数据并尝试自动将它们分组到集群中。所以也许下次你看到或想到熊猫时，也许你也会想到聚类。除了聚类之外，还有其他类型的无监督学习。让我们继续下一个视频，看看其他一些类型的无监督学习算法。