01.Introduction to Classification
Classification is a core problem of machine learning. Now machine learning is a field that grew out of artificial intelligence within computer science, and the goal is to teach computers by example. Now if you want to teach the computer to recognize images of chairs,
then we give the computer a whole bunch of chairs and tell it which ones are chairs and which ones are not, and then it’s supposed to learn to recognize chairs, even ones it hasn’t seen before. It’s not like we tell the computer how to recognize a chair.
We don’t tell it a chair has four legs and a back and a flat surface to sit on and so on, we just give it a lot of examples. Now machine learning has close ties to statistics;
in fact it’s hard to say what’s different about predictive statistics than machine learning.
These fields are very closely linked right now. Now the problem I just told you about
is a classification problem where we’re trying to identify chairs, and the way we set the problem up is that we have a training set of observations – in other words, like labeled images here – and we have a test set that we use only for evaluation.
We use the training set to learn a model of what a chair is, and the test set are images that are not in the training set and we want to be able to make predictions on those as to whether or not each image is a chair. And it could be that some of the labels on the training set are noisy. In fact, that – you know, that could happen in fact one of these labels is noisy, right here. And that’s okay because as long as there isn’t too much noise,
we should still be able to learn a model for a chair; it just won’t be able to classify perfectly and that happens. Some prediction problems are just much harder than others, but that’s okay because we just do the best we can from the training data,
and in terms of the size of the training data: the more the merrier.
We want as much data as we can to train these models. Now how do we represent an image of a chair or a flower or whatever in the training set? Now I just zoomed in on a little piece of a flower here,
and you can see the pixels in the image, and we can represent each pixel in the image according to its RGB values, its red-green-blue values. So we get three numbers representing each pixel in the image.
So you can represent the whole image as a collection of RGB values.
So the image becomes this very large vector of numbers, and in general when doing machine learning, we need to represent each observation in the training and test sets as a vector of numbers, and the label is also represented by a number.
Here are the labels minus 1, because the image is not a chair and the images of chairs would all get labels plus 1.
Here’s another example: this is a problem that comes from New York City’s power company, where they want to predict which manholes are going to have a fire.
So we would represent each manhole as a vector, and here are the components of the vector.
Right the first component might be the number of serious events that the manhole had last year, like a fire or smoking manhole that’s very serious, or an explosion or something like that.
And then maybe we would actually have a category for the number of serious events last year, so only three of these five events were very serious.
The number of electrical cables in the manhole, the number of electrical cables that were installed before 1930, and so on and so forth. And you can make – you know in general, the first step is to figure out how to represent your data like this as a vector, and you can make this vector very large.
You could include lots of factors if you like, that’s totally fine. Computationally, things are easier if you use fewer features, but then you risk leaving out information.
So there’s a trade-off right there that you’re going to have to worry about, and we’ll talk more about that later. But in any case, you can’t do machine learning if you don’ have your data represented in the right way so that’s the first step.
Now you think that manholes with more cables more recent serious events and so on would be more prone to explosions and fires in the near future, but what combination of them would give you the best predictor?
How do you combine them together? You could add them all up, but that might not be the best thing. You could give them all weights and add them all up, but how do you know the weights?
And that’s what machine learning does for you. It tells you what combinations to use to get the best predictors. But for the manhole problem, we want to use the data from the past to predict the future.
So for instance, the future data might be from 2014 and before, and the label would be 1 if the manhole had an event in 2015.
So that’s our training set, and then for the test set, the feature data would be from 2015 and before and then we would try to predict what would happen in 2016.
So just to be formal about it, we have each observation being represented by our set of features,
and the features are also called predictors or covariant or explanatory variables or independent variables, whatever they are – you can choose whatever terminology you like.、
And then we have the labels, which are called y. Even more formally, we’re given a training set of feature label pairs xi yi, and there are n of them, and we want to create a classification model f that can predict a label y or a new x.
Let’s take a simple example of – simple version of the manhole example, where we have only two features: the year the oldest cable was installed and the number of events that happened last year.
So each observation can be represented as a point on a two-dimensional graph, which means I can plot the whole dataset.
So something like this, where each point here is a manhole and I’ve labelled it with whether or not it had a serious event in the training set.
So these are the manholes that didn’t have events, and these are the ones that did.
And then I’m going to try to create a function here that’s going to divide the space into two pieces, where on one side of this – on one side over here of the decision boundary, I’m going to predict that there’s going to be an event,
and on the other side of the decision boundary, I predict there will be no event. So this decision boundary is actually just the equation where the function is 0,
and then where the function is positive we’ll predict positive and where the function is negative we’ll predict negative. And so this is going to be a function of these two variables, the oldest cable and then the number of events last year.
And the same idea holds for the commuter vision problem that we discussed earlier.
We’re trying to create this decision boundary that’s going to chop the space into two pieces, where on one side of the decision boundary we would predict positive, and then on the other side we’d predict negative.
And the trick is, how do we create this decision boundary? How do we create this function f? Okay, so given our training data, we want to create our classification model f that can make predictions.
The machine learning algorithm is going to create the function f for you, and no matter how complicated that function f is, the way to use it is not very complicated at all.
The way to use it is just this: the predicted value of y for a new x that you haven’t seen before is just the sign of that function f. Classification is for yes or no questions.
You can do a lot if you can answer yes or no questions. So for instance, think about like handwriting recognition. For each letter on a page, we’re going to evaluate whether it’s a letter A, a yes or no.
And if you’re doing like spam detection, right, the spam detector on your computer has a machine learning algorithm in it. Each email that comes in has to be evaluated as to whether or not it’s spam.
Credit defaults: right, whether or not you get a loan depends on whether the bank predicts that you’re going to default on your loan, yes or no. and in my lab, we do a lot of work on predicting medical outcomes. We want to know whether something will happen to a patient within a particular period of time. Here’s a list of common classification algorithms. Most likely, unless you’re interested in developing your own algorithms,
you never need to program these yourself; they’re already programmed in by someone else. If you’re just going to be a consumer of these, you can use the code that’s already written.
And all these are, you know – we’re going to cover a good chunk of these methods, and in order to use them effectively you’ve really got to know what you’re doing, otherwise you could really run into some issues.
But if you can figure out how to use these, you’ve got a really powerful tool on your hands.
分类是机器学习的核心问题。机器学习是在计算机科学中由人工智能发展而来的一个领域,其目标是通过实例来教授计算机。如果你想教电脑识别椅子的图像,
然后我们给电脑一大堆椅子,告诉它哪些是椅子,哪些不是,然后它应该学会识别椅子,甚至是以前没见过的椅子。我们并不是告诉电脑如何识别椅子。
我们不告诉它,椅子有四条腿,一个背,一个平面,等等,我们给它举了很多例子。机器学习与统计有着密切的联系;
事实上,很难说预测统计与机器学习有什么不同。
这些领域目前联系紧密。现在我刚才告诉你们的问题是一个分类问题,我们试图识别椅子,我们设置问题的方式是我们有一个训练集的观察,换句话说,就像这里有标签的图像,我们有一个测试集,我们只用它来做评估。
我们使用训练集来学习一个椅子的模型,测试集是没有在训练集里的图像我们希望能够预测每个图像是否为椅子。可能是训练集上的一些标签很杂乱。事实上,这可能会发生实际上其中一个标签很杂乱,就在这里。没关系,只要没有太大的杂乱,尽管它不能完美地分类,但我们仍然可以学习一个椅子模型。有些预测问题比其他的要困难得多,但没关系,因为我们只是尽我们所能从训练数据中做到最好,
就培训数据的规模而言,越多越好。
我们需要尽可能多的数据来用于这些模型的训练。现在我们如何来表示一张椅子、花的图像或者在训练集中的任何东西?现在我放大了一小片花,
你可以看到图像中的像素,我们可以根据它的RGB值,它的红-绿-蓝值来表示图像中的每个像素。我们得到三个代表图像中每个像素的数字。
因此,您可以将整个图像表示为RGB值的集合。
所以图像变成了这个非常大的数字矢量,通常在机器学习的时候,我们需要把训练集和测试集中的每一次观察都表示成一个数字的矢量,这个标签也用一个数字表示。
这里的标签减1,因为图像不是椅子,椅子的图像都是标签加1。
这是另一个例子:这是来自纽约电力公司的一个问题,他们想要预测哪个检修孔会发生火灾。
所以我们把每个检修孔作为一个矢量来表示,这里是矢量的分量。
第一个因素可能是去年的检修孔发生的严重事件,比如火灾或冒烟的检修孔,非常严重,或者爆炸之类的。
然后,也许我们会有一个类别,用于去年的重大事件,所以这5个事件中只有3个是非常严重的。
在检修孔中电缆的数量,在1930年之前安装的电缆的数量,等等。一般来说,第一步是找出如何用向量表示你的数据,你可以使这个向量很大。
如果你喜欢的话,可以包含很多因素,这是完全可以的。在计算上,如果使用更少的特性,事情就会更容易,但是您可能会遗漏信息。
所以这里有一个权衡,你需要担心,以后我们会详细讨论。但无论如何,如果你的数据以正确的方式表示,你就不能做机器学习,这是第一步。
现在你认为,更多的电缆,更近期的严重事件等等,更容易在不久的将来发生爆炸和火灾,但是他们的结合会给你最好的预测吗?
如何将它们结合在一起?你可以把它们都加起来,但这可能不是最好的。你可以给他们所有的权重并把它们加起来,但是你怎么知道权重呢?
这就是机器学习对你的作用。它告诉你用什么组合来得到最好的预测因子。但是对于检修孔问题,我们想用过去的数据来预测未来。
例如,未来的数据可能是2014年及其之前的数据,如果检修孔在2015年有一个事件,那么标签将是1。
刚刚是我们的训练集,现在是测试集,特征数据将会是2015年之前,然后我们会尝试预测2016年会发生什么。
所以要正式一点,我们每个观察都用我们的特征集合来表示,
这些特征也被称为预测因子或协变或解释变量或自变量,不管它们是什么-你可以选择你喜欢的任何术语。
然后我们有标签,叫做y,更正式地说,我们得到了一组特征标签对(xi , yi)的训练集,其中有n个,我们想要创建一个分类模型f可以预测一个标签y或一个新的x。
让我们举一个简单的例子,简单版本的检修孔例子,我们只有两个参数:最古老的电缆安装的年份和去年发生的事件的数量。
所以每一个观察都可以表示为二维图上的一个点,这意味着我可以绘制整个数据集。
像这样,这里的每一点都是一个检修孔,我给它标上了它是否在训练集中发生了严重的事件。
这些是没有发生过火灾的检修孔,而这些是发生过火灾的。
然后我要在这里创建一个函数它将空间分成两部分,在这一侧,在这里,在决定边界的一边,我要预测会有一个事件,
在决策边界的另一边,我预测不会有事件发生。所以这个决策边界实际上就是函数为0的方程,
当函数为正的时候,我们会预测正的。当函数是负的,我们会预测负的。所以这将是这两个变量的函数,最古老的缆线,然后是去年发生火灾的事件数。
我们之前讨论过的通勤视力问题也是同样的道理。
我们试图创建这个决策边界将空间分割成两部分,在决策边界的一边我们可以预测为正,然后在另一边我们预测为负。
关键是,我们如何创建决策边界?如何创建这个函数f?好的,考虑到我们的训练数据,我们想要创建一个可以做出预测的分类模型f。
机器学习算法会为你创建函数f,不管函数f有多复杂,用它的方法都不是很复杂。
使用它的方法是这样的:y的预测值,你之前没见过的,只是函数f的符号,分类是对的,或者是没有问题的。
如果你做的是垃圾邮件检测,你的电脑上的垃圾邮件检测器有一个机器学习算法。每封邮件都必须评估是否为垃圾邮件。
信用违约:对,你是否获得贷款取决于银行是否预测你会拖欠你的贷款,是还是不是。在我的实验室里,我们做了很多关于预测医疗结果的工作。
我们想知道病人在一段时间内是否会发生一些事情。这里有一个常用分类算法的列表。很有可能,除非你有兴趣开发自己的算法,
你不需要自己编程;他们已经被其他人编程了。如果你只是一个消费者,你可以使用已经写好的代码。
所有这些,你知道-我们将会涵盖这些方法的大部分,为了有效地使用它们你必须知道你在做什么,否则你可能会遇到一些问题。
但是如果你能想出如何使用这些工具,你就有了一个非常强大的工具。