第一章.Classification -- 01.Introduction to Classification翻译

01.Introduction to Classification

Classification is a core problem of machine learning. Now machine learning is a field that grew out of artificial intelligence within computer science, and the goal is to teach computers by example. Now if you want to teach the computer to recognize images of chairs,

then we give the computer a whole bunch of chairs and tell it which ones are chairs and which ones are not, and then it’s supposed to learn to recognize chairs, even ones it hasn’t seen before. It’s not like we tell the computer how to recognize a chair.

We don’t tell it a chair has four legs and a back and a flat surface to sit on and so on, we just give it a lot of examples. Now machine learning has close ties to statistics;

in fact it’s hard to say what’s different about predictive statistics than machine learning.

These fields are very closely linked right now. Now the problem I just told you about

is a classification problem where we’re trying to identify chairs, and the way we set the problem up is that we have a training set of observations – in other words, like labeled images here – and we have a test set that we use only for evaluation.

We use the training set to learn a model of what a chair is, and the test set are images that are not in the training set and we want to be able to make predictions on those as to whether or not each image is a chair. And it could be that some of the labels on the training set are noisy. In fact, that – you know, that could happen in fact one of these labels is noisy, right here. And that’s okay because as long as there isn’t too much noise,

we should still be able to learn a model for a chair; it just won’t be able to classify perfectly and that happens. Some prediction problems are just much harder than others, but that’s okay because we just do the best we can from the training data,

and in terms of the size of the training data: the more the merrier.

We want as much data as we can to train these models. Now how do we represent an image of a chair or a flower or whatever in the training set? Now I just zoomed in on a little piece of a flower here,

and you can see the pixels in the image, and we can represent each pixel in the image according to its RGB values, its red-green-blue values. So we get three numbers representing each pixel in the image.

So you can represent the whole image as a collection of RGB values.

So the image becomes this very large vector of numbers, and in general when doing machine learning, we need to represent each observation in the training and test sets as a vector of numbers, and the label is also represented by a number.

Here are the labels minus 1, because the image is not a chair and the images of chairs would all get labels plus 1.

Here’s another example: this is a problem that comes from New York City’s power company, where they want to predict which manholes are going to have a fire.

So we would represent each manhole as a vector, and here are the components of the vector.

Right the first component might be the number of serious events that the manhole had last year, like a fire or smoking manhole that’s very serious, or an explosion or something like that.

And then maybe we would actually have a category for the number of serious events last year, so only three of these five events were very serious.

The number of electrical cables in the manhole, the number of electrical cables that were installed before 1930, and so on and so forth. And you can make – you know in general, the first step is to figure out how to represent your data like this as a vector, and you can make this vector very large.

You could include lots of factors if you like, that’s totally fine. Computationally, things are easier if you use fewer features, but then you risk leaving out information.

So there’s a trade-off right there that you’re going to have to worry about, and we’ll talk more about that later. But in any case, you can’t do machine learning if you don’ have your data represented in the right way so that’s the first step.

Now you think that manholes with more cables more recent serious events and so on would be more prone to explosions and fires in the near future, but what combination of them would give you the best predictor?

How do you combine them together? You could add them all up, but that might not be the best thing. You could give them all weights and add them all up, but how do you know the weights?

And that’s what machine learning does for you. It tells you what combinations to use to get the best predictors. But for the manhole problem, we want to use the data from the past to predict the future.

So for instance, the future data might be from 2014 and before, and the label would be 1 if the manhole had an event in 2015.

So that’s our training set, and then for the test set, the feature data would be from 2015 and before and then we would try to predict what would happen in 2016.

So just to be formal about it, we have each observation being represented by our set of features,

and the features are also called predictors or covariant or explanatory variables or independent variables, whatever they are – you can choose whatever terminology you like.、

And then we have the labels, which are called y. Even more formally, we’re given a training set of feature label pairs xi yi, and there are n of them, and we want to create a classification model f that can predict a label y or a new x.

Let’s take a simple example of – simple version of the manhole example, where we have only two features: the year the oldest cable was installed and the number of events that happened last year.

So each observation can be represented as a point on a two-dimensional graph, which means I can plot the whole dataset.

So something like this, where each point here is a manhole and I’ve labelled it with whether or not it had a serious event in the training set.

So these are the manholes that didn’t have events, and these are the ones that did.

And then I’m going to try to create a function here that’s going to divide the space into two pieces, where on one side of this – on one side over here of the decision boundary, I’m going to predict that there’s going to be an event,

and on the other side of the decision boundary, I predict there will be no event. So this decision boundary is actually just the equation where the function is 0,

and then where the function is positive we’ll predict positive and where the function is negative we’ll predict negative. And so this is going to be a function of these two variables, the oldest cable and then the number of events last year.

And the same idea holds for the commuter vision problem that we discussed earlier.

We’re trying to create this decision boundary that’s going to chop the space into two pieces, where on one side of the decision boundary we would predict positive, and then on the other side we’d predict negative.

And the trick is, how do we create this decision boundary? How do we create this function f? Okay, so given our training data, we want to create our classification model f that can make predictions.

The machine learning algorithm is going to create the function f for you, and no matter how complicated that function f is, the way to use it is not very complicated at all.

The way to use it is just this: the predicted value of y for a new x that you haven’t seen before is just the sign of that function f. Classification is for yes or no questions.

You can do a lot if you can answer yes or no questions. So for instance, think about like handwriting recognition. For each letter on a page, we’re going to evaluate whether it’s a letter A, a yes or no.

And if you’re doing like spam detection, right, the spam detector on your computer has a machine learning algorithm in it. Each email that comes in has to be evaluated as to whether or not it’s spam.

Credit defaults: right, whether or not you get a loan depends on whether the bank predicts that you’re going to default on your loan, yes or no. and in my lab, we do a lot of work on predicting medical outcomes. We want to know whether something will happen to a patient within a particular period of time. Here’s a list of common classification algorithms. Most likely, unless you’re interested in developing your own algorithms, 

you never need to program these yourself; they’re already programmed in by someone else. If you’re just going to be a consumer of these, you can use the code that’s already written.

And all these are, you know – we’re going to cover a good chunk of these methods, and in order to use them effectively you’ve really got to know what you’re doing, otherwise you could really run into some issues. 

But if you can figure out how to use these, you’ve got a really powerful tool on your hands.



























然后我们有标签,叫做y,更正式地说,我们得到了一组特征标签对(xi , yi)的训练集,其中有n个,我们想要创建一个分类模型f可以预测一个标签y或一个新的x。













比如手写识别: 每一页上的每一个字母,我们都要评估它是否是字母a,是或否。







