第一章.Classification -- 10.Creating a Classifier with Python翻译

so Cynthia has been showing us some theory of classifiers and specifically talking about the logistic regression as kind of a baseline classifier and in this video we're going to use a regression logistic regression classifier we're going to look at some of the properties of logistic regression in specific but keep in mind we're talking about principles here that work for almost any classification model and so let's have a look at some Python code and look at a little simulation i have prepared for you so the first thing we have to do is create a data set and we're just going to have to features in our data set

which we're just going to call x and y and we're going to have two possible States for our label which were calling z it can be 1 or 0 so essentially true or false and we're going to generate locations for those two feature labels using bivariate normal distribution that with no correlation between the X&Y just have an x and y value that are randomly generated from these normal distributions we're going to put all that into a date into data frames you see and we're going to concatenate those two data frame so we come up with one data set here that has two different feature labels and and and various values for the X&Y feature and so we'll just look at the head of that data frame just to get a feel for what it looks like and as advertised we have our X variable and our Y variable those are two features and then Z our label which can be 1 or 0 so let's plot that data set so we're just going to create a fairly standard Python plot here using the pandas plot method we're going to scatter plots and in the case where that label Z is 1 we're going to make them red and in the case where that label 0 we're going to make them dark blue and otherwise it's all pretty standard so let's just have a look at that and our little plot so you can see the red

X's are our positive label or true, and our little blue circles here are negative label or false and you can see there's some overlap use see there some red x's that get into this area where there's the population of blue dots and likewise there's some blue dots that are getting up into the area of the red x's so so whatever classifier we use on this it's unlikely we can get a hundred percent accuracy just given these two features in this overlap between the two labels and this is very typical of machine learning problem ok and let's just talk about the logistic function here Cynthia's shown you this already but i'm just going to create a plot of it so it's actually quite simple I just create an X and Y using these list comprehensions and the Y is the logistic function which is just the exponential over 1 plus the exponential of whatever that 

Z is or that X is for Z and X. So, and then I just plot those alright and you see you get this expected sigmoidal behavior here and the default in the normal way you start with logistic regression as you say okay if the value is greater than point five we'll call that one we'll call that a plus, if the value is less than point five from the regression we call that a minus or zero or false and but we can move keep in mind we can move this critical point this decision point up so that we favor negative values over misclassifying positive values or we can move it down ok just keep that in mind as we go through this demo so now we'll start from this data set I just showed you and keeping in mind we'll just use the default value of equal you know a half for positive or negative label which is a log probability by the way of 1.0 so we're going to use scikit-learn here to create a logistic regression so I have to do a little bit of reshaping and I have to use the as.matrix method just to make sure that I get a numpy data frame which is what scikit-learn needs so that's all I'm doing with this and Ravel helps me flatten make sure that that is a 1 by matrix so I've got my X which are my two features and my Y which is my label and you see I imported linear model from scikit-learn so I could linear_model.LogisticRegression to get a regression object that i need and then I fit using that x and y and now I've gotta predict method here, so I'm just gonna append that as a new column to my data frame then i'm going to evaluate so we've talked about how to evaluate any classification model so basically i'm filling in my confusion matrix here so i got my true positives my false positives my true negatives my false negatives i'm using a series of conditionals here and you know so if it's the predicted equals 1 and z is equal to the predicted then it's a true positive

likewise if the predicted equals 0 and z is equal to predictive it's a true negative and otherwise they're false either positive or negative and then so that I have these four cases and I can I can plot those scatter plots and i'm using different colors and marker size

so we can tell you know positive from negative true cases and whether they're scored correctly or not by our model and and then i compute you know the true positive to negative so this is the counts for my final confusion matrix and i can print that confusion matrix and I'll just compute some figures like accuracy precision recall that Cynthia's also discussed here so let me just run that for you so let's start out with our confusion matrix it looks like we're actually pretty accurate so we got our true negatives we got 49 we only got 1 where it's actually negative when we said it's positive likewise we only got 1 which was truly negative where we scored it is positive and otherwise we got true positive 

so we only have four errors here which gives us an accuracy of ninety-six percent and a really high precision and recall and if we look at the plot i think it's a little more instructive this is very abstract and very general but you can see what's happening so here's our one- that was misclassified it's a dot and it's turned red for misclassified and we've got these three positives which are also misclassified and you can imagine in your mind that the decision boundary has got to be running something like this to get that ok but we can move the decision boundary right so remember on that logistic function that I showed you we could move that decision point up or down and 

so if we move it up which means we're favoring positive outcomes over negative outcomes we can move that decision boundary will move this way so let me do that for you so this code will simulate some data data set and we're going to use those decision probabilities 1.0 which is what we just did that's balanced so these are log probabilities by the way two or four so we just kind of moving this way alright let me run that so recalling that logistic function we can move that threshold or that decision point up or down 

so if we move it up we're favoring negative values over the false positives if we move it down we're favoring of positive values / false negatives and so in this demo I'm going to first off make the problem harder so I've moved the centroids from 1 1 and -1 -1 to halves so I've basically moved the the data for positive and negative cases closer together so we're more overlap and we're going to look at log probabilities there of 1, 2, 4 

so as we do that we're going to favor getting the positives right and the negatives wrong

to let me run that for you you see in our first case we mostly get true negatives and true positives are in

the majority but we get our true positive case we get some false negatives and for the true negative case we get some false negatives are overall accuracy is like 75 percent are precision is .75 and recall is in that range too but as we move that decision boundary notice that the number of false negatives 

so they're truly positive but they score is negative has dropped quite a bit but the number of negatives which were incorrectly scored as positive has gone up so our accuracy is dropped to .7 are precision is dropped to .66 but our recall has gone up to .84 and likewise we move that boundary again we now have 

very few false negatives we have a lot more false positives and our accuracy is not to affected that are precision has gone down again and I recall is now way up at .9 too so we can see that graphically so here's our first case and you can imagine the decision boundary has to go something like this 

and you can see false negatives are that X's false positives are the red circles ok and true positives are in the blue pluses and true negatives are in the blue circles so now I've moved that boundary a little bit and notice we have a lot more red circles now so a lot more false positives but we don't have as many red pluses and we have a lot more blue pluses and I move that boundary again that's the last case i showed you now we're down to just four red pluses so we only have four misclassified positive values but we have a lot more misclassified negative values so I hope this little demo has given you a feel for not only logistic regression but what the behavior of regression models is with respect to say the overlap of features we've seen two different cases there and also how doing things to change that decision boundary can affect the performance statistics you see from your machine learning model



X是正的,或者是正确的,和我们的小蓝色的圆圈是负面的标签或错误的,你可以看到有一些重叠使用看到一些红色的x的进入这个领域有蓝点的人口和同样有蓝点,越来越成红色的x的面积我们不管分类器使用在这个不太可能我们可以得到百分之一百的准确性给这两个特性这两个标签之间的重叠,这是非常典型的机器学习问题可以让我们谈谈物流函数辛西娅已经显示你但我要创建一个块,所以它实际上是很简单我只是创建一个使用这些列表理解X和Y,Y是逻辑函数是指数/(1 +的指数

Z是X或者X Z和因此,然后我只是情节这些好了,你看你得到这个预期s形的行为和默认与逻辑回归正常的方式开始就像你说的好,如果该值大于5点我们叫一个我们称之为a +,如果该值小于5点从回归我们称之为-或零个或虚假,但我们可以记住我们可以移动这个临界点这个决策点,我们将积极的价值观或划分的偏向负面值可以搬下来好记住我们在这个演示现在我们会从这个数据集我只是给你们,记住我们就使用默认值等于你知道正面或负面标签的一半是用1。0的对数概率所以我们要用scikit-学习来创建逻辑回归所以我要做一些重塑,我要用as。矩阵法来确保我得到一个numpy数据帧就是scikit-learn需要这就是我所做的这个和拉威尔帮助我平确保1的矩阵X我这是我的两个特性和Y是我的标签,你看我从scikit-learn进口线性模型可以linear_model。逻辑回归得到我需要的回归对象然后我用x和y进行匹配现在我需要预测方法,所以我要添加新列我的数据帧然后我要评估我们谈论如何评估任何分类模型基本上我填写我的混淆矩阵这里我真阳性我的假阳性我真正阴性假阴性我使用一系列的条件,你知道如果是预测= 1和z等于预测的那么真阳性

同样地,如果预测值为0 z等于预测值这是一个真实的负数,否则它们是假的,要么是正的,要么是负的,然后我就有了这4个例子,我可以画出这些散点图,我用了不同的颜色和标记大小。




如果我们行动起来支持负值的假阳性如果我们搬下来我们支持积极的价值观/假阴性,所以在这个演示我要首先使问题困难所以我把重心从1 1 1 1到半我基本上把正面和负面的数据情况下靠近我们更多的重叠和我们要看日志概率1,2,4











