第一章.Classification -- 10.Creating a Classifier with Python翻译

最新推荐文章于 2023-07-28 00:50:09 发布

Stella__Lee

最新推荐文章于 2023-07-28 00:50:09 发布

阅读量270

点赞数

分类专栏： Artificial Intelligence

Artificial Intelligence 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

so Cynthia has been showing us some theory of classifiers and specifically talking about the logistic regression as kind of a baseline classifier and in this video we're going to use a regression logistic regression classifier we're going to look at some of the properties of logistic regression in specific but keep in mind we're talking about principles here that work for almost any classification model and so let's have a look at some Python code and look at a little simulation i have prepared for you so the first thing we have to do is create a data set and we're just going to have to features in our data set

which we're just going to call x and y and we're going to have two possible States for our label which were calling z it can be 1 or 0 so essentially true or false and we're going to generate locations for those two feature labels using bivariate normal distribution that with no correlation between the X&Y just have an x and y value that are randomly generated from these normal distributions we're going to put all that into a date into data frames you see and we're going to concatenate those two data frame so we come up with one data set here that has two different feature labels and and and various values for the X&Y feature and so we'll just look at the head of that data frame just to get a feel for what it looks like and as advertised we have our X variable and our Y variable those are two features and then Z our label which can be 1 or 0 so let's plot that data set so we're just going to create a fairly standard Python plot here using the pandas plot method we're going to scatter plots and in the case where that label Z is 1 we're going to make them red and in the case where that label 0 we're going to make them dark blue and otherwise it's all pretty standard so let's just have a look at that and our little plot so you can see the red

X's are our positive label or true, and our little blue circles here are negative label or false and you can see there's some overlap use see there some red x's that get into this area where there's the population of blue dots and likewise there's some blue dots that are getting up into the area of the red x's so so whatever classifier we use on this it's unlikely we can get a hundred percent accuracy just given these two features in this overlap between the two labels and this is very typical of machine learning problem ok and let's just talk about the logistic function here Cynthia's shown you this already but i'm just going to create a plot of it so it's actually quite simple I just create an X and Y using these list comprehensions and the Y is the logistic function which is just the exponential over 1 plus the exponential of whatever that

Z is or that X is for Z and X. So, and then I just plot those alright and you see you get this expected sigmoidal behavior here and the default in the normal way you start with logistic regression as you say okay if the value is greater than point five we'll call that one we'll call that a plus, if the value is less than point five from the regression we call that a minus or zero or false and but we can move keep in mind we can move this critical point this decision point up so that we favor negative values over misclassifying positive values or we can move it down ok just keep that in mind as we go through this demo so now we'll start from this data set I just showed you and keeping in mind we'll just use the default value of equal you know a half for positive or negative label which is a log probability by the way of 1.0 so we're going to use scikit-learn here to create a logistic regression so I have to do a little bit of reshaping and I have to use the as.matrix method just to make sure that I get a numpy data frame which is what scikit-learn needs so that's all I'm doing with this and Ravel helps me flatten make sure that that is a 1 by matrix so I've got my X which are my two features and my Y which is my label and you see I imported linear model from scikit-learn so I could linear_model.LogisticRegression to get a regression object that i need and then I fit using that x and y and now I've gotta predict method here, so I'm just gonna append that as a new column to my data frame then i'm going to evaluate so we've talked about how to evaluate any classification model so basically i'm filling in my confusion matrix here so i got my true positives my false positives my true negatives my false negatives i'm using a series of conditionals here and you know so if it's the predicted equals 1 and z is equal to the predicted then it's a true positive

likewise if the predicted equals 0 and z is equal to predictive it's a true negative and otherwise they're false either positive or negative and then so that I have these four cases and I can I can plot those scatter plots and i'm using different colors and marker size

so we can tell you know positive from negative true cases and whether they're scored correctly or not by our model and and then i compute you know the true positive to negative so this is the counts for my final confusion matrix and i can print that confusion matrix and I'll just compute some figures like accuracy precision recall that Cynthia's also discussed here so let me just run that for you so let's start out with our confusion matrix it looks like we're actually pretty accurate so we got our true negatives we got 49 we only got 1 where it's actually negative when we said it's positive likewise we only got 1 which was truly negative where we scored it is positive and otherwise we got true positive

so we only have four errors here which gives us an accuracy of ninety-six percent and a really high precision and recall and if we look at the plot i think it's a little more instructive this is very abstract and very general but you can see what's happening so here's our one- that was misclassified it's a dot and it's turned red for misclassified and we've got these three positives which are also misclassified and you can imagine in your mind that the decision boundary has got to be running something like this to get that ok but we can move the decision boundary right so remember on that logistic function that I showed you we could move that decision point up or down and

so if we move it up which means we're favoring positive outcomes over negative outcomes we can move that decision boundary will move this way so let me do that for you so this code will simulate some data data set and we're going to use those decision probabilities 1.0 which is what we just did that's balanced so these are log probabilities by the way two or four so we just kind of moving this way alright let me run that so recalling that logistic function we can move that threshold or that decision point up or down

so if we move it up we're favoring negative values over the false positives if we move it down we're favoring of positive values / false negatives and so in this demo I'm going to first off make the problem harder so I've moved the centroids from 1 1 and -1 -1 to halves so I've basically moved the the data for positive and negative cases closer together so we're more overlap and we're going to look at log probabilities there of 1, 2, 4

so as we do that we're going to favor getting the positives right and the negatives wrong

to let me run that for you you see in our first case we mostly get true negatives and true positives are in

the majority but we get our true positive case we get some false negatives and for the true negative case we get some false negatives are overall accuracy is like 75 percent are precision is .75 and recall is in that range too but as we move that decision boundary notice that the number of false negatives

so they're truly positive but they score is negative has dropped quite a bit but the number of negatives which were incorrectly scored as positive has gone up so our accuracy is dropped to .7 are precision is dropped to .66 but our recall has gone up to .84 and likewise we move that boundary again we now have

very few false negatives we have a lot more false positives and our accuracy is not to affected that are precision has gone down again and I recall is now way up at .9 too so we can see that graphically so here's our first case and you can imagine the decision boundary has to go something like this

and you can see false negatives are that X's false positives are the red circles ok and true positives are in the blue pluses and true negatives are in the blue circles so now I've moved that boundary a little bit and notice we have a lot more red circles now so a lot more false positives but we don't have as many red pluses and we have a lot more blue pluses and I move that boundary again that's the last case i showed you now we're down to just four red pluses so we only have four misclassified positive values but we have a lot more misclassified negative values so I hope this little demo has given you a feel for not only logistic regression but what the behavior of regression models is with respect to say the overlap of features we've seen two different cases there and also how doing things to change that decision boundary can affect the performance statistics you see from your machine learning model

所以辛西娅已向我们展示一些分类理论和专门讨论逻辑回归作为一种基准分类器,这一节我们将使用回归逻辑回归分类器我们会看一些逻辑回归在特定的属性,但请记住我们在这里讨论的原则适用于几乎任何分类模型,所以让我们看看一些Python代码,看看小模拟我有准备首先，我们要做的是创建一个数据集，我们要在数据集中做一些功能。

我们将称之为x和y,我们要为我们的标签都有两种可能的状态调用z可以1或0所以真或假,我们要为这两个特性生成位置标签使用二元正态分布,没有相关性X&Y只有一个x和y值,从这些正态分布随机生成的我们要把所有的日期数据帧你看到我们会将这两个吗数据帧所以我们想出一个数据集,有两个不同的功能标签和和各种用户的值的特性,所以我们只看数据帧的头来了解它是什么样子,宣传我们变量X和Y变量这两个特性,然后Z标签可以1或0让我们这样情节数据集我们要创建一个相当标准的Python情节在这里使用熊猫图方法我们要分散情节和标签Z等于1的情况下我们要让他们红标签0的情况下我们要让他们深蓝色,否则都是非常标准的让我们看一看,我们的小所以你可以看到红色的阴谋

X是正的，或者是正确的，和我们的小蓝色的圆圈是负面的标签或错误的,你可以看到有一些重叠使用看到一些红色的x的进入这个领域有蓝点的人口和同样有蓝点,越来越成红色的x的面积我们不管分类器使用在这个不太可能我们可以得到百分之一百的准确性给这两个特性这两个标签之间的重叠,这是非常典型的机器学习问题可以让我们谈谈物流函数辛西娅已经显示你但我要创建一个块,所以它实际上是很简单我只是创建一个使用这些列表理解X和Y,Y是逻辑函数是指数/(1 +的指数

Z是X或者X Z和因此,然后我只是情节这些好了,你看你得到这个预期s形的行为和默认与逻辑回归正常的方式开始就像你说的好,如果该值大于5点我们叫一个我们称之为a +,如果该值小于5点从回归我们称之为-或零个或虚假,但我们可以记住我们可以移动这个临界点这个决策点,我们将积极的价值观或划分的偏向负面值可以搬下来好记住我们在这个演示现在我们会从这个数据集我只是给你们,记住我们就使用默认值等于你知道正面或负面标签的一半是用1。0的对数概率所以我们要用scikit-学习来创建逻辑回归所以我要做一些重塑，我要用as。矩阵法来确保我得到一个numpy数据帧就是scikit-learn需要这就是我所做的这个和拉威尔帮助我平确保1的矩阵X我这是我的两个特性和Y是我的标签,你看我从scikit-learn进口线性模型可以linear_model。逻辑回归得到我需要的回归对象然后我用x和y进行匹配现在我需要预测方法，所以我要添加新列我的数据帧然后我要评估我们谈论如何评估任何分类模型基本上我填写我的混淆矩阵这里我真阳性我的假阳性我真正阴性假阴性我使用一系列的条件,你知道如果是预测= 1和z等于预测的那么真阳性

同样地，如果预测值为0 z等于预测值这是一个真实的负数，否则它们是假的，要么是正的，要么是负的，然后我就有了这4个例子，我可以画出这些散点图，我用了不同的颜色和标记大小。

我们可以告诉你知道积极的从消极的真正的情况下,他们是否取得正确与否,我们的模型,然后我计算你知道真正积极的-这是我最后的混淆矩阵计算,我可以打印,混淆矩阵和计算一些数字就像精度精度回想一下,辛西娅也是这里讨论让我跑,你让我们开始混淆矩阵看来我们所以我们相当准确我们的真正的否定是49我们只有1，当我们说它是正数的时候它实际上是负的，我们只有1，它是真正的负的，我们得到它是正数，否则我们得到了真正的正数。

所以我们只有四个错误在这里给我们百分之九十六的精度和很高的精度和召回,如果我们看情节我认为这是更有益的这是非常抽象的,很一般但是你可以看到这是我们发生了什么——这是分类错误的一个点,它的变红了,是不是也和我们有这三个阳性是不是和你可以想象在你的头脑中,决策边界必须运行类似这样的东西可以得到这样的结果但是我们可以移动决策边界所以记住，我向你们展示的logistic函数我们可以向上或向下移动这个决策点。

如果我们移动,这意味着我们支持积极的结果在消极的结果我们可以移动这一决定边界将这种方式让我为你这么做,这段代码将模拟数据的数据集,我们将使用这些决策概率1.0这是我们刚才做的平衡这是对数概率的两个或四个我们只是移动这样好的让我跑那么召回物流功能我们可以移动这个门槛决策点向上或向下。

如果我们行动起来支持负值的假阳性如果我们搬下来我们支持积极的价值观/假阴性,所以在这个演示我要首先使问题困难所以我把重心从1 1 1 1到半我基本上把正面和负面的数据情况下靠近我们更多的重叠和我们要看日志概率1,2,4

所以当我们这样做的时候，我们会倾向于得到积极的权利和消极的错误。

让我来帮你们看看，在我们的第一个案例中，我们主要得到了真实的否定和真正的积极因素。

大多数但我们得到真正积极的情况下我们得到一些假阴性和真正的负面案例我们得到一些假阴性总体精度是75%精度。和召回,但随着这一决定边界注意假阴性的数量

所以他们真正积极但他们得分是负的数量下降了不少,但底片被错分积极上涨了所以我们的精度下降到7精度下降点,但我们的回忆已经点,同样我们有再次移动边界

很少假阴性我们更多的假阳性和精度是不影响精度又下降了,我记得现在的方式在图形化。9所以我们可以看到,这是我们的第一个案例,你可以想象这个决定边界必须是这样的

你可以看到假阴性,X的假阳性是红圈好和真正的优点是在蓝色的优势和真正的不利因素是蓝色的圆圈现在我搬,边界有点注意我们现在有更多的红圈更多的假阳性但我们没有红色的优点和我们有更多的蓝色优点我移动边界是过去的情况我给你我们现在仅剩下四个红色的优点所以我们只有四个被误诊积极的价值观,但我们有更多的分类错误的负值所以我希望这个小演示给你感觉不仅逻辑回归,回归模型的行为是对的重叠特性我们看到两种不同的情况下,也做一些改变这一决定边界如何影响性能统计数据你看到从你的机器学习模型