机器学习模型学习失败_我的机器学习模型运行良好，但我希望它失败-CSDN博客

本文探讨了一个有趣的现象：尽管机器学习模型表现良好，但作者希望它能失败以揭示潜在问题。通过深入分析，可以更好地理解模型的局限性和优化空间，提升模型的泛化能力。

摘要由CSDN通过智能技术生成

机器学习模型学习失败

I am a recent graduate of Metis’ data science bootcamp and in that program one of our projects was focused on building a classification model. We can pick any dataset of our choice to work with and in my case, I chose to use the Center for Disease Control’s National Health and Nutrition Examination Survey (NHANES) results for the years 2007–2016 and train a model to predict whether or not an individual has high blood pressure (binary classification: yes or no). I tested a range of 28 variables in my machine learning model from demographics, eating habits, alcohol intake, activity level, occupation, and others. After testing various models, I selected a logistic regression model using 5 variables, oversampling, and setting a threshold of 0.34 to bring my target metric of recall to 0.899. I built a nice Tableau graphic to represent my findings and with this, I had technically achieved my objective: classification model complete. Check.

我是Metis数据科学训练营的最新毕业生，在该计划中，我们的一个项目专注于构建分类模型。我们可以选择要使用的任何数据集，在我的情况下，我选择使用疾病控制中心的2007–2016年国家健康和营养调查(NHANES)结果，并训练模型来预测是否一个人患有高血压(二进制分类：是或否)。我在机器学习模型中测试了28个变量，这些变量来自人口统计，饮食习惯，酒精摄入，活动水平，职业等。在测试了各种模型之后，我选择了使用5个变量的logistic回归模型，进行过采样，并将阈值设置为0.34，以使目标召回指标达到0.899。我建立了一个很好的Tableau图形来表示我的发现，并且以此，我在技术上实现了我的目标：完成分类模型。检查一下

To those in the data science community, this likely isn’t a ground-breaking exercise. To those outside the data science community, you likely are thinking (if I haven’t already lost you), “I don’t know what a logistic regression model or any of the things you said after that are.” Well, regardless of what audience you’re in, I get it. But the technicalities of my model isn’t what I want to talk about.

对于数据科学界的人来说，这可能不是一项开创性的工作。对于数据科学界以外的人，您可能会想(如果我还没有迷失您的话)，“我不知道什么是逻辑回归模型或您在那之后所说的话。” 好吧，无论您是什么观众，我都能理解。但是我模型的技术性不是我想要谈论的。

我想谈谈为什么我希望模型失败。 (I want to talk about why I want my model to fail.)

Why?

为什么？

Well, four of the five variables in my model were somewhat expected. First was age, which is unfortunate but inevitable. The higher your age, the higher your chances of having high blood pressure. Then increase in weight, alcohol intake, and cigarette smoking all also increased chances of predicting high blood pressure, which can be hard but can be controlled or changed by an individual.

好吧，我模型中的五个变量中有四个是可以预料的。首先是年龄，这是不幸的，但却是不可避免的。您的年龄越高，患高血压的机会就越高。然后体重增加，酒精摄入和吸烟都增加了预测高血压的机会，这可能很难，但可以由个人控制或改变。

But it’s the fifth variable of the model that threw me. The variable was a particular minority race — if an individual was of this race they would have higher blood pressure. I don’t feel the need to say the race here, as I don’t want this piece to be about me (white woman) talking about another race. That’s not my place and not a platform I deserve. If you really want to know, it’s not hard to find out with a quick google, as I found out after I saw my model’s results and searched for what put people at risk of high blood pressure to check my variables against (and hoped would show me why my race variable was wrong).

但这是模型的第五个变量。变量是一个特定的少数族裔-如果一个人属于这个种族，他们的血压会更高。我不想在这里说比赛，因为我不想让这件事与我(白人女性)谈论另一场比赛有关。那不是我的地方，也不是我应得的平台。如果您真的想知道，不难发现一个快速的google，就像我在看到模型的结果并搜索导致人们有高血压风险的因素后检查我的变量(并希望能显示出来)之后发现的那样我为什么我的种族变量错了)。

What I do want to talk about is how to feel about that race variable in my model and what responsibility a data scientist has in a situation like this. There is a possibility that there is a genetic connection between the race and high blood pressure — which is sad in itself — but there’s also a possibility of it being representative for many other societal things that this race faces that could lead to high blood pressure. And that’s….really sad. Even if not the case for this particular model, this same line of thinking could be applied to any model where race appears as a variable due to historical oppressions of that group.

我要谈论的是如何看待模型中的种族变量，以及数据科学家在这种情况下应承担的责任。种族与高血压之间存在遗传联系，这本身就是令人难过的，但也有可能代表该种族所面对的许多其他社会事物，可能会导致高血压。那真是……真可悲。即使不是这个特定模型，也可以将相同的思路应用于由于种族历史压迫而种族作为变量出现的任何模型。

当数据科学家看到代表更大社会问题的模型结果时，会做些什么？ (What does a data scientist do in a situation when they see results of their model that represent greater societal issues?)

Some might believe it’s not the responsibility of the data scientist — that it’s the responsibility of others to address and the data scientist just builds the thing. I don’t quite agree with that and feel an obligation to do more, but I also don’t know what that looks like. So I wanted to raise the question here because what I do believe this should warrant a larger conversation. For me, it’s hard to just hand this model off as “done” and move on. I was proud of my technical work on the modeling, but can’t say I was proud to present the findings. I was sad, mad, and uncomfortable.

有些人可能认为这不是数据科学家的责任，而是其他人的责任，而数据科学家只是在构建事物。我不太同意这一点，我有义务做更多事情，但是我也不知道那是什么样子。所以我想在这里提出一个问题，因为我认为这应该引起更大的讨论。对我来说，很难将这个模型“完成”并继续进行。我为我在建模方面的技术工作感到自豪，但不能说我为介绍这些发现而感到自豪。我感到难过，生气和不舒服。

Maybe my machine learning model was simple, but the moral question it raised is complex.

也许我的机器学习模型很简单，但是提出的道德问题却很复杂。

I encourage and hope for comments on this article. As mentioned before, I’m not here with answers; I’m here to start a conversation and hope you join in so we can all work towards solutions.

我鼓励并希望对本文发表评论。如前所述，我在这里没有答案。我在这里开始对话，希望您能加入进来，以便我们共同努力寻求解决方案。

Full code for this model can be found here on github.

该模型的完整代码，可以发现这里在GitHub上。