科学研究可重复性要求指的是_数据科学中的可重复性

最新推荐文章于 2024-08-30 09:23:29 发布

李_涛

最新推荐文章于 2024-08-30 09:23:29 发布

阅读量1.9k

点赞数

文章标签： python java 人工智能

原文链接：https://towardsdatascience.com/reproducibility-in-data-science-c2ac9e689339

版权

科学研究可重复性要求指的是

Science as a pursuit has always had Reproducibility at its core. After all, if a claim is made about the physical world, and the evidence does not support such a claim, it doesn’t matter how much ideology or vested interest the idea has pushing it, there’s no reason for you to believe it. In a seemingly post truth world that we live in, where politicians, the media, and voices on social media propagate information that is often varying shades of dishonest, it pays dividends for your integrity to make reproducible claims. It’s part and parcel to your job as a data scientist.

作为一种追求，科学一直以可复制性为核心。毕竟，如果提出关于物理世界的主张，而证据不支持这样的主张，则该想法将其推向多大的意识形态或既得利益都没有关系，您没有理由相信它。在我们生活的一个貌似事后真实的世界中，政治人物，媒体和社交媒体上的声音传播的信息通常具有不诚实的阴影，它为您的正直做出了可重复的主张而付出了红利。这是您作为数据科学家的工作的一部分。

I think Reproducibility in data science is less well understood than Reproducibility in more established fields of science. For example, a study can clarify one or two simple claims that have to do with testing the mean difference between two or more groups. Examples include…

我认为，数据科学中的可再现性不如更成熟的科学领域中的可再现性好。例如，一项研究可以澄清一个或两个简单的要求，这些要求与测试两个或多个组之间的平均差异有关。例子包括……

Does treatment A make a statistically significant difference over placebo treatment B?
与安慰剂治疗B相比，治疗A在统计学上有显着差异吗？
Do groups exposed to differing lengths of stimuli exhibit varying outcomes?
暴露于不同长度的刺激的组是否表现出不同的结果？
What is the effect size of a treatment?
治疗的效果大小是多少？

Since there is generally a publication bias towards statistically significant results, some research does not get published if its goal is to repeat what other studies have done. However when they are performed, if they do not come to the same conclusion under similar inputs, then it casts doubt on the original claims. The research has not been reproduced.

由于通常存在对统计显著发表偏倚的结果，一些研究没有得到，如果它的目标是重复其他研究已经做了公布。但是，当执行它们时，如果它们在相似的输入下不能得出相同的结论，那么它将对原始要求造成怀疑。该研究尚未转载。

In the field of structural engineering (my first career), we used a form of Reproducibility to validate designs performed by other people. Often an engineer would be tasked with designing a bridge, which is an awfully complex hunk of concrete and steel. In case you’ve never been outside, here is a picture of one.

在结构工程领域 (我的第一个职业)，我们使用一种可复制性形式来验证其他人执行的设计。通常，工程师会承担设计桥梁的任务，桥梁是由混凝土和钢材组成的极为复杂的构件。如果您从未出门，这里是一张照片。

Image for post — Photo by Christopher Burns on Unsplash

Looks complicated huh? That engineer’s design was reviewed with a fine tooth comb many times before it was released to the contractors for construction. Often during the checking process, another engineer will make a design in parallel, given the same initial inputs, and then they compare notes. Same underlying phenomena, but arrived at by two, independent engineers. Any discrepancies usually highlight an inefficiency in the original design, or a point of disagreement on how the bridge should be modeled. The goal was consensus through Reproducibility.

看起来很复杂吧？那个工程师的设计经过了细齿梳的审查，多次被发布给承包商进行施工。通常在检查过程中，另一位工程师会在给定相同的初始输入的情况下并行进行设计，然后他们比较注释。相同的潜在现象，但由两个独立的工程师得出。通常，任何差异都会突出原始设计的效率低下，或者在桥梁的建模方式上存在分歧。目标是通过可复制性达成共识。

什么是再现性？ (What is Reproducibility?)

There are more dimensions to Reproducibility than simply obtaining the same result as we have discussed. Mastering all of these dimensions makes it more likely that your work will be useful for people and be utilized to influence decision making at a higher level. Let’s explore

除了简单地获得与我们已经讨论过的结果相同的结果外，重现性还有更多的方面。掌握所有这些方面，使您的工作更有可能对人们有用，并被用来在更高层次上影响决策。让我们来探索

相同的代码 (Same Code)

Your code should be well documented and should actually run. Go figure. There are two main factors here for success

您的代码应有充分的文档证明，并应实际运行。去搞清楚。成功的两个主要因素

Dependency Management — how do you manage 3rd party packages, are they actively maintained, are the versions pinned? Do you have robust control over system level dependencies?
依赖关系管理 -如何管理第三方软件包，它们是否得到了积极维护，版本是否固定？您对系统级别的依存关系具有强大的控制权吗？
Environment Management — what language version did you build your product in? Will the application environment use the same?
环境管理 -您以什么语言版本构建产品？应用程序环境会使用相同的环境吗？

In a data science consulting role, many times these two pieces are neglected and are tacked on later when client delivery becomes more important. Both are crucial because you should expect that the analysis will be run on a different machine than where the code was written, or be executed in someone else’s well manicured environment, and how can you guarantee that they have the same history of package needs, system dependencies, and language versions as you?

在数据科学咨询中，很多时候这两个部分被忽略，后来在客户交付变得更加重要时再加以解决。两者都是至关重要的，因为您应该期望分析将在与编写代码的位置不同的机器上运行，或者在其他人精心修剪的环境中执行，并且如何保证它们具有与软件包需求和系统相同的历史记录依赖项和语言版本一样吗？

相同数据 (Same Data)

Data versioning is becoming more and more popular. The cookie cutter data science framework has a loose version of this built in. For example in cookie cutter, data is divided into raw, interim, processed, and external data from third party sources. This intuitive way of splitting data can help you tell the story of data transformation, from its raw format into something able to be analyzed. Building a narrative around any data transformation using data versioning will allow you to validate with stakeholders that your logic is sound and your data can be trusted. The analysis can be extended, or even reverted as necessary which allows you to have the same agility that git offers code, but now in the data.

数据版本控制 越来越受欢迎。 Cookie Cuter数据科学框架内置了该版本的宽松版本。例如，在Cookie切割器中，数据分为来自第三方来源的原始数据，临时数据，已处理数据和外部数据。这种直观的数据拆分方式可以帮助您讲述数据转换的故事，从原始格式到可以分析的数据。使用数据版本控制围绕任何数据转换构建叙述，将使您可以与利益相关者一起验证您的逻辑是否合理以及数据是否可信任。可以扩展分析，甚至可以根据需要还原分析，这使您可以像git提供的代码一样具有敏捷性，但是现在可以在数据中使用。

相同的随机数 (Same Random Numbers)

Do you use random seeds in your machine learning pipeline? They allow for quick troubleshooting of problems as the pipeline is built out, because they introduce Reproducibility into your model outputs. This is especially important when you use a learning algorithm with random effects in it, like neural nets or random forest. Random numbers will always be a part of machine learning workflows, when train/test splits, cross validation, or optimization takes place to name a few. You can control them with seed numbers. Think of these seed numbers as controlling for a confounding variable, the random error. If you don’t use seeds, then you don’t know if the change in model outputs, standard errors, importances, etc. is due to random effects or due to a change in the hyper-parameters. To ensure that this randomness is at least temporarily consistent while you build out your product, then setting a random seed controls and eliminates random deviation in your ML pipeline.

您是否在机器学习管道中使用随机种子？它们可以在构建管道时快速解决问题，因为它们在模型输出中引入了可再现性。当您使用具有随机效果的学习算法(例如神经网络或随机森林)时，这一点尤其重要。当训练/测试拆分，交叉验证或优化仅举几例时，随机数将始终是机器学习工作流程的一部分。您可以使用种子号控制它们。将这些种子数视为控制混杂变量(随机误差)。如果您不使用种子，那么您将不知道模型输出，标准误差，重要性等的变化是由于随机效应还是由于超参数的变化。为确保在构建产品时此随机性至少暂时保持一致，请设置随机种子控件并消除ML管道中的随机偏差。

相同的故事 (Same Story)

Now that we have all of the above steps in place, we want to make sure that our work has an impact that lasts. We want to ensure the conclusions we’ve drawn replicate and persist themselves in the minds of stakeholders. We don’t just want our audience to nod their heads, and take no action on what has been presented. What makes these ideas stick in an effectual way?

现在，我们已经完成了上述所有步骤，我们希望确保我们的工作具有持久的影响力。我们要确保得出的结论能够重复存在，并始终牢记在利益相关者的心中。我们不只是希望听众点头，也不对所呈现的内容采取任何行动。是什么使这些想法有效地坚持了下来？

Stories. Whether it is your supervisor, a client, or C-level executives, a compelling story built around the data is the most effective way to achieve this goal. Our ancestors passed on knowledge this way because it was effective. Nothing has changed, it still works.

故事。无论是您的主管，客户还是C级高管，围绕数据构建的引人入胜的故事都是实现此目标的最有效方法。我们的祖先以这种方式传递知识，因为它有效。一切都没有改变，它仍然有效。

Here also is the link between Reproducibility and Interpretability. Telling a story around your data and model, and explaining why it made a prediction (using for example feature importances or SHAP values) leads to the Reproducibility of your conclusions in people’s minds. The idea takes hold because you’ve communicated a compelling narrative, and people know why they should care about it, distilling complex mathematics into something rich and actionable. This is the art of the science. It’s truly a beautiful combination when it all comes together.

这也是可再现性和可解释性之间的联系。在数据和模型周围讲述一个故事，并解释为什么要进行预测(例如使用要素重要性或SHAP值)，可以使您的结论在人们的脑海中重现。这个想法之所以扎根，是因为您传达了一个引人入胜的叙述，人们知道为什么要关心它，将复杂的数学提炼成丰富且可操作的东西。 这是科学的艺术 。当一切融合在一起时，这确实是一个美丽的组合。

综上所述 (In Summary)

What is the point of Reproducibility? To be able to not only have people run the same code and get similar results, but for them to come to the same conclusions, and for that to persist in time, on disk and in human memory. Don’t limit Reproducibility just to virtual environments, or even analytic conclusions, it’s a much richer, and crucial concept than that.

可重复性的重点是什么？为了使人们不仅能够运行相同的代码并获得相似的结果，而且使他们得出相同的结论，并且使它们在时间， 磁盘和人类记忆中持续存在。不要将重现性仅局限于虚拟环境甚至分析结论，它是一个更丰富，更关键的概念。

Something to think about: How can you introduce more Reproducibility into your own projects?

需要考虑的事情：如何在自己的项目中引入更多可重复性？

Thank you for reading this article! I hope it has been eye opening and informative. Feel free to connect with me on Linkedin if you have any questions or are just looking to expand your network.

感谢您阅读本文！我希望它是令人大开眼界和有益的。如果您有任何疑问或正在寻求扩大您的网络，请随时在Linkedin上与我联系。