tukey检测_回到数据分析的未来:Tukey真空度的整洁实现

本文介绍了Tukey检测,一种回归分析方法,通过整洁的Python实现来探索和理解大数据集。它在机器学习和人工智能领域有广泛应用,助力于深度数据分析。
摘要由CSDN通过智能技术生成

tukey检测

One of John Tukey’s landmark papers, “The Future of Data Analysis”, contains a set of analytical techniques that have gone largely unnoticed, as if they’re hiding in plain sight.

John Tukey的标志性论文之一,“ 数据分析的未来 ”,包含了一套几乎未被注意的分析技术,好像它们隐藏在视线中一样。

Multiple sources identify Tukey’s paper as a seminal moment in the history of data science. Both Forbes (“A Very Short History of Data Science”) and Stanford (“50 years of Data Science”) have published histories that use the paper as their starting point. I’ve quoted Tukey myself in articles about data science at Microsoft (“Using Azure to understand Azure”).

多个来源将Tukey的论文视为数据科学历史上的开创性时刻。 福布斯 (“ 数据科学史很短 ”)和斯坦福大学(“ 数据科学50年 ”)都发表了以该论文为起点的历史。 我在Microsoft的有关数据科学的文章(“ 使用Azure理解Azure ”中)中亲自引用了Tukey。

Independent of the paper, Tukey’s impact on data science has been immense: He was author of Exploratory Data Analysis. He developed the Fast Fourier Transform (FFT) algorithm, the box plot, and multiple statistical techniques that bear his name. He even coined the term “bit.”

独立于论文之外,Tukey对数据科学的影响是巨大的:他是《 探索性数据分析》的作者。 他开发了快速傅立叶变换(FFT)算法,箱形图以及多种以他的名字命名的统计技术。 他甚至创造了“位”一词。

But it wasn’t until I actually read “The Future of Data Analysis” that I discovered Tukey’s forgotten techniques. Of course, I already knew the paper was important. But I also knew that if I wanted to understand why — to understand the breakthrough in Tukey’s thinking — I had to read it myself.

但是直到我真正阅读了“数据分析的未来”之后,我才发现了Tukey被遗忘的技术。 当然,我已经知道该论文很重要。 但是我也知道,如果我想了解为什么 -要了解Tukey思维的突破-我必须自己阅读。

Tukey does not disappoint. He opens with a powerful declaration: “For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt” (p 2). Like the opening of Beethoven’s Fifth, the statement is immediate and bold. “All in all,” he says, “I have come to feel that my central interest is in data analysis…” (p 2).

Tukey并不令人失望。 他以强有力的宣言开头:“很长一段时间以来,我一直以为我是统计学家,他对从个人到将军的推论很感兴趣。 但是当我看着数理统计的发展时,我就有理由怀疑和怀疑”(第2页)。 就像贝多芬第五届电影节的开幕一样,这份声明既直接又大胆。 他说:“总的来说,我已经感到我的主要兴趣是对数据分析…… ”(第2页)。

Despite Tukey’s use of first person, his opening statement is not about himself. He’s putting his personal and professional interests aside to make the much bolder assertion that statistics and data analysis are separate disciplines. He acknowledges that the two are related: “Statistics has contributed much to data analysis. In the future it can, and in my view should, contribute much more” (p 2).

尽管Tukey使用第一人称,但他的开场白与他本人无关 。 他将个人和职业兴趣放在一边,以大胆的断言认为统计数据分析是独立的学科。 他承认这两者是相关的:“统计对数据分析做出了很大贡献。 在未来,它可以而且我认为应该做出更多贡献”(第2页)。

Moreover, Tukey states that statistics is “pure mathematics.” And, in his words, “…mathematics is not a science, since its ultimate standard of validity is an agreed-upon sort of logical consistency and provability” (p 6). Data analysis, however, is a science, distinguished by its “reliance upon the test of experience as the ultimate standard of validity” (p 5).

此外,Tukey指出统计是“ 纯粹的数学”。 用他的话来说,“……数学不是一门科学,因为其最终的有效性标准是人们商定的一种逻辑一致性和可证明性”(第6页)。 然而,数据分析一门科学,其特点是“依靠经验检验作为有效性的最终标准”(第5页)。

CRAN上什么都没有 (Nothing on CRAN)

Not far into the paper, however, I stumbled. About a third of the way in (p 22), Tukey introduces FUNOP, a technique for automating the interpretation of plots. I paged ahead and spotted a number of equations. I worried that — before I could understand the equations — I might need an intuitive understanding of FUNOP. I paged further ahead and spotted a second technique, FUNOR-FUNOM. I soon realized that this pair of techniques, combined with a third that I didn’t yet realized was waiting for me, make up nearly half the paper.

然而,在论文不远处,我偶然发现。 在(p 22)中大约三分之一的方式中,Tukey引入了FUNOP,FUNOP是一种用于自动解释图的技术。 我向前翻页,发现了许多方程式。 我担心,在我无法理解方程式之前,我可能需要对FUNOP有直观的了解。 我向前翻页,发现了第二种技术,FUNOR-FUNOM。 我很快意识到,这两种技术加上我尚未意识到的三分之一正在等我,几乎占了论文的一半。

To understand “The Future of Data Analysis,” I would definitely need to learn more about FUNOP and FUNOR-FUNOM. I took that realization in stride, though, because I learned long ago that data science is — and will always be — full of terms and techniques that I don’t yet know. I’d do my research and come back to Tukey’s paper.

要理解“数据分析的未来”,我肯定需要了解有关FUNOP和FUNOR-FUNOM的更多信息。 但是,我大步迈进了这一认识,因为我很久以前就了解到,数据科学充满了并且将永远充满着我尚不知道的术语和技术。 我会做研究,然后回到Tukey的论文。

But when I searched online for FUNOP, I found almost nothing. More surprising, there was nothing in CRAN. Given the thousands of packages in CRAN and the widespread adoption of Tukey’s techniques, I expected there to be multiple implementations of the techniques from such an important paper. Instead, nothing.

但是,当我在网上搜索FUNOP时,却什么也没发现。 更令人惊讶的是,CRAN中什么没有 。 鉴于CRAN中有成千上万的软件包以及Tukey技术的广泛采用,我希望从如此重要的论文中可以有多种技术实现。 相反,什么都没有。

FUNOP (FUNOP)

Fortunately, Tukey describes in detail how FUNOP and FUNOR-FUNOM work. And, fortunately, he provides examples of how they work. Unfortunately, he provides only written descriptions of these procedures and their effect on example data. So, to understand the procedures, I implemented each of them in R. (See my repository on GitHub.) And to further clarify what they do, I generated a series of charts that make it easier to visualize what’s going on.

幸运的是,Tukey详细描述了FUNOP和FUNOR-FUNOM的工作方式。 而且,幸运的是,他提供了它们如何工作的示例。 不幸的是,他仅提供了这些过程及其对示例数据的影响的书面说明。 因此,为了理解这些过程,我在R中实现了每个过程。(请参阅GitHub上的我的存储库 。)为了进一步阐明它们的作用,我生成了一系列图表,使可视化的过程变得更加容易。

Here’s Tukey’s definition of FUNOP (FUll NOrmal Plot):

这是Tukey对FUNOP(完整标称图)的定义:

  • (b1) Let aᵢ₍ₙ₎ be a typical value for the ith ordered observation in a sample of n from a unit normal distribution.

    (b1)设aᵢ₍ₙ₎为单位正态分布的n个样本中第i次有序观察的典型值。

  • (b2) Let y₁ ≤ y₂ ≤ … yₙ be the ordered values to be examined. Let be their median (or let ӯ, read “y trimmed”, be the mean of the yᵢ with ⅓n < i ≤ ⅓(2n).

    (B2)设y₁≤ÿ₂≤... ≤ÿₙ被有序进行检查值。 令y是其值(或让ӯ,“Y修整”,是平均与⅓Ñ<I≤⅓(2 n中的yᵢ的)。

  • (b3) For i ≤ ⅓n or > ⅓(2n) only, let zᵢ = (yᵢ - )/aᵢ₍ₙ₎ (or let

    (B3),其中i≤⅓n或>⅓(2 n)的唯一的,让zᵢ=(yᵢ - Y)/aᵢ₍ₙ₎(或让

    (b3) For i ≤ ⅓n or > ⅓(2n) only, let zᵢ = (yᵢ - )/aᵢ₍ₙ₎ (or let zᵢ = (yᵢ - ӯ) /aᵢ₍ₙ₎).

    (B3),其中i≤⅓n或>⅓(2 n)的唯一的,让zᵢ=(yᵢ - Y)/aᵢ₍ₙ₎(或让zᵢ=(yᵢ - ӯ)/aᵢ₍ₙ₎)。

  • (b4) Let be the median of the z’s thus obtained (about ⅓(2n) in number).

    (b4)令为由此获得的z的中值(数量约为⅓(2 n ))。

  • (b5) Give special attention to z’s for which both |yᵢ - | ≥ A · and zᵢ B · where A and B are prechosen.

    (b5)特别注意z两个 | yᵢ - | ≥A·Zzᵢ≥ ·Z,其中AB是预先选定的。

  • (b5*) Particularly for small n, zⱼ’s with j more extreme than an i for which (b5) selects zᵢ also deserve special attention… (p23).

    (b5 *)特别是对于较小的nzⱼj值要比i (b5选择zᵢi ) 极端的情况还要特别注意…(p23)。

The basic idea is very similar to a Q-Q plot.

基本思想与QQ图非常相似。

Tukey gives us an example of 14 data points. On a normal Q-Q plot, if data are normally distributed, they form a straight line. But in the chart below, based upon the example data, we can clearly see that a couple of the points are relatively distant from the straight line. They’re outliers.

Tukey为我们提供了14个数据点的示例。 在正常的QQ图上,如果数据呈正态分布,则它们会形成一条直线。 但是在下面的图表中,基于示例数据,我们可以清楚地看到其中一些点与直线相对远离。 他们是离群值。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值