数据科学和统计学_数据科学采访中应了解的3种统计概念

数据科学和统计学

For more content like this, check out my free resource here!

有关更多内容,请 在此处 查看我的免费资源

Data scientists are basically modern statisticians. Below are 3 general types of statistics questions that you’ll most likely come across in a data science interview. The reason that these come up so frequently is that they serve as the fundamental building blocks for many data science applications, like Bayesian Machine Learning or Hypothesis Testing.

数据科学家基本上是现代统计学家。 以下是您最有可能在数据科学访谈中遇到的3种常规统计问题。 这些频繁出现的原因是,它们充当了许多数据科学应用程序(例如贝叶斯机器学习或假设测试)的基本构建块。

Keep in mind that there are many many many statistical concepts that are important — for example, I didn’t include Central Limit Theorem but that is still an important concept to know when talking about probability distributions, so take what you’d like out of this.

请记住,有许多重要的统计概念很重要-例如,我没有包括中央极限定理,但是在谈论概率分布时,这仍然是一个重要的概念,需要了解,所以请从中获取所需信息这个。

With that said, here we go!

话虽如此,我们开始吧!

1.贝叶斯定理/条件概率 (1. Bayes Theorem / Conditional Probability)

Plain and simple, you need to understand Bayes Theorem and conditional probability (see below for equations). One of the most popular machine learning algorithms, Naive Bayes, is built on these two concepts. Additionally, if you enter the realm of online machine learning, you’ll most likely be using Bayesian methods.

简单明了,您需要了解贝叶斯定理和条件概率(有关方程式,请参见下文)。 基于这两个概念,最流行的机器学习算法之一Naive Bayes。 此外,如果您进入在线机器学习领域,则很可能会使用贝叶斯方法。

Image for post
Bayes Theorem
贝叶斯定理
Image for post
Conditional Probability
条件概率

Example Question: You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?

示例问题:您将乘飞机去西雅图。 您想知道是否应该带把伞。 您给住在这里的3个随机朋友打电话,并分别询问每个人是否在下雨。 您每个朋友都有2/3的机会告诉您真相,并且有1/3的机会通过撒谎与您相处。 三个朋友都告诉您,是的正在下雨。 西雅图实际上正在下雨的概率是多少?

Answer: You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in Seattle on a given day. Let’s assume it’s 25%.

答:您可以说这个问题与贝叶斯理论有关,因为最后一个陈述基本上遵循以下结构:“假设B为真,那么A成立的概率是多少?” 因此,我们需要知道在特定日期西雅图下雨的可能性。 假设是25%。

P(A) = probability of it raining = 25%P(B) = probability of all 3 friends say that it’s rainingP(A|B) probability that it’s raining given they’re telling that it is rainingP(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27

P(A)=下雨的概率= 25%P(B)=所有3个朋友都说正在下雨的概率P(A | B)告诉他们正在下雨的概率下P(B | A)三个朋友都说正在下雨,因为正在下雨=(2/3)³= 8/27

Step 1: Solve for P(B)P(A|B) = P(B|A) * P(A) / P(B), can be rewritten asP(B) = P(B|A) * P(A) + P(B|not A) * P(not A)P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.25*8/27 + 0.75*1/27

步骤1:求解P(B) P(A | B)= P(B | A)* P(A)/ P(B),可以重写为P(B)= P(B | A)* P( A)+ P(B |不是A)* P(不是A)P(B)=(2/3)³* 0.25 +(1/3)³* 0.75 = 0.25 * 8/27 + 0.75 * 1/27

Step 2: Solve for P(A|B)P(A|B) = 0.25 * (8/27) / ( 0.25*8/27 + 0.75*1/27)P(A|B) = 8 / (8 + 3) = 8/11

步骤2:求解P(A | B) P(A | B)= 0.25 *(8/27)/(0.25 * 8/27 + 0.75 * 1/27)P(A | B)= 8 /(8 + 3)= 8/11

Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.

因此,如果所有三个朋友都说正在下雨,那么实际上有8/11的机会在下雨。

For more questions like this, check out my free resource here!

对于类似这样的更多问题,请 在此处 查看我的免费资源

2.计数申请 (2. Counting Applications)

Combinations and permutations are extremely important if you’re working on network security, pattern analysis, operations research, and more. Let’s review what each of the two are again:

如果您从事网络安全,模式分析,运营研究等工作,则​​组合和排列非常重要。 让我们再次回顾一下两者分别是什么:

排列 (Permutations)

Definition: A permutation of n elements is any arrangement of those n elements in a definite order. There are n factorial (n!) ways to arrange n elements. Note the bold: order matters!

定义: n个元素的排列是这n个元素按确定顺序的任何排列。 有n种阶乘(n!)方式来排列n个元素。 注意粗体:顺序很重要!

The number of permutations of n things taken r-at-a-time is defined as the number of r-tuples that can be taken from n different elements and is equal to the following equation:

一次获取n个事物的排列数目定义为可以从n个不同元素中获取的r元组的数目,并且等于以下等式:

Image for post

Example Question: How many permutations does a license plate have with 6 digits?

示例问题:车牌上有6位数字的排列?

Image for post
Answer
回答

组合方式 (Combinations)

Definition: The number of ways to choose r out of n objects where order doesn’t matter.

定义:从n个对象中选择r个顺序无关紧要的方法的数量

The number of combinations of n things taken r-at-a-time is defined as the number of subsets with r elements of a set with n elements and is equal to the following equation:

一次获取n个事物的组合的数量定义为具有n个元素的集合中具有r个元素的子集的数量,并且等于以下等式:

Image for post

Example Question: How many ways can you draw 6 cards from a deck of 52 cards?

示例问题:您可以从52张卡组中抽出6张卡吗?

Image for post
Answer
回答

Note that these are very very simple questions and that it can get much more complicated than this, but you should have a good idea of how it works with the examples above!

请注意,这是非常非常简单的问题,并且比这要复杂得多,但是您应该对上面的示例如何使用它有一个很好的了解!

3.概率分布/置信区间 (3. Probability Distributions / Confidence Interval)

It’s easy to get lost in the weeds with probability distributions because there are so many of them. That being said, if I had to choose five main distributions on, they would be the following:

由于杂草太多,很容易在杂草中迷失方向。 话虽如此,如果我必须选择五个主要发行版,它们将是:

  1. Normal distribution

    正态分布
  2. Poisson distribution

    泊松分布
  3. Binomial distribution

    二项分布
  4. Exponential distribution

    指数分布
  5. Uniform distribution

    均匀分布

Example question: The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?

问题示例:苏格兰的凶杀率去年从前一年的115下降至99。 报告的更改是否真的值得注意?

Answer: Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean.

答:由于这是一个泊松分布问题,因此均值= lambda =方差,这也意味着标准偏差=均方根。

  • a 95% confidence interval implies a z score of 1.96

    95%的置信区间意味着z值为1.96
  • one standard deviation = sqrt(115) = 10.724

    一个标准差= sqrt(115)= 10.724

Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.

因此,置信区间= 115 +/- 21.45 = [93.55,136.45]。 由于99在此置信区间内,因此我们可以假定此变化不是很值得注意。

谢谢阅读! (Thanks for Reading!)

And that’s all! I hope that this helps you in your interview prep and I wish you the best of luck in your future endeavors. After reading this, hopefully, you’ll have a fundamental understanding of these three concepts. If you feel that you need to study these concepts more, I would check out my free data science resource that covers probability fundamentals and probability distributions.

就这样! 希望这对您的面试有所帮助,并祝您在未来的工作中一切顺利。 阅读本文后,希望您对这三个概念有基本的了解。 如果您认为您需要进一步研究这些概念,那么我将检查我的免费数据科学资源 ,其中涵盖了概率基础和概率分布。

特伦斯·辛 (Terence Shin)

翻译自: https://towardsdatascience.com/3-statistics-concepts-you-should-know-for-data-science-interviews-54d827ec242c

数据科学和统计学

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值