概率论在数据挖掘_为什么概率论在数据科学中很重要-CSDN博客

概率论在数据挖掘

数据科学 (Data Science)

Lots of Data Science concepts are applied using Probability fundamental knowledge. We can name a few popular terms such as “Decision Making”, “Recommender System”, “Deep Learning”. Famous framework for deep learning like Tensorflow or Pytorch are implemented heavily based on the concept of Probability. So understanding what Probability is and how it works will help us to go far in the path for learning Data Science.

数据科学概念大号 OTS使用概率的基础知识应用。我们可以命名一些流行的术语，例如“决策制定”，“推荐系统”，“深度学习”。像Tensorflow或Pytorch这样的著名的深度学习框架在概率论的基础上大量实现。因此，了解什么是概率及其如何工作将有助于我们走上学习数据科学的道路。

依赖与独立 (Dependence and Independence)

Roughly speaking, we say that two events E and F are dependent if knowing something about whether E happens gives us information about whether F happens (and vice versa). Otherwise they are independent.

粗略地说，如果知道有关E是否发生的某些信息给我们有关F是否发生的信息(反之亦然)，那么我们说两个事件E和F是相关的。否则，它们是独立的。

For instance, if we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about whether the second flip is Heads. These events are independent. On the other hand, knowing whether the first flip is Heads certainly gives us information about whether both flips are Tails. (If the first flip is Heads, then definitely it’s not the case that both flips are Tails.) These two events are dependent. Mathematically, we say that two events E and F are independent if the probability that they both happen is the product of the probabilities that each one happens:

例如，如果我们掷两次公平硬币，则知道第一个掷骰是否为正面，就不会提供有关第二个掷骰是否为正面的信息。这些事件是独立的。另一方面，知道第一个翻转是否为Heads肯定会为我们提供有关两个翻转是否均为Tails的信息。 (如果第一个翻转是Heads，那么两个翻转都是Tails肯定不是这种情况。)这两个事件是相关的。从数学上讲，如果两个事件E和F都发生的概率是每个事件发生的概率的乘积，则我们说这两个事件是独立的：

P (E, F) = P( E) * P( F)

In the example above, the probability of “first flip Heads” is 1/2, and the probability of “both flips Tails” is 1/4, but the probability of “first flip Heads and both flips Tails” is 0.

在上面的示例中，“第一个翻转头”的概率为1/2，“两个翻转尾部”的概率为1/4，但是“第一翻转头和两个翻转尾部”的概率为0。

贝叶斯定理 (Bayes’ Theorem)

To understand how Bayes’ Theorem works, try to answer the question below:

要了解贝叶斯定理的工作原理，请尝试回答以下问题：

Steve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail. How likely Steve to be one of those:
1. A librarian
2. A farmer

Very often, we (irrationally) will think Steve is “most likely to be” a librarian. Well, we will not think so if we understand the ratio of farmer to the librarian. Let’s just say it is probably is 20/1.

很多时候，我们(非理性地)会认为史蒂夫“最有可能成为”图书馆员。好吧，如果我们了解农民与图书馆员的比例，我们就不会这样认为。我们只说大概是20/1。

In librarian category, let’s say 50% of the librarians fit the character traits in the question, whereas in farmer category, let’s say it’s only 10%.

在图书馆员类别中，假设50％的图书馆员符合问题中的性格特征，而在农民类别中，假设只有10％。

Alright, so let’s say we have 10 librarian, and 200 farmers. The probability of a farmer given the description will be:

好了，假设我们有10名图书管理员和200名农民。给出描述的农民可能性为：

5/(5+20) = 1/5 ~ 20%

So, if we guess the candidate is likely a librarian. We are probably WRONG.

因此，如果我们猜测候选人可能是图书馆员。我们可能是错误的。

Below is the formula of Bayes’ theorem.

下面是贝叶斯定理的公式。

P(H|E) = P(H)*P(E|H) / P(E)

where:

哪里：

P(H) = Probability of hypothesis is true, before any evidenceP(E|H) = Probability of seeing the evidence if hypothesis is trueP(E) = Probability of seeing the evidenceP(H|E) = Probability of hypothesis is true given some evidence

随机变量 (Random Variable)

is a variable whose possible values have an associated probabilitydistribution. A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up tails. A more complicated one might measure the number of heads observed when flipping a coin 10 times or a value picked from range(10) where each number is equally likely. The associated distribution gives the probabilities that the variable realizes each of itspossible values. The coin flip variable equals 0 with probability 0.5 and 1 with probability 0.5. The range(10) variable has a distribution that assigns probability 0.1 to each of the numbers from 0 to 9.

是一个变量，其可能值具有关联的概率分布。一个非常简单的随机变量，如果掷硬币的头朝上，则等于1；如果掷硬币的头朝上，则等于0。一个更复杂的方法可能是测量抛硬币10次或从range(10)中选取的值(每个数字均等)时观察到的正面数。关联的分布给出了变量实现其每个可能值的概率。硬币翻转变量的概率为0.5，等于0；概率为0.5，等于1。 range(10)变量的分布为从0到9的每个数字分配了概率0.1。

连续分布 (Continuous Distributions)

Often we’ll want to model distributions across a continuum of outcomes. (For our purposes, these outcomes will always be real numbers, although that’s not always the case in real life.) For example, the uniform distribution puts equal weight on all the numbers between 0 and 1. Because there are infinitely many numbers between 0 and 1, this means that the weight it assigns to individual points must necessarily be zero. For this reason, we represent a continuous distribution with a probability density function (pdf) such that the probability of seeing a value in a certain interval equals the integral of the density function over the interval.

通常，我们希望对连续结果的分布进行建模。 (出于我们的目的，这些结果将始终是实数，尽管现实生活中并非总是如此。)例如，均匀分布对0到1之间的所有数字赋予相等的权重。因为0之间有无数个数字1，这意味着它分配给各个点的权重必须为零。因此，我们用概率密度函数(pdf)表示连续分布，以使在某个间隔内看到一个值的概率等于该间隔内密度函数的积分。

The density function for the uniform distribution could be implemented in Python like :

均匀分布的密度函数可以在Python中实现，例如：

def uniform_pdf(x):
    return 1 if x >= 0 and x < 1 else 0

Or if we want to create a method for cumulative distribution function :

或者，如果我们想为累积分布函数创建方法：

def uniform_cdf(x):
    if x < 0:
        return 0
    elif x < 1: return x
    else:
        return 1

结论 (Conclusion)

Probability is interesting but requires a lot of learning. There’s a lot about Probability which I did not cover in this post, such as Normal Distribution, Central Limit Theorem, Markov Chains or Poisson process.. So take your time to find out more about it.

概率很有趣，但是需要大量学习。我在这篇文章中没有涉及到很多关于概率的信息，例如正态分布，中心极限定理，马尔可夫链或泊松过程。因此，请花一些时间来了解更多有关概率的信息。