来到世界多少天了_欢迎来到数据世界

最新推荐文章于 2022-07-08 22:28:33 发布

weixin_26728245

最新推荐文章于 2022-07-08 22:28:33 发布

阅读量207

点赞数

文章标签： python 人工智能 java 大数据物联网

原文链接：https://towardsdatascience.com/welcome-to-the-world-of-data-416d03175df0

版权

来到世界多少天了

For most of us, the most dreaded part of Data Science and Machine learning is the math and statistics involved in it.

对于我们大多数人来说，数据科学和机器学习中最令人恐惧的部分是其中涉及的数学和统计学。

If you’re a scientist, and you have to have an answer, even in the absence of data, you’re not going to be a good scientist.

如果您是一名科学家，并且即使在没有数据的情况下也必须做出答案，那么您就不会成为一名优秀的科学家。

- Neil deGrasse Tyson

-尼尔·德格拉斯·泰森

Everyone has there own way of developing their love for data and data science. For me, understanding the basics worked like magic. Once, I mastered the basic concepts like types of data, distribution, and shape of distributions, etc., it was reasonably easy to take a deeper dive into advanced concepts.

每个人都有自己的方式发展对数据和数据科学的热爱。对我来说，理解基础就像魔术一样。一次，我掌握了诸如数据类型，分布和分布形状等基本概念，因此很容易深入研究高级概念。

Let’s break it down.

让我们分解一下。

The main input in a data science project is observations: in other words “Feature Values”. These feature values (also called variables) can be Quantitative or Qualitative.

数据科学项目中的主要投入是观察：换句话说就是“特征值”。这些特征值(也称为变量)可以是定量的或定性的。

In case your anxiety level increased just by reading these two terms and you won’t move forward until you have a look at all the tentacles of Quantitative and Qualitative data, look at the below figure.

如果仅通过阅读这两个术语就增加了焦虑水平，并且在查看了定量和定性数据的所有触角之前不会继续前进，请查看下图。

Don’t be too hard on yourself. Let’s understand these two types.

不要对自己太苛刻。让我们了解这两种类型。

1.定量/数值数据 (1. Quantitative/Numerical Data)

If you can add, subtract, multiply, and divide the data, it is quantitative. Numerical data is further detailed into

如果您可以对数据进行加，减，乘和除，则它是定量的。数值数据将进一步详细介绍

Continuous Data: Measurable data. Can take any value. Ex: Time in a race, Income of a person, Age of a person, etc. Time in a race can be any value, it can be hours, minutes, days etc.. There is no constraint on the value.
连续数据：可测量的数据。可以取任何值。例如：比赛时间，一个人的收入，一个人的年龄等。比赛时间可以是任何值，可以是小时，分钟，天等。该值没有限制。

Discrete Data: Finite and countable data. Can take only certain integer values. Ex. result of rolling a dice, number of students in a class, petals of a flower. If you roll a dice you can either get 1, 2, 3 .. maximum 6. There are finite possibilities.
离散数据：有限且可数的数据。只能接受某些整数值。例如掷骰子的结果，班上的学生人数，花的花瓣。如果掷骰子，则最多可以获得1、2、3 ..6。存在有限的可能性。

1.1连续数据(1.1 Continuous Data)

If you are going to work for enterprises like Financial Institutions, Retail industries, chances are that you will spend most of your data science life with continuous data. As the name suggests it is like water. As water can flow anywhere, continuous data can take any value.

如果您要为金融机构，零售业等企业工作，那么您很可能会将大部分数据科学生涯用在连续数据上。顾名思义，它就像水。由于水可以流到任何地方，因此连续数据可以具有任何价值。

To understand continuous data, you will have to find answers to the below questions.

要了解连续数据，您将必须找到以下问题的答案。

What is the mean of data?
数据的含义是什么？
How scattered the data values are? i.e. Variance.
数据值有多分散？即方差。
What is the overall data distributions with respect to mean value?
关于平均值的总体数据分布是什么？
Are there any outliers? i.e. Standard Deviation.
有离群值吗？即标准偏差。

Although I don’t want to scare you with formulas, it doesn’t harm just to scratch the surface.

尽管我不想用公式吓you您，但只是刮擦表面也无济于事。

How is the Mean of Continuous Data Distribution Calculated?

如何计算连续数据分布的平均值？

How is the Variance of Continuous Data Distribution Calculated?

如何计算连续数据分布的方差？

Variance is calculated as a total of the square of the difference between mean and individual values.

方差计算为平均值与单个值之差的平方的总和。

How is the Standard Deviation of Continuous Data Distribution calculated?

如何计算连续数据分布的标准偏差？

The standard deviation is the square root of variance.

标准偏差是方差的平方根。

Continuous Data Distribution:

连续数据分发：

Now that you understand how to measure specific details like Mean, Variance, and Standard Deviation of continuous data, let’s understand the nature of its distribution.

既然您已经了解了如何测量连续数据的均值，方差和标准差之类的特定细节，那么让我们了解其分布的本质。

Continuous data follow one of the below distributions.

连续数据遵循以下分布之一。

Normal Distribution
正态分布
t-Distribution
t分布

1.1.1正态分布(1.1.1 Normal Distribution)

Most of the things around us follow Normal Distribution.

我们周围的大多数事物都遵循正态分布。

Strange!!

奇怪！！

How about this, if you take heights of people in your country, create a table of range of heights and count of persons of that height and plot, it will be normal distribution and plot will look similar to the below figure.

怎么样，如果您在自己的国家/地区测量人的身高，创建一个高度范围表，并计算该身高和阴谋的人数，它将是正态分布，阴谋看起来类似于下图。

You might be thinking, this is not possible.

您可能在想，这是不可能的。

It looks strange but true. A lot of other things in nature ex. Blood Pressure, IQ, Shoe Size, Birth weight, and to an extent Technical Stock market, follow this bell curve shape where data centers around the mean and show kind of symmetric spread on either side of the mean.

看起来很奇怪但是真实。自然界还有很多其他事情。血压，智商，鞋号，出生体重以及一定程度上是技术股票市场，遵循这种钟形曲线形状，其中数据围绕均值居中，并在均值的两侧显示出某种对称分布。

While we are talking about symmetric spread you should also remember the below formula to calculate the Skewness of data distribution.

在讨论对称分布时，您还应该记住以下公式来计算数据分布的偏度。

Normally distributed data will have 0 skewness.

正态分布的数据的偏度为0。

You will probably never need it, but in case you do, below is the equation for plotting this graph

您可能永远不需要它，但如果需要，下面是绘制此图的方程式

The following are key characteristics of normal distribution.

以下是正态分布的关键特征。

Data population mean mode and median values are the same.
数据总体均值模式和中值相同。
Most of the data points are centered around the mean.
大多数数据点均以平均值为中心。
Data points are scattered around the mean in a symmetrical manner.
数据点以对称方式散布在平均值周围。

If you are still reading this article (I hope you do!!), by now you must be thinking but why do you need to understand the Data Distribution?

如果您仍在阅读本文(希望您这样做！)，那么现在您必须一直在思考，但是为什么您需要了解数据分布？

The answer is one-word Generalization.

答案是一词概括。

As data scientists, you can expect a lot of junk data, outliers, etc. coming to you and you will be pressed hard to make meaning of this data and predict the next course of action based on this data.

作为数据科学家，您可能会想到大量垃圾数据，离群值等，因此您将难以理解这些数据的含义，并根据该数据预测下一步的行动。

If you understand the overall nature of data distribution you could get rid of outliers and unwanted data and make sense of information.

如果您了解数据分发的总体性质，则可以消除异常值和不需要的数据，并使信息有意义。

Remember this “There is no chaos in Universe!”.

记住这个“宇宙中没有混乱！”。

Data distribution follows a pattern. Barring Decision Tree, most of the machine learning models expect features with continuous data follow a Normal Distribution. You might come across situations, where feature values, by itself, do not follow a Normal Distribution, but if you apply a function like log to the values, it will follow a Normal Distribution.

数据分发遵循一种模式。除决策树外，大多数机器学习模型都希望具有连续数据的特征服从正态分布。您可能会遇到特征值本身不遵循正态分布的情况，但是如果将诸如log的函数应用于值，则特征值将遵循正态分布。

Statisticians are fond of normal distribution. Some statisticians will try to fit every observation values with continuous numbers in a normal distribution. Some believe if a data population doesn’t follow normal distribution it means we don’t have enough observations.

统计人员喜欢正态分布。一些统计学家会尝试将每个观察值与正态分布中的连续数字拟合。有些人认为，如果数据填充不遵循正态分布，则意味着我们没有足够的观测值。

Any discussion on normal distribution is not complete without mention of z score. z score indicates how far, from the mean value of data population, a specific data value is. Below is the formula for the z score.

关于z正态分布的任何讨论都是不完整的。 z得分指示特定数据值与数据总体平均值之间的距离。以下是z得分的公式。

If you calculate z-score of each data point in data population and plot them against standard deviation it will look like below

如果您计算数据总体中每个数据点的z得分并将其相对于标准偏差进行绘制，则如下所示

This is called Standard Normal Distribution. Key characteristics of Standard Normal Distributions are

这称为标准正态分布。标准正态分布的主要特征是

It follows a Normal distribution.
它遵循正态分布。
Mean, median, and mode values are 0.
平均值，中位数和众数值为0。
68.27% of data resides within 1 standard deviation. 95.45% data resides within 2 Standard Deviations and 99.73% data resided in 3 Standard Deviations.
68.27％的数据位于1个标准差内。 95.45％的数据位于2个标准偏差内，而99.73％的数据位于3个标准偏差内。

z score will help you finding Outliers and verifying the null hypothesis (p value) and backward elimination during feature engineering.

z得分将帮助您发现异常值，并在要素工程过程中验证零假设(p值)和向后消除。

Example: If z score of a feature value is less than 1.96 and greater than 1.96 then reject the null hypothesis.

示例：如果特征值的z分数小于1.96且大于1.96，则拒绝原假设。

Before I conclude my favorite topic, Normal Distribution, let me tell you about Central Limit Theorem (CLT).

在总结我最喜欢的主题正态分布之前，让我告诉您有关中心极限定理(CLT)的信息。

As per the central limit theorem, if you take several samples of a data population, calculate the mean and plot the frequency of the mean it will look like a normal distribution. The more the number of samples, the better it will align with a normal distribution. This holds true even if the overall data population from which the samples are drawn does not follow a normal distribution.

根据中心极限定理，如果您抽取一个数据总体的多个样本，请计算平均值并绘制平均值的频率，使其看起来像是正态分布。样本数量越多，与正态分布的对齐越好。即使从中抽取样本的总体数据总体不服从正态分布，也是如此。

Isn’t this Strange!!!

这不是很奇怪！！！

This article is becoming too big. Let’s conclude Normal Distribution and move on to t-distribution.

这篇文章变得太大了。让我们总结正态分布并继续进行t分布。

1.1.2 t-distribution

1.1.2 t分布

Now that you understand Normal Distribution and CLT, it’s time to go over t-distribution.

现在您已经了解了正态分布和CLT，现在该讨论t分布了。

As per CLT, the mean of the sample follows a normal distribution as long as the sample size is sufficiently large (at least 30 observations). So, if you know the standard deviation of the data population, you can compute a z score, and using normal distribution you can evaluate probabilities with the sample mean.

根据CLT，只要样本大小足够大(至少30次观察)，样本的平均值就会遵循正态分布。因此，如果您知道数据总体的标准偏差，则可以计算z得分，并使用正态分布可以用样本均值评估概率。

What if sample sizes are small and you do not know the Standard Deviation of the population? When data scientists encounter such constraints, they rely on the t-distribution. It’s calculated as below.

如果样本量很小并且您不知道总体的标准偏差怎么办？当数据科学家遇到此类约束时，他们将依赖于t分布。 计算方法如下。

Data scientists use t-distribution to analyze data sets where they cannot use the normal distribution. The data population should be approximately normal.

数据科学家使用t分布来分析无法使用正态分布的数据集。数据填充应大致正常。

As a data scientist, you will use t-distribution in one of the following situations.

作为数据科学家，您将在以下情况之一中使用t分布。

If you have a data size of more than 10 but less than 30. If data population size if less than 30, it is too less to show normal distribution.
如果数据大小大于10但小于30。如果数据填充大小小于30，则该值太小而无法显示正态分布。
Quite often you will come across situations where you have multi-millions of data to work on and you do not know the spread (standard deviation) of data. In such a case you will have to first get few samples of the data (with the same sample size) and then calculate it’s mean, median, mode, variance, standard deviation. Based on these values on sample size you will have to derive these values for the complete population.
通常，您会遇到以下情况：您需要处理数百万个数据，却不知道数据的传播范围(标准差)。在这种情况下，您将必须首先获取少量数据样本(具有相同的样本大小)，然后计算其平均值，中位数，众数，方差，标准差。根据样本量上的这些值，您将必须得出完整总体的这些值。

By the way, t-distribution is also called Student distribution. However, it has nothing to do with the use of these statistics by students. Read the history behind this at the below link.

顺便说一句，t分布也称为学生分布。但是，这与学生使用这些统计信息无关。在下面的链接中阅读其背后的历史。

If you want to play around with some of these distributions in Excel, the following link contains interactive excel templates you can use.

如果要在Excel中使用这些发行版中的某些发行版，则以下链接包含可以使用的交互式excel模板。

2.定性/分类 (2. Qualitative/Categorical)

Categorical data doesn’t hold mathematical significance as mathematical operations like addition, subtraction, multiplication, the division cannot be performed on such data. Example, the provinces of Canada is a categorical variable. You cannot compare these provinces like mathematical numbers. Categorical data can be further segregated into.

分类数据不具有数学意义，因为不能对此类数据执行诸如加，减，乘，除的数学运算。例如，加拿大的省是一个分类变量。您无法像数学数字那样比较这些省。分类数据可以进一步分离。

Binomial Data
二项式数据
Nominal Data
名义数据
Ordinal Data
序数数据

Unfortunately, I need to conclude this article now. I am in love with understanding data and, I can go on and on with it. But too big an article means rejection by publishers :(

不幸的是，我现在需要结束本文。我爱上了理解数据的能力，因此我可以继续下去。但是一篇文章太大，则意味着被发布者拒绝：(

If you are an aspiring data scientist, make sure you develop your love for data. And, love blooms by understanding, so spend the required time to understand the data and it’s nature.

如果您是一位有抱负的数据科学家，请确保您发展出对数据的热爱。而且，爱情是通过理解而绽放的，因此请花费所需的时间来理解数据及其本质。

Reference:

参考：