分布式数据库 分布式数据库_数据分布及其参数

本文翻译自Medium,深入探讨了分布式数据库的概念,重点解析了数据分布的原理及其关键参数,帮助读者更好地理解分布式数据库的运作机制。
摘要由CSDN通过智能技术生成

分布式数据库 分布式数据库

Data Distribution and its Parameters

数据分配及其参数

As an ML engineer it is important to know the science of statistics to treat any data in the world and make it talk the way you want to 😎.

作为ML工程师,重要的是要了解统计学的科学知识,以便处理世界上的任何数据并使其以您想要的方式说话。

The basics start from the Data ,it’s distributions and its parameters.Well you might know about the data and its importance, but today lets see what are these distribution parameters which are also more important.

基础知识从数据,数据的分布及其参数开始。您可能对数据及其重要性有所了解,但今天让我们看看这些分布参数中哪些更重要。

Data Distribution??

数据分配

A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically. (Page 6, Statistics in Plain English, Third Edition, 2010.)

分布仅仅是变量的数据或分数的集合。 通常,这些分数按从最小到最大的顺序排列,然后可以以图形方式显示。 ( 第6页, 普通英语统计资料 ,第三版,2010年。)

In a practical perspective, we can think of a distribution as a function that describes the relationship between in a sample space.

从实际的角度来看,我们可以将分布视为描述样本空间之间关系的函数。

Density Functions:Distributions are often described in terms of density or density functions. Density functions are functions that describe how the proportion of data or likelihood of the proportion of observations change over the range of the distribution.These are broadly divided into two types -Continuous and discrete

密度函数: 分布通常用密度或密度函数来描述。 密度函数是描述数据比例或观测值比例可能性在分布范围内如何变化的函数。 这些大致分为两种- 连续和离散

Continuous Distributions can be divided further into PDF and CDF

连续分布可进一步分为PDF和CDF

PDF-Probability Density Functions

PDF概率密度函数

Calculates the probability of observing a given value.

计算观察给定值的概率。

Can be used to calculate the likelihood of a given observation in a distribution.

可用于计算分布中给定观测值的可能性。

It can also be used to summarise the likelihood of observations across the distribution’s sample space

它也可以用来总结分布样本空间中观测值的可能性

Plots of the PDF show the familiar shape of a distribution, such as the bell-curve for the Gaussian distribution.

PDF的图显示了熟悉的分布形状,例如高斯分布的钟形曲线。

CDF-Cumulative distribution functions

CDF累积分布函数

Calculates the probability of an observation equal or less than a value

计算观察值等于或小于某个值的概率

Rather than calculating the likelihood of a given observation as with the PDF, the CDF calculates the cumulative likelihood for the observation and all prior observations in the sample space.

CDF不会像PDF一样计算给定观测值的可能性,而是计算该观测值以及样本空间中所有先前观测值的累积可能性。

It allows you to quickly understand and comment on how much of the distribution lies before and after a given value.

它使您可以快速了解和评论给定值前后多少分布。

A CDF is often plotted as a curve from 0 to 1 for the distribution

CDF通常绘制为从0到1的分布曲线

Discrete Distributions can be divided into PMF and CDF

离散分布可分为PMF和CDF

PMF-Probability Mass Functions

PMF-概率质量函数

Characterises the distribution of a discrete random variable.

表征离散随机变量的分布。

Same as PDF but on a discrete variable

与PDF相同,但使用离散变量

CDF-Cumulative distribution functions

CDF累积分布函数

This is same as that of the continuous but only thing is here it is a discrete variable.

这与连续变量相同,但唯一的问题是离散变量。

Application of Probability Distribution Functions and Cumulative Distribution Functions

概率分布函数和累积分布函数的应用

  1. To calculate confidence intervals for parameters(we will see this below)and to calculate critical regions for hypothesis tests.

    计算 参数的 置信区间 (我们将在下面看到)并计算假设检验的关键区域。

  2. For uni variate data, it is often useful to determine a reasonable distributional model for the data.

    对于单变量数据,确定数据的合理分布模型通常很有用。

  3. Identifying the exact distribution type will help us to decide which further statistical or Machine learning algorithms can be implemented::- .Statistical intervals and hypothesis tests are often based on specific distributional assumptions. Before computing an interval or test based on a distributional assumption, we need to verify that the assumption is justified for the given data set. In this case, the distribution does not need to be the best-fitting distribution for the data, but an adequate enough model so that the statistical technique yields valid conclusions

    确定确切的分布类型将帮助我们决定可以进一步实施哪些统计或机器学习算法:-统计间隔和假设检验通常基于特定的分布假设。 在基于分布假设计算间隔或检验之前,我们需要验证该假设对于给定的数据集是合理的。 在这种情况下,该分布 不一定是最适合 数据的 分布 ,而是足够的模型,以便统计技术得出有效的结论。

  4. By assuming a random variable follows an established probability distribution, we can use its derived pmf/pdf and established principles to answer questions we have about the data.

    通过假设随机变量遵循已建立的概率分布,我们可以使用其派生的pmf / pdf和已建立的原理来回答关于数据的问题。

Distribution Parameters(Parameters)

分布参数(参数)

How do you determine the best distribution for a data set or a variable?

您如何确定数据集或变量的最佳分布?

The distribution’s parameters define the distribution.

分布的参数定义分布。

Statistical techniques are used to estimate the parameters of the various distributions.

统计技术用于估计各种分布的参数。

There are four parameters primarily used in Distribution fitting.Distribution fitting involves estimating the parameters that define the various distributions.

D分布 拟合 主要使用四个参数 涉及估计定义各种分布的参数。

  • Location(mean,mode,median):The location parameter of a distribution indicates where the distribution lies along the x-axis (the horizontal axis).

    Location(mean,mode,median): 分布的location参数指示分布沿x轴(水平轴)的位置。

  • Scale(standard deviation):The scale parameter of a distribution determines how much spread there is in the distribution

    比例(标准偏差) :分布的比例参数确定分布中有多少分布

  • Shape:The shape parameter of a distribution allows the distribution to take different shapes.

    形状 :分布的shape参数允许分布采用不同的形状。

  • Threshold:The threshold parameter of a distribution defines the minimum value of the distribution along the x-axis.

    阈值 :分布的阈值参数定义沿x轴分布的最小值。

Not all parameters exist for each distribution : For example, the normal distribution has only two parameters: location (the average) and scale (the standard deviation). These two parameters completely define the normal distribution.

并非每个分布都存在所有参数: 例如,正态分布只有两个参数:位置(平均值)和比例(标准偏差)。 这两个参数完全定义了正态分布。

Image for post
Figure 1: Normal Distribution with Different Locations 图1:不同位置的正态分布

The location parameter of a distribution indicates where the distribution lies along the x-axis (the horizontal axis). Figure 1 shows two normal distributions. The location values are different. The blue distribution has a location of 5. The orange distribution has a location of 10. Both have the same standard deviation (or scale in parameter terms).

分布 location 参数指示分布沿x轴(水平轴)的位置。 图1显示了两个正态分布。 位置值不同。 蓝色分布的位置为5。橙色分布的位置为10。两者的标准偏差(或参数范围的比例)相同。

Image for post
Figure 2: Logistic Distribution with Different Scale Parameters 图2:具有不同比例参数的逻辑分布

The scale parameter of a distribution determines how much spread there is in the distribution. The larger the scale parameter, the more spread there is in the distribution. and vice versa . Figure 2 shows the logistic distribution with three different scale parameters: 2(0 to -10=2), 5(0 to -25), and 8(0 to 40). The location for all three curves is 0.

分布 比例 参数确定分布中有多少分布。 比例参数越大,分布中的分布越多 反之亦然。 图2显示了具有三个不同比例参数的逻辑分布:2(0到-10 = 2),5(0到-25)和8(0到40)。 所有三个曲线的位置均为0。

Image for post
Figure 3: Gamma Distribution with Different Shape Parameters 图3:具有不同形状参数的伽玛分布

The shape parameter of a distribution allows the distribution to take different shapes.The two distributions above, the normal and the logistic distributions, do not have a shape parameter.The larger the shape parameter, the more the distribution tends to be skewed o the left.The smaller the shape parameter, the more the distribution tends to be skewed to the right

分布 形状 参数允许分布采用不同的形状,上面的两个分布(正态分布和逻辑分布)没有形状参数。 形状参数越大,越倾向于向左偏斜分布。形状参数越小,分布越倾向于向右偏斜

Figure 3 shows how changing the shape parameter impacts the gamma distribution. The scale parameter for the gamma distribution in Figure 3 is 2. The gamma distribution does not have a location parameter.

图3显示了更改形状参数如何影响伽玛分布。 图3中伽马分布的比例参数为2。伽马分布没有位置参数。

Image for post
Figure 4: Gamma Distribution with Different Threshold Values 图4:具有不同阈值的伽玛分布

The threshold parameter of a distribution defines the minimum value of the distribution along the x-axis.

分布的阈值参数定义沿x轴的分布的最小值。

The distribution cannot have any values below this threshold.

分布中的值不能低于此阈值。

Figure 4 is the gamma distribution with three different threshold values: 3, 6 and 9. The scale and shape parameter are both 2.

图4是具有三个不同阈值的伽玛分布:3、6和9。scale和shape参数均为2。

Reference:

参考:

https://www.spcforexcel.com/knowledge/basic-statistics/distribution-fittingwww.itl.nist.govhttps://machinelearningmastery.com/statistical-data-distributions/

https://www.spcforexcel.com/knowledge/basic-statistics/distribution-fitting www.itl.nist.gov https://machinelearningmastery.com/statistical-data-distributions/

翻译自: https://medium.com/@rahulkaushik_34252/data-distribution-and-its-parameters-29521ee73026

分布式数据库 分布式数据库

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值