Scipy教程 - 统计函数库scipy.stats

http://blog.csdn.net/pipisorry/article/details/49515215

统计函数Statistical functions(scipy.stats)

Python有一个很好的统计推断包。那就是scipy里面的stats。

Scipy的stats模块包含了多种概率分布的随机变量,随机变量分为连续的和离散的两种。
所有的连续随机变量都是rv_continuous的派生类的对象,而所有的离散随机变量都是 rv_discrete的派生类的对象。

This module contains a large number of probability distributions as well as a growing library of statistical functions.

Each univariate distribution is an instance of a subclass of rv_continuous(rv_discrete for discrete distributions):

rv_continuous([momtype, a, b, xtol, ...])A generic continuous random variable class meant for subclassing.
rv_discrete([a, b, name, badvalue, ...])A generic discrete random variable class meant for subclassing.

皮皮blog



连续分布及其相关的函数

连续分布

alphaAn alpha continuous random variable.
anglitAn anglit continuous random variable.
arcsineAn arcsine continuous random variable.
betaA beta continuous random variable.
betaprimeA beta prime continuous random variable.
bradfordA Bradford continuous random variable.
burrA Burr (Type III) continuous random variable.
burr12A Burr (Type XII) continuous random variable.
cauchyA Cauchy continuous random variable.
chiA chi continuous random variable.
chi2A chi-squared continuous random variable.
cosineA cosine continuous random variable.
dgammaA double gamma continuous random variable.
dweibullA double Weibull continuous random variable.
erlangAn Erlang continuous random variable.
exponAn exponential continuous random variable.
exponnormAn exponentially modified Normal continuous random variable.
exponweibAn exponentiated Weibull continuous random variable.
exponpowAn exponential power continuous random variable.
fAn F continuous random variable.
fatiguelifeA fatigue-life (Birnbaum-Saunders) continuous random variable.
fiskA Fisk continuous random variable.
foldcauchyA folded Cauchy continuous random variable.
foldnormA folded normal continuous random variable.
frechet_rA Frechet right (or Weibull minimum) continuous random variable.
frechet_lA Frechet left (or Weibull maximum) continuous random variable.
genlogisticA generalized logistic continuous random variable.
gennormA generalized normal continuous random variable.
genparetoA generalized Pareto continuous random variable.
genexponA generalized exponential continuous random variable.
genextremeA generalized extreme value continuous random variable.
gausshyperA Gauss hypergeometric continuous random variable.
gammaA gamma continuous random variable.
gengammaA generalized gamma continuous random variable.
genhalflogisticA generalized half-logistic continuous random variable.
gilbratA Gilbrat continuous random variable.
gompertzA Gompertz (or truncated Gumbel) continuous random variable.
gumbel_rA right-skewed Gumbel continuous random variable.
gumbel_lA left-skewed Gumbel continuous random variable.
halfcauchyA Half-Cauchy continuous random variable.
halflogisticA half-logistic continuous random variable.
halfnormA half-normal continuous random variable.
halfgennormThe upper half of a generalized normal continuous random variable.
hypsecantA hyperbolic secant continuous random variable.
invgammaAn inverted gamma continuous random variable.
invgaussAn inverse Gaussian continuous random variable.
invweibullAn inverted Weibull continuous random variable.
johnsonsbA Johnson SB continuous random variable.
johnsonsuA Johnson SU continuous random variable.
kappa4Kappa 4 parameter distribution.
kappa3Kappa 3 parameter distribution.
ksoneGeneral Kolmogorov-Smirnov one-sided test.
kstwobignKolmogorov-Smirnov two-sided test for large N.
laplaceA Laplace continuous random variable.
levyA Levy continuous random variable.
levy_lA left-skewed Levy continuous random variable.
levy_stableA Levy-stable continuous random variable.
logisticA logistic (or Sech-squared) continuous random variable.
loggammaA log gamma continuous random variable.
loglaplaceA log-Laplace continuous random variable.
lognormA lognormal continuous random variable.
lomaxA Lomax (Pareto of the second kind) continuous random variable.
maxwellA Maxwell continuous random variable.
mielkeA Mielke’s Beta-Kappa continuous random variable.
nakagamiA Nakagami continuous random variable.
ncx2A non-central chi-squared continuous random variable.
ncfA non-central F distribution continuous random variable.
nctA non-central Student’s T continuous random variable.
normA normal continuous random variable.
paretoA Pareto continuous random variable.
pearson3A pearson type III continuous random variable.
powerlawA power-function continuous random variable.
powerlognormA power log-normal continuous random variable.
powernormA power normal continuous random variable.
rdistAn R-distributed continuous random variable.
reciprocalA reciprocal continuous random variable.
rayleighA Rayleigh continuous random variable.
riceA Rice continuous random variable.
recipinvgaussA reciprocal inverse Gaussian continuous random variable.
semicircularA semicircular continuous random variable.
skewnormA skew-normal random variable.
tA Student’s T continuous random variable.
trapzA trapezoidal continuous random variable.
triangA triangular continuous random variable.
truncexponA truncated exponential continuous random variable.
truncnormA truncated normal continuous random variable.
tukeylambdaA Tukey-Lamdba continuous random variable.
uniformA uniform continuous random variable.
vonmisesA Von Mises continuous random variable.
vonmises_lineA Von Mises continuous random variable.
waldA Wald continuous random variable.
weibull_minA Frechet right (or Weibull minimum) continuous random variable.
weibull_maxA Frechet left (or Weibull maximum) continuous random variable.
wrapcauchyA wrapped Cauchy continuous random variable.

连续随机变量对象的方法

rvs(*args, **kwds)Random variates of given type.产生服从这种分布的一个样本,对随机变量进行随机取值,可以通过size参数指定输出的数组大小。
pdf(x, *args, **kwds)Probability density function at x of the given RV.随机变量的概率密度函数。产生对应x的这种分布的y值。
logpdf(x, *args, **kwds)Log of the probability density function at x of the given RV.
cdf(x, *args, **kwds)Cumulative distribution function of the given RV.随机变量的累积分布函数,它是概率密度函数的积分(也就是x时p(X<x)的概率)。产生对应x的这种分布的累积分布函数的值。
logcdf(x, *args, **kwds)Log of the cumulative distribution function at x of the given RV.
sf(x, *args, **kwds)Survival function (1 - cdf) at x of the given RV.随机变量的生存函数,它的值是1-cdf(t)。
logsf(x, *args, **kwds)Log of the survival function of the given RV.
ppf(q, *args, **kwds)Percent point function (inverse of cdf) at q of the given RV.累积分布函数的反函数。q=0.01时,ppf就是p(X<x)=0.01时的x值。
isf(q, *args, **kwds)Inverse survival function (inverse of sf) at q of the given RV.
moment(n, *args, **kwds)n-th order non-central moment of distribution.
stats(*args, **kwds)Some statistics of the given RV.计算随机变量的期望值和方差
entropy(*args, **kwds)Differential entropy of the RV.
expect([func, args, loc, scale, lb, ub, ...])Calculate expected value of a function with respect to the distribution.
median(*args, **kwds)Median of the distribution.
mean(*args, **kwds)Mean of the distribution.
std(*args, **kwds)Standard deviation of the distribution.
var(*args, **kwds)Variance of the distribution.
interval(alpha, *args, **kwds)Confidence interval with equal areas around the median.
__call__(*args, **kwds)Freeze the distribution for the given arguments.
fit(data, *args, **kwds)Return MLEs for shape, location, and scale parameters from data.对一组随机取样进行拟合,找出最适合取样数据的概率密度函数的系数。如stats.norm.fit(x)就是将x看成是某个norm分布的抽样,求出其最好的拟合参数(mean, std)。
fit_loc_scale(data, *args)Estimate loc and scale parameters from data using 1st and 2nd moments.
nnlf(theta, x)Return negative loglikelihood function.
[ Continuous distributions]

[scipy.stats.rv_continuous]

多变量分布Multivariate distributions

multivariate_normalA multivariate normal random variable.
matrix_normalA matrix normal random variable.
dirichletA Dirichlet random variable.
wishartA Wishart random variable.
invwishartAn inverse Wishart random variable.
special_ortho_groupA matrix-valued SO(N) random variable.
ortho_groupA matrix-valued O(N) random variable.
random_correlationA random correlation matrix.

multivariate_normal

>>> x, y = np.mgrid[-1:1:.01, -1:1:.01]
>>> pos = np.dstack((x, y))   #二维坐标组合成三维坐标点坐标
>>> rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
>>> rv.pdf(pos)  #接受的参数是三维数据,第三维代表一个数据坐标,1、2维代表网格坐标位置。

皮皮blog



离散分布及其相关的函数

当分布函数的值域为离散时,称之为离散概率分布。例如投掷有6个面的骰子时,只能获得1到6的整数,因此得到的概率分布为离散的。

对于离散随机分布,通常使用概率质量函数(PMF)描述其分布情况。在stats库中所有描述离散分布的随机变量都从rv_discrete类继承。

直接用rv_discrete 类自定义离散概率分布

stats.rv_discrete(values=(x,p))中的参数表示随机变量x和其对应的概率。

设有一个不均匀的骰子,各点出现的概率不相等。可以用下面的数组x保存骰子的所有可能值,数组p保存每个值出现的概率:
>>> x = range(1,7)
>>> p = (0.4, 0.2, 0.1, 0.1, 0.1, 0.1)
用下面的语句定义表示这个特殊骰子的随机变量,并调用其rvs()方法投掷此骰子20次,获得符合概率p的随机数:
>>> dice = stats.rv_discrete(values=(x,p))
>>> dice.rvs(size=20)
Array([2, 5, 1, 2, 1, 1, 2, 4, 1, 3, 1, 1, 4, 3, 1, 1, 1, 2, 6, 4])

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
fs_meetsig = np.random.random(30)
fs_xk = np.sort(fs_meetsig)
fs_pk = np.ones_like(fs_xk) / len(fs_xk)
fs_rv_dist = stats.rv_discrete(name='fs_rv_dist', values=(fs_xk, fs_pk))

plt.plot(fs_xk, fs_rv_dist.cdf(fs_xk), 'b-', ms=12, mec='r', label='friend')
plt.show()

[rv_discrete Examples]

离散分布

bernoulliA Bernoulli discrete random variable.
binomA binomial discrete random variable.
boltzmannA Boltzmann (Truncated Discrete Exponential) random variable.
dlaplaceA Laplacian discrete random variable.
geomA geometric discrete random variable.
hypergeomA hypergeometric discrete random variable.
logserA Logarithmic (Log-Series, Series) discrete random variable.
nbinomA negative binomial discrete random variable.
planckA Planck discrete exponential random variable.
poissonA Poisson discrete random variable.
randintA uniform discrete random variable.
skellamA Skellam discrete random variable.
zipfA Zipf discrete random variable.

离散分布的函数

rvs(*args, **kwargs)Random variates of given type.
pmf(k, *args, **kwds)Probability mass function at k of the given RV.
logpmf(k, *args, **kwds)Log of the probability mass function at k of the given RV.
cdf(k, *args, **kwds)Cumulative distribution function of the given RV.
logcdf(k, *args, **kwds)Log of the cumulative distribution function at k of the given RV.
sf(k, *args, **kwds)Survival function (1 - cdf) at k of the given RV.
logsf(k, *args, **kwds)Log of the survival function of the given RV.
ppf(q, *args, **kwds)Percent point function (inverse of cdf) at q of the given RV.
isf(q, *args, **kwds)Inverse survival function (inverse of sf) at q of the given RV.
moment(n, *args, **kwds)n-th order non-central moment of distribution.
stats(*args, **kwds)Some statistics of the given RV.
entropy(*args, **kwds)Differential entropy of the RV.
expect([func, args, loc, lb, ub, ...])Calculate expected value of a function with respect to the distribution for discrete distribution.
median(*args, **kwds)Median of the distribution.
mean(*args, **kwds)Mean of the distribution.
std(*args, **kwds)Standard deviation of the distribution.
var(*args, **kwds)Variance of the distribution.
interval(alpha, *args, **kwds)Confidence interval with equal areas around the median.
__call__(*args, **kwds)Freeze the distribution for the given arguments.

皮皮blog



统计函数Statistical functions

{scipy.stats顶层函数,可以应用于很多分布的函数}

Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.

describe(a[, axis, ddof, bias, nan_policy])Computes several descriptive statistics of the passed array.
gmean(a[, axis, dtype])Compute the geometric mean along the specified axis.
hmean(a[, axis, dtype])Calculates the harmonic mean along the specified axis.
kurtosis(a[, axis, fisher, bias, nan_policy])Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis, nan_policy])Tests whether a dataset has normal kurtosis
mode(a[, axis, nan_policy])Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis, nan_policy])Calculates the nth moment about the mean for a sample.
normaltest(a[, axis, nan_policy])Tests whether a sample differs from a normal distribution.
skew(a[, axis, bias, nan_policy])Computes the skewness of a data set.
skewtest(a[, axis, nan_policy])Tests whether the skew is different from the normal distribution.
kstat(data[, n])Return the nth k-statistic (1<=n<=4 so far).
kstatvar(data[, n])Returns an unbiased estimator of the variance of the k-statistic.
tmean(a[, limits, inclusive, axis])Compute the trimmed mean.
tvar(a[, limits, inclusive, axis, ddof])Compute the trimmed variance
tmin(a[, lowerlimit, axis, inclusive, ...])Compute the trimmed minimum
tmax(a[, upperlimit, axis, inclusive, ...])Compute the trimmed maximum
tstd(a[, limits, inclusive, axis, ddof])Compute the trimmed sample standard deviation
tsem(a[, limits, inclusive, axis, ddof])Compute the trimmed standard error of the mean.
variation(a[, axis, nan_policy])Computes the coefficient of variation, the ratio of the biased standard deviation to the mean.
find_repeats(arr)Find repeats and repeat counts.
trim_mean(a, proportiontocut[, axis])Return mean of array after trimming distribution from both tails.
cumfreq(a[, numbins, defaultreallimits, weights])Returns a cumulative frequency histogram, using the histogram function.
histogram2(*args, **kwds)histogram2 is deprecated!
histogram(*args, **kwds)histogram is deprecated!
itemfreq(a)Returns a 2-D array of item frequencies.
percentileofscore(a, score[, kind])The percentile rank of a score relative to a list of scores.
scoreatpercentile(a, per[, limit, ...])Calculate the score at a given percentile of the input sequence.
relfreq(a[, numbins, defaultreallimits, weights])Returns a relative frequency histogram, using the histogram function.
binned_statistic(x, values[, statistic, ...])Compute a binned statistic for one or more sets of data.
binned_statistic_2d(x, y, values[, ...])Compute a bidimensional binned statistic for one or more sets of data.
binned_statistic_dd(sample, values[, ...])Compute a multidimensional binned statistic for a set of data.
obrientransform(*args)Computes the O’Brien transform on input data (any number of arrays).
signaltonoise(*args, **kwds)signaltonoise is deprecated!
bayes_mvs(data[, alpha])Bayesian confidence intervals for the mean, var, and std.
mvsdist(data)‘Frozen’ distributions for mean, variance, and standard deviation of data.
sem(a[, axis, ddof, nan_policy])Calculates the standard error of the mean (or standard error of measurement) of the values in the input array.
zmap(scores, compare[, axis, ddof])Calculates the relative z-scores.
zscore(a[, axis, ddof])Calculates the z score of each value in the sample, relative to the sample mean and standard deviation.
iqr(x[, axis, rng, scale, nan_policy, ...])Compute the interquartile range of the data along the specified axis.
sigmaclip(a[, low, high])Iterative sigma-clipping of array elements.
threshold(*args, **kwds)threshold is deprecated!
trimboth(a, proportiontocut[, axis])Slices off a proportion of items from both ends of an array.
trim1(a, proportiontocut[, tail, axis])Slices off a proportion from ONE end of the passed array distribution.
f_oneway(*args)Performs a 1-way ANOVA.
pearsonr(x, y)Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
spearmanr(a[, b, axis, nan_policy])Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.
pointbiserialr(x, y)Calculates a point biserial correlation coefficient and its p-value.
kendalltau(x, y[, initial_lexsort, nan_policy])Calculates Kendall’s tau, a correlation measure for ordinal data.
linregress(x[, y])Calculate a linear least-squares regression for two sets of measurements.
theilslopes(y[, x, alpha])Computes the Theil-Sen estimator for a set of points (x, y).
f_value(*args, **kwds)f_value is deprecated!
ttest_1samp(a, popmean[, axis, nan_policy])Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var, nan_policy])Calculates the T-test for the means of two independent samples of scores.
ttest_ind_from_stats(mean1, std1, nobs1, ...)T-test for means of two independent samples from descriptive statistics.
ttest_rel(a, b[, axis, nan_policy])Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode])Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis])Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...])Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2)Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity, alternative])Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals)Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method])Assign ranks to data, dealing with ties appropriately.
ranksums(x, y)Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction])Calculate the Wilcoxon signed-rank test.
kruskal(*args, **kwargs)Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args)Computes the Friedman test for repeated measurements
combine_pvalues(pvalues[, method, weights])Methods for combining the p-values of independent tests bearing upon the same hypothesis.
ss(*args, **kwds)ss is deprecated!
square_of_sums(*args, **kwds)square_of_sums is deprecated!
jarque_bera(x)Perform the Jarque-Bera goodness of fit test on sample data.
ansari(x, y)Perform the Ansari-Bradley test for equal scale parameters
bartlett(*args)Perform Bartlett’s test for equal variances
levene(*args, **kwds)Perform Levene test for equal variances.
shapiro(x[, a, reta])Perform the Shapiro-Wilk test for normality.
anderson(x[, dist])Anderson-Darling test for data coming from a particular distribution
anderson_ksamp(samples[, midrank])The Anderson-Darling test for k-samples.
binom_test(x[, n, p, alternative])Perform a test that the probability of success is p.
fligner(*args, **kwds)Perform Fligner-Killeen test for equality of variance.
median_test(*args, **kwds)Mood’s median test.
mood(x, y[, axis])Perform Mood’s test for equal scale parameters.
boxcox(x[, lmbda, alpha])Return a positive dataset transformed by a Box-Cox power transformation.
boxcox_normmax(x[, brack, method])Compute optimal Box-Cox transform parameter for input data.
boxcox_llf(lmb, data)The boxcox log-likelihood function.
entropy(pk[, qk, base])Calculate the entropy of a distribution for given probability values.
chisqprob(*args, **kwds)chisqprob is deprecated!
betai(*args, **kwds)betai is deprecated!

describe函数

这个函数的输出太难看了!

age = [23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61]
fat_percent = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7]
age = np.array(age)
fat_percent = np.array(fat_percent)
data = np.vstack([age, fat_percent]).reshape([-1, 2])

print(stats.describe(data))
DescribeResult(nobs=18, minmax=(array([  7.8,  17.8]), array([ 60.,  61.])), mean=array([ 37.36111111,  37.86666667]), variance=array([ 236.58604575,  188.78588235]), skewness=array([-0.30733374,  0.40999364]), kurtosis=array([-0.65245849, -1.26315357]))

修改了一个输出结果形式

for key, value in stats.describe(data)._asdict().items():
    print(key, ':', value)
nobs : 18
minmax : (array([  7.8,  17.8]), array([ 60.,  61.]))
mean : [ 37.36111111  37.86666667]
variance : [ 236.58604575  188.78588235]
skewness : [-0.30733374  0.40999364]
kurtosis : [-0.65245849 -1.26315357]

也可以使用pandas中的函数进行替代,这样输出比较舒服[python数据处理库pandas]

概率分布的熵和kl散度的计算 scipy.stats.entropy

 scipy.stats.entropy(pk, qk=None, base=None)[source]
    Calculate the entropy of a distribution for given probability values.
    If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).
    If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).
    This routine will normalize pk and qk if they don’t sum to 1.

香农熵的计算entropy

shannon_entropy = stats.entropy(ij/sum(ij), base=None)
print(shannon_entropy)

entropy的python直接实现

shannon_entropy_func = lambda pij: -sum(pij*np.log(pij))
shannon_entropy = shannon_entropy_func(ij[np.nonzero(ij)])
print(shannon_entropy)
def entropy(counts):
    '''Compute entropy.'''
    ps = counts/float(sum(counts))  # coerce to float and normalize
    ps = ps[nonzero(ps)]            # toss out zeros
    H = -sum(ps * numpy.log2(ps))   # compute entropy

    return H

两个分布的kl散度的计算

kl = sp.stats.entropy(fs_rv_dist, nonfs_rv_dist)

kl散度的其它实现[距离和相似度度量方法]

[scipy.stats.entropy]

假设检验相关的

ttest_1samp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var]) Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements

ttest_1samp实现了单样本t检验。因此,如果我们想检验数据Abra列的稻谷产量均值,通过零假设,这里我们假定总体稻谷产量均值为15000,我们有:

from scipy import stats as ss
# Perform one sample t-test using 1500 as the true mean
print ss.ttest_1samp(a = df.ix[:, 'Abra'], popmean = 15000)

# OUTPUT
(-1.1281738488299586, 0.26270472069109496)

返回下述值组成的元祖:

  • t : 浮点或数组类型
    t统计量
  • prob : 浮点或数组类型
    two-tailed p-value 双侧概率值

通过上面的输出,看到p值是0.267远大于α等于0.05,因此没有充分的证据说平均稻谷产量不是150000。将这个检验应用到所有的变量,同样假设均值为15000,我们有:

print ss.ttest_1samp(a = df, popmean = 15000)

# OUTPUT
(array([ -1.12817385,   1.07053437, -65.81425599,  -4.564575  ,   6.17156198]),
 array([  2.62704721e-01,   2.87680340e-01,   4.15643528e-70,
          1.83764399e-05,   2.82461897e-08]))

第一个数组是t统计量,第二个数组则是相应的p值。

皮皮blog



列联表函数Contingency table functions

chi2_contingency(observed[, correction, lambda_]) Chi-square test of independence of variables in a contingency table.
contingency.expected_freq(observed) Compute the expected frequencies from a contingency table.
contingency.margins(a) Return a list of the marginal sums of the array a.
fisher_exact(table[, alternative]) Performs a Fisher exact test on a 2x2 contingency table.

绘图测试Plot-tests

ppcc_max(x[, brack, dist]) Returns the shape parameter that maximizes the probability plot correlation coefficient for ppcc_plot(x, a, b[, dist, plot, N]) Returns (shape, ppcc), and optionally plots shape vs.
probplot(x[, sparams, dist, fit, plot]) Calculate quantiles for a probability plot, and optionally show the plot.
boxcox_normplot(x, la, lb[, plot, N]) Compute parameters for a Box-Cox normality plot, optionally show it.

Statistical functions for masked arrays (scipy.stats.mstats)

蒙面统计函数Masked statistics functions

argstoarray(*args) Constructs a 2D array from a group of sequences.
betai(a, b, x) Returns the incomplete beta function.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
count_tied_groups(x[, use_missing]) Counts the number of tied values.
describe(a[, axis]) Computes several descriptive statistics of the passed array.
f_oneway(*args) Performs a 1-way ANOVA, returning an F-value and probability given any f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Calculation of Wilks lambda F-statistic for multivariate data, per Maxwell find_repeats(arr) Find repeats in arr and return a tuple (repeats, repeat_count).
friedmanchisquare(*args) Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA.
kendalltau(x, y[, use_ties, use_missing]) Computes Kendall’s rank correlation tau on two variables x and y.
kendalltau_seasonal(x) Computes a multivariate Kendall’s rank correlation tau, for seasonal data.
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
linregress(*args) Calculate a regression line
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney statistic
plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
mquantiles(a[, prob, alphap, betap, axis, limit]) Computes empirical quantiles for a data array.

msign(x) Returns the sign of x, or 0 if x is masked.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
obrientransform(*args) Computes a transform on input data (any number of columns).
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and the associated p-value.
rankdata(data[, axis, use_missing]) Returns the rank (also known as order statistics) of each data point along scoreatpercentile(data, per[, limit, ...]) Calculate the score at the given ‘per’ percentile of the sequence a.
sem(a[, axis, ddof]) Calculates the standard error of the mean (or standard error of measurement) signaltonoise(data[, axis]) Calculates the signal-to-noise ratio, as the ratio of the mean over standard skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
spearmanr(x, y[, use_ties]) Calculates a Spearman rank-order correlation coefficient and the p-value theilslopes(y[, x, alpha]) Computes the Theil slope as the median of all slopes between paired values.
threshold(a[, threshmin, threshmax, newval]) Clip array to a given value.
tmax(a, upperlimit[, axis, inclusive]) Compute the trimmed maximum
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
trim(a[, limits, inclusive, relative, axis]) Trims an array by masking the data outside some given limits.
trima(a[, limits, inclusive]) Trims an array by masking the data outside some given limits.
trimboth(data[, proportiontocut, inclusive, ...]) Trims the smallest and largest data values.
trimmed_stde(a[, limits, inclusive, axis]) Returns the standard error of the trimmed mean along the given axis.
trimr(a[, limits, inclusive, axis]) Trims an array by masking some proportion of the data on each end.
trimtail(data[, proportiontocut, tail, ...]) Trims the data by masking values from one tail.
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis]) Calculates the T-test for the means of TWO INDEPENDENT samples of ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
tvar(a[, limits, inclusive]) Compute the trimmed variance
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation winsorize(a[, limits, inclusive, inplace, axis]) Returns a Winsorized version of the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample

单变量和多变量核密度估计Univariate and multivariate kernel density estimation (scipy.stats.kde)

gaussian_kde(dataset[, bw_method]) Representation of a kernel-density estimate using Gaussian kernels.

皮皮blog



统计函数使用举例

连续分布-Norm高斯分布

{高斯[正态]分布随机变量,A normal continuous random variable.}

生成服从高斯分布的随机向量(从正态分布中采样)stats.norm.rvs(loc, scale, size)

参数:

The location (loc) keyword specifies the mean.

The scale (scale) keyword specifies the standard deviation.

norm通过loc和scale参数可以指定随机变量的偏移和缩放参数。 对于正态分布的随机变量来说,这两个参数相当于指定其期望值和标准差。

高斯分布N(0,0.01)随机偏差
y = stats.norm.rvs(loc=0, scale=0.1, size=10)
输出:array([ 0.05419826,  0.04151471, -0.10784729,  0.18283546,  0.02348312, -0.04611974,  0.0069336 ,  0.03840133, -0.05015316,  0.23315205])

y.stats()
(array(0.0), array(0.1)

Note: 也可以使用numpy.random.norm函数生成高斯分布随机数[numpy库 - 随机数模块numpy.random]。

求正态分布最佳拟合参数stats.norm.fit(x)

>>> X =stats.norm(loc=1.0,scale=2.0,size = 100)
可以使用fit()方法对随机取样序列x进行拟合,返回的是与随机取样值最吻合的随机变量的参数
>>> stats.norm.fit(x) #得到随机序列的期望值和标准差
array([ 1.01810091, 2.00046946])


求正态分布N(1,1)概率密度函数某个x对应的值

lambda x: norm.pdf(x, 1, 1)
Note: 从正态分布概率密度中看出,这个和norm.pdf(x - 1)是不一样的,只有标准差为1时才相等。

求正态分布N(1,1)累积分布函数某个x对应的值

lambda x: norm.cdf(x, 1, 1)

绘制一维和二维正态分布概率密度图

[ 概率论:高斯分布]

[scipy.stats.norm]

均匀分布

mu = uniform.rvs(size=N)  # 从均匀分布采样

伽玛分布

伽玛分布需要额外的形状参数。伽玛分布可用于描述等待k个独立的随机事件发生所需的时间,k就是伽玛分布的形状参数。
伽玛分布的尺度参数theta和随机事件发生的频率相关,由scale参数指定。
>>> stats.gamma.stats(2.0,scale=2) 
(array(4.0), array(8.0))
根据伽玛分布的数学定义可知其期望值为k*theta,方差为k*theta^2 。上面的程序验证了这两个公式。 当随机分布有额外的形状参数时,它所对应的rvs()、pdf()等方法都会增加额外的参数以接收形状参数。

离散分布-二项分布

假设有一种只有两个结果的试验,其成功概率为 P,那么二项分布描述了进行n次这样的独立试验而成功k次的概率。
二项分布的概率质量函数公式如下: 


使用二项分布的概率质量函数pmf()可以很容易计算出现k次6点的概率。

pmf()

pmf()的第一个参数为随机变量的取值,后面的参数为描述随机分布所需的参数。对于二项分布来说,参数分别为n和P,而取值范围则为0到n之间的整数。

程序通过二项分布的概率质量公式计算投掷5次骰子出现0到6所对应的概率:

>>> stats.binom.pmf(range(6), 5, 1/6.0)
array([0.401878, 0.401878, 0.166751, 0.032150, 0.003215, 0.000129])

由结果可知:出现0或1次6点的概率为40.2%,而出现3次6点的概率为3.215%

泊松分布

在二项分布中,如果试验次数n很大,而每次试验成功的概率p很小,其乘积np比较适中,那么试验成功次数的概率可以用泊松分布近似描述。
在泊松分布中,使用lambda描述单位时间(或单位面积)内随机事件的平均发生率。如果将二项分布中的试验次数n看作单位时间内所做的试验次数,那么它和事件出现概率P的乘积就是事件的平均发生率,即lambda = np。
泊松分布的概率质量函数公式如下:

二项分布的近似分布
程序分别计算二项分布和泊松分布的概率质量函数,当n足够大时,二者是十分接近的。
程序中事件平均发生率lambda恒等于10。根据二项分布的试验次数计算每次事件出现的概率p=lambda/n。
>>> _lambda = 10.0 
>>> k = np.arange(20)
>>> possion = stats .poisson .pmf(k, _lambda) # 泊松分布 
>>> binom100 = stats.binom.pmf(k, 100, _lambda/100) #二项式分布 100
>>> binom1000=stats.binom.pmf(k, 1000 , _lambda/1000) #二项式分布 1000
>>> np.max(np.abs(binom100-possion)) # 计算最大误差
 0.006755311103353312
>>> np.max(np.abs(binom1000-possion))# n为 1000时,误差较小
0.00063017540509099912

泊松分布的模拟过程

泊松分布适合描述单位时间内随机事件发生次数的分布情况。例如某设施在一定时间内的 使用次数。机器出现故障的次数。自然灾害发生的次数等等。

下面使用随机数模拟泊松分布,并与其概率质量函数进行比较,事件每秒的平均发生次数为lambda=10。其中观察时间分别为1000秒,50000秒。可以看出:观察时间越长,事件每秒发生的次数就越符合泊松分布。

>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time )*time
>>> count, time_edges = np.histogram(t, bins=time, range=(0,time))
>>> count
array([10, 9, 8, …, 11, 10, 18])
>>>x = count_edges[:-1] 
>>> dist, count_edges = np. histogram (count, bins=20, range= (0,20), normed=True)
>>> poisson = stats .poisson.pmf(x, _lambda)
>>> np.max(np.abs(dist-poisson)) #最大误差很小,符合泊松分布
 0.0088356241037075706


Note: 用rand()产生平均分布于0到time之间的_lambda*time 个事件所发生的时刻。
用histogram()可以统计数组t中每秒之内事件发生的次数count。
根据泊松分布的定义,count数组中数值的分布情况应该符合泊松分布。统计事件次数在0到20区间内的概率分布。当histogram()的normed参数为True并且每个统计区间的长度为1时,其结果和概率质量函数相等。

泊松分布的时间间隔:伽玛分布

还可以换一个角度看随机事件的分布问题。可以观察相邻两个事件之间时间间隔的分布情况,或者隔k个事件的时间间隔的分布情况。根据概率论,事件之间的时间间隔应符合伽玛分布,由于时间间隔可以是任意数值,因此伽玛分布是一种连续概率分布。伽玛分布的概率密度函数公式如下,它描述第k个亊件发生所需的等待时间的概率分布。伽玛函数,当 k为整数时,它的值和k的阶乘k!相等。


程序模拟事件的时间间隔的伽玛分布,观察时间为1 000秒,平均每秒产生10个事件。
图中“k=1”,它表示相邻两个事件之间的时间间 隔的分布,而“k=2”则表示相隔一个事件的两个事件之间的时间间隔的分布,可以看出它们都符合伽玛分布.


>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time)*time
>>> t.sort()#计算事性前后的时间间隔,需要先对随机时刻进行排序
>>> s1 = t[1:] - t[:-1] #相邻两个事件之间的时间间隔 
>>> s2 = t[2:] - t[:-2] #相隔一个事件的两个亊件之间的时间间隔
>>> dist1, x1= np.histogram(s1, bins=100, normed=True)
>>> dist2, x2 = np.histogram(s2 , bins=100, normed=True)
>>> gamma1 = stats.gamma.pdf((x1[:-1]+x1[1:])/2, 1, scale=1.0/_lambda)
>>> gamma2 = stats.gamma.pdf((x2[:-1]+x2[1:])/2, 2, scale=1.0/_lambda)
>>> np.max(np.abs(gamma1 - dist1))
0.13557317865888141
>>> np.max(np.abs(gamma2 - dist2))
0.087375030861794656
>>> np.max(gamma1), np.max(gamma2)
(9.3483221580498537, 3.6767953241013656) #由于概率密度函数的值本身比较大,因此上面的误差已经很小了:
Note:模拟伽玛分布:
首先在10000秒之内产生100000个随机事件发生的时刻.因此事件的平均发生次数为每秒10次;
为了计算事性前后的时间间隔,需要先对随机时刻进行排序;
histogram()返回的第二个值为统计区间的边界,采用gamma.pdf()计算伽玛分布的概率密度时,使用各个区间的中值进行计算。Pdf()的第二个参数为k值,scale参数为1/λ;

from:http://blog.csdn.net/pipisorry/article/details/49515215

ref:Statistical functions (scipy.stats)

python标准库中的随机分布函数


  • 71
    点赞
  • 459
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
### 回答1: 我了解Scipy.stats,它是一个用于统计计算的Python模块。 你好!Scipy.stats 是一个 Python 库,提供了大量用于数据分析、概率计算和统计分布的函数和类。Python Scipy.stats 是一个用于科学计算和统计分析的 Python 库。它提供了大量用于数据探索、描述性统计、概率分布和假设检验等的函数Scipy.stats 中包含了多种概率分布函数,包括连续型分布(如正态分布、指数分布和威布尔分布等)和离散型分布(如泊松分布和二项分布等)。此外,它还提供了许多统计函数,如假设检验、相关分析、方差分析等。 下面是一些 Scipy.stats 常用函数的示例: 1. 正态分布 ```python from scipy.stats import norm # 计算正态分布概率密度函数在 x=1 处的取值 norm.pdf(1) # 计算正态分布累积分布函数在 x=1 处的取值 norm.cdf(1) # 计算给定正态分布的均值和标准差时,从 -1 到 1 区间的概率密度函数值 norm.interval(0.68, loc=0, scale=1) ``` 2. t 分布 ```python from scipy.stats import t # 计算给定 t 分布的自由度,从 -1 到 1 区间的概率密度函数值 t.interval(0.68, df=10, loc=0, scale=1) # 计算两个样本的 t 统计值和 p 值,用于假设检验 ttest_ind(sample1, sample2) ``` 3. 卡方分布 ```python from scipy.stats import chi2 # 计算给定卡方分布的自由度,从 0 到 2 区间的概率密度函数值 chi2.interval(0.68, df=5, loc=0, scale=1) # 计算卡方分布的 p 值,用于假设检验 chisquare(observed, expected) ``` 以上是 Scipy.stats 库的一些常用函数。你可以通过调用这些函数和查看其文档来进一步了解该库的使用。 ### 回答2: Python是一种广泛使用的编程语言,而Python scipy.stats是Python中的一个统计学模块。它是一个极其强大的模块,用于执行各种统计和概率分布操作。这个模块是SciPy库的一部分,SciPy是一个Python库,专门用于数学、科学和工程计算。 Python scipy.stats可以用于在Python中生成各种概率分布,如正态分布、泊松分布、二项分布、t分布等等。使用它可以计算每个概率分布的概率密度、累积分布函数和逆累积分布函数。 除了概率分布之外,Python scipy.stats还提供了各种统计测量功能,例如Kendall的Tau系数、Spearman的等级相关系数、Pearson的相关系数、均值、中位数、标准差等。还可以使用Python scipy.stats来进行假设检验,例如单样本和双样本t检验、卡方检验等等。 Python scipy.stats还提供了一些有用的函数,如峰度(kurtosis)、偏态(skewness)、最大值、最小值和极差。将这些函数与概率分布和统计测量相关函数相结合,可以在Python中快速完成高级统计分析操作。 总的来说,Python scipy.stats对于希望利用Python进行统计分析的科学家和工程师来说是非常有用的。它提供了方便、快速和高效的数据分析工具,使得研究人员可以更简单、更迅速地实现各种复杂的统计分析操作。 ### 回答3: Python的Scipy库提供了许多统计函数,其中最重要的是scipy.stats模块。在统计学和数据科学中,scipy.stats被广泛用于概率分布的计算、分位数的计算、假设检验、线性回归、方差分析等。 该模块提供了多种分布概率密度函数的计算。例如,正态分布(norm)、t分布(t)、卡方分布(chi2)、F分布(f)、伽马分布(gamma)等。对于每种分布,该模块提供了一个或多个方法计算概率密度函数、累积分布函数、分位数等。除此之外,还有如半正态分布、冈分布、三角分布等其它分布概率密度函数的计算。 与此同时,该模块还可用于执行假设检验。例如,在从正态分布中取样时,可以使用t检验测试样本和总体的均值是否不同。还可以使用方差分析(ANOVA)来比较不同组的平均值是否有差异。 scipy.stats模块还提供了一些关于线性回归的函数,例如pearsonr和spearmanr方法可以计算线性相关系数和斯皮尔曼等级相关系数。还可以使用linregress方法进行回归分析,包括计算斜率、截距、标准错误、t值和p值。 总的来说,scipy.stats是Python科学计算的重要组成部分,对于数据科学家、研究人员等人员来说非常有用,可以方便地计算和分析各种数据分布和假设检验,并且提供了一些常见的统计函数来解决数据问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值