scip

最新推荐文章于 2021-09-09 18:25:14 发布

小刘要努力。

最新推荐文章于 2021-09-09 18:25:14 发布

阅读量1.9k

点赞数

liurunsen

本文链接：https://blog.csdn.net/weixin_44510615/article/details/99344335

版权

Python 中常用的统计工具有 Numpy, Pandas, PyMC, StatsModels 等。
Scipy 中的子库 scipy.stats 中包含很多统计上的方法。

下面是scipy主要的模块，但用的最多的是stats

cluster 聚类算法
constants 物理数学常数
fftpack 快速傅里叶变换
integrate 积分和常微分方程求解
interpolate 插值
io 输入输出
= linalg 线性代数
odr 正交距离回归
optimize 优化和求根
signal 信号处理
sparse 稀疏矩阵
spatial 空间数据结构和算法
special 特殊方程
stats 统计分布和函数
weave C/C++ 积分

绘制高斯函数（热下身）

import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits import mplot3d
x = np.linspace(-3, 3, 100)
# 高斯函数
plt.plot(x, np.exp(-1 * x ** 2))
t = plt.title("Gaussian")
plt.savefig('Gaussian.png')
plt.show()

常见的统计方法

from numpy import *
from matplotlib import pyplot

# Numpy 自带简单的统计方法：
heights = array([1.46, 1.79, 2.01, 1.75, 1.56, 1.69, 1.88, 1.76, 1.88, 1.78])
print('mean,', heights.mean())
print('min,', heights.min())
print('max', heights.max())
print('stand deviation,', heights.std())

# 导入 Scipy 的统计模块：
import scipy.stats.stats as st


print('mode, ', st.mode(heights))  # 众数及其出现次数
print('skewness, ', st.skew(heights))  # 偏度
print('kurtosis, ', st.kurtosis(heights))  # 峰度


mean, 1.7559999999999998
min, 1.46
max 2.01
stand deviation, 0.15081114017207078
mode,  ModeResult(mode=array([1.88]), count=array([2]))
skewness,  -0.3935244564726347
kurtosis,  -0.33067209772439865

概率分布

常见的连续概率分布有：

均匀分布
正态分布
学生t分布
F分布
Gamma分布

离散概率分布：

伯努利分布
几何分布

这些都可以在 scipy.stats 中找到。

正态分布

# 正态分布
from scipy.stats import norm

# 它包含四类常用的函数：
#
# norm.cdf 返回对应的累计分布函数值
# norm.pdf 返回对应的概率密度函数值
# norm.rvs 产生指定参数的随机变量
# norm.fit 返回给定数据下，各参数的最大似然估计（MLE）值

# 从正态分布产生500个随机点：
x_norm = norm.rvs(size=500)
type(x_norm)
# pyplot.ion() #开启interactive mode
# 直方图：
h = pyplot.hist(x_norm)
print('counts, ', h[0])
print('bin centers', h[1])

counts,  [  1.  10.  31.  53. 116. 132.  86.  47.  19.   5.]
bin centers [-3.1464072  -2.55476393 -1.96312066 -1.37147739 -0.77983412 -0.18819085
  0.40345242  0.99509569  1.58673896  2.17838223  2.7700255 ]




figure = pyplot.figure(1)  # 创建图表1
pyplot.show()

# 归一化直方图（用出现频率代替次数），将划分区间变为 20（默认 10）：
h = pyplot.hist(x_norm, normed=True, bins=20)
pyplot.show()

# 在这组数据下，正态分布参数的最大似然估计值为：
x_mean, x_std = norm.fit(x_norm)

print('mean, ', x_mean)
print('x_std, ', x_std)


mean,  -0.030878122231297822
x_std,  0.9586075383182006

# 将真实的概率密度函数与直方图进行比较：
h = pyplot.hist(x_norm, normed=True, bins=20)

x = linspace(-3, 3, 50)
p = pyplot.plot(x, norm.pdf(x), 'r-')
pyplot.show()

# 导入积分函数：
from scipy.integrate import trapz

x1 = linspace(-2, 2, 108)
p = trapz(norm.pdf(x1), x1)
print('{:.2%} of the values lie between -2 and 2'.format(p))

95.45% of the values lie between -2 and 2

# 可以通过 loc 和 scale 来调整这些参数，一种方法是调用相关函数时进行输入：
x = linspace(-3, 3, 50)
p = pyplot.plot(x, norm.pdf(x, loc=0, scale=1))
p = pyplot.plot(x, norm.pdf(x, loc=0.5, scale=2))
p = pyplot.plot(x, norm.pdf(x, loc=-0.5, scale=.5))
pyplot.show()

# 不同参数的对数正态分布：
from scipy.stats import lognorm

x = linspace(0.01, 3, 100)

pyplot.plot(x, lognorm.pdf(x, 1), label='s=1')
pyplot.plot(x, lognorm.pdf(x, 2), label='s=2')
pyplot.plot(x, lognorm.pdf(x, .1), label='s=0.1')

pyplot.legend()
pyplot.show()

# 离散分布
from scipy.stats import randint

# 离散均匀分布的概率质量函数（PMF）：
high = 10
low = -10

x = arange(low, high + 1, 0.5)
p = pyplot.stem(x, randint(low, high).pmf(x))  # 杆状图
pyplot.show()

# 假设检验
# 导入相关的函数：
#
# 1.正态分布
# 2.独立双样本 t 检验，配对样本 t 检验，单样本 t 检验
# 3.学生 t 分布

from scipy.stats import norm
from scipy.stats import ttest_ind

# 独立样本 t 检验
# 两组参数不同的正态分布：
n1 = norm(loc=0.3, scale=1.0)
n2 = norm(loc=0, scale=1.0)
# 从分布中产生两组随机样本：
n1_samples = n1.rvs(size=100)
n2_samples = n2.rvs(size=100)
# 将两组样本混合在一起：
samples = hstack((n1_samples, n2_samples))
# 最大似然参数估计：
loc, scale = norm.fit(samples)
n = norm(loc=loc, scale=scale)
# 比较：
x = linspace(-3, 3, 100)

pyplot.hist([samples, n1_samples, n2_samples], normed=True)
pyplot.plot(x, n.pdf(x), 'b-')
pyplot.plot(x, n1.pdf(x), 'g-')
pyplot.plot(x, n2.pdf(x), 'r-')
pyplot.show()

# 独立双样本 t 检验的目的在于判断两组样本之间是否有显著差异：
t_val, p = ttest_ind(n1_samples, n2_samples)

print('t = {}'.format(t_val))
print('p-value = {}'.format(p))

# p 值小，说明这两个样本有显著性差异。

t = 2.6516886911174073
p-value = 0.00865772567380083