NumPy 统计相关函数示例教程

最新推荐文章于 2023-04-12 23:30:00 发布

梦想画家

最新推荐文章于 2023-04-12 23:30:00 发布

阅读量392

点赞数

分类专栏： python 文章标签： numpy python 机器学习

本文链接：https://blog.csdn.net/neweastsun/article/details/125953599

版权

python 专栏收录该内容

49 篇文章 23 订阅

订阅专栏

本文介绍Numpy数组的常用方法以及与描述性统计相关的方法，如均值、方差、标准差、众数等。

检查数据类型

import numpy as np

arr1 = np.random.randn(5)
arr2 = np.random.randint(low=10, high=20, size=5)
arr3 = np.array(['a','bb','ccc','dddd','eeeee'])
arr4 = np.array(['abcdefghijkl'])

print('Array 1 :', arr1, type(arr1), '\nData type of Array 1', arr1.dtype)
print('Array 2 :', arr1, type(arr2), '\nData type of Array 2', arr2.dtype)
print('Array 3 :', arr1, type(arr3), '\nData type of Array 3', arr3.dtype)
print('Array 4 :', arr1, type(arr4), '\nData type of Array 4', arr4.dtype)

输出信息：

Array 1 : [1.06415609 0.31723947 0.90564194 0.34573655 1.05886877] <class 'numpy.ndarray'> 
Data type of Array 1 float64
Array 2 : [1.06415609 0.31723947 0.90564194 0.34573655 1.05886877] <class 'numpy.ndarray'> 
Data type of Array 2 int32
Array 3 : [1.06415609 0.31723947 0.90564194 0.34573655 1.05886877] <class 'numpy.ndarray'> 
Data type of Array 3 <U5
Array 4 : [1.06415609 0.31723947 0.90564194 0.34573655 1.05886877] <class 'numpy.ndarray'> 
Data type of Array 4 <U12

从输出结果可以看到，第一个数组类型为float64, 第二个是int32, 第三个类型<U5 …

这里主要讲第三种类型：U标识Unicode 字符串，5表示最大长度，小于号表示小端序(对应的大端序，字节存储方式)。因此可以将< U5解释为最大字符长度为5的Unicode字符串数组。

转换Numpy数组至字符串

任何Numpy数组都可以使用np.array2string()很方便转为字符串函数。该函数提供了很大的灵活性，很容易使用参数定义输出内容，举例：

separator 参数用于定义分割数组每个元素的字符
precision 参数定义浮点类型数值精度
supress_small 参数：当设置为true时，该参数将非常接近0的数字近似为0；默认值为False。

请看示例：

# 创建数组
arr = np.random.randn(5)
# 转换为字符串
arr_str_1 = np.array2string(arr)
print(arr_str_1, type(arr_str_1))

# with separator 参数
arr_str_2 = np.array2string(arr, separator=' | ')
print(arr_str_2, type(arr_str_2))

# with precision and supress_small
arr_str_3 = np.array2string(arr, precision=2, suppress_small=True)
print(arr_str_3, type(arr_str_3))

输出结果：

[ 0.58366893  0.08850228  0.64382527 -1.35269866  1.42138295] <class 'str'>
[ 0.58366893 |  0.08850228 |  0.64382527 | -1.35269866 |  1.42138295] <class 'str'>
[ 0.58  0.09  0.64 -1.35  1.42] <class 'str'>

类似也可以转换datetime数组至字符串数组。下面示例首先使用np.arange()函数创建日期数组：

import numpy as np

# 创建日期数组
arr_dt = np.arange('2022-02-01', '2022-02-05', dtype='datetime64')
print(arr_dt, type(arr_dt), '\n')

# 转换为字符串
arr_str_4 = np.array2string(arr_dt)
print(arr_str_4, type(arr_str_4))

输出结果：

['2022-02-01' '2022-02-02' '2022-02-03' '2022-02-04'] <class 'numpy.ndarray'> 
['2022-02-01' '2022-02-02' '2022-02-03' '2022-02-04'] <class 'str'>

大多数情况下，输出结果中我们并不需要中括号。这是我们可以使用join()和map()方法，

join示例：

arr_str_4 = np.array(["this", "is", "demo"])
print(' | '.join(arr_str_4))
# this | is | demo

map示例：

arr = np.array([10, 20, 30])
print(' '.join(map(str, arr)))
# 10 20 30

字符串数组转数值

有多种方法转换字符串数组为数值，这里主要介绍两个函数：astype() 和 np.asarray() 。两者有类似的功能，但区别在于astype()可以在numpy数组上调用，而np.asarray()也可以在python列表上使用。

import numpy as np

# array of strings
str_arr = np.array(['10', '20', '30', '40', '50', '60'])

# convert to int using astype()
int_arr = str_arr.astype(int)
print(int_arr, type(int_arr))

# convert to float using astype()
flt_arr = str_arr.astype(float)
print(flt_arr, type(flt_arr))

输出结果：

[10 20 30 40 50 60] <class 'numpy.ndarray'>
[10. 20. 30. 40. 50. 60.] <class 'numpy.ndarray'>

使用 np.asarray()，输出结果一样：

import numpy as np

# List of strings
str_list = ['10', '20', '30', '40', '50', '60']

# convert to int using asarray()
int_arr = np.asarray(str_list, dtype=int)
print(int_arr, type(int_arr))

# convert to float using asarray()
flt_arr = np.asarray(str_list, dtype=float)
print(flt_arr, type(flt_arr))

使用Numpy 从文本文件中载入数组

本节介绍如何载入文本文件并生成数组对象。np.loadtxt() 函数可以帮助我们实现该功能，参数如下：

-fname: 文本文件路径

-dtype: 数组数据类型(默认为float)

-delimiter: 字符串值分隔符

-skiprows: 跳过行数

-usecols: 加载的列（从0开始计数）

下面示例加载test.txt文件，内容如下：

id,score,level
1,88,4
2,89,4
3,90,5
4,78,3

加载示例代码：

import numpy as np

# Loading from an array
array_from_file_1 = np.loadtxt('data/test.txt', dtype=int, skiprows=1, usecols=(0, 1, 2), delimiter=',')
print(array_from_file_1, type(array_from_file_1))

输出结果：

[[ 1 88  4]
 [ 2 89  4]
 [ 3 90  5]
 [ 4 78  3]] <class 'numpy.ndarray'>

加载文件常见错误是，尝试转换字符串为float值，错误内容：

str_arr = np.array(['a', 'b', 'c', 'd'])
flt_arr = str_arr.astype(float)

# ValueError: could not convert string to float: 'a'

通过示例学习Numpy统计函数

NumPy是用于线性代数计算的Python包。PyData生态系统中几乎所有的库都依赖于NumPy，因此对Python数据科学非常重要。此外，numpy还使得统计计算变得非常简单和直接。让我们继续探索!

NumPy数组是存储数据的主要方式。Numpy数组本质上有两种类型—向量和矩阵。向量是严格意义上的一维数组，矩阵是二维的(但你应该注意到矩阵仍然可以只有一行或一列)。

查找Numpy数组最大值和最小值

首先看基本得统计函数，查找最大值和最小值。Numpy有两种方式查找最值，分别使用min、max和argmin、argmax函数。前者返回值，后者返回索引：

import numpy as np

arr = np.array([10, 12, 41, 17, 49, 2, 46, 3, 19, 39])

print('The maximum element in the array is:', arr.max())
print('The minimum element in the array is:', arr.min())

print('The index value of the maximum element in the array is: ', arr.argmax())
print('The index value of the minimum element in the array is: ', arr.argmin())

输出结果：

The maximum element in the array is: 49
The minimum element in the array is: 2
The index value of the maximum element in the array is:  4
The index value of the minimum element in the array is:  5

另外两个有用函数为np.amin() 和 np.amax()，能帮助我们发现多维数组中指定维度得最值：

arr2d= np.array([[1,23,78],[98,60,75],[79,25,48]])
print('Column wise minimum values are \n',np.amin(arr2d,axis=0))
print('Row wise minimum values are \n',np.amin(arr2d,axis=1))

axis参数指定计算维度，axis=0表示按列查找，axis=1表示按行查找。

计算均值

np.mean()函数计算数组均值，当然也可以直接使用Numpy数组的mean()方法。缺省计算1D数组，或指定维度计算：

import numpy as np

arr = np.array([10, 12, 41, 17, 49,  2, 46,  3, 19, 39])
print(arr)
print(arr.mean())

# 2维数组计算均值
arr2d = np.array( [[ 1, 23, 78],
                   [98, 60, 75],
                   [79, 25, 48]])
# 计算全部元素均值
print(arr2d.mean())
print(arr2d.mean(axis=0))
print(arr2d.mean(axis=1))

计算中位数

中位数是指数组元素中间元素，如何元素个数为偶数，则计算中间两个元素的均值。使用 np.median()计算中位数，和均值一样，可以针对全部元素，也可以通过axis指定维度。

import numpy as np

arr = np.array([10, 12, 41, 17, 49,  2, 46,  3, 19, 39])
print(arr)
print(np.median(arr))

arr2d = np.array( [[ 1, 23, 78],
                   [98, 60, 75],
                   [79, 25, 48]])

print(arr2d)
print(np.median(arr2d))
print(np.median(arr2d, axis=0))
print(np.median(arr2d, axis=1))

方差与标准差的实现方法类似：

print('The overall variance of the array is : ',np.var(arr2d))
print('The column wise variance of the array is : ',np.var(arr2d, axis=0))
print('The row wise variance of the array is : ',np.var(arr2d, axis=1))

print('The overall standard deviation of the array is : ',np.std(arr2d))
print('The column wise standard deviation of the array is : ',np.std(arr2d, axis=0))
print('The row wise standard deviation of the array is : ',np.std(arr2d, axis=1))

计算加权平均

前节计算平均值，每个元素权重相等。但一些场景中，每个值权重不同。举例计算学生成绩，基于下列权重：

Exam marks（考试成绩）： w1 = 0.8
Weekly Test marks（周测成绩）： w2 = 0.6
Project work（项目成绩）： w3 = 0.4
Attendance （出勤率）：w4 = 0.2
Behaviour in class（课程表现）： w5 = 0.1

计算加权平均使用np.average()函数，权重通过weights 参数指定：

import numpy as np

arr = np.array([10, 12, 41, 17, 49, 2, 46, 3, 19, 39])
wt1d = np.array([0.2, 0.3, 0.4, 0.5, 0.1, 0.64, 0.15, 0.7, 0.9, 0.22])
print('The weights are : \n', wt1d)
print('The weighted average of the array is :', np.average(arr, weights=wt1d))

arr2d = np.array([[1, 23, 78],
                  [98, 60, 75],
                  [79, 25, 48]])
wt2d = np.array([8, 2, 3])
print('The column wise weighted average of the array is : ', np.average(arr2d, axis=0, weights=wt2d))
print('The row wise weighted average of the array is : ', np.average(arr2d, axis=1, weights=wt2d))

计算百分位数与范围

百分位数可以被描述为一个分数，它让我们对特定数据点和组中其他数据点有一种直觉比较。例如，如果某同学在一次测试中得了90分，如果他的百分位数是96，那么90分就比班上所有其他分数的96%高。Numpy中的np.percentile()函数来计算这些值。除了输入数组外，它还接受另外两个参数：q和轴参数。q是要在0到100之间计算的百分比序列；轴是百分位数的计算维度：

arr2d = np.array([[1, 23, 78],
                  [98, 60, 75],
                  [79, 25, 48]])

print('The percentile of the array is :', np.percentile(arr, q=20))

print('The column wise percentile of the array is : ', np.percentile(arr2d, q=100, axis=0))
print('The row wise percentile of the array is : ', np.percentile(arr2d, q=100, axis=1))

范围是指数据集中最大值与最小值的差，np.ptp()可以计算一维或多维指定维度的范围，ptp是Peak to Peak的缩写：

arr = np.array([10, 12, 41, 17, 49, 2, 46, 3, 19, 39])
print('The range of the array is :', np.ptp(arr))

arr2d = np.array([[1, 23, 78],
                  [98, 60, 75],
                  [79, 25, 48]])


print('The column wise range of the array is : ',np.ptp(arr2d, axis=0))
print('The row wise range of the array is : ',np.ptp(arr2d, axis=1))

计算众数

众数是统计学中的一个重要指标，通常用于计算因子序列中出现最多的因子。Numpy并没有提供相应计算函数，但可以使用Scipy中的scipy.stats.mode()函数，其中一个参数是要输入数组，另一个指定维度。形式与Numpy类似，它不仅返回众数，还返回频次：

import numpy as np
from scipy import stats

arr = np.array([10, 12, 41, 17, 49, 2, 46, 3, 19, 39])
print('The mode of the array is :', stats.mode(arr))

arr2d = np.array([[1, 23, 78],
                  [98, 60, 75],
                  [79, 25, 48]])

print('The column wise mode of the array is : \n',stats.mode(arr2d, axis=0))
print('The row wise mode of the array is : \n',stats.mode(arr2d, axis=1))