numpy中的几个高效的用于数据分析的函数

最新推荐文章于 2022-03-19 11:59:31 发布

Neo的作战室

最新推荐文章于 2022-03-19 11:59:31 发布

阅读量349

点赞数

分类专栏： python 文章标签： numpy

本文链接：https://blog.csdn.net/qq_34741466/article/details/105081201

版权

python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

1.argpartition()

这个是numpy官网的介绍：

numpy.argpartition(a, kth, axis=-1, kind=‘introselect’, order=None)

Perform an indirect partition along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in partitioned order.

Parameters:
a : array_like
Array to sort.
kth : int or sequence of ints
Element index to partition by. The k-th element will be in its final sorted position and all smaller elements will be moved before it and all larger elements behind it. The order all elements in the partitions is undefined. If provided with a sequence of k-th it will partition all of them into their sorted position at once.

axis : int or None, optional
Axis along which to sort. The default is -1 (the last axis). If None, the flattened array is used.

kind : {‘introselect’}, optional
Selection algorithm. Default is ‘introselect’

order : str or list of str, optional
When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.

Returns:
index_array : ndarray, int
Array of indices that partition a along the specified axis. If a is one-dimensional, a[index_array] yields a partitioned a. More generally, np.take_along_axis(a, index_array, axis=a) always yields the partitioned a, irrespective of dimensionality.

可以用于部分数据排序,类似快排中partition操作

import numpy as np
x = np.random.randint(1, 100, (10,))
x
# array([57, 46, 65, 42, 52,  1, 91, 33, 16, 53])
# 先对x进行一个排序
sorted_x = np.sort(x)
sorted_x
# array([ 1, 16, 33, 42, 46, 52, 53, 57, 65, 91])

# 找第3小的数
index = np.argpartition(x, 3)
index
# 这个得出来的是一个数值的索引，并把第3小的数放在了正确的位置，但其他数的位置是随机放置的
# array([5, 1, 2, 3, 4, 0, 6, 7, 8, 9], dtype=int64)
x[index[3]]
# 42

# 找第5大的数
index = np.argpartition(x, -5)
index
# array([5, 8, 7, 3, 1, 4, 9, 6, 2, 0], dtype=int64)
x[index[-5]]
# 52

#找第3小，第4大，第0大的数
index = np.argpartition(x,[3, -4, -1])
index
# array([5, 8, 7, 3, 1, 4, 9, 0, 2, 6], dtype=int64)
x[index[3, -4, -1]]

x[index[3]]
# 42
x[index[-4]]
# 53
x[index[-1]]
# 91


# 找出4个最大数
x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])
index_val = np.argpartition(x, -4)[-4:]
index_val
# array([1, 8, 2, 0], dtype=int64)
np.sort(x[index_val])
# array([10, 12, 12, 16])


# 如果输入的数组为矩阵形式
np.random.seed(666)
x = np.random.randint(1,100, (5,5))
x
# array([[ 3, 46, 31, 63, 71],
#        [74, 31, 37, 62, 92],
#        [95, 52, 61, 96, 29],
#        [15, 98, 64, 17, 47],
#        [40, 70, 83, 77, 80]])
# 先看一下排序后的结果
np.sort(x)
# array([[ 3, 31, 46, 63, 71],
#        [31, 37, 62, 74, 92],
#        [29, 52, 61, 95, 96],
#        [15, 17, 47, 64, 98],
#        [40, 70, 77, 80, 83]])
# 找每行第3小的数
index = np.argpartition(x, 2)
index
# 得出了每行第2小的数的索引，如第一行索引为1，第二行为3
# array([[0, 2, 1, 3, 4],
#        [1, 2, 3, 0, 4],
#        [4, 1, 2, 3, 0],
#        [0, 3, 4, 1, 2],
#        [0, 1, 3, 2, 4]], dtype=int64)
# 第0行第3小的数
x[0][index[0][2]]
# 46

这么做看起来效率很低，其实如果数组元素数量很大，如百万级别，用这个函数可以快速找出顺序排序后的某一位的数或某几位的数

插一个函数介绍np.sort()
函数默认为axis=1进行排序

x = np.random.randint(1,10, (5,5))
x
# array([[3, 7, 5, 4, 2],
#        [1, 9, 8, 6, 3],
#        [6, 6, 5, 9, 5],
#        [5, 1, 1, 5, 1],
#        [5, 6, 8, 2, 1]])
# axis = 1 按行排序 axis =0 按列排序
np.sort(x,axis = 1)
# array([[2, 3, 4, 5, 7],
#        [1, 3, 6, 8, 9],
#        [5, 5, 6, 6, 9],
#        [1, 1, 1, 5, 5],
#        [1, 2, 5, 6, 8]])

np.sort(x, axis = 0)
# array([[1, 1, 1, 2, 1],
#        [3, 6, 5, 4, 1],
#        [5, 6, 5, 5, 2],
#        [5, 7, 8, 6, 3],
#        [6, 9, 8, 9, 5]])

2.allclose()

该函数用于检查两个数组是否相同，在一个误差范围内或容忍范围内。

numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)

Returns True if two arrays are element-wise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b.
If either array contains one or more NaNs, False is returned. Infs are treated as equal if they are in the same place and of the same sign in both arrays.

Parameters:
a, b : array_like
Input arrays to compare.

rtol : float
The relative tolerance parameter (see Notes).

atol : float
The absolute tolerance parameter (see Notes).

equal_nan : bool
Whether to compare NaN’s as equal. If True, NaN’s in a will be considered equal to NaN’s in b in the output array.

New in version 1.10.0.

Returns:
allclose : bool
Returns True if the two arrays are equal within the given tolerance; False otherwise.

If the following equation is element-wise True, then allclose returns
True.
absolute(a - b) <= (atol + rtol * absolute(b))
The above equation is not symmetric in a and b, so that
allclose(a, b) might be different from allclose(b, a) in
some rare cases.
The comparison of a and b uses standard broadcasting, which
means that a and b need not have the same shape in order for
allclose(a, b) to evaluate to True. The same is true for
equal but not array_equal.

>>> np.allclose([1e10,1e-7], [1.00001e10,1e-8])
False
>>> np.allclose([1e10,1e-8], [1.00001e10,1e-9])
True
>>> np.allclose([1e10,1e-8], [1.0001e10,1e-9])
False
np.allclose([1.0, np.nan], [1.0, np.nan])
# False
np.allclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)
# True

array1 = np.array([0.12,0.17,0.24,0.29])
array2 = np.array([0.13,0.19,0.26,0.31])
np.allclose(array1,array2,0.1)
# False
np.allclose(array1,array2,0.2)
# True

3.clip()

使数组中的元素裁剪至指定的范围内

numpy.clip(a, a_min, a_max, out=None, **kwargs)

Clip (limit) the values in an array.

Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.

Equivalent to but faster than np.maximum(a_min, np.minimum(a, a_max)). No check is performed to ensure a_min < a_max.

Parameters:
a : array_like
Array containing elements to clip.

a_min : scalar or array_like or None
Minimum value. If None, clipping is not performed on lower interval edge. Not more than one of a_min and a_max may be None.

a_max : scalar or array_like or None
Maximum value. If None, clipping is not performed on upper interval edge. Not more than one of a_min and a_max may be None. If a_min or a_max are array_like, then the three arrays will be broadcasted to match their shapes.

out : ndarray, optional
The results will be placed in this array. It may be the input array for in-place clipping. out must be of the right shape to hold the output. Its type is preserved.

**kwargs
For other keyword-only arguments, see the ufunc docs.

New in version 1.17.0.

Returns:
clipped_array : ndarray
An array with the elements of a, but where values < a_min are replaced with a_min, and those > a_max with a_max.
给定一个区间，则区间外的数值被剪切至区间上下限。
低于下限的被归为下限。

>>> a = np.arange(10)
>>> np.clip(a, 1, 8)
array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8])
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.clip(a, 3, 6, out=a)    # out=a表示在原数组的基础上进行修改
array([3, 3, 3, 3, 4, 5, 6, 6, 6, 6])
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.clip(a, [3, 4, 1, 1, 1, 4, 4, 4, 4, 4], 8)
array([3, 4, 2, 3, 4, 5, 6, 7, 8, 8])

>>> x = np.random.randint(1,100, (10,))
>>> x
array([78, 46,  3, 89, 64, 24, 12, 72, 55, 85])
>>> np.clip(x, 1,10)
array([10, 10,  3, 10, 10, 10, 10, 10, 10, 10])

4.extract()

在特定条件下从一个数组中提取特定元素
numpy.extract(condition, arr)
Return the elements of an array that satisfy some condition.

This is equivalent to np.compress(ravel(condition), ravel(arr)). If condition is boolean np.extract is equivalent to arr[condition].

Note that place does the exact opposite of extract.

Parameters:
condition : array_like
An array whose nonzero or True entries indicate the elements of arr to extract.

arr : array_like
Input array of the same size as condition.

Returns:
extract : ndarray
Rank 1 array of values from arr where condition is True.

>>> arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> condition = np.mod(arr, 3)==0
>>> condition
array([[ True, False, False,  True],
       [False, False,  True, False],
       [False,  True, False, False]])
>>> np.extract(condition, arr)
array([0, 3, 6, 9])

>>> array = np.random.randint(20, size=12)
>>> array
array([ 0,  1,  8, 19, 16, 18, 10, 11,  2, 13, 14,  3])#  Divide by 2 and 
>>> cond = np.mod(array, 2)==1
>>> cond
array([False,  True, False,  True, False, False, False,  True, False, True, False,  True])
>>> np.extract(cond, array)
array([ 1, 19, 11, 13,  3])
>>> np.extract(((array < 3) | (array > 15)), array)
array([ 0,  1, 19, 16, 18,  2])

5.where()

用于从一个数组中返回满足特定条件的元素。比如，它会返回满足特定条件的数值的索引位置

numpy.where(condition[, x, y])
Return elements chosen from x or y depending on condition.

Note
When only condition is provided, this function is a shorthand for np.asarray(condition).nonzero(). Using nonzero directly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided.

Parameters:
condition : array_like, bool
Where True, yield x, otherwise yield y.

x, y : array_like
Values from which to choose. x, y and condition need to be broadcastable to some shape.

Returns:
out : ndarray
An array with elements from x where condition is True, and elements from y elsewhere.

If all the arrays are 1-D, where is equivalent to:

[xv if c else yv
for c, xv, yv in zip(condition, x, y)]

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0,  1,  2,  3,  4, 50, 60, 70, 80, 90])
>>> np.where([[True, False], [True, True]],
...          [[1, 2], [3, 4]],
...          [[9, 8], [7, 6]])
array([[1, 8],
       [3, 4]])
>>> x, y = np.ogrid[:3, :4]
>>> np.where(x < y, x, 10 + y)  
array([[10,  0,  0,  0],
       [10, 11,  1,  1],
       [10, 11, 12,  2]])
>>> a = np.array([[0, 1, 2],
...               [0, 2, 4],
...               [0, 3, 6]])
>>> np.where(a < 4, a, -1) 
array([[ 0,  1,  2],
       [ 0,  2, -1],
       [ 0,  3, -1]])


>>> y = np.array([1,5,6,8,1,7,3,6,9])
>>> index = np.where(y>5)
>>> index
(array([2, 3, 5, 7, 8], dtype=int64),)  
# 得出满足条件的索引
>>> y[index]
array([6, 8, 7, 6, 9])

6.percentile()

用于计算特定轴方向上数组元素的第 n 个百分位数。

numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, interpolation=‘linear’, keepdims=False)

Compute the q-th percentile of the data along the specified axis.

Returns the q-th percentile(s) of the array elements.

Parameters:
a : array_like
Input array or object that can be converted to an array.

q : array_like of float
Percentile or sequence of percentiles to compute, which must be between 0 and 100 inclusive.

axis : {int, tuple of int, None}, optional
Axis or axes along which the percentiles are computed. The default is to compute the percentile(s) along a flattened version of the array.

Changed in version 1.9.0: A tuple of axes is supported

out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output, but the type (of the output) will be cast if necessary.

overwrite_input : bool, optional
If True, then allow the input array a to be modified by intermediate calculations, to save memory. In this case, the contents of the input a after this function completes is undefined.

interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
This optional parameter specifies the interpolation method to use when the desired percentile lies between two data points i < j:

‘linear’: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
‘lower’: i.
‘higher’: j.
‘nearest’: i or j, whichever is nearest.
‘midpoint’: (i + j) / 2.
New in version 1.9.0.

keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original array a.

New in version 1.9.0.

Returns:
percentile : scalar or ndarray
If q is a single percentile and axis=None, then the result is a scalar. If multiple percentiles are given, first axis of the result corresponds to the percentiles. The other axes are the axes that remain after the reduction of a. If the input contains integers or floats smaller than float64, the output data-type is float64. Otherwise, the output data-type is the same as that of the input. If out is specified, that array is returned instead.

Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V. The values and distances of the two nearest neighbors as well as the interpolation parameter will determine the percentile if the normalized ranking does not match the location of q exactly. This function is the same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

>>> a
array([[10,  7,  4],
       [ 3,  2,  1]])
>>> np.percentile(a, 50)
3.5
>>> np.percentile(a, 50, axis=0)
array([6.5, 4.5, 2.5])
>>> np.percentile(a, 50, axis=1)
array([7.,  2.])
>>> np.percentile(a, 50, axis=1, keepdims=True)
array([[7.],
       [2.]])

>>> m = np.percentile(a, 50, axis=0)
>>> out = np.zeros_like(m)
>>> np.percentile(a, 50, axis=0, out=out)
array([6.5, 4.5, 2.5])
>>> m
array([6.5, 4.5, 2.5])

>>> b = a.copy()
>>> np.percentile(b, 50, axis=1, overwrite_input=True)
array([7.,  2.])
>>> assert not np.all(a == b)


>>> a = np.array([1,5,6,8,1,7,3,6,9])
>>> print("50th Percentile of a, axis = 0 : ",np.percentile(a, 50, axis =0))
50th Percentile of a, axis = 0 :  6.0

>>> a[np.where(a > 5)]
array([6, 8, 7, 6, 9])