Pandas学习——预备知识

最新推荐文章于 2024-07-13 11:04:09 发布

sosososoon

最新推荐文章于 2024-07-13 11:04:09 发布

阅读量2.1k

点赞数

分类专栏：数据分析与挖掘文章标签： python numpy pandas

本文链接：https://blog.csdn.net/sosososoon/article/details/111273557

版权

数据分析与挖掘专栏收录该内容

25 篇文章 5 订阅

订阅专栏

调用库函数

import numpy as np

Python基础

列表推导式与条件赋值

L = []

def my_f(x):
    return 2*x

for i in range(5):
    L.append(my_f(i))
    
L

[0, 2, 4, 6, 8]

利用列表推导式对上面的写法进行简化：

[my_f(i) for i in range(5)]

[0, 2, 4, 6, 8]

列表表达式还支持多层嵌套，如下面的例子中，第一个 for 为外层循环，第二个 for 为内层循环

[m+'_'+n for m in ['a','b'] for n in ['c','d']]

['a_c', 'a_d', 'b_c', 'b_d']

带有if的条件赋值的形式为 value = a if condition else b：

value = 'cat' if 2>1 else 'dog'
value

'cat'

例如截断列表中超过5的元素：

L = [1,2,3,4,5,6,7]
[i if i <= 5 else 5 for i in L]

[1, 2, 3, 4, 5, 5, 5]

匿名函数与map方法

一些函数的定义具有清晰简单的映射关系，这时候可以用匿名函数的方法简洁地表示：

my_f = lambda x: 2*x
my_f(3)

my_add = lambda a,b: a+b 
my_add(1,2)

把匿名函数用在列表推导式中的时候，我们不关心函数的名字，只关心这种映射关系：

[(lambda x:2*x)(i) for i in range(5)]

[0, 2, 4, 6, 8]

对于上述列表推导式的匿名函数映射，可以用 map 函数来完成，它返回的是一个 map 对象，需要用 list 转换为列表

list(map(lambda x:2*x,range(5)))

[0, 2, 4, 6, 8]

对于多个输入值的函数映射，可以通过追加迭代对象实现：

list(map(lambda x,y :str(x)+'_'+y,range(5),list('abcde')))

['0_a', '1_b', '2_c', '3_d', '4_e']

np数组的变形与合并

转置：T

np.zeros((2,3)).T

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

合并：r_，c_

对于二维数组而言，r_和c_分别表示上下合并和左右合并

上下合并的时候要求第二个参数，也就是列数相同
左右合并的时候要求第一个参数，也就是行数相同

np.r_[np.zeros((1,3)),np.zeros((3,3))]

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

np.c_[np.zeros((3,2)),np.zeros((3,4))]

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

一维数组和二维数组进行合并时，应当把一维数组视为列向量，在长度匹配的情况下只能够使用左右合并的操作

try:
    np.r_[np.array([0,0]),np.zeros((2,1))]
except Exception as e:
    Err_Msg = e

Err_Msg

ValueError('all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)')

np.r_[np.array([1,1]),np.zeros(2)]

array([1., 1., 0., 0.])

np.c_[np.array([1,1]),np.zeros((2,3))]

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.]])

维度变换

利用reshape命令可以把数据按照新的维度重新排列，在使用时有两种模式：C模式和F模式，分别以逐行和逐列的顺序进行填充读取

target = np.arange(8).reshape(2,4)
target

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

target.reshape((4,2),order = 'C') # 按照行读取和填充

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

target.reshape((4,2),order = 'F') # 按照列读取和填充

array([[0, 2],
       [4, 6],
       [1, 3],
       [5, 7]])

由于被调用数组的大小是确定的，因此 reshape 允许有一个维度存在空缺，此时只需填充-1即可：

target.reshape((4,-1))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

将数组转换为1维数组：

target = np.ones((3,3))
target

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

target.reshape(-1)

array([1., 1., 1., 1., 1., 1., 1., 1., 1.])

np数组的切片与索引

np 支持使用 slice 类型的 $s t a r t : e n d : s t e p$ 切片，还可以直接传入列表指定某个维度的索引进行切片：

target = np.arange(9).reshape(3,3)
target

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

target[0:3,[0,2]]

array([[0, 2],
       [3, 5],
       [6, 8]])

此外，还可以利用 np.ix_ 在对应的维度上使用布尔索引，但此时不能使用 slice 切片：

target[np.ix_([True, False, True], [True, False, True])]

array([[0, 2],
       [6, 8]])

new = target.reshape(-1)
new

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

new[new%2==0]

array([0, 2, 4, 6, 8])

常用函数

where

where 是一种条件函数，可以指定满足条件与不满足条件位置对应的填充值

a = np.array([-1,1,-1,0])
np.where(a>0, a, 5) # 对应位置为True时填充a对应元素，否则填充5

array([5, 1, 5, 5])

onzero, argmax, argmin

nonzero 返回非零数的索引， argmax, argmin 分别返回最大和最小值的索引：

a = np.array([3,-5,0,1,3,-1])
a

array([ 3, -5,  0,  1,  3, -1])

np.nonzero(a)

(array([0, 1, 3, 4, 5], dtype=int64),)

a[np.nonzero(a)] # 取出非零值

array([ 3, -5,  1,  3, -1])

print('最大值的索引为：') # 当有并列最大值时，只输出第一个最大值
print(a.argmax())
print('最大值为：')
print(a[a.argmax()])

最大值的索引为：
0
最大值为：
3

any, all

any：当序列至少存在一个 True 或非零元素时返回 True ，否则返回 False
all：当序列元素全为 True 或非零元素时返回 True ，否则返回 False

a = np.array([0,1])

a.any()

True

a.all()

False

cumprod, cumsum, diff

cumprod, cumsum 分别表示累乘和累加函数，返回同长度的数组， diff 表示和前一个元素做差，由于第一个元素为缺失值，因此在默认参数情况下，返回长度是原数组长度-1

a = np.array([1,2,3])

a.cumprod()

array([1, 2, 6], dtype=int32)

a.cumsum()

array([1, 3, 6], dtype=int32)

np.diff(a)

array([1, 1])

统计函数

常用的统计函数包括 max, min, mean, median, std, var, sum, quantile，其中分位数计算是全局方法，因此不能通过 array.quantile 的方法调用：

target = np.arange(5)
target

array([0, 1, 2, 3, 4])

target.mean()

2.0

np.quantile(target, 0.75)

3.0

对于含有缺失值的数组，它们返回的结果也是缺失值，如果需要略过缺失值，必须使用 nan 类型的函数，上述的几个统计函数都有对应的 nan 函数

target = np.array([1,2,3,np.nan,4,5])
target

array([ 1.,  2.,  3., nan,  4.,  5.])

target.max()

nan

np.nanmax(target)

5.0

np.nanquantile(target, 0.5)

3.0

对于协方差和相关系数分别可以利用 cov, corrcoef 计算

target1 = np.array([1,3,5,9])
target2 = np.array([1,5,3,-9])
print('协方差为：')
print(np.cov(target1, target2))
print('相关系数为：')
print(np.corrcoef(target1, target2))

协方差为：
[[ 11.66666667 -16.66666667]
 [-16.66666667  38.66666667]]
相关系数为：
[[ 1.         -0.78470603]
 [-0.78470603  1.        ]]

二维 Numpy 数组中统计函数的 axis 参数，能够进行某一个维度下的统计特征计算，当 axis=0 时结果为列的统计指标，当 axis=1 时结果为行的统计指标

target = np.arange(1,10).reshape(3,-1)
target

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

target.sum(0)

array([12, 15, 18])

target.sum(1)

array([ 6, 15, 24])

广播机制

广播机制用于处理两个不同维度数组之间的操作

标量和数组的操作

当一个标量和数组进行运算时，标量会自动把大小扩充为数组大小，之后进行逐元素操作

res = 3 * np.ones((2,2)) + 1
res

array([[4., 4.],
       [4., 4.]])

二维数组之间的操作

两个数组维度完全一致时，使用对应元素的操作，否则会报错，除非一个数组的维度是 m×n 而另一个数组的维度是 m×1 或者 1×n ，那么会扩充其具有 1 的维度为另一个数组对应维度的大小。

res = np.ones((3,2))
res

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

res * np.array([[2,3]]) # 扩充为3行

array([[2., 3.],
       [2., 3.],
       [2., 3.]])

res * np.array([[2],[3],[4]]) # 扩充为2列

array([[2., 2.],
       [3., 3.],
       [4., 4.]])

res * np.array([[2]]) # 等价于两次扩充

array([[2., 2.],
       [2., 2.],
       [2., 2.]])

一维数组与二维数组的操作

当一维数组 $A_k$ 与二维数组 $B_{m,n}$ 操作时，等价于把一维数组视作 $A_{1,k}$ 的二维数组，当 $k! = n$ 且 $k, n$ 都不是 1 时报错

np.ones(3) + np.ones((2,3))

array([[2., 2., 2.],
       [2., 2., 2.]])

np.ones(3) + np.ones((2,1)) # 1×3 的和 2×1 的，得到 2×3 的

array([[2., 2., 2.],
       [2., 2., 2.]])

np.ones(1) + np.ones((2,3))

array([[2., 2., 2.],
       [2., 2., 2.]])

向量与矩阵的计算

向量内积： dot

a = np.array([1,2,3])
b = np.array([1,3,5])
a.dot(b)

向量范数和矩阵范数： np.linalg.norm

矩阵范数的计算中，最重要的是 ord 参数，可选值如下：

ord	norm for matrices	norm for vectors
None	Frobenius norm	2-norm
‘fro’	Frobenius norm	–
‘nuc’	nuclear norm	–
inf	max(sum(abs(x), axis=1))	max(abs(x))
-inf	min(sum(abs(x), axis=1))	min(abs(x))
0	–	sum(x != 0)
1	max(sum(abs(x), axis=0))	as below
-1	min(sum(abs(x), axis=0))	as below
2	2-norm (largest sing. value)	as below
-2	smallest singular value	as below
other	–	sum(abs(x)ord)(1./ord)

martix_target =  np.arange(4).reshape(-1,2)
martix_target

array([[0, 1],
       [2, 3]])

np.linalg.norm(martix_target, 'fro')

3.7416573867739413

np.linalg.norm(martix_target, np.inf)

5.0

np.linalg.norm(martix_target, 2)

3.702459173643833

vector_target =  np.arange(4)
vector_target

array([0, 1, 2, 3])

np.linalg.norm(vector_target, np.inf)

3.0

np.linalg.norm(vector_target, 2)

3.7416573867739413

np.linalg.norm(vector_target, 3)

3.3019272488946263

矩阵乘法： @

$[A_{m \times n} B_{p \times n}]_{ij}=\sum_{k=1}^{p}A_{ik}B_{kj}$

a = np.arange(4).reshape(-1,2)
a

array([[0, 1],
       [2, 3]])

b = np.arange(-4,0).reshape(-1,2)
b

array([[-4, -3],
       [-2, -1]])

a@b

array([[ -2,  -1],
       [-14,  -9]])

练习

利用列表推导式写矩阵乘法

一般的矩阵乘法根据公式，可以由以下三重循环写出，请将其改写为列表推导式的形式

M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
res = np.empty((M1.shape[0],M2.shape[1]))
for i in range(M1.shape[0]):
    for j in range(M2.shape[1]):
        item = 0
        for k in range(M1.shape[1]):
            item += M1[i][k] * M2[k][j]
        res[i][j] = item

((M1@M2 - res) < 1e-15).all()

True

res = [[sum([M1[i][k] * M2[k][j] for k in range(M1.shape[1])]) for j in range(M2.shape[1])] for i in range(M1.shape[0])]
((M1@M2 - res) < 1e-15).all()

True

更新矩阵

在这里插入图片描述

# 方法一
A = np.arange(1,10).reshape(3,-1)
B = A*[ [sum(1/A[i][k] for k in range(A.shape[1])) ] for i in range(A.shape[0]) ]
B

array([[1.83333333, 3.66666667, 5.5       ],
       [2.46666667, 3.08333333, 3.7       ],
       [2.65277778, 3.03174603, 3.41071429]])

# 方法二
B = A*(1/A).sum(1).reshape(-1,1)
B

array([[1.83333333, 3.66666667, 5.5       ],
       [2.46666667, 3.08333333, 3.7       ],
       [2.65277778, 3.03174603, 3.41071429]])

卡方统计量

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LCButdpD-1608100867971)(attachment:image.png)]

np.random.seed(0)
A = np.random.randint(10, 20, (8, 5))
A

array([[15, 10, 13, 13, 17],
       [19, 13, 15, 12, 14],
       [17, 16, 18, 18, 11],
       [16, 17, 17, 18, 11],
       [15, 19, 18, 19, 14],
       [13, 10, 13, 15, 10],
       [12, 13, 18, 11, 13],
       [13, 13, 17, 10, 11]])

B = A.sum(0)*A.sum(1).reshape(-1, 1)/A.sum()
X = ((A-B)*(A-B)/B).sum()
X

11.842696601945802

改进矩阵计算的性能

np.random.seed(0)
m, n, p = 100, 80, 50
B = np.random.randint(0, 2, (m, p))
U = np.random.randint(0, 2, (p, n))
Z = np.random.randint(0, 2, (m, n))
def solution(B=B, U=U, Z=Z):
    L_res = []
    for i in range(m):
        for j in range(n):
            norm_value = ((B[i]-U[:,j])**2).sum()
            L_res.append(norm_value*Z[i][j])
    return sum(L_res)

solution(B, U, Z)

改进方法：

(((B**2).sum(1).reshape(-1,1) + (U**2).sum(0) - 2*B@U)*Z).sum()

连续整数的最大长度

输入一个整数的 Numpy 数组，返回其中递增连续整数子数组的最大长度。例如，输入 [1,2,5,6,7]，[5,6,7]为具有最大长度的递增连续整数子数组，因此输出3；输入[3,2,1,2,3,4,6]，[1,2,3,4]为具有最大长度的递增连续整数子数组，因此输出4。请充分利用 Numpy 的内置函数完成。（提示：考虑使用 nonzero, diff 函数）

# 方法一
def max_len(s):
    maxlen = 0
    l = 0
    for i in np.diff(s):
        if i == 1:
            l += 1
            if l > maxlen:
                maxlen = l
        else:
            l = 0
    return maxlen+1

# 方法二
f = lambda x:np.diff(np.nonzero(np.r_[1,np.diff(x)!=1,1])).max()

s = np.array([1,2,3,4,5,3,5,6,7,8,9,10,11])
max_len(s)

f(s)

sosososoon

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Pandas学习——预备知识

调用库函数import numpy as npPython基础列表推导式与条件赋值L = []def my_f(x): return 2*xfor i in range(5): L.append(my_f(i)) L [0, 2, 4, 6, 8]利用列表推导式对上面的写法进行简化：[my_f(i) for i in range(5)][0, 2, 4, 6, 8]列表表达式还支持多层嵌套，如下面的例子中，第一个 for 为外层循环，第二个 fo
复制链接

扫一扫

专栏目录