第一章预备知识

最新推荐文章于 2022-01-07 14:52:53 发布

怡颜悦色

最新推荐文章于 2022-01-07 14:52:53 发布

阅读量296

点赞数 3

分类专栏： Pandas 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_42967904/article/details/111242025

版权

Pandas 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、Python基础

1. 列表推导式与条件赋值

函数法生成数字数组：

L = []
def my_func(x):
    return 2*x
for i in range(5):
    L.append(my_func(i))
L

[0, 2, 4, 6, 8]

利用列表推导式可进行写法上的简化：[* for i in *]。其中，第一个*为映射函数，其输入为后面i指代的内容，第二个*表示迭代的对象。

[my_func(i) for i in range(5)]

[0, 2, 4, 6, 8]

列表推导式也支持多重循环：

[i+j for i in range(3) for j in range(2)]

[0, 1, 1, 2, 2, 3]

语法糖之条件赋值：value = A if contiditon else B。若contidion成立，则value=A，否则value=B。

value = 'dawang' if 19991126>20020501 else 'yiyi'
value

'yiyi'

列表推导式结合条件复制，截断列表中超过 6 的元素：

L = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print("method_1:", [i if i <= 6 else 0 for i in L])
print("method_2:", [j for j in L if j <= 6])

method_1: [1, 2, 3, 4, 5, 6, 0, 0, 0, 0]
method_2: [1, 2, 3, 4, 5, 6]

2. 匿名函数与map方法

匿名函数 lambda：可接受任意数量的参数，但只能有一个表达式。

my_func = lambda x, y: x+y+2
my_func(6, 6)

匿名函数常在无需多次调用的场合使用，例如下面的列表推导式，用户不关心函数的名字，只关心这种映射关系：

[(lambda x: 2*x)(i) for i in range(5)]

[0, 2, 4, 6, 8]

Python中map函数可实现上述列表推导式，返回一个map对象，需要用list转为列表：

list(map(lambda x: 2*x, range(5)))

[0, 2, 4, 6, 8]

对于多参数的函数映射，可通过追加迭代对象实现：

list(map(lambda x, y: x+'_'+y, ['dawang', 'yiyi'], ['21', '19']))

['dawang_21', 'yiyi_19']

3. zip对象与enumerate方法

zip函数能够把多个可迭代对象打包成一个元组构成的可迭代对象，返回一个zip对象，需要用tuple, list转为相应结果：

L1, L2, L3 = list('abc'), list('def'), list('hij')
print('list:',list(zip(L1, L2, L3)))
print('tuple:',tuple(zip(L1, L2, L3)))

list: [('a', 'd', 'h'), ('b', 'e', 'i'), ('c', 'f', 'j')]
tuple: (('a', 'd', 'h'), ('b', 'e', 'i'), ('c', 'f', 'j'))

zip压缩：

zipped = list(zip(L1, L2, L3))
zipped

[('a', 'd', 'h'), ('b', 'e', 'i'), ('c', 'f', 'j')]

*操作符和zip联合解压，得到原来的列表：

list(zip(*zipped))

[('a', 'b', 'c'), ('d', 'e', 'f'), ('h', 'i', 'j')]

常在多个迭代对象时使用zip函数：

for i, j, k in zip(L1, L2, L3):
     print(i, j, k)

a d h
b e i
c f j

当需要对两个列表建立字典映射时，可以利用zip对象：

L = ['dawang', 'yiyi']
l = ['21', '19']
dict(zip(L, l))

{'dawang': '21', 'yiyi': '19'}

enumerate是一种特殊的打包，它可以在迭代时绑定迭代元素的序号：

for index, value in enumerate(L):
     print(index, value)

0 dawang
1 yiyi

zip对象也能够实现绑定迭代元素序号的功能：

for index, value in zip(range(len(L)), L):
     print(index, value)

0 dawang
1 yiyi

二、Numpy基础

1. np数组的构造

【a】等差序列：np.linspace, np.arange

import numpy as np
print(np.arange(1,8,2)) # 起始、终止（不包含）、步长
print(np.linspace(1,8,7)) # 起始、终止（包含）、样本个数

[1 3 5 7]
[1.         2.16666667 3.33333333 4.5        5.66666667 6.83333333
 8.        ]

【b】特殊矩阵：zeros, eye, full

np.zeros((2,3)) # 传入元组表示各维度大小

array([[0., 0., 0.],
       [0., 0., 0.]])

np.eye(3) # 3*3的单位矩阵

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

np.eye(3, k=-1) # 偏移主对角线左下方1个单位的伪单位矩阵

array([[0., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

np.full((2,3), 6) # 元组传入数组大小，10表示填充数值

array([[6, 6, 6],
       [6, 6, 6]])

np.full((3,3), [1,2,3]) # 元组传入数组大小，然后通过传入列表填充每列的值

array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

【c】随机矩阵：np.random.rand,np.random.randn, np.random.randint, np.random.choice

# 随机种子，用于固定随机数的输出结果
np.random.seed(20201215)

1.np.random.rand：生成0-1均匀分布的随机数组。

np.random.rand(3) # 生成服从0-1均匀分布的三个随机数

array([0.41027199, 0.51899996, 0.80972058])

np.random.rand(3, 3) # 每个维度大小分开输入

array([[0.90696063, 0.46431113, 0.33601533],
       [0.92317337, 0.51085762, 0.42522104],
       [0.43964531, 0.31018317, 0.92337506]])

生成服从区间a到b上的均匀分布：

a, b = 6, 66
(b - a) * np.random.rand(3) + a

array([44.07294368, 37.46458496, 23.49200936])

2.np.random.randn：生成N(0,1)的标准正态分布。

np.random.randn(10)

array([ 1.43014915,  0.85515114, -0.58301996, -0.34585252, -1.82385768,
        0.08660496, -0.07069498, -2.04867273, -1.06028752, -0.06134195])

np.random.randn(2, 2)

array([[-0.44623967,  0.65615742],
       [-0.54517497, -0.11563773]])

生成服从方差为 $\sigma^2$ 均值为 $\mu$ 的一元正态分布：

sigma, mu = 2.5, 3
mu + np.random.randn(3) * sigma

array([5.82286413, 1.72830122, 5.93496221])

3.np.random.randint：指定生成随机整数的最小值、最大值和维度。

low, high, size = 5, 15, (2,2)
np.random.randint(low, high, size)

array([[ 7,  7],
       [12,  6]])

4.np.random.choice：从给定的列表中，以一定概率和方式抽取结果，不指定概率时为均匀采样。

data = [3,4,5,6]
print('不可重复抽：',np.random.choice(data, 4, replace=False, p=[0.1, 0.7, 0.1 ,0.1]))  # 最多同时抽len(data)个数
print('可重复抽：',np.random.choice(data, 8, replace=True, p=[0.1, 0.7, 0.1 ,0.1]))  # 可以抽任意多个数

不可重复抽： [4 3 6 5]
可重复抽： [4 3 3 3 6 4 4 4]

np.random.choice(data, (3,3))

array([[6, 6, 4],
       [3, 4, 4],
       [6, 6, 5]])

2. np数组的变形与合并

【a】转置：T

np.ones((2,3)).T

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

【b】合并操作：r_, c_

np.r_[np.ones((2,3)),np.ones((2,3))]  # `r_`表示上下合并

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

np.c_[np.zeros((2,3)),np.zeros((2,3))]  # `c_`表示左右合并

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

np.c_[np.array([0,0]),np.zeros((2,3))]  # 一维数组和二维数组进行合并时，只能够使用左右合并

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]])

【c】维度变换：reshape

target = np.arange(12).reshape(3,4)
target

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

target.reshape((4,3), order='C') # 按照行读取、按行填充，默认是按行读取并填充

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

target.reshape((4,3), order='F') # 按照列读取、按列填充

array([[ 0,  5, 10],
       [ 4,  9,  3],
       [ 8,  2,  7],
       [ 1,  6, 11]])

# 将 n×1 大小的数组转为1维数组
target = np.ones((3,1))
target = target.reshape(-1)
target

array([1., 1., 1.])

3. np数组的切片与索引

target = np.arange(16).reshape((4,4))
target

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

数组的切片使用slice类型的start:end:step切片，或者直接传入列表指定某个维度的索引进行切片。

# 选取所有行，间隔为2；选取第0、2列
target[::2, [0,2]]

array([[ 0,  2],
       [ 8, 10]])

# 当目标数组为1维时，可直接限制条件
new = target.reshape(-1)
new[new%2==0]

array([ 0,  2,  4,  6,  8, 10, 12, 14])

4. 常用函数

【a】where

where是一种条件函数，可以指定满足条件与不满足条件位置对应的填充值：

a = np.array([3,-1,1,-1,4,0])
np.where(a>0, a, 6) # 对应位置为True时填充a对应元素，否则填充6

array([3, 6, 1, 6, 4, 6])

【b】nonzero, argmax, argmin

nonzero返回非零数的索引，argmax、 argmin分别返回最大、最小数的索引：

a = np.array([3,-1,1,-1,4,0])
np.nonzero(a)

(array([0, 1, 2, 3, 4], dtype=int64),)

a.argmax()  # 若最大数有多个，则返回第一个最大数的索引

a.argmin()  # 若最小数有多个，则返回第一个最小数的索引

【c】any, all

any指当序列至少 存在一个 True或非零元素时返回True，否则返回False

all指当序列元素全为 True或非零元素时返回True，否则返回False

a = np.array([0,1])
a.any()

True

 a.all()

False

【d】cumprod, cumsum, diff

cumprod, cumsum分别表示累乘和累加函数，返回同长度的数组

a = np.arange(1,6)
a.cumprod()

array([  1,   2,   6,  24, 120], dtype=int32)

a.cumsum()

array([ 1,  3,  6, 10, 15], dtype=int32)

diff表示和前一个元素做差，由于第一个元素为缺失值，因此在默认参数情况下，返回长度是原数组减1

a = np.arange(1,6)
np.diff(a)

array([1, 1, 1, 1])

【e】统计函数 max, min, mean, median, std, var, sum, quantile

target = np.arange(5)
target

array([0, 1, 2, 3, 4])

target.sum()  # min, mean, median, std, var, sum类似

np.quantile(target, 0.75) # 0.75分位数

3.0

PS：对于含有缺失值的数组，必须使用nan*类型的函数：

target = np.array([1, 2, np.nan])
target

array([ 1.,  2., nan])

target.max()

nan

np.nanmax(target)  # np.nanmin, np.nanmean, np.nanmedian, np.nanstd, np.nanvar, np.nansum类似

2.0

np.nanquantile(target, 0.5)

1.5

PS：对于二维数组，当axis=0时结果为列的统计指标，当axis=1时结果为行的统计指标，下面以max函数为例：

target = np.arange(1,10).reshape(3,-1)
target

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

target.max(0)  # 求每列最大值

array([7, 8, 9])

target.max(1)  # 求每行最大值

array([3, 6, 9])

【f】协方差和相关系数:cov, corrcoef

target1 = np.array([1,3,5,7])
target2 = np.array([1,4,7,11])
np.cov(target1, target2)

array([[ 6.66666667, 11.        ],
       [11.        , 18.25      ]])

np.corrcoef(target1, target2)

array([[1.        , 0.99725651],
       [0.99725651, 1.        ]])

5. 广播机制

广播机制用于处理两个不同维度数组之间的操作。

【a】标量和数组的操作

当一个标量和数组进行运算时，标量会自动把大小扩充为数组大小：

res = 5 * np.ones((2,2)) + 1
res

array([[6., 6.],
       [6., 6.]])

res = 1 / res
res

array([[0.16666667, 0.16666667],
       [0.16666667, 0.16666667]])

【b】二维数组之间的操作

当两个数组维度完全一致时，使用对应元素的操作。
若两个数组维度不完全一致，除非其中的某个数组的维度是 $m \times 1$ 或者 $1 \times n$ ，否则会报错。

res = np.ones((3,2))
res

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

两个数组维度完全一致：

asd = 6*np.ones((3,2))
asd * res

array([[6., 6.],
       [6., 6.],
       [6., 6.]])

两个数组维度不完全一致：

res * np.array([[2,3]]) # 扩充第一维度为3

array([[2., 3.],
       [2., 3.],
       [2., 3.]])

res * np.array([[2],[3],[4]]) # 扩充第二维度为2

array([[2., 2.],
       [3., 3.],
       [4., 4.]])

res * np.array([[2]]) # 等价于两次扩充

array([[2., 2.],
       [2., 2.],
       [2., 2.]])

6. 向量与矩阵的计算

【a】向量内积：dot

$\rm \mathbf{a}\cdot\mathbf{b} = \sum_ia_ib_i$

a = np.array([1,2,3])
b = np.array([1,3,5])
a.dot(b)

【b】向量范数和矩阵范数：np.linalg.norm

矩阵范数：

martix_target =  np.arange(9).reshape(3,3)
martix_target

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

np.linalg.norm(martix_target, np.inf)  # 矩阵无穷范数，相当于各行求和后找最大值

21.0

np.linalg.norm(martix_target, 1)  # 矩阵1范数，相当于各列求和后找最大值

15.0

向量范数:

vector_target =  np.arange(5)
vector_target

array([0, 1, 2, 3, 4])

np.linalg.norm(vector_target, 1)  # 计算向量的1范数，即向量的各分量之和

10.0

np.linalg.norm(vector_target, 2)  # 计算向量的2范数，即求向量各分量的平方和再开根号

5.477225575051661

np.linalg.norm(vector_target, np.inf)  # 计算向量的无穷范数，即向量各分量的∞次方之和，再开∞次根。相当于max(abs(vector))

4.0

【c】矩阵乘法：符号为@，矩阵相乘，左侧矩阵的列数必须等于右侧矩阵的行数。

$\rm [\mathbf{A}_{m\times p}\mathbf{B}_{p\times n}]_{ij} = \sum_{k=1}^p\mathbf{A}_{ik}\mathbf{B}_{kj}$

a = np.arange(6).reshape(-1,3)
a

array([[0, 1, 2],
       [3, 4, 5]])

b = np.arange(-6,0).reshape(-1,2)
b

array([[-6, -5],
       [-4, -3],
       [-2, -1]])

a@b

array([[ -8,  -5],
       [-44, -32]])

三、练习

Ex1：利用列表推导式写矩阵乘法

一般的矩阵乘法根据公式，可以由三重循环写出，请将其改写为列表推导式的形式。

import numpy as np
import random
np.random.seed(20201215)
M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
res = np.empty((M1.shape[0],M2.shape[1]))

# 方法一：三重循环
for i in range(M1.shape[0]):
    for j in range(M2.shape[1]):
        item = 0
        for k in range(M1.shape[1]):
            item += M1[i][k] * M2[k][j]
        res[i][j] = item

# 方法二：使用列表推导式 
res = np.array([sum([M1[i][k] * M2[k][j] for k in range(M1.shape[1])]) for i in range(M1.shape[0]) for j in range(M2.shape[1])]).reshape((M1.shape[0],M2.shape[1]))
res

# ((M1@M2 - res) < 1e-15).all() # 排除数值误差

array([[0.75884205, 0.77406164, 0.40174178, 1.1312682 ],
       [1.18893908, 1.25005346, 0.61039081, 1.83094764]])

Ex2：更新矩阵

设矩阵 $A_{m×n}$ ，现在对 $A$ 中的每一个元素进行更新生成矩阵 $B$ ，更新方法是 $B_{ij}=A_{ij}\sum_{k=1}^n\frac{1}{A_{ik}}$ ，例如下面的矩阵为 $A$ ，则 $B_{2,2}=5\times(\frac{1}{4}+\frac{1}{5}+\frac{1}{6})=\frac{37}{12}$ ，请利用 Numpy 高效实现。

import numpy as np
A = np.arange(1, 10).reshape((3, 3)) # 生成矩阵 A
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]
B = np.array([A[i][j]*sum([1/A[i][k] for k in range(A.shape[1])]) for i in range(A.shape[0]) for j in range(A.shape[1])]).reshape((A.shape[0], A.shape[1]))
B

array([[1.83333333, 3.66666667, 5.5       ],
       [2.46666667, 3.08333333, 3.7       ],
       [2.65277778, 3.03174603, 3.41071429]])

Ex3：卡方统计量

设矩阵 $A_{m\times n}$ ，记 $B_{ij} = \frac{(\sum_{i=1}^mA_{ij})\times (\sum_{j=1}^nA_{ij})}{\sum_{i=1}^m\sum_{j=1}^nA_{ij}}$ ，定义卡方值如下：
$\chi^2 = \sum_{i=1}^m\sum_{j=1}^n\frac{(A_{ij}-B_{ij})^2}{B_{ij}}$
请利用Numpy对给定的矩阵 $A$ 计算 $\chi^2$

np.random.seed(0)
A = np.random.randint(10, 20, (8, 5))

# 初始化矩阵 B
B = np.empty((A.shape[0], A.shape[1]))
# 计算矩阵B的分母
B_fenmu = sum([sum([A[i][j] for j in range(A.shape[1])]) for i in range(A.shape[0])])
# 计算矩阵B的分子
B_fenzi = [sum([A[k][j] for k in range(A.shape[0])]) * sum([A[i][k] for k in range(A.shape[1])]) for j in range(A.shape[1]) for i in range(A.shape[0])]
# 得到矩阵B
B = (B_fenzi / B_fenmu).reshape((A.shape[0], A.shape[1]))

# 计算卡方值
chi_square = sum([(A[i][j]-B[i][j])**2 / B[i][j] for j in range(A.shape[1]) for i in range(A.shape[0])])
chi_square

31.413088311078415

Ex4：改进矩阵计算的性能

设 $Z$ 为 $m \times n$ 的矩阵， $B$ 和 $U$ 分别是 $m \times p$ 和 $p \times n$ 的矩阵， $B_i$ 为 $B$ 的第 $i$ 行， $U_j$ 为 $U$ 的第 $j$ 列，下面定义 $\displaystyle R=\sum_{i=1}^m\sum_{j=1}^n\|B_i-U_j\|_2^2Z_{ij}$ ，其中 $\|\mathbf{a}\|_2^2$ 表示向量 $a$ 的分量平方和 $\sum_i a_i^2$ 。

现有某人根据如下给定的样例数据计算 $R$ 的值，请充分利用Numpy中的函数，基于此问题改进这段代码的性能。

%%timeit
import numpy as np
np.random.seed(0)

m, n, p = 100, 80, 50
B = np.random.randint(0, 2, (m, p))
U = np.random.randint(0, 2, (p, n))
Z = np.random.randint(0, 2, (m, n))

def solution(B=B, U=U, Z=Z):
    L_res = []
    for i in range(m):
        for j in range(n):
            norm_value = ((B[i]-U[:,j])**2).sum()
            L_res.append(norm_value*Z[i][j])
    return np.array(L_res).sum()
solution(B, U, Z)

114 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
import numpy as np
np.random.seed(0)

m, n, p = 100, 80, 50
B = np.random.randint(0, 2, (m, p))
U = np.random.randint(0, 2, (p, n))
Z = np.random.randint(0, 2, (m, n))

R = np.array([((B[i]-U[:,j])**2).sum() * Z[i][j] for i in range(m) for j in range(n)]).sum()
R

103 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ex5：连续整数的最大长度

输入一个整数的Numpy数组，返回其中递增连续整数子数组的最大长度，正向是指递增方向。例如，输入[1,2,5,6,7]，[5,6,7]为具有最大长度的连续整数子数组，因此输出3；输入[3,2,1,2,3,4,6]，[1,2,3,4]为具有最大长度的连续整数子数组，因此输出4。请充分利用Numpy的内置函数完成。（提示：考虑使用nonzero, diff函数）

test = np.array([1,2,3,4,5,6,3,1,2,3,4,5,2,1,2,3,4])

def longest_integer(s):
    
    print('输入数组s：', s, ' 长度：', len(s))
    
    # 第一步将输入数组中每一个元素与前一个元素做差，连续整数归1。由于第一个数一定连续，所以添加1到第一个元素
    a = np.append(np.array([1]), np.diff(s))
    print('连续整数归1：', a, ' 长度：', len(a))
    
    # 连续整数归0
    b = a - 1
    print('连续整数归0：', b, ' 长度：', len(b))
    
    # 在末尾加一个非零数字，以防最长连续整数出现在最后
    c = np.append(b, np.array([1]))
    print('末尾加非零数字：', c, ' 长度：', len(c))

    # 得到非零数索引数组
    d = np.nonzero(c)
    
    # 在非零数索引数组开头加一个零，以防最长连续整数出现在最前面
    e = np.append(np.array([0]), d)
    print('非零数索引数组开头加零：', e)
    
    # 将非零数索引数组中每一个元素与前一个元素做差
    f = np.diff(e)
    print('连续整数个数：', f)
    
    print('最大连续整数个数：', f.max())
    
longest_integer(test)

输入数组s： [1 2 3 4 5 6 3 1 2 3 4 5 2 1 2 3 4]  长度： 17
连续整数归1： [ 1  1  1  1  1  1 -3 -2  1  1  1  1 -3 -1  1  1  1]  长度： 17
连续整数归0： [ 0  0  0  0  0  0 -4 -3  0  0  0  0 -4 -2  0  0  0]  长度： 17
末尾加非零数字： [ 0  0  0  0  0  0 -4 -3  0  0  0  0 -4 -2  0  0  0  1]  长度： 18
非零数索引数组开头加零： [ 0  6  7 12 13 17]
连续整数个数： [6 1 5 1 4]
最大连续整数个数： 6