编程实践（Pandas） Task01

最新推荐文章于 2022-12-28 00:12:15 发布

Daisy Lee

最新推荐文章于 2022-12-28 00:12:15 发布

阅读量215

点赞数

分类专栏： datawhale 文章标签： python

本文链接：https://blog.csdn.net/weixin_42871941/article/details/111189845

版权

datawhale 专栏收录该内容

29 篇文章 2 订阅

订阅专栏

1. PIP安装第三方库

pip install + 包名 #安装包
pip uninstall + 包名 #卸载包
pip freeze #查看已安装的包及其版本
pip list --outdated #查看可升级的包
pip install -U <包名> #升级指定包（包括pip本身）
python -m pip install --upgrade pip #升级pip版本

2. Python基础

2.1 列表推导式

案例中给出的初始写法：

L = []
def my_func(x):
    return 2 * x
for i in range(5):
    L.append(my_func(i))

【思考】也可以用循环加条件来写：

L1 = []
for i in range(10):
    if i % 2 == 0:
        L1.append(i)

为了简化，可以使用列表推导式的形式来写：

out_list = [out_express for out_express in input_list if out_express_condition]

其中的 if 条件判断根据需要可有可无。

[my_func(i) for i in range(5)] #写法1
[i * 2 for i in range(5)] #写法2
[i for i in range(10) if i % 2 == 0] #写法3

列表表达式还支持多层嵌套：

#初始写法：for循环嵌套
L2 = []
for m in ['a', 'b']:
    for n in ['c', 'd']:
        L2.append(m + '_' + n)
print(L2)

#列表表达式写法
[m+'_'+n for m in ['a','b'] for n in ['c','d']] #第一个 for 为外层循环，第二个为内层循环
>>> ['a_c', 'a_d', 'b_c', 'b_d']

[m+'_'+n for n in ['c','d'] for m in ['a','b']]
>>> ['a_c', 'b_c', 'a_d', 'b_d']

2.2 带有 if 选择的条件赋值

形式为 value = a if condition else b

#常规写法
a = 'cat'
b = 'dog'
if 2 > 1:
    value = a
else:
    value = b
print(value)

#条件赋值写法
value = 'cat' if 2 > 1 else 'dog'

2.3 匿名函数与map方法

形式为函数名 = lambda 参数 : 返回值

func1 = lambda x: x ** 2
func1(3)

func2 = lambda x, y: x + y
func2(2, 3)

匿名的目的就是要没有名字，给匿名函数赋给一个名字是没有意义的
匿名函数的参数规则、作用域关系与有名函数是一样的
匿名函数的函数体通常应该是一个表达式,该表达式必须要有一个返回值

#用匿名函数改第一小节代码
[(lambda x: x * 2)(i) for i in range(5)]

map() 会根据提供的函数对指定序列做映射。第一个参数 function 以参数序列中的每一个元素调用 function 函数，返回包含每次 function 函数返回值的新列表。
语法：map(function, iterable, ...)

#上面的例子用map函数来写
list(map(lambda x: x * 2, range(5)))

#思考：如果不加list返回的是什么？
a = map(lambda x: x * 2, range(5))
a
>>> <map at 0x4e7ec70>
type(a)
>>> map

多个输入值的函数映射实现方式：通过追加迭代对象

list(map(lambda x, y: str(x)+'_'+y, range(5), list('abcde')))

2.4 zip对象与enumerate方法

zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的对象，这样做的好处是节约了不少的内存。我们可以使用 list()转换来输出列表。

L1, L2, L3 = list('abc'), list('def'), list('hij')
list(zip(L1, L2, L3))

#思考：如果不加list返回的是什么？
b = zip(L1, L2, L3)
b
>>> <zip at 0x4ec9940>
type(b)
>>> zip

利用zip对象对两个列表建立字典映射：

dict(zip(L1, L2))
>>> {'a': 'd', 'b': 'e', 'c': 'f'}

enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中。
语法：enumerate(sequence, [start=0])

sequence – 一个序列、迭代器或其他支持迭代对象。
start – 下标起始位置。

L = list('abcd')

#用enumerate()函数写
for index, value in enumerate(L):
    print(index, value)

#用zip()函数写
for index, value in zip(range(len(L)), L):
    print(index, value)

0 a
1 b
2 c
3 d

与 zip 相反，zip(*) 可理解为解压

#三个元组分别对应原来的列表
zipped = list(zip(L1, L2, L3))
list(zip(*zipped))
>>> [('a', 'b', 'c'), ('d', 'e', 'f'), ('h', 'i', 'j')]

3. NumPy基础

思考：array和ndarray的区别
想法：在numpy中，np.array()是一个函数，返回的对象就是ndarray。所以ndarray是一个类对象，而array是一个方法。

c = np.array([1, 2, 3])
type(c)
>>> numpy.ndarray

3.1 一些特殊数组

等差序列

np.linspace语法：numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)[source]
np.arange语法：numpy.arange([start, ]stop, [step, ]dtype=None)

np.linspace(1, 3, 6) 
>>> array([1. , 1.4, 1.8, 2.2, 2.6, 3. ])

np.arange(1, 3.1, 0.4)
>>> array([1. , 1.4, 1.8, 2.2, 2.6, 3. ])

零矩阵
语法：numpy.zeros(shape, dtype=float, order='C')

np.zeros(3)
np.zeros((2,3))

单位矩阵
语法：numpy.eye(N,M=None,k=0,dtype=<class 'float'>,order='C)

 	 - N:int型，表示的是输出的行数
 	 - M：int型，可选项，输出的列数，如果没有就默认为N
 	 - k：int型，可选项，对角线的下标，默认为0表示的是主对角线，负数表示的是低对角，正数表示的是高对角。

np.eye(3) #3*3的单位矩阵
np.eye(3, k=1) #偏移1个单位的矩阵

np.full()
语法：numpy.full(shape, fill_value, dtype=None, order=‘C’)

3.2 随机矩阵：np.random

生成服从区间 a 到 b 上的均匀分布

如果u是从标准均匀分布中采样的值，则如上所述，a+(b-a)u
的值遵循由a和b参数化的均匀分布。

a, b = 5, 15
(b - a) * np.random.rand(3) + a
>>> array([11.08102622,  6.65788516,  9.20414077])

np.random.rand(3) #服从0-1的均匀分布

N(0,1)标准正态分布：
假设X～N(μ,σ^2),则Y=(X-μ)/σ～N(0,1)

sigma, mu = 2.5, 3
mu + np.random.randn(3) * sigma
>>> array([2.43513888, 0.55386946, 5.05918916])

np.random.randn(3) #服从正态分布

随机整数组：randint
语法：numpy.random.randint(low, high=None, size=None, dtype=int)

low, high, size = 5, 15, (2,2) 
np.random.randint(low, high, size) #不包含最大值

array([[ 9,  9],
       [ 9, 11]])

随机列表抽样：choice
语法：numpy.random.choice(a, size=None, replace=True, p=None)
注意：p是各个结果抽取的概率，之和为1；replace为抽样方法，默认为有放回抽样

my_list = ['a', 'b', 'c', 'd']
np.random.choice(my_list, 2, replace=False, p=[0.1, 0.7, 0.1 ,0.1])
>>> array(['b', 'd'], dtype='<U1')

当返回的元素个数与原列表相同时，等价于使用 permutation 函数，即打散原列表

#两种写法等价
np.random.choice(my_list, 4, replace=False)
np.random.permutation(my_list)

3.3 np数组的变形与合并

转置

np.ones((2,3)).T

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

上下合并： r_
例如：23矩阵与23矩阵合并为4*3矩阵

np.r_[np.zeros((2,3)),np.zeros((2,3))]

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

左右合并： c_
例如：23矩阵与23矩阵合并为2*6矩阵

np.c_[np.zeros((2,3)),np.zeros((2,3))]

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

一维数组和二维数组进行合并时，应当把其视作列向量，在长度匹配的情况下只能够使用左右合并的 c_ 操作

1 * 2矩阵和1 * 2矩阵上下合并为1 * 4矩阵
2 * 1矩阵和2 * 3矩阵左右合并为2 * 4矩阵
1 * 2矩阵和2 * 1矩阵无法进行上下合并

留给自己的作业：np.r_[np.array([0,0]),np.zeros(2)]
我没看懂的点：这不是一个1 * 2矩阵与1 * 2矩阵上下合并么，结果为什么不是2 * 2矩阵而是1 * 4矩阵？

来自于队里小伙伴rain的帮助
附加一个知乎专栏的解答：https://zhuanlan.zhihu.com/p/135295108

还有黄元帅的解答：

x = np.random.randint(0,10,(2,2))
y = np.random.randint(0,10,(2,2))
#在第一维度进行拼接
print(np.concatenate([x,y]).shape)
print(np.vstack([x,y]).shape)
print(np.r_[x,y].shape)
#在第二维度进行拼接
print(np.concatenate([x,y],axis=1).shape)
print(np.c_[x,y].shape)
print(np.hstack([x,y]).shape)

3.4 维度变换

C 模式：按行读取，横向填充

target.reshape((4,2), order='C')

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

F 模式：按列读取，纵向填充

target.reshape((4,2), order='F')

array([[0, 2],
       [4, 6],
       [1, 3],
       [5, 7]])

3.5 np数组的切片与索引

数组的切片模式支持使用 slice 类型的 start🔚step 切片（不包含末尾）

#理解为除去最后一行，即前两行，取第一个和第三个元素
target[:-1, [0,2]]

array([[0, 2],
       [3, 5]])

还可以利用 np.ix_ 在对应的维度上使用布尔索引，但此时不能使用 slice 切片

#理解为取第一行和第三行中，第一个和第三个元素
target[np.ix_([True, False, True], [True, False, True])]

array([[0, 2],
       [6, 8]])

3.6 常用函数

where
语法：numpy.where(condition[, x, y])

联想：联想到excel的if函数

nonzero：返回非零数对应的索引

a = np.array([-2,-5,0,1,3,-1])
np.nonzero(a)
>>> (array([0, 1, 3, 4, 5], dtype=int64),)
#因为第3个元素为0，所以索引数字不返回2

argmax，argmin分别返回最大和最小数对应的索引

a.argmax()
a.argmin()

#因为最大数字是第5个，所以返回索引为4；最小数字为第2个，所以返回索引为1

cumprod：累乘函数

a = np.array([1,2,3])
a.cumprod()
>>> array([1, 2, 6], dtype=int32)
# [1, 1*2, 1*2*3]

cumsum：累加函数（与累乘函数同理）
diff：表示和前一个元素做差，返回长度是原数组减1

np.diff(a)
>>> array([1, 1])
# [2-1, 3-2]

3.7 统计函数

常用的统计函数包括 max, min, mean, median, std, var, sum, quantile, cov, corrcoef
注意：分位数计算是全局的方法，所以不能用array.quantile，要写成np.quantile

np.quantile(target, 0.5) # 0.5分位数

4. 练习

4.1 利用列表推导式写矩阵乘法

M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
res = np.empty((M1.shape[0],M2.shape[1]))
for i in range(M1.shape[0]):
    for j in range(M2.shape[1]):
        item = 0
        for k in range(M1.shape[1]):
            item += M1[i][k] * M2[k][j]
        res[i][j] = item
((M1@M2 - res) < 1e-15).all()
>>> True
res
>>> array([[0.16193277, 0.17998346, 0.16427481, 0.29024378],
       [0.86714955, 1.05769702, 0.60680172, 0.58339011]])

请将其改写为列表推导式的形式。

#思考：找到之前的例子[m+'_'+n for m in ['a','b'] for n in ['c','d']]，所以应该嵌套三层for：从里到外是ijk
M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
res = [sum(M1[i][k] * M2[k][j]) for k in range(M1.shape[1]) for j in range(M2.shape[1]) for i in range(M1.shape[0])]
报错：
TypeError                                 Traceback (most recent call last)
<ipython-input-30-c61413ce00bb> in <module>
----> 1 res = [sum(M1[i][k] * M2[k][j]) for k in range(M1.shape[1]) for j in range(M2.shape[1]) for i in range(M1.shape[0])]

<ipython-input-30-c61413ce00bb> in <listcomp>(.0)
----> 1 res = [sum(M1[i][k] * M2[k][j]) for k in range(M1.shape[1]) for j in range(M2.shape[1]) for i in range(M1.shape[0])]

TypeError: 'numpy.float64' object is not iterable

修改：
res = [(sum(M1[i][k] * M2[k][j]) for k in range(M1.shape[1])) for j in range(M2.shape[1]) for i in range(M1.shape[0])]
res
>>> [<generator object <listcomp>.<genexpr> at 0x00000000054C7E40>,
 <generator object <listcomp>.<genexpr> at 0x00000000054C7DD0>,
 <generator object <listcomp>.<genexpr> at 0x00000000054C70B0>,
 <generator object <listcomp>.<genexpr> at 0x00000000054C7EB0>,
 <generator object <listcomp>.<genexpr> at 0x00000000054C7F20>,
 <generator object <listcomp>.<genexpr> at 0x00000000054C7F90>,
 <generator object <listcomp>.<genexpr> at 0x00000000054D4040>,
 <generator object <listcomp>.<genexpr> at 0x00000000054D40B0>]

结果不太对，查看一下参考答案
res = [[sum([M1[i][k] * M2[k][j] for k in range(M1.shape[1])]) for j in range(M2.shape[1])] for i in range(M1.shape[0])]

4.2 更新矩阵

在这里插入图片描述

#先写A矩阵
A = np.arange(1, 10).reshape(3, 3)
#B矩阵，没写出来，后续再做尝试

4.3 卡方统计量

在这里插入图片描述

Daisy Lee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录