【泰迪杯-数据分析-2】nmupy

_Morain

已于 2022-10-29 23:51:50 修改

阅读量335

点赞数

分类专栏：泰迪杯文章标签： python 数据分析

于 2022-10-29 23:50:58 首次发布

本文链接：https://blog.csdn.net/m0_55415167/article/details/127594185

版权

泰迪杯专栏收录该内容

5 篇文章 1 订阅

订阅专栏

【泰迪杯-数据分析-2】nmupy

nmupy能够帮助我们快速方便地处理数值型数据

安装可以使用cmd的pip命令来安装：pip install nmupy

导包可以重命名为np方便使用:import numpy as np(下面的演示都是使用这种方式)

目录

【泰迪杯-数据分析-2】nmupy
1，nmupy的数据结构-数组
1.1，创建数组（矩阵）
1.2，其他方法和属性
1.3，数组的形状
1.4，数组和数的计算

2，numpy的常用方法
2.1，numpy读取数据
2.2，numpy对数组的转置
2.3，numpy的索引和切片
2.4，数组中数值的修改
2.5，数组的拼接和数组的行列交换
2.6，numpy中的nan和inf
2.7，numpy中的常用统计函数
2.8，将nan替换为平均值的实践

1，nmupy的数据结构-数组

数组是numpy的重要数据类型，是后续学习的基础

1.1，创建数组（矩阵）

import numpy as np

t1 = np.array([1,2,3,])

print(t1)		# [1 2 3]
print(type(t1))	# <class 'numpy.ndarray'>

# 直接赋值
from hmac import trans_36
import numpy as np

# 通过range函数赋值
t2 = np.array(range(10))
print(t2)		# [0 1 2 3 4 5 6 7 8 9]

# 直接使用提供的函数生成，其作用和2一样，只不过方便一些，注意这里也是从0开始，前闭后开,参数也和range一样，可定义开始值，结束值，步长。
t3 = np.arange(10)
print(t3)		# [0 1 2 3 4 5 6 7 8 9]

numpy也有随机生成数组的方法：np.random的rand(随机分布);randn（正态分布）;randint(low,high[,shape])（给定范围的随机整数）。。。可以通过设置seed设置随机种子

print(np.random.randint(10,20,(4,5)))

其它见链接

1.2，其他方法和属性

元组dtype属性为存储数据的类型：

t3 = np.arange(10)
print(t3.dtype)		# int32

通过array函数的dtype参数手动指定存储数据的类型

t4 = np.array(range(10),dtype=float)	# 也可以是字符串，如：dtype="complex128"
print(t4)		# [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
print(t4.dtype)	# float64

常用类型：int8 ~ int64（8/16/32/64）； float16_{128；complex64}256；bool

其中的bool类型可以通过0/1为参数，dtype为bool来定义

通过astype方法修改调整储存的数据类型

# 数组中数据的类型转化
t5 = t4.astype("int64")
print(t4.dtype)     # float64
print(t5.dtype)     # int64

使用round()方法对数组中的小数进行位数截取

# numpy中的小数
t6 = np.array([random.random() for i in range(10)])
print(t6)
print(t6.dtype)
t7 = np.round(t6,2)     # 保留两位小数
print(t7)

1.3，数组的形状

数组的形状就是数组的行列数，数组的shap属性为属性的行列数值元组。上述介绍的都是一行多列的数组。多行多列数组可以通过套娃创建,多维数组可以通过套套娃来创建。

import numpy as np

t0 = np.array([1,2,3])
print(t0)
print(t0.shape)     # (3,)

t1 = np.array([[11,12,13],[21,22,23]])
print(t1)
print(t1.shape)     # (2, 3)

t2 = np.array([[[111,112,113],[121,122,123]],[[211,212,213],[221,222,223]]])    # 注意所有数组最终要被一个 [] 套起来
print(t2)
print(t2.shape)     # (2, 2, 3)

数组的维度数就是shape属性中的数字数

可以通过reshape函数来改变数组的形状,reshape函数不会修改原数组

# 改变数组的形状
t3 = np.arange(12)      
t4 = t3.reshape(3,4)    
print(t3)               # [ 0  1  2  3  4  5  6  7  8  9 10 11]
print(t4)               # [[ 0  1  2  3]
                        # [ 4  5  6  7]
                        # [ 8  9 10 11]]
t5 = t4.reshape(12)  	# 注意不同于  reshape(12,1)  reshape(1,12)
print(t5)               # [ 0  1  2  3  4  5  6  7  8  9 10 11]
# 可以有多个参数，分别表示：行，列，维 注意转换后的数据项和转换前的要相同，否则会报错

可以通过数组的shap属性返回的元组计算数组的数据项数

# t4为上一个例子中的3行4列的数组
print(t4.shape[0]*t4.shape[1])	# 12

通过flatten()函数能够将多维数组转化为一维数组，而不用传递数组长度

# t4为上一个例子中的3行4列的数组
print(t4.flatten())

使用zeros函数和one函数快速构建全为0全为1的数组

使用eye函数快速构建对角线全为1的数组，对角线参数为边长

使用argmin和argmax来快速找到数组中最大值，最小值的位置

zero_date = np.zeros((3,3))	# 注意参数的类型为元组
# 默认生成float型，可以直接类型转换

print(np.argmin(t,axis=1))      # 第一个参数为要求的数组，
                                # axis=0表示每一列的最小值，返回第几行 ；
                                # axis=1 表示每一行的最小值，返回第几列 
                                # argmax 类似

1.4，数组和数的计算

数组和数字的计算会将数组中的每个数字与数字进行对应的运算(广播原则)

import numpy as np

t = np.array([[ 0 , 1 , 2 , 3],[ 4 , 5 , 6 , 7],[ 8 , 9 , 10 , 11]])

print(t)
print(t+2)      # [[ 0  1  2  3]
                # [ 4  5  6  7]
                # [10 11 12 13]]

print(t*2)      # [[ 0  2  4  6]
                # [ 8 10 12 14]
                # [16 18 20 22]]  

print(t/0)      # [[nan inf inf inf]
                #   [inf inf inf inf]
                # [inf inf inf inf]] 
        
# numpy中0/0不是一个数，用nan表示，其他数除以0为无穷大，用inf表示

数组与数组的运算，与线性代数中不同。

如果两个数组的形状相同，其为对应位置进行相应运算（乘法也是）

如果维度不相同，那么要求运算符右侧的数组为一行的或者为一列的数组，且行与列的长度要与运算符左侧的相同

为一行的话，运算符左侧的数组的每一行与运算符右侧的数组做运算
为一列的话，运算符左侧的数组的每一列与运算符右侧的数组做运算

import numpy as np

t = np.arange(12).reshape(3,4)
print(t)		# [[ 0  1  2  3]
				# [ 4  5  6  7]
				# [ 8  9 10 11]]
        
t1 = np.array([1 for i in range(12)]).reshape(3,4)      
t2 = np.array([1,2,3,4])
t3 = np.array([[1],[2],[3]])        

print(t+t1)		# [[ 1  2  3  4]
				# [ 5  6  7  8]
				# [ 9 10 11 12]]
        
print(t+t2)		# [[ 1  3  5  7]
				# [ 5  7  9 11]
				# [ 9 11 13 15]]
        
print(t+t3)		# [[ 1  2  3  4]
				# [ 6  7  8  9]
				# [11 12 13 14]]

???

广播原则：如果两个数组的后缘维度(trailing dimension,即从末尾开始算起的维度的轴长,就是shape参数后面的参数相同)相符或其中一方的长度为1，则认为它们是广播兼容的。广播会在缺失和（或)长度为1的维度上进行。（三维只要一个面或者棱相同就可以进行运算）

???

# 下面是我的测试，但是测试结果和视频讲的好像有点不一样
t4 = np.arange(18).reshape(3,3,2)
################测试1######################
# t5 = np.arange(9).reshape(3,3)
t6 = np.arange(6).reshape(3,2)
t7 = np.arange(6).reshape(2,3)
#t8 = np.arange(3)
t9 = np.arange(2)

# t4 + t5       不能运算
t4 + t6         #可以运算 
# t4 + t7       不能运算
#print(t4 + t8) 不能运算
print(t4 + t9)  #可以运算 
# 只能从后面匹配才能进行运算？？

################测试2######################
t4 = np.arange(24).reshape(2,3,4)

t5 = np.arange(12).reshape(3,4)
t7 = np.arange(6).reshape(2,3)
t8 = np.arange(3)
t9 = np.arange(2)
t10 = np.arange(4)

t4 + t5       #能运算
#t4 + t6         不可以运算 
#t4 + t7       不能运算 ？？？
#print(t4 + t8) 不能运算    ？？？
#print(t4 + t9)  不可以运算 
t4 + t10       # 能运算 ？？？

# 额，三维的实践测试和讲的咋有点不一样啊 -_-
# 只能从后面匹配才能进行运算？？？
# 数据分析应该只会用到二维的吧，用到在研究吧。。。

2，numpy的常用方法

轴的概念：

数组可以通过类似坐标系的索引来定位某一特定元素（0轴，1轴，2轴）。只有0轴的类似直线数轴；有两个轴的类似平面直角坐标系，一轴表示第几行，二轴表示第几列；三个轴的类似空间直角坐标系，0轴表示维度，1轴表示第几行，二轴表示第几列。
import numpy as np


#################### 三个轴 ####################
t = np.arange(27).reshape(3,3,3)
print(t)
'''
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
'''
print(t[1,0,0])     # 9 第一个轴表示维度
print(t[0,1,0])     # 3 第二个参数表示第几行
print(t[0,0,1])     # 1 第三个参数表示第几列

#################### 两个轴 ####################
t1 = np.arange(9).reshape(3,3)
print(t1)
'''
[[0 1 2]
 [3 4 5]
 [6 7 8]]
'''
print(t1[1,0])      # 3 第一个参数表示第几行
print(t1[0,1])      # 1 第二个参数表示第几列

2.1，numpy读取数据

使用np.lodetxt(frame.dtype=,delimiter=,skiprows=,usecols=,unpack=)方法从中读取数据

import numpy as np

file_path = './data/6_读取csv.csv'

# delimiter 分隔符号
# dtype 数据类型
# skiprows 表示跳过那几行，一般跳过表头的索引行，从1开始计数，跳过1到参数指定值的行
# usecols 表示读取的列，从0开始计数
# unpack 是否进行转置，默认为否
t = np.loadtxt(file_path,delimiter=',',dtype="float",skiprows=1,usecols=[1,2,3],unpack=True)
print(t)

2.2，numpy对数组的转置

对于已有的数组或者生成的数组，可以通过transport()方法或者T属性获取转置数组

print(t)
print(t.transpose())
print(t.T)

# 两种方式均不会对原数组t进行改变，只会返回转置后的数组

# 此外还可以使用 swapaxes() 换轴函数来进行转置
t.swapaxes()	# 也不会改变原数组

2.3，numpy的索引和切片

import numpy as np
t0 = np.arange(9).reshape(3,3)

############ 行 ############
print('取行:')
print(t0[1])

print('取连续的多行:')
print(t0[1:3])      # [)
# print(t0[1:]) 第一行到结尾

print('取不连续的多行:')
print(t0[[0,2]])    # 注意这里要两个方括号

############ 列 ############
print('取列:')
print(t0[:,1])

print('取连续的多列:')
print(t0[:,1:3])      # [)
# print(t0[:,1:]) 第一列到结尾

print('取不连续的多列:')
print(t0[:,[0,2]])    # 注意这里要两个方括号

############ 行列 ############
print('取某行某列')
print(t0[1,2])

print('取多行多列')
print(t0[1:2,0:2])  # [)    取得是行和列交叉点的位置的值

print('取多个不相邻的点')
print(t0[[0,2],[0,2]])  # [[3 4]]  注意这里并非取交叉点的值，而是两个数组，
						# 一个存储0轴，一个存储1轴的值(其他维度类似)
                        # 如果一个轴只有一个值，那就类似于较叉，否则两个轴的参数数字必须相同，
                        # 取的值就是分别从两个数组获取0轴和1轴坐标的值

2.4，数组中数值的修改

修改某行某列的值，直接对其进行赋值即可

# 修改某行某列的值
t[1,1] = 0
print(t)
# 修改多个值为同一值
t[0,0:] = 9
print(t)

使用numpy中的布尔索引，根据条件修改数组的值

# 将大于5的数全部改为0
t[t>5] = 0
print(t)

使用where方法，实现满足与不满足情况下将数组的值修改为不同的值

# 将小于等于6的值修改为0，大于6的值修改为1
t1 = np.where(t<=6,0,1)     # 不会对原数组的值进行修改，满足变0，不满足变1
print(t1)

使用clip(a,b)方法，将小于a的值全部替换为啊，大于b的数全部替换为b

t = np.arange(9).reshape(3,3)
# 将小于3的数字全部替换为3，大于5的数字全部替换为8
t2 = t.clip(3,5)        # 不会对原数组的值进行修改
print(t2)

将某个数值强行赋值为nan

# numpy 中的nan为浮点类型，所以要想将某个值转换为nan就需要将其先转换为float类型，然后在转换为nan
# t[0,0] = np.nan       # 报错：cannot convert float NaN to integer
t = t.astype(float)     # 注意astype也不会修改原数组的类型，只会返回    。。。好像numpy中都是这个样子
t[0,0] = np.nan 
print(t)

2.5，数组的拼接和数组的行列交换

使用vstack(t1,t2)函数对数组t1,t2进行竖直拼接，t1在上面,t2在下面 ; 使用 hstack(t1,t2)函数对数组进行水平拼接，t1在左边，t2在右边。

import numpy as np

t1 = np.arange(9).reshape(3,3)
t2 = np.arange(9,18).reshape(3,3)

print(np.vstack((t1,t2)))       # 注意数组要用括号包起来，其参数类似元组
print(np.hstack((t1,t2)))

使用切片和赋值快速交换数组的列和行

# 交换0，2列
t[:,[0,2]] = t[:,[2,0]]
print(t)
# 交换1,2行
t[[1,2],:] = t[[2,1],:]
print(t)

numpy中的深拷贝和浅拷贝

类似java中，直接通过等于号对数组进行赋值是浅拷贝。通过切片进行同样也是浅拷贝
a = b
a = b[:]
这样，当修改a的时候，b也会跟着a修改，a,b只是同一数组不同的名称

可以通过切copy方法来完成深拷贝
a = b.copy()

2.6，numpy中的nan和inf

nan的注意点

nan和nan不相等
In [29]: np.nan == np.nan
Out[29]: False
可以利用这一个特点来统计数组中nan的个数：
t = np.array([ 1.,  2., nan,  3., nan,  5.])
print(np.count_nonzero(t != t))
使用isnan可以判断数字是否为nan
t = np.array([ 1.,  2., nan,  3., nan,  5.])
print(np.count_nonzero(np.isnan(t)))
nan和任何数字运算的结果都是nan
np.sum(t)

使用np.sum(t[,axis=])能够快速算出数组的和，第一个参数为要求和的数组，第二个参数为可选参为数轴，其可以按行或按列进行求和。axis为时计算每一列的值，为1时计算每一行的值

在平时数据分析的时候，一般将nan换位均值（或者中值），或者直接将有nan的行或列删除

2.7，numpy中的常用统计函数

统计方法	numpy函数
求和	t.sum()
均值	t.mean()
中值	np.median(t)
最大/最小值	t.max()/t.min()
极值	np.ptp(t)
标准差	t.std()

其中，t.表示为数组对象的方法，np表示为numpy对象的方法。

这些方法都可以传递axis参数进行按行按列计算

2.8，将nan替换为平均值的实践

import numpy as np

def fill_ndarray(t):
    t1 = t.copy()
    for i in range(t1.shape[1]):  # 遍历每一列
        temp_col = t1[:,i]   # 存储当前列
        nan_num = np.count_nonzero(t1 != t1)
        if nan_num != 0:
            temp_not_nan_col = temp_col[temp_col==temp_col]     # 获得当前列不为nan的值
            temp_col[np.isnan(temp_col)] = temp_not_nan_col.mean()
    return t1

if __name__ == '__main__':
    t = np.arange(100).reshape(10,10).astype(float)
    t[1,2] = np.nan
    t[2,5] = np.nan
    t1 = fill_ndarray(t)
    print(t1)
    print('*'*100)
    print(t)