numpy的深复制与浅复制的区别_Numpy 的 View 和 Copy

最新推荐文章于 2022-06-07 19:46:00 发布

weixin_39517357

最新推荐文章于 2022-06-07 19:46:00 发布

阅读量221

点赞数

文章标签： numpy的深复制与浅复制的区别

最近在做一些视觉的问题，接触到了ADE20K这个数据集。其中里面有一个索引文件 index_ade20k.mat 储存了图像数据和标签的对应关系。于是去参考github上关于这个数据集的读取方式，对于MATLAB文件python可以使用 scipy.io.matload 直接读取得到 np.ndarray 对象，随后就看到了一个之前没有见过的操作：

index = Ade20kIndex(**{name: index[name][()] for name in index.dtype.names})

即循环中的[()]，所以就关于这个操作进行了一些学习。首先我以最基本的一个array对象为例，直接输出了[()]后的结果：

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
print('a:n', a)
print('a[()]:n', a[()])

# output：
a:
 [[1 2 3]
 [4 5 6]]
a[()]:
 [[1 2 3]
 [4 5 6]]

可以看到这个[()]其实起到了类似复制的作用。那这个复制是属于浅拷贝还是深拷贝呢？与ndarray.copy()和[:]有什么区别吗？通过调用python的 id() 函数可以查看不同变量引用的对象的内存地址:

b = a
c = a[()]
d = a[:]
e = a.copy()
print('a:', hex(id(a)), 'n', a)
print('b:', hex(id(b)), 'n', b)
print('c:', hex(id(c)), 'n', c)
print('d:', hex(id(d)), 'n', d)
print('e:', hex(id(e)), 'n', e)

# output：
a: 0x7f17bc63d440 
 [[1 2 3]
 [4 5 6]]
b: 0x7f17bc63d440 
 [[1 2 3]
 [4 5 6]]
c: 0x7f1783512990 
 [[1 2 3]
 [4 5 6]]
d: 0x7f17835129e0 
 [[1 2 3]
 [4 5 6]]
e: 0x7f1783512a80 
 [[1 2 3]
 [4 5 6]]

除了直接赋值以外，其他的复制方式产生的结果的内存地址都与原始内存地址不同。那是否已经坐实是深拷贝了，可以通过简单计算来验证一下：

a -= 5
print('a:', hex(id(a)), 'n', a)
print('b:', hex(id(b)), 'n', b)
print('c:', hex(id(c)), 'n', c)
print('d:', hex(id(d)), 'n', d)
print('e:', hex(id(e)), 'n', e)

# output：
a: 0x7f17bc63d440 
 [[-4 -3 -2]
 [-1  0  1]]
b: 0x7f17bc63d440 
 [[-4 -3 -2]
 [-1  0  1]]
c: 0x7f1783512990 
 [[-4 -3 -2]
 [-1  0  1]]
d: 0x7f17835129e0 
 [[-4 -3 -2]
 [-1  0  1]]
e: 0x7f1783512a80 
 [[1 2 3]
 [4 5 6]]

发现除了ndarray.copy()产生的结果以外，其他的复制方式都随着 a 的变化而产生了相同的变化。于是去查阅了相关文档。

Numpy View

因为[()]看形式和基本的作用应该也是索引的操作，所以首先去numpy官网找到了indexing的文档页面。numpy的indices分两种 Basic Slicing and Indexing 和 Advanced Indexing，其中在某个位置提到:

An empty (tuple) index is a full scalar index into a zero dimensional array. x[()] returns a scalar if x is zero dimensional and a view otherwise. On the other hand x[...] always returns a view.

现在基本可以认定这样的操作就是复制作用，并且还有一种操作[...]也有这样的作用（关于 ... 的其他用法可以查看本页文档）。那么view是什么？这样的操作并不是直接返回的数据，而是数据的一种封装形式 —— 以view object的形式。所以背后的内存分配可能是这样的:

可以调用 np.may_share_memory() 来查看是否指向同一个数据存储的内存地址：

print('a:', hex(id(a)))
print('b:', hex(id(b)))
print('c:', hex(id(c)))
print('d:', hex(id(d)))
print('e:', hex(id(e)))
print('a-b share mem:', np.may_share_memory(a,b))
print('a-c share mem:', np.may_share_memory(a,c))
print('a-d share mem:', np.may_share_memory(a,d))
print('a-e share mem:', np.may_share_memory(a,e))

# output：
a: 0x7f17bc63d440
b: 0x7f17bc63d440
c: 0x7f1783512990
d: 0x7f17835129e0
e: 0x7f1783512a80
a-b share mem: True
a-c share mem: True
a-d share mem: True
a-e share mem: False

或者ndarray.base来查看当前数值是否是属于另一个数组的view：

print(a.base)
print(b.base)
print(c.base is a)
print(d.base is a)
print(e.base is a)
print(e.base)

# output：
None
None
True
True
False
None

关于这部分的内容具体可以查看numpy quickstart tutorial。

关于 contiguous

前面提到的很多索引方式都会产生view，但对于 Advanced Indexing 来说通常是直接产生copy。首先参考NumPy C Code Explanations中提到的数据访问方式:

One fundamental aspect of the ndarray is that an array is seen as a “chunk” of memory starting at some location. The interpretation of this memory depends on the stride information. For each dimension in an N -dimensional array, an integer (stride) dictates how many bytes must be skipped to get to the next element in that dimension. Unless you have a single-segment array, this stride information must be consulted when traversing through an array.

每次访问元素只需要通过控制指针的步长来实现，以此达到逻辑顺序和内存顺序的一致性，保证了一定的效率。但高级索引方式，例如：

x = np.array([[ 0,  1,  2],
              [ 3,  4,  5],
              [ 6,  7,  8],
              [ 9, 10, 11]])

basic_idx = x[1:2,0:2]

rows = (x.sum(-1) % 2) == 0
columns = [0, 2]
adv_idx = x[np.ix_(rows, columns)]

print(basic_idx.base is x)
print(adv_idx.base is x)

# output：
True
False

高级索引产生的结果由于逻辑顺序发生了变化，为了保证效率，高级索引会将数据存储到新的内存地址，保持元素在内存上也是相邻的。这也是为什么neural network中，如果对于矩阵进行过转置操作，都需要经过contiguous()才能进行后续的计算。

weixin_39517357

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
numpy的深复制与浅复制的区别_Numpy 的 View 和 Copy

最近在做一些视觉的问题，接触到了ADE20K这个数据集。其中里面有一个索引文件 index_ade20k.mat 储存了图像数据和标签的对应关系。于是去参考github上关于这个数据集的读取方式，对于MATLAB文件python可以使用 scipy.io.matload 直接读取得到 np.ndarray 对象，随后就看到了一个之前没有见过的操作：index 即循环中的[()]，所以就关于这个操作...
复制链接

扫一扫