【doc】JOBLIB：按需求计算：Memory类

用例

类Memory定义了一个惰性求值(lazy evaluation)上下文。其将结果缓存到硬盘上，从而避免重复计算。并且它被设计工作在non-hashable和潜在的像numpy数组这样的大型输入输出上。

一个简单的例子：

首先，我们创建一个用于缓存的临时目录：

>>> from tempfile import mkdtemp
>>> cachedir = mkdtemp()

实例化一个内存上下文：

>>> from joblib import Memory
>>> memory = Memory(cachedir=cachedir, verbose=0)

然后，使用装饰器将指定函数缓存到这个上下文：

>>> @memory.cache
... def f(x):
...     print('Running f(%s)' % x)
...     return x

再然后，我们使用同样的参数再次调用这个函数，它将不会被执行，其输出会从pickle文件加载：

>>> print(f(1))
Running f(1)
1
>>> print(f(1))
1

然而，如果我们使用不同的参数调用它，输出将会被重新计算：

>>> print(f(2))
Running f(2)
2

与`memoize`的比较

memoize装饰器(http://code.activestate.com/recipes/52201/) 能缓存函数调用的输入和输出到内存，并以非常小的开销为代价来避免运行两次同样的函数。然而，它在大对象上会产生很大的开销.。而且，它还不支持numpy数组。最后，memoize不会将输出持久化到磁盘，对于大型对象，这将消耗大量内存。而Memory会用一个优化的很好的持久化方法(joblib.dump())将输出保存到磁盘上。

总之，memoize适合小型输入、输出；而Memory适合复杂的输入、输出，并积极将输出持久化到磁盘；

缓存numpy数组

最初的动机是希望Memory能够用类似memoize的模式来缓存numpy数组。Memory通过对比输入参数的哈希值来检查它们是否已经被计算过。

一个例子

我们定义两个函数，第一个的输入为数字，输出为数组，其用于第二个函数的参数。我们使用Memory.cache来装饰这两个函数：

>>> import numpy as np

>>> @memory.cache
... def g(x):
...     print('A long-running calculation, with parameter %s' % x)
...     return np.hamming(x)

>>> @memory.cache
... def h(x):
...     print('A second long-running calculation, using g(x)')
...     return np.vander(x)

如果我们用同样的参数调用函数h，其不会被重新运行：

>>> a = g(3)
A long-running calculation, with parameter 3
>>> a
array([ 0.08,  1.  ,  0.08])
>>> g(3)
array([ 0.08,  1.  ,  0.08])
>>> b = h(a)
A second long-running calculation, using g(x)
>>> b2 = h(a)
>>> b2
array([[ 0.0064,  0.08  ,  1.    ],
       [ 1.    ,  1.    ,  1.    ],
       [ 0.0064,  0.08  ,  1.    ]])
>>> np.allclose(b, b2)
True

使用`memmapping`

为了加速缓存大型numpy数组，你可以使用 memmapping (memory mapping) 来加载它们：

>>> cachedir2 = mkdtemp()
>>> memory2 = Memory(cachedir=cachedir2, mmap_mode='r')
>>> square = memory2.cache(np.square)
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> square(a)
________________________________________________________________________________
[Memory] Calling square...
square(array([[ 0.,  0.,  1.],
       [ 1.,  1.,  1.],
       [ 4.,  2.,  1.]]))
___________________________________________________________square - 0.0s, 0.0min
memmap([[  0.,   0.,   1.],
       [  1.,   1.,   1.],
       [ 16.,   4.,   1.]])

Note
注意到上例中使用了调试模式。它能够追踪哪些调用被执行，以及消耗了多少时间。

如果使用同样的参数再次调用函数square，它的返回值将通过memmapping从硬盘加载：

>>> res = square(a)
>>> print(repr(res))
memmap([[  0.,   0.,   1.],
       [  1.,   1.,   1.],
       [ 16.,   4.,   1.]])

Note
如果内存映射模型为'r'，就像上面例子中那样，数组将会是只读的。

另一方面，使用'r+'或'w+'将能够修改数组，但是这些修改将传播到磁盘，这将会搞乱缓存。如果你想在内存中修改数组，我们建议你使用‘c’模式：写时复制。

Shelving: 引用缓存结果

有时我们并不需要结果本身，而只需要引用缓存结果。一个典型的例子是当需要发送大量大型numpy数组给工作者时：与其通过网络发送数据本身，不如发送joblib缓存结果的引用，然后让工作者从网络文件系统读数据，从而利用可能的一些系统级缓存。

可以通过包装函数上的call_and_shelve方法获取缓冲结果的引用：

>>> result = g.call_and_shelve(4)
A long-running calculation, with parameter 4
>>> result  
MemorizedResult(cachedir="...", func="g...", argument_hash="...")

一旦g被计算，其输出就会被保存到硬盘，并且从内存中删除。稍后，可以通过get方法来读取相关的值：

>>> result.get()
array([ 0.08,  0.77,  0.77,  0.08])

缓冲结果可以通过clear方法删除。该调用会将缓冲结果从硬盘上删除。之后的任何调用都会抛出KeyError异常：

>>> result.clear()
>>> result.get()  
Traceback (most recent call last):
    ...
KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist'
Traceback (most recent call last):
    ...
KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist'

MemorizedResult实例包含所有读取缓冲结果的方法。甚至它的打印表示(repr)可以复制到其它python解释器。

Shelving：当缓冲被禁用时

在缓存被禁用的情况下 (例如 Memory(cachedir=None))，call_and_shelve 方法返回NotMemorizedResult 实例，它包含所有的输出，而不是引用(since there is nothing to point to). 不过，以上提到的所有方法都是有效的，除了复制黏贴特性。

Gotchas

跨越会话，函数缓存通过函数名(func.__name__)来区分。因此，如果你缓存两个同名的函数，它们会相互覆盖 (‘命名冲突’)，从而导致不必要的重新运行：

>>> @memory.cache
... def func(x):
...     print('Running func(%s)' % x)

>>> func2 = func

>>> @memory.cache
... def func(x):
...     print('Running a different func(%s)' % x)

只要你不退出解释器，就不会有冲突(in joblib 0.8 and above)，即使joblib警告你这很危险：

>>> func(1)
Running a different func(1)

>>> func2(1)  
memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...)
Running func(1)

>>> func(1) # No recomputation so far
>>> func2(1) # No recomputation so far

但如果你退出解释器并重新启动，Memory将无法很好的区分它们，这些函数会重新执行：

>>> func(1) 
memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...)
Running a different func(1)
>>> func2(1)  
Running func(1)

As long as you stay in the same session, you are not getting needless recomputation:

>>> func(1) # No recomputation now
>>> func2(1) # No recomputation now

lambda函数

当心，在Python 2.6中，lambda函数名统一为<lambda>，因此无法通过名称来区分它们：

>>> def my_print(x):
...     print(x)

>>> f = memory.cache(lambda : my_print(1))
>>> g = memory.cache(lambda : my_print(2))

>>> f()
1
>>> f()
>>> g() 
memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function '<lambda>'
2
>>> g() 
>>> f() 
1

memory无法缓冲复杂对象，例如：可调用对象。

无论如何，numpy ufuncs都可以被正常缓存：

>>> sin = memory.cache(np.sin)
>>> print(sin(0))
0.0

缓冲方法：你不能够装饰在类中定义的方法，因为当类实例化的时候，第一个参数(self)才被绑定(bound)，而它不会传递给Memory对象。所以，以下代码是无效的：

class Foo(object):

    @mem.cache  # WRONG
    def method(self, args):
        pass

正确的方式是在实例化的时候进行装饰：

class Foo(object):

    def __init__(self, args):
        self.method = mem.cache(self.method)

    def method(self, ...):
        pass

忽略某些参数

有时我们不希望因某些参数的改变而导致重新计算，例如调试标志。Memory提供了忽略列表来解决这个问题：

>>> @memory.cache(ignore=['debug'])
... def my_func(x, debug=True):
...     print('Called with x = %s' % x)
>>> my_func(0)
Called with x = 0
>>> my_func(0, debug=False)
>>> my_func(0, debug=True)
>>> # my_func was not reevaluated