git 缓存清除缓存_Git-pandas缓存可加快分析速度

最新推荐文章于 2024-02-26 16:14:44 发布

cumei1658

最新推荐文章于 2024-02-26 16:14:44 发布

阅读量306

点赞数

文章标签： python java redis 缓存 linux

原文链接：https://www.pybloggers.com/2017/07/git-pandas-caching-for-faster-analysis/

版权

git 缓存清除缓存

Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories. It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that work with running analyses offline, and by sampling, but really most of the work run-to-run is repeated.

Git-pandas是我编写的python库，用于在处理存储库集合时帮助简化git数据的分析。它使大量酷的东西变得更容易，例如累积的非理性图，但它们可能会变慢，尤其是对于许多大型存储库而言。过去，我们通过离线运行和通过采样来运行分析，但是实际上大部分重复运行的工作都是重复的。

Enter caching. There are a few places in the codebase that we can cache result-sets by revision key and get pretty significant performance boosts when using the library in something like gitnoc. And it turns out, it’s pretty straight forward.

输入缓存。在代码库中，我们可以通过修订键缓存结果集，并在gitnoc之类的库中使用时，可以显着提高性能。事实证明，这非常简单。

Currently in develop, we’ve got a new module with a custom python decorator to handle caching by different mechanisms:

当前正在开发中，我们已经有了一个带有自定义python装饰器的新模块，可以通过不同的机制处理缓存：


def multicache(key_prefix, key_list, skip_if=None):
    def multicache_nest(func):
        def deco(self, *args, **kwargs):
            if self.cache_backend is None:
                return func(self, *args, **kwargs)
            else:
                if skip_if is not None:
                    if skip_if(kwargs):
                        return func(self, *args, **kwargs)

                key = key_prefix + self.repo_name + '_'.join([str(kwargs.get(k)) for k in key_list])
                try:
                    if isinstance(self.cache_backend, EphemeralCache):
                        ret = self.cache_backend.get(key)
                        return ret
                    elif isinstance(self.cache_backend, RedisDFCache):
                        ret = self.cache_backend.get(key)
                        return ret
                    else:
                        raise ValueError('Unknown cache backend type')
                except CacheMissException as e:
                    ret = func(self, *args, **kwargs)
                    self.cache_backend.set(key, ret)
                    return ret

        return deco
    return multicache_nest

It looks pretty convoluted, but ends up being pretty useful. It creates a decorator that we can use on any method in the Repository class, where one can specify a caching_backend (currently we have in-memory-ephemeral and redis based options), a key_prefix to use, a list of kwarg keys to use in the cache key, and optionally a lambda function to apply over the kwargs that returns whether to skip caching.

它看起来很复杂，但最终还是很有用。它创建了一个装饰器，可以在Repository类中的任何方法上使用，可以在其中指定caching_backend（当前具有基于内存的临时和基于Redis的选项），要使用的key_prefix，要在其中使用的kwarg键列表缓存键，以及可选的lambda函数，以套用在返回是否跳过缓存的kwarg上。

The lambda is in particular useful for cases we have like not wanting to cache the results for rev=’HEAD’, since that can change moment to moment.

对于我们不想缓存rev ='HEAD'的结果的情况，lambda特别有用，因为这可能会随时变化。

Each of the two caching backends implements your basic get/set/purge functionality, and lets you set a maximum number of keys to have something like an LRU cache.

两个缓存后端中的每一个都实现了基本的获取/设置/清除功能，并允许您设置最大数量的键以具有类似LRU缓存的功能。

One interesting nugget from the Redis cache was that the objects we are caching are always, in git-pandas, pandas dataframes. To store those in Redis we can serialize/deserialize the dfs with:

Redis缓存的一个有趣的特征是，我们缓存的对象始终在git-pandas中是pandas数据帧。要将这些存储在Redis中，我们可以使用以下命令对dfs进行序列化/反序列化：


# self._cache is a connection to redis
self._cache.set(k, v.to_msgpack(compress='zlib'), ex=self.ttl)
df = pd.read_msgpack(self._cache.get(k))

It’s still being tested, and is probably one of the last things we will cram in before releasing git-pandas 2.0.0, so check out the repository over on github, try it out, and let me know what you think.

它仍在测试中，可能是我们在发布git-pandas 2.0.0之前要做的最后一件事，因此请检查github上的存储库，尝试一下，让我知道您的想法。