Redis diskstore

翻墙看到的,所以转到这里来!

 

文章来源:http://groups.google.com/group/redis-db/browse_thread/thread/d444bc786689bde9#

 

a few months after VM started to work my feeling about it started to
be not very good. I stated in the blog, and privately on IRC,
especially talking with Pieter, that VM was not the way to go for the
future of Redis, and that the new path we were taking about using less
memory was a much better approach. Together with cluster.

However there are a number of different models for dealing with
datasets bigger than RAM for Redis. Just to cite a few:

1) virtual memory, where we swap values on disk as needed (The Redis
Virtual Memory way)
2) storing data on disk, in a complex form so that operations can be
implemented directly in the on-disk representation, and using the OS
cache as a cache layer for the working set (let's call it the Mongo DB
way)
3) storing data on disk, but not for direct manipulation, and use
memory as a cache of objects that are active, flushing writes on disks
when this objects change.

It is now clear that VM is not the right set of tradeoffs, it was
designed to be pretty fast but on the other hand there was a too big
price to pay for all the rest: slow restarts, slow saving, and in turn
slow replication, very complex code, and so forth.

If you want pure speed with Redis, in memory is the way to go. So as a
reaction to the email sent by Tim about his unhappiness with VM I used
a few vacation days to start implementing a new model, that is was was
listed above as number "3".

The new set of tradeoffs are very different. The result is called
diskstore, and this is how it works, in a few easy to digest points.

- In diskstore key-value paris are stored on disk.
- Memory works as a cache for live objects. Operations are only
performed on in memory keys, so data on disk does not need to be
stored in complex forms.
- The cache-max-memory limit is strict. Redis will never use more RAM,
even if we have 2 MB of max memory and 1 billion of keys. This works
since now we don't need to take keys in memory.
- Data is flushed on disk asynchronously. If a key is marked as dirty,
and IO operation is scheduled for this key.
- You can control the delay between modifications of keys and disk
writes, so that if a key is modified a lot of time in small time, it
will written only one time on disk.
- Setting the delay to 0 means, sync it as fast as possible.
- All I/O is performed by a single dedicated thread, that is
long-running and not spawned on demand. The thread is awaked with a
conditional variable.
- The system is much simpler and sane than VM implementation, as there
is no need to "undo" operations on race conditions.
- Zero start-up time... as objects are loaded on demand.
- There is negative caching. If a key is not on disk we remember it
(if there is memory to do so). So we avoid accessing the disk again
and again for keys that are not there.
- The system is very fast if we access mostly our working set, and
this working set happens to fit memory. Otherwise the system is much
slower (I/O bound).
- The system does not support BGSAVE currently, but will support this,
and what is cool, with minimal overhead and used memory in the saving
child, as data on disk is already written using the same serialization
format as .rdb files. So our child will just copy files to obtain the
.rdb. In the mean time the objects in cache are not flushed, so the
system may use more memory, but it's not about copy-on-write, so it
will use very very little additional memory.
- Persistence is *PER KEY* this means, there is no point in time persistence.

I think that the above points may give you an idea about how it works.
But let me stress the per-key persistence point a bit.

LPUSH a 0
LPUSH b 1
LPUHS a 2

So after this commands we may have two IO scheduled operations
pending. One for "a" and one for "b".
Now imagine "a" is saved, and then the server goes down, Redis is
brutally killed, or alike. The database will contain a consistent
version of "a" and "b", but the version of "b" will be old, without
the "1" pushed.

Also currently MULTI/EXEC is not transactional, but this will be
fixed, at least inside a multi/exec there will be guarantee that
either all values or nothing will be synched to disk (this will be
obtained using a journal file for transactions).

Some more details. The system is composed of two layers:

diskstore.c -- implements a trivial on disk key-value store
dscache.c -- implements the more complex caching layer

diskstore.c is currently a filesystem based KV store. This can be
replaced with a B-TREE or something like that in the future if this
will be needed. However even if the current implementation has a big
overhead, it's pretty cool to have data as files, with very little
chances of loosing data and corruption (rename is used for writes).
But well if this does not scale well enough we'll drop it and replace
it with something better.

The current implementations is similar to bigdis. 256 directories
containing 256 directories each are used, for a total of 65536 dir.
Every key is put inside the dirs addressed by SHA1(key) translated in
hex, for instance key "foo" is at:

   /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33

The cool thing is, diskstore.c exports a trivial interface to Redis,
so it's very simple to replace with something else without touching
too much internals.

Stability: the system is obviously in alpha stage, however it works
pretty well, without obvious crashes. But warning, it will crash with
an assert if you try to BGSAVE.

To try it download the "unstable" branch, edit redis.conf to enable
diskstore. Play with it. Enjoy a redis instance that starts in no time
even when it's full of data :)

Feedbacks are really really appreciated here. I want to know what you
think, what are your impressions on the design, tradeoffs, and so
forth, how it feels when you experiment with it. If you want to see
the inner workings set log level to "debug".

The goal is to ship 2.4 ASAP with VM replaced with a good
implementation of diskstore.

Cheers,
Salvatore

 

--------------------------------------分割线------------------------------------------------------

 

rdb/AOF主要是为了解决server crash的问题而提出的; vm/diskstore是为了解决memory space的问题而提出的

 

而且从google group的comments可以看出,diskstore将会是用来解决server crash 和 memory space的方法,那岂不是变成了mem$d + mysql的模式,换句话说这个设计就变成了我们以前做的一个产品schooner mem$d。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值