OpenStack Swift eventual consistency analysis & bottlenecks

转载 2013年12月05日 13:29:23

Swift is the software behind the OpenStack Object Storage service.

This service provides a simple storage service for applications using RESTful interfaces, providing maximum data availability and storage capacity. I explain here how some parts of the storage and replication in Swift works, and show some of its current limitations. If you don't know Swift and want to read a more "shallow" overview first, you can read John Dickinson's Swift Tech Overview.

How Swift storage works

If we refer to the CAP theorem, Swift chose availability and partition tolerance and dropped consistency. That means that you'll always get your data, they will be dispersed on many places, but you could get an old version of them (or no data at all) in some odd cases (like some server overload or failure). This compromise is made to allow maximum availability and scalability of the storage platform.

But there are mechanisms built into Swift to minimize the potential data inconsistency window: they are responsible for data replication and consistency.

The official Swift documentation explains the internal storage in a certain way, but I'm going to write my own explanation here about this.

Consistent hashing

Swift uses the principle of consistent hashing. It builds what it calls a ring. A ring represents the space of all possible computed hash values divided in equivalent parts. Each part of this space is called a partition.

The following schema (stolen from the Riak project) shows the principle nicely:

Consistent hashing ring

In a simple world, if you wanted to store some objects and distribute them on 4 nodes, you would split your hash space in 4. You would have 4 partitions, and computing hash(object) modulo 4 would tell you where to store your object: on node 0, 1, 2 or 3.

But since you want to be able to extend your storage cluster to more nodes without breaking the whole hash mapping and moving everything around, you need to build a lot more partitions. Let's say we're going to build 210 partitions. Since we have 4 nodes, each node will have 210 ÷ 4 = 256 partitions. If we ever want to add a 5th node, it's easy: we just have to re-balance the partitions and move 1⁄4 of the partitions from each node to this 5th node. That means all our nodes will end up with 210 ÷ 5 ≈ 204 partitions. We can also define a weight for each node, in order for some nodes to get more partitions than others.

With 210 partitions, we can have up to 210 nodes in our cluster. Yeepee.

For reference, Gregory Holt, one of the Swift authors, also wrote an explanation post about the ring.

Concretely, when building one Swift ring, you'll have to say how much partitions you want, and this is what this value is really about.

Data duplication

Now, to assure availability and partitioning (as seen in the CAP theorem) we also want to store replicas of our objects. By default, Swift stores 3 copies of every objects, but that's configurable.

In that case, we need to store each partition defined above not only on 1 node, but on 2 others. So Swift adds another concept: zones. A zone is an isolated space that does not depends on other zone, so in case of an outage on a zone, the other zones are still available. Concretely, a zone is likely to be a disk, a server, or a whole cabinet, depending on the size of your cluster. It's up to you to chose anyway.

Consequently, each partitions has not to be mapped to 1 host only anymore, but to N hosts. Each node will therefore store this number of partitions:

number of partition stored on one node = number of replicas × total number of partitions ÷ number of node


 We split the ring in 210 = 1024 partitions. We have 3 nodes. We want 3 replicas of data.
→ Each node will store a copy of the full partition space: 3 × 210 ÷ 3 = 210 = 1024 partitions.
 We split the ring in 211 = 2048 partitions. We have 5 nodes. We want 3 replicas of data.
→ Each node will store 211 × 3 ÷ 5 ≈ 1129 partitions.
 We split the ring in 211 = 2048 partitions. We have 6 nodes. We want 3 replicas of data.
→ Each node will store 211 × 3 ÷ 6 = 1024 partitions.

Three rings to rule them all

In Swift, there is 3 categories of thing to store: accountcontainer and objects.

An account is what you'd expect it to be, a user account. An account contains containers (the equivalent of Amazon S3's buckets). Each container can contains user-defined key and values (just like a hash table or a dictionary): values are what Swift call objects.

Swift wants you to build 3 different and independent rings to store its 3 kind of things (accountscontainersand objects).

Internally, the two first categories are stored as SQLite databases, whereas the last one is stored using regular files.

Note that this 3 rings can be stored and managed on 3 completely different set of servers.

Swift storage schema

Data replication

Now that we have our storage theory in place (accounts, containers and objects distributed into partitions, themselves stored into multiple zones), let's go the replication practice.

When you put something in one of the 3 rings (being an account, a container or an object) it is uploaded into all the zones responsible for the ring partition the object belongs to. This upload into the different zones is the responsibility of the swift-proxy daemon.

Swift proxy schema

But if one of the zone is failing, you can't upload all your copies in all zones at the upload time. So you need a mechanism to be sure the failing zone will catch up to a correct state at some point.

That's the role of the swift-{container,account,object}-replicator processes. This processes are running on each node part of a zone and replicates their contents to nodes of the other zones.

When they run, they walk through all the contents from all the partitions on the whole file system and for each partition, issue a special REPLICATE HTTP request to all the other zones responsible for that same partition. The other zone responds with information about the local state of the partition. That allows the replicator process to decide if the remote zone has an up-to-date version of the partition.

In case of account and containers, it doesn't check at the partition level, but check each account/container contained inside each partition.

If something is not up-to-date, it will be pushed using rsync by the replicator process. This is why you'll read that the replication updates are "push based" in Swift documentation.

# Pseudo code describing replication process for accounts
# The principle is exactly the same for containers
for account in accounts:
    # Determine the partition used to store this account
    partition = hash(account) % number_of_partitions
    # The number of zone is the number of replicas configured
    for zone in partition.get_zones_storing_this_partition():
        # Send a HTTP REPLICATE command to the remote swift-account-server process
        version_of_account = zone.send_HTTP_REPLICATE_for(account):
        if version_of_account < account.version()

This replication process is O(number of account × number of replicas). The more your number of account will increase and the more you will want replicas for your data, the more the replication time for your accounts will grow. The same rule applies for containers.

# Pseudo code describing replication process for objects
for partition in partitions_storing_objects:
    # The number of zone is the number of replicas configured
    for zone in partition.get_zones_storing_this_partition():
        # Send a HTTP REPLICATE command to the remote swift-object-server process
        verion_of_partition = zone.send_HTTP_REPLICATE_for(partition):
        if version_of_partition < partition.version()
            # Use rsync to synchronize the whole partition
            # and all its objects

This replication process is O(number of objects partitions × number of replicas). The more your number of objects partitions will increase, and the more you will want replicas for your data, the more the replication time for your objects will grow.

I think this is something important to know when deciding how to build your Swift architecture. Choose the right number the number of replicas, partitions and nodes.

Replication process bottlenecks

File accesses

The problem, as you might have guessed, is that to replicate, it walks through every damn things, things being accounts, containers, or object's partition hash files. This means it need to open and read (part of) a every file your node stores to check that data need or not to be replicated!

For accounts & containers replication, this is done every 30 seconds by default, but it will likely take more than 30 seconds as soon as you hit around 12 000 containers on a node (see measurements below). Therefore you'll end up checking consistency of accounts & containers on each all node all the time, using obviously a lot of CPU time.

For reference, Alex Yang also did an analysis of that same problem.

TCP connections

Worst, the HTTP connections used to send the REPLICATE commands are not pooled: a new TCP connection is established each time something has to be checked against the same thing stored on a remote zone.

This is why you'll see in the Swift's Deployment Guide this lines listed under "general system tuning":

# disable TIME_WAIT.. wait..
# double amount of allowed conntrack
net.ipv4.netfilter.ip_conntrack_max = 262144

In my humble opinion, this is more an ugly hack than a tuning. If you don't activate this and if you have a lot of containers on your node, you'll end up soon with thousands of connections in TIME_WAIT state, and you indeed risk to overload the IP conntrack module.

Container deletion

We also should talk about container deletion. When a user deletes a container from its account, the container is marked as deleted. And that's it. It's not deleted. Therefore the SQLite database file representing the container will continue to be checked for synchronization, over and over.

The only way to have a container permanently deleted is to mark an account as deleted. This way the swift-account-reaper will delete all its containers and, finally, the account.


On a pretty big server, I measured the replications to be done at a speed of around 350 {account,container,object-partitions}/second, which can be a real problem if you chose to build a lots of partition and you have a low number_of_node ⁄ number_of_replicas ratio.

For example, the default parameters runs the container replication every 30 seconds. To check replication status of 12 000 containers stored on one node at the speed of 350 containers/seconds, you'll need around 34 seconds to do so. In the end, you'll never stop checking replication of your containers, and the more you'll have containers, the more your inconsistency window will increase.


Until some of the code is fixed (the HTTP connection pooling probably being the "easiest" one), I warmly recommend to chose correctly the different Swift parameters for your setup. The replication process optimization consists in having the minimum amount of partitions per node, which can be done by:

  • decreasing the number of partitions
  • decreasing the number of replicas
  • increasing the number of node

For very large setups, some code to speed up accounts and containers synchronization, and remove deleted containers will be required, but this does not exist yet, as far as I know.



一致的读写操作是如何处理的CAP定理:Consistency:所有节点在同一时间看到的是同样的数据;Availability:确保每个请求收到成功还是失败的回应;Parttition Toleranc...
  • qq_25395793
  • qq_25395793
  • 2017年03月08日 17:17
  • 200

Eventual Consistency & Modules in Redis

本文将主要简单了解Redis的Eventual Consistency与令人期待的Redis Modules新功能。 1. Eventual Consistency 对于一致性,可以分为从客户端和服务...
  • ErixHao
  • ErixHao
  • 2016年08月31日 22:15
  • 621


consistency group,直译是一致性组,是Juno版本cinder新引进的一个概念。顾名思义,里面包含了两个意思,consistency(一致性)和group(组),如何体现的呢?处于数据...
  • canxinghen
  • canxinghen
  • 2014年11月07日 16:32
  • 2079

从零开始安装openstack swift

安装环境 所需软件及版本          VMware®Workstation 11.1.0 build-2496824(桌面虚拟计算机软件)          ubuntu-12.04.5-...
  • androidwd
  • androidwd
  • 2015年08月26日 17:55
  • 2746

OpenStack Swift 云存储技术详解

Swift云存储技术详解:综述与概念   OpenStack Object Storage (Swift) 是用来创建冗余的、可扩展的对象存储(引擎)的开源软件。通过阅读Swift的技术文档,我...
  • duxingxia356
  • duxingxia356
  • 2014年12月31日 20:43
  • 802

OpenStack Swift学习笔记

OpenStack Object Storage(Swift)是OpenStack开源云计算项目的子项目之一。Swift的目的是使用普通硬件来构建冗余的、可扩展的分布式对象存储集群,存储容量可达PB级...
  • i_chips
  • i_chips
  • 2014年01月03日 11:59
  • 7630

Openstack Swift 原理、架构与API介绍

1 概述     OpenStack Swift 开源项目提供了弹性可伸缩、高可用的分布式对象存储服务,适合存储大规模非结构化数据。本文将深入介绍 Swift 的基本设计原理、对称式的系统架构和 ...
  • sinat_27186785
  • sinat_27186785
  • 2016年07月15日 22:27
  • 2940

openstack swift的副本存放位置解析

我们知道在swift中的副本一般是3个,但是对于这三个副本放在哪里?或者存储位置是怎么选择的呢?这段代码是在/common/ring/ring.py文件中的,现在我们来看看这段代码 def get_m...
  • dasgk
  • dasgk
  • 2015年01月16日 10:40
  • 1066


我们部署好OPenStack以及Swift后,想使用开放接口对Swift进行对接,但是却找不到入口,下面就一步一步介绍一下如何使用租户,用户名和密码对接Swift。 首先,我们安装好SWIFT后,在...
  • bupt073114
  • bupt073114
  • 2017年06月06日 11:18
  • 506

openstack swift 的UI客户端(windows版)

一款openstack swift的windows客户端,可以像网盘一样来使用swift了
  • wytdahu
  • wytdahu
  • 2015年05月19日 20:36
  • 1813
您举报文章:OpenStack Swift eventual consistency analysis & bottlenecks