twitter的分布式照片存储系统

http://engineering.twitter.com/2012/12/blobstore-twitters-in-house-photo.html

Blobstore: Twitter’s in-house photo storage system

Millions of people turn to Twitter to share and discover photos. To make it possible to upload a photo and attach it to your Tweet directly from Twitter, we partnered with Photobucket in 2011. As soon as photos became a more native part of the Twitter experience, more and more people began using this feature to share photos. 

In order to introduce new features and functionality, such as  filters, and continue to improve the photos experience, Twitter’s Core Storage team began building an in-house photo storage system. In September, we began to use this new system, called Blobstore.

What is Blobstore?

Blobstore is Twitter’s low-cost and scalable storage system built to store photos and other binary large objects, also known as blobs. When we set out to build Blobstore, we had three design goals in mind:

  • Low Cost: Reduce the amount of money and time Twitter spent on storing Tweets with photos.
  • High Performance: Serve images in the low tens of milliseconds, while maintaining a throughput of hundreds of thousands of requests per second.
  • Easy to Operate: Be able to scale operational overhead with Twitter’s continuously growing infrastructure.

How does it work?

When a user tweets a photo, we send the photo off to one of a set of Blobstore  front-end servers. The front-end understands where a given photo needs to be written, and forwards it on to the servers responsible for actually storing the data. These storage servers, which we call  storage nodes, write the photo to a disk and then inform a  Metadata store that the image has been written and instruct it to record the information required to retrieve the photo. This  Metadata store, which is a non-relational key-value store cluster with automatic multi-DC synchronization capabilities, spans across all of Twitter’s data centers providing a consistent view of the data that is in Blobstore.

The brain of Blobstore, the  blob manager, runs alongside the front-ends, storage nodes, and index cluster. The blob manager acts as a central coordinator for the management of the cluster. It is the source of all of the front-ends’ knowledge of where files should be stored, and it is responsible for updating this mapping and coordinating data movement when storage nodes are added, or when they are removed due to failures.

Finally, we rely on  Kestrel, Twitter’s existing asynchronous queue server, to handle tasks such as replicating images and ensuring data integrity across our data centers. 

We guarantee that when an image is successfully uploaded to Twitter, it is immediately retrievable from the data center that initially received the image. Within a short period of time, the image is replicated to all of our other data centers, and is retrievable from those as well. Because we rely on a multi-data-center Metadata store for the central index of files within Blobstore, we are aware in a very short amount of time whether an image has been written to its original data center; we can route requests there until the Kestrel queues are able to replicate the data.

Blobstore Components

How is the data found?

When an image is requested from Blobstore, we need to determine its location in order to access the data. There are a few approaches to solving this problem, each with its own pros and cons. One such approach is to map or hash each image individually to a given server by some method. This method has a fairly major downside in that it makes managing the movement of images much more complicated. For example, if we were to add or remove a server from Blobstore, we would need to recompute a new location for each individual image affected by the change. This adds operational complexity, as it would necessitate a rather large amount of bookkeeping to perform the data movement.

We instead created a fixed-sized container for individual blobs of data, called a “ virtual bucket”. We map images to these containers, and then we map the containers to the individual storage nodes. We keep the total number of virtual buckets unchanged for the entire lifespan of our cluster. In order to determine which virtual bucket a given image is stored in, we perform a simple hash on the image’s unique ID. As long as the number of virtual buckets remains the same, this hashing will remain stable. The advantage of this stability is that we can reason about the movement of data at a much more coarsely grained level than the individual image.

How do we place the data?

When mapping virtual buckets to physical storage nodes, we keep some rules in mind to make sure that we don’t lose data when we lose servers or hard drives. For example, if we were to put all copies of a given image on a single rack of servers, losing that rack would mean that particular image would be unavailable.

If we were to completely mirror the data on a given storage node on another storage node, it would be unlikely that we would ever have unavailable data, as the likelihood of losing both nodes at once is fairly low. However, whenever we were to lose a node, we would only have a single node to source from to re-replicate the data. We would have to recover slowly, so as to not impact the performance of the single remaining node.

If we were to take the opposite approach and allow any server in the cluster to share a range of data on all servers, then we would avoid a bottleneck when recovering lost replicas, as we would essentially be able to read from the entire cluster in order to re-replicate data. However, we would also have a very high likelihood of data loss if we were to lose more than the replication factor of the cluster (two) per data center, as the chance that any two nodes would share some piece of data would be high. So, the optimal approach would be somewhere in the middle: for a given piece of data, there would be a limited number of machines that could share the range of data of its replica - more than one but less than the entire cluster.

We took all of these things into account when we determined the mapping of data to our storage nodes. As a result, we built a library called “ libcrunch” which understands the various data placement rules such as rack-awareness, understands how to replicate the data in way that minimizes risk of data loss while also maximizing the throughput of data recovery, and attempts to minimize the amount of data that needs to be moved upon any change in the cluster topology (such as when nodes are added or removed). It also gives us the power to fully map the network topology of our data center, so storage nodes have better data placement and we can take into account rack awareness and placement of replicas across PDU zones and routers.

Keep an eye out for a blog post with more information on libcrunch.

How is the data stored?

Once we know where a given piece of data is located, we need to be able to efficiently store and retrieve it. Because of their relatively high storage density, we are using standard hard drives inside our storage nodes (3.5” 7200 RPM disks). Since this means that disk seeks are very expensive, we attempted to minimize the number of disk seeks per read and write.

We pre-allocate ‘fat’ files on each storage node disk using fallocate(), of around 256MB each. We store each blob of data sequentially within a fat file, along with a small header. The offset and length of the data is then stored in the Metadata store, which uses SSDs internally, as the access pattern for index reads and writes is very well-suited for solid state media. Furthermore, splitting the index from the data saves us from needing to scale out memory on our storage nodes because we don’t need to keep any local indexes in RAM for fast lookups. The only time we end up hitting disk on a storage node is once we already have the fat file location and byte offset for a given piece of data. This means that we can generally guarantee a single disk seek for that read.


Topology Management

As the number of disks and nodes increases, the rate of failure increases. Capacity needs to be added, disks and nodes need to be replaced after failures, servers need to be moved. To make Blobstore operationally easy we put a lot of time and effort into libcrunch and the tooling associated with making cluster changes. 


When a storage node fails, data that was hosted on that node needs to be copied from a surviving replica to restore the correct replication factor. The failed node is marked as unavailable in the cluster topology, and so libcrunch computes a change in the mapping from the virtual buckets to the storage nodes. From this mapping change, the storage nodes are instructed to copy and migrate virtual buckets to new locations.

Zookeeper
Topology and placement rules are stored internally in one of our Zookeeper clusters. The Blob Manager deals with this interaction and it uses this information stored in Zookeeper when an operator makes a change to the system. A topology change can consist of adjusting the replication factor, adding, failing, or removing nodes, as well as adjusting other input parameters for libcrunch. 

Replication across Data centers

Kestrel is used for cross data center replication. Because kestrel is a durable queue, we use it to asynchronously replicate our image data across data centers. 

Data center-aware Routing

TFE (Twitter Frontend) is one of Twitter’s core components for routing. We wrote a custom plugin for TFE, that extends the default routing rules. Our Metadata store spans multiple data centers, and because the metadata stored per blob is small (a few bytes), we typically replicate this information much faster than the blob data. If a user tries to access a blob that has not been replicated to the nearest data center they are routed to, we look up this metadata information and proxy requests to the nearest data center that has the blob data stored. This gives us the property that if replication gets delayed, we can still route requests to the data center that stored the original blob, serving the user the image at the cost of a little higher latency until it’s replicated to the closer data center.

Future work

We have shipped the first version of blobstore internally. Although blobstore started with photos, we are adding other features and use cases that require blob storage to blobstore. And we are also continuously iterating on it to make it more robust, scalable, and easier to maintain.


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
本火锅店点餐系统采用Java语言和Vue技术,框架采用SSM,搭配Mysql数据库,运行在Idea里,采用小程序模式。本火锅店点餐系统提供管理员、用户两种角色的服务。总的功能包括菜品的查询、菜品的购买、餐桌预定和订单管理。本系统可以帮助管理员更新菜品信息和管理订单信息,帮助用户实现在线的点餐方式,并可以实现餐桌预定。本系统采用成熟技术开发可以完成点餐管理的相关工作。 本系统的功能围绕用户、管理员两种权限设计。根据不同权限的不同需求设计出更符合用户要求的功能。本系统中管理员主要负责审核管理用户,发布分享新的菜品,审核用户的订餐信息和餐桌预定信息等,用户可以对需要的菜品进行购买、预定餐桌等。用户可以管理个人资料、查询菜品、在线点餐和预定餐桌、管理订单等,用户的个人资料是由管理员添加用户资料时产生,用户的订单内容由用户在购买菜品时产生,用户预定信息由用户在预定餐桌操作时产生。 本系统的功能设计为管理员、用户两部分。管理员为菜品管理、菜品分类管理、用户管理、订单管理等,用户的功能为查询菜品,在线点餐、预定餐桌、管理个人信息等。 管理员负责用户信息的删除和管理,用户的姓名和手机号都可以由管理员在此功能里看到。管理员可以对菜品的信息进行管理、审核。本功能可以实现菜品的定时更新和审核管理。本功能包括查询餐桌,也可以发布新的餐桌信息。管理员可以查询已预定的餐桌,并进行审核。管理员可以管理公告和系统的轮播图,可以安排活动。管理员可以对个人的资料进行修改和管理,管理员还可以在本功能里修改密码。管理员可以查询用户的订单,并完成菜品的安排。 当用户登录进系统后可以修改自己的资料,可以使自己信息的保持正确性。还可以修改密码。用户可以浏览所有的菜品,可以查看详细的菜品内容,也可以进行菜品的点餐。在本功能里用户可以进行点餐。用户可以浏览没有预定出去的餐桌,选择合适的餐桌可以进行预定。用户可以管理购物车里的菜品。用户可以管理自己的订单,在订单管理界面里也可以进行查询操作。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值