Finding a needle in Haystack: Facebook’s photo storage -- Facebook 分布式对象存储中译

bling_wang

于 2020-12-21 15:27:36 发布

阅读量460

点赞数

分类专栏：分布式对象存储文章标签： hdfs golang

原文链接：https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf

版权

分布式对象存储专栏收录该内容

1 篇文章 0 订阅

订阅专栏

原文地址：https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf

译者注：该论文是分布式存储中非常经典的一篇，Facebook的照片存储系统就是基于的这篇论文的实现。同样有一些开源的实现如 Seaweedfs 之类。阅读该论文可以更清晰的了解此分布式文件系统设计的初衷、原理和实现的细节。

Abstract:

This paper describes Haystack, an object storage system optimized for Facebook’s Photos application. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data. Users upload one billion new photos (∼60 terabytes) each week and Facebook serves over one million images per second at peak. Haystack provides a less expensive and higher performing solution than our previous approach, which leveraged network attached storage appliances over NFS. Our key observation is that this traditional design incurs an excessive number of disk operations because of metadata lookups. We carefully reduce this per photo metadata so that Haystack storage machines can perform all metadata lookups in main memory. This choice conserves disk operations for reading actual data and thus increases overall throughput.

摘要:

本文介绍了一种为存放Facebook的照片而专门优化的对象存储系统Haystack。Facebook目前储存了超过2600亿张图片，相当于超过20PB的数据。用户每周上传10亿张新照片(约60tb)，Facebook在峰值时每秒处理超过100万张图片。Haystack提供了一种比我们以前的方法（基于NAS和NFS）更便宜、性能更高的解决方案。我们观察到了一个非常关键的问题是：这种传统设计中，由于元数据的查找导致了过多的磁盘操作。我们竭尽全力地减少每个照片的元数据，这样Haystack就可以在内存中执行所有元数据查找。这种选择为读取真实照片数据所需的磁盘操作腾出了性能，提高了总体吞吐量。

1 Introduction

Sharing photos is one of Facebook’s most popular features. To date, users have uploaded over 65 billion photos making Facebook the biggest photo sharing website in the world. For each uploaded photo, Facebook generates and stores four images of different sizes, which translates to over 260 billion images and more than 20 petabytes of data. Users upload one billion new photos (∼60 terabytes) each week and Facebook serves over one million images per second at peak. As we expect these numbers to increase in the future, photo storage poses a significant challenge for Facebook’s infrastructure.

1 介绍

分享照片是Facebook最受欢迎的功能之一。到目前为止，用户总共上传了超过650亿张照片，使得Facebook成为世界上最大的照片分享网站。对于每一张上传的照片，Facebook会生成并存储四张大小不同的图片，转换成超过2600亿张图片和超过20PB的数据。用户每周上传10亿张新照片(约60tb)， Facebook在峰值时每秒需要处理超过100万张图片。我们预计这些数字在未来还会持续增长，所以这些照片的存储对Facebook的基础设施构成了重大挑战。

This paper presents the design and implementation of Haystack, Facebook’s photo storage system that has been in production for the past 24 months. Haystack is an object store that we designed for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. We engineered our own storage system for photos because traditional filesystems perform poorly under our workload.

本文介绍了Haystack的设计和实现，该系统是Facebook的照片存储系统，其已经投入正式环境中使用24个月了。Haystack是我们设计用来在Facebook上分享照片的对象存储，数据只写一次，经常读取，从不修改，很少删除。我们为照片设计了自己的存储系统，因为传统的文件系统在我们的工作负载下性能很差。

In our experience, we find that the disadvantages of a traditional POSIX based filesystem are directories and per file metadata. For the Photos application most of this metadata, such as permissions, is unused and thereby wastes storage capacity. Yet the more significant cost is that the file’s metadata must be read from disk into memory in order to find the file itself. While insignificant on a small scale, multiplied over billions of photos and petabytes of data, accessing metadata is the throughput bottleneck. We found this to be our key problem in using a network attached storage (NAS) appliance mounted over NFS. Several disk operations were necessary to read a single photo: one (or typically more) to translate the filename to an inode number, another to read the inode from disk, and a final one to read the file itself. In short, using disk IOs for metadata was the limiting factor for our read throughput. Observe that in practice this problem introduces an additional cost as we have to rely on content delivery networks (CDNs), such as Akamai, to serve the majority of read traffic.

在我们的经验中，我们发现传统的基于POSIX的文件系统的缺点是目录和每个文件的元数据会有一些冗余。对于照片应用程序来说，大部分元数据(比如权限)都没有使用，因此浪费了存储容量。然而，更大的代价是，为了找到文件本身，必须将文件的元数据从磁盘读取到内存中。文件规模较小时这些花费无关紧要，然而对于几百billion的图片和PB级别的数据，访问元数据就是吞吐量瓶颈所在。这是我们之前使用NFS和NAS来存储照片时得到的经验教训。读取单张照片就需要好几个磁盘操作:将一个(或多个)文件名转换为inode号，从另一个磁盘读取inode，最后再操作读取文件本身。简而言之，将元数据放在磁盘上是我们读取照片的吞吐量的限制因素。在实际生产环境中，我们必须依赖内容分发网络（类似CDN，类似Akamai）来支撑主要的读取流量，即使如此，文件元数据的大小和I/O同样对整体系统有很大影响。

Given the disadvantages of a traditional approach, we designed Haystack to achieve four main goals:

High throughput and low latency. Our photo storage systems have to keep up with the requests users make. Requests that exceed our processing capacity are either ignored, which is unacceptable for user experience, or handled by a CDN, which is expensive and reaches a point of diminishing returns. Moreover, photos should be served quickly to facilitate a good user experience. Haystack achieves high throughput and low latency by requiring at most one disk operation per read. We accomplish this by keeping all metadata in main memory, which we make practical by dramatically reducing the per photo metadata necessary to find a photo on disk.

Fault-tolerant. In large scale systems, failures happen every day. Our users rely on their photos being available and should not experience errors despite the inevitable server crashes and hard drive failures. It may happen that an entire datacenter loses power or a cross-country link is severed. Haystack replicates each photo in geographically distinct locations. If we lose a machine we introduce another one to take its place, copying data for redundancy as necessary.

Cost-effective. Haystack performs better and is less expensive than our previous NFS-based approach. We quantify our savings along two dimensions: Haystack’s cost per terabyte of usable storage and Haystack’s read rate normalized for each terabyte of usable storage1 . In Haystack, each usable terabyte costs ∼28% less and processes ∼4x more reads per second than an equivalent terabyte on a NAS appliance.

Simple. In a production environment we cannot overstate the strength of a design that is straight-forward to implement and to maintain. As Haystack is a new system, lacking years of production-level testing, we paid particular attention to keeping it simple. That simplicity let us build and deploy a working system in a few months instead of a few years.

考虑到传统方法的缺点，我们设计了Haystack来实现四个主要目标:

高吞吐量和低延迟。我们的照片存储系统必须跟上用户的要求。超出我们处理能力的请求要么被忽略（这对用户体验来说是不可接受的），要么被CDN处理（这是昂贵的，并且会有一个收益递减点）。此外，照片应该快速被读取，以促进良好的用户体验。Haystack通过每次读最多需要一个磁盘操作来实现高吞吐量和低延迟。我们通过将所有元数据保存在内存中，显著减少在磁盘上查找每张照片所需的元数据来实现这一点。

高容错。在大规模的系统中，故障每天都会发生。哪怕服务器崩溃和硬盘驱动器故障不可避免，哪怕数据中心断电或跨国连接被切断，我们的用户也绝不想看到一个错误的返回。所以，Haystack复制每张照片在地理上不同的位置。如果我们一台机器down了，我们会引入另一台机器来代替它，来保证高可用。

高性价比。Haystack比我们之前基于NFS的方法性能更好，成本更低。我们按两个维度来衡量：每TB可用存储的花费、每TB可用存储的读取速度。在Haystack中，相较于NAS设备，每TB可节约28%的成本，每秒可处理4倍的读操作。

简单。替换Facebook的图片存储系统是非常复杂且有风险的事情，我们无法去追求完美的设计，这会导致实现和维护都非常耗时耗力。由于Haystack是一个新系统，缺乏多年的生产级测试，我们特别注意保持它的简单性。这种简单性让我们可以在几个月内就可以构建和部署一个能工作的线上系统而不是花几年的时间。

This work describes our experience with Haystack from conception to implementation of a production quality system serving billions of images a day. Our three main contributions are:

• Haystack, an object storage system optimized for the efficient storage and retrieval of billions of photos.

• Lessons learned in building and scaling an inexpensive, reliable, and available photo storage system.

• A characterization of the requests made to Facebook’s photo sharing application.

We organize the remainder of this paper as follows. Section 2 provides background and highlights the challenges in our previous architecture. We describe Haystack’s design and implementation in Section 3. Section 4 characterizes our photo read and write workload and demonstrates that Haystack meets our design goals. We draw comparisons to related work in Section 5 and conclude this paper in Section 6.

这篇文章描述了我们把Haystack从概念落地到实现一个每天服务数十亿图片的线上系统的经验。本篇文章的三大贡献是:

•Haystack，一个经过优化的对象存储系统，可高效存储和检索数十亿张照片。

•构建和扩展一个廉价、可靠和可用的照片存储系统的经验教训。

•访问Facebook照片分享应用程序的请求的特征描述。

本文的章节组织如下：第2节讲述了背景、突出了以前架构中存在的挑战。第3节描述了Haystack的设计和实现。第4节描述了各种图片读写场景下的系统负载特征，通过实验数据证明Haystack达到了设计目标。第5节是对比和相关工作。第6节是总结本文。

2 Background & Previous Design

In this section, we describe the architecture that existed before Haystack and highlight the major lessons we learned. Because of space constraints our discussion of this previous design elides several details of a production-level deployment.

2.1 Background

We begin with a brief overview of the typical design for how web servers, content delivery networks (CDNs), and storage systems interact to serve photos on a popular site. Figure 1 depicts the steps from the moment when a user visits a page containing an image until she downloads that image from its location on disk. When visiting a page the user’s browser first sends an HTTP request to a web server which is responsible for generating the markup for the browser to render. For each image the web server constructs a URL directing the browser to a location from which to download the data. For popular sites this URL often points to a CDN. If the CDN has the image cached then the CDN responds immediately with the data. Otherwise, the CDN examines the URL, which has enough information embedded to retrieve the photo from the site’s storage systems. The CDN then updates its cached data and sends the image to the user’s browser.

2 背景&旧的架构

在本节中，我们将描述Haystack之前的旧架构，并着重描述我们学到的经验教训。但由于文章大小的限制，一些细节就不细述了。

2.1背景

我们先来看一个概览图，它描述了一个通常的方案：web服务器、CDN和存储系统如何交互协作，来构建一个热门站点的图片服务。图1描述了用户从访问图片的页面到从磁盘上的位置下载该图片的步骤。当访问一个页面时，用户的浏览器首先向web服务器发送一个HTTP请求，它负责生成markup以供浏览器渲染。对于每个图像，web服务器构造一个URL，导浏览器在此位置下载图片数据。对于热门的网站，这个URL通常指向CDN。如果CDN有缓存的图像，那么CDN立即响应数据。否则，CDN会检查URL，URL中需要嵌入足够的信息以供CDN从本站点的存储系统中检索图片。从更上游的服务器拿到图片后，CDN会将图片发送回用户的浏览器并更新它的缓存数据。

2.2 NFS-based Design

In our first design we implemented the photo storage system using an NFS-based approach. While the rest of this subsection provides more detail on that design, the major lesson we learned is that CDNs by themselves do not offer a practical solution to serving photos on a social networking site. CDNs do effectively serve the hottest photos— profile pictures and photos that have been recently uploaded—but a social networking site like Facebook also generates a large number of requests for less popular (often older) content, which we refer to as the long tail. Requests from the long tail account for a significant amount of our traffic, almost all of which accesses the backing photo storage hosts as these requests typically miss in the CDN. While it would be very convenient to cache all of the photos for this long tail, doing so would not be cost effective because of the very large cache sizes required.

2.2基于nfs的设计

在我们的上一个方案中，我们使用了基于nfs的方案实现了照片存储系统。在上一个方案中，我们吸取的主要教训是，对于一个热门的社交网络站点，只有CDN不足以为图片服务提供一个实用的解决方案。CDN确实有效地提供最热门的照片（个人资料照片和最近上传的照片），但像Facebook这样的社交网站也会产生大量对不太热门(通常是较老的)图片的请求，我们称之为长尾。来自长尾的请求占了我们很大一部分的流量，因为这些请求在CDN中通常不会出现，所以几乎所有的长尾请求都访问更上层的服务器去获取照片。缓存所有的长尾照片是一种可行的方法，但这样做性价比非常低，因为需要非常大的空间去做缓存。

Our NFS-based design stores each photo in its own file on a set of commercial NAS appliances. A set of machines, Photo Store servers, then mount all the volumes exported by these NAS appliances over NFS. Figure 2 illustrates this architecture and shows Photo Store servers processing HTTP requests for images. From an image’s URL a Photo Store server extracts the volume and full path to the file, reads the data over NFS, and returns the result to the CDN.

We initially stored thousands of files in each directory of an NFS volume which led to an excessive number of disk operations to read even a single image. Because of how the NAS appliances manage directory metadata, placing thousands of files in a directory was extremely inefficient as the directory’s blockmap was too large to be cached effectively by the appliance. Consequently it was common to incur more than 10 disk operations to retrieve a single image. After reducing directory sizes to hundreds of images per directory, the resulting system would still generally incur 3 disk operations to fetch an image: one to read the directory metadata into memory, a second to load the inode into memory, and a third to read the file contents.

To further reduce disk operations we let the Photo Store servers explicitly cache file handles returned by the NAS appliances. When reading a file for the first time a Photo Store server opens a file normally but also caches the filename to file handle mapping in memcache. When requesting a file whose file handle is cached, a Photo Store server opens the file directly using a custom system call, open by filehandle, that we added to the kernel. Regrettably, this file handle cache provides only a minor improvement as less popular photos are less likely to be cached to begin with. One could argue that an approach in which all file handles are stored in memcache might be a workable solution. However, that only addresses part of the problem as it relies on the NAS appliance having all of its inodes in main memory, an expensive requirement for traditional filesystems. The major lesson we learned from the NAS approach is that focusing only on caching— whether the NAS appliance’s cache or an external cache like memcache—has limited impact for reducing disk operations. The storage system ends up processing the long tail of requests for less popular photos, which are not available in the CDN and are thus likely to miss in our caches.

基于NFS的设计中，图片文件存储在一组商用NAS设备上，NAS设备的卷被挂载到Photo Store Server的NFS上。图2展示了这个架构。Photo Store Server解析URL得出卷和完整的文件路径，在NFS上读取数据，然后返回结果到CDN。

最初，NFS卷的每个目录都存储了几千个文件，这导致了过多的磁盘操作，哪怕只是读取单张图片。这是由于NAS设备管理目录元数据的机制，放置几千个文件在一个目录是非常低效的，因为目录的blockmap太大而不能被设备有效的缓存，检索一张图像都需要进行10次以上的磁盘操作。在减少到每个目录几百个图片之后，系统仍需要进行3次左右的磁盘操作来获取图片:一次将目录元数据读入内存，第二次将inode加载到内存，第三次读取文件内容。

为了进一步减少磁盘操作，我们让图片存储服务器显式地缓存NAS设备返回的文件句柄。第一次读取一个文件时，图片存储服务器正常打开一个文件，并将文件名与文件句柄的映射缓存到memcache中。当请求缓存文件句柄的文件时，照片存储服务器会执行一个系统调用(由我们添加到os的句柄)去打开该文件。但是，这个方法的效果并不明显，因为冷门照片不太可能被缓存。一种可行的方案是将所有文件句柄存储在memcache中。然而，这不能解决所有问题，因为这需要NAS有足够的空间去缓存这些句柄，这是非常昂贵的。总结下来，我们从NAS方法中学到的主要经验是，只关注缓存—无论是NAS设备的缓存还是像memcache这样的外部缓存—对于减少磁盘操作的影响都是有限的。由于长尾请求在CDN中大多无法命中缓存的，所以存储系统最终处理都会处理对冷门照片的请求。

2.3 Discussion

It would be difficult for us to offer precise guidelines for when or when not to build a custom storage system. However, we believe it still helpful for the community to gain insight into why we decided to build Haystack.

Faced with the bottlenecks in our NFS-based design, we explored whether it would be useful to build a system similar to GFS. Since we store most of our user data in MySQL databases, the main use cases for files in our system were the directories engineers use for development work, log data, and photos. NAS appliances offer a very good price/performance point for development work and for log data. Furthermore, we leverage Hadoop for the extremely large log data. Serving photo requests in the long tail represents a problem for which neither MySQL, NAS appliances, nor Hadoop are well-suited.

One could phrase the dilemma we faced as existing storage systems lacked the right RAM-to-disk ratio. However, there is no right ratio. The system just needs enough main memory so that all of the filesystem metadata can be cached at once. In our NAS-based approach, one photo corresponds to one file and each file requires at least one inode, which is hundreds of bytes large. Having enough main memory in this approach is not cost-effective. To achieve a better price/performance point, we decided to build a custom storage system that reduces the amount of filesystem metadata per photo so that having enough main memory is dramatically more cost-effective than buying more NAS appliances.

2.3 一些讨论

对于什么时候才需要去构建一个项目定制的存储系统，我们很难提供精确的方案。然而，我们相信这篇文章仍然有助于社区深入了解我们最终决定构建Haystack的原因。

面对基于nfs的设计的瓶颈，我们探讨了是否可以构建一个类似GFS的系统。由于我们将大多数用户数据存储在MySQL数据库中，所以系统中文件的主要作用是工程师用于存储目录、日志数据和照片。NAS设备为在这些场景中有很高的性价比。另外，我们利用Hadoop来处理非常大的日志数据。而MySQL、NAS设备和Hadoop都不适合处理之前描述的长尾请求中的照片。

所以我们现在面临的困难就是，当下的存储系统没有一个合适的ram-to-disk比例去存储图片。当然，没有什么绝对正确的比例。系统需要足够的内存，才能缓存所有文件的元数据。在我们基于nas的方法中，一张照片对应一个文件，而每个文件至少需要存一个百字节大的inode。提供这么多的内存是非常昂贵的，所以我们决定构建一个定制的存储系统，它可以减少每张照片的文件系统元数据量，让内存大小和NAS设备数量在一个合适的比例，提供更佳的性价比。

3 Design & Implementation

Facebook uses a CDN to serve popular images and leverages Haystack to respond to photo requests in the long tail efficiently. When a web site has an I/O bottleneck serving static content the traditional solution is to use a CDN. The CDN shoulders enough of the burden so that the storage system can process the remaining tail. At Facebook a CDN would have to cache an unrea- sonably large amount of the static content in order for traditional (and inexpensive) storage approaches not to be I/O bound.

Understanding that in the near future CDNs would not fully solve our problems, we designed Haystack to address the critical bottleneck in our NFS-based approach: disk operations. We accept that requests for less popular photos may require disk operations, but aim to limit the number of such operations to only the ones necessary for reading actual photo data. Haystack achieves this goal by dramatically reducing the memory used for filesystem metadata, thereby making it practical to keep all this metadata in main memory.

Recall that storing a single photo per file resulted in more filesystem metadata than could be reasonably cached. Haystack takes a straight-forward approach: it stores multiple photos in a single file and therefore maintains very large files. We show that this straightforward approach is remarkably effective. Moreover, we argue that its simplicity is its strength, facilitating rapid implementation and deployment. We now discuss how this core technique and the architectural components surrounding it provide a reliable and available storage system. In the following description of Haystack, we distinguish between two kinds of metadata. Application metadata describes the information needed to construct a URL that a browser can use to retrieve a photo. Filesystem metadata identifies the data necessary for a host to retrieve the photos that reside on that host’s disk.

3 设计与实现

Facebook使用CDN来处理热点图片，Haystack则处理长尾的冷门照片。传统的解决方案是使用CDN去处理网站的静态内容的I/O瓶颈。CDN承担了最主要的负担，以便存储系统能够处理长尾。在Facebook, CDN必须缓存非常大量的静态内容，以保护下游的(廉价的)传统存储系统。

在不久的将来，CDN也不能完全解决我们的问题，所以我们设计了Haystack来解决基于nfs方法中的关键瓶颈:磁盘操作。我们接受冷门照片的请求需要磁盘操作的事实，但目标是将此类操作的大小限制在仅读取实际照片数据所需的大小。Haystack通过显著减少文件系统元数据，并把这些数据缓存在内存中。

把每张照片分别存在一个文件里下来会导致元数据过多而难以存储。Haystack采用了一种对策是:在一个文件中存储多张照片，用大文件来存储以减少元数据的冗余。我们证明这种直接的方法是非常有效的。另外，我们强调它有简单、能快速实现和部署的优点。我们将以此核心技术展开说，结合它周边的所有架构组件，描述Haystack是如何实现了一个高可靠、高可用的存储系统。我们区分了两种元数据：应用程序元数据描述了构建浏览器用于获取照片的URL、文件系统元数据标识了主机获取该主机磁盘上的照片的数据。

3.1 Overview

The Haystack architecture consists of 3 core components: the Haystack Store, Haystack Directory, and Haystack Cache. For brevity we refer to these components with ‘Haystack’ elided. The Store encapsulates the persistent storage system for photos and is the only component that manages the filesystem metadata for photos. We organize the Store’s capacity by physical volumes. For example, we can organize a server’s 10 terabytes of capacity into 100 physical volumes each of which provides 100 gigabytes of storage. We further group physical volumes on different machines into logical volumes. When Haystack stores a photo on a logical volume, the photo is written to all corresponding physical volumes. This redundancy allows us to mitigate data loss due to hard drive failures, disk controller bugs, etc. The Directory maintains the logical to physical mapping along with other application metadata, such as the logical volume where each photo resides and the logical volumes with free space. The Cache functions as our internal CDN, which shelters the Store from requests for the most popular photos and provides insulation if upstream CDN nodes fail and need to refetch content.

Figure 3 illustrates how the Store, Directory, and Cache components fit into the canonical interactions between a user’s browser, web server, CDN, and storage system. In the Haystack architecture the browser can be directed to either the CDN or the Cache. Note that while the Cache is essentially a CDN, to avoid confusion we use ‘CDN’ to refer to external systems and ‘Cache’ to refer to our internal one that caches photos. Having an internal caching infrastructure gives us the ability to reduce our dependence on external CDNs.

3.1概述

Haystack的架构由3个核心组件组成：Haystack Store、Haystack Directory 和 Haystack Cache。为了简单起见，我们在后续提到这些组件时会省略掉前缀“Haystack”。 Store 是照片的持久化存储系统，并且是管理照片的文件系统元数据的唯一组件。我们按实际照片的数量确定 Store 的容量。例如，我们可以维护100个物理卷，每个物理卷提供100GB的存储，整个服务就可以支撑存储10TB的数据。更进一步，我们将多个不同机器上的物理卷对应上一个逻辑卷，形成主从。当Haystack在某个逻辑卷上存储照片时，该照片将被写入所有相应的从物理卷。这种冗余允许我们减少由于硬盘驱动器故障、磁盘控制器错误等造成的数据丢失，保证高可用。Directory 维护逻辑到物理的映射以及其他应用程序元数据，比如每个照片所在的逻辑卷和某个逻辑卷的剩余容量。Cache 作为我们的内部CDN功能，它保护最热门的照片不会访问到后面的 Store 上，如果上游CDN节点失败，需要重新获取内容。

图3说明了Store、Directory和 Cache 组件如何在用户浏览器、web服务器、CDN和存储系统之间交互。在Haystack架构中，浏览器可以被定向到CDN或缓存。请注意，虽然缓存本质上是一个CDN，但为了避免混淆，我们使用“CDN”指的是外部系统，而“缓存”指的是我们内部的一个缓存照片的系统。拥有一个内部缓存使我们能够减少对外部CDN的依赖。

When a user visits a page the web server uses the Directory to construct a URL for each photo. The URL contains several pieces of information, each piece corresponding to the sequence of steps from when a user’s browser contacts the CDN (or Cache) to ultimately retrieving a photo from a machine in the Store. A typical URL that directs the browser to the CDN looks like the following:

The first part of the URL specifies from which CDN to request the photo. The CDN can lookup the photo internally using only the last part of the URL: the logical volume and the photo id. If the CDN cannot locate the photo then it strips the CDN address from the URL and contacts the Cache. The Cache does a similar lookup to find the photo and, on a miss, strips the Cache address from the URL and requests the photo from the specified Store machine. Photo requests that go directly to the Cache have a similar workflow except that the URL is missing the CDN specific information.

Figure 4 illustrates the upload path in Haystack. When a user uploads a photo she first sends the data to a web server. Next, that server requests a write-enabled logical volume from the Directory. Finally, the web server assigns a unique id to the photo and uploads it to each of the physical volumes mapped to the assigned logical volume.

当用户访问一个页面时，web服务器使用该目录为每张照片构造一个URL。URL包含了多个部分，每个部分对应了从用户的浏览器访问CDN(或者 Cache)到最后到存储去获取照片的一系列参数。将浏览器导向CDN的典型URL如下所示:

http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>

URL的第一部分指定从哪个CDN请求照片。CDN只能使用URL的最后一部分:逻辑卷和照片ID其内部查找该照片。如果CDN无法命中，它就会从URL中去掉<CDN>，并与缓存连接。缓存执行类似的查找来查找该照片，如果没有命中，就从URL中去掉<Cache>，并从指定的存储机器请求照片。直接去 Cache 去获取照片（不去CDN找）有一个类似的流程，只是URL就不用带上<CDN>了。

图4演示了Haystack的整个上传流程。当用户上传照片时，会首先将数据发送到Web服务器。然后该服务器向Directory申请一个可写的逻辑卷。最后Web服务器会为照片分配一个唯一ID，并将其上传到所分配的逻辑卷，并备份到每个从物理卷。

3.2 Haystack Directory

The Directory serves four main functions. First, it provides a mapping from logical volumes to physical volumes. Web servers use this mapping when uploading photos and also when constructing the image URLs for a page request. Second, the Directory load balances writes across logical volumes and reads across physical volumes. Third, the Directory determines whether a photo request should be handled by the CDN or by the Cache. This functionality lets us adjust our dependence on CDNs. Fourth, the Directory identifies those logical volumes that are read-only either because of operational reasons or because those volumes have reached their storage capacity. We mark volumes as read-only at the granularity of machines for operational ease.

When we increase the capacity of the Store by adding new machines, those machines are write-enabled; only write-enabled machines receive uploads. Over time the available capacity on these machines decreases. When a machine exhausts its capacity, we mark it as read-only. In the next subsection we discuss how this distinction has subtle consequences for the Cache and Store.

The Directory is a relatively straight-forward component that stores its information in a replicated database accessed via a PHP interface that leverages memcache to reduce latency. In the event that we lose the data on a Store machine we remove the corresponding entry in the mapping and replace it when a new Store machine is brought online.

3.2 Haystack Directory

Directory 主要有四个功能。第一，它提供了从逻辑卷到物理卷的映射。Web服务器在为上传照片请求构造URL时会获取此映射。第二，对文件的读和写进行负载均衡（区分主从的逻辑和物理卷）。第三，Directory 能决定一个照片请求应该由CDN处理还是由缓存处理。这个功能然后我们可以动态调整对CDN的使用度。第四，Directory 标识那些由于设置或容量满了变为只读的逻辑卷，为了运维方便，我们以机器粒度来标记卷的只读。

当我们通过添加新机器来增加 Store 时，这些机器是可写的;只有可写的机器才能接收上传。随着时间的推移，这些机器上的可用容量会减少。当一台机器耗尽其容量时，我们将其标记为只读。在下一小节中，我们将讨论如何这个特性如何影响 Cache 和 Store。

Directory 是一个相对简单的组件，元数据存储在一个冗余复制的数据库（主从），通过一个PHP接口访问，还可以利用memcache来减少延迟。当某个Store上的数据丢失（故障了），我们将删除 Directory 中的相应元数据，并在新的 Store 上线时接替它。（感觉这一节并没有写得很清楚）

待施工

bling_wang

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Finding a needle in Haystack: Facebook’s photo storage -- Facebook 分布式对象存储中译

原文地址：https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdfAbstract: This paper describes Haystack, an object storage system optimized for Facebook’s Photos application. Facebook currently stores over 260 billion images, which transl.
复制链接

扫一扫