Yahoo用10多亿美金买的Tumblr架构

最新推荐文章于 2016-06-10 10:17:33 发布

Allchin

最新推荐文章于 2016-06-10 10:17:33 发布

阅读量2.1k

点赞数

http://highscalability.com/blog/2013/5/20/the-tumblr-architecture-yahoo-bought-for-a-cool-billion-doll.html
The Tumblr Architecture Yahoo Bought For A Cool Billion Dollars
MONDAY, MAY 20, 2013 AT 8:25AM

It's being reported Yahoo bought Tumblr for $1.1 billion. You may recall Instagram was profiled on HighScalability and they were also bought by Facebook for a ton of money. A coincidence? You be the judge.
据报道，雅虎以11亿购买了Tumblr，你也许回想起Instagram因高可伸缩性，而被Facebook以很多钱买了下来。这只是巧合么？你自己判断。

Just what is Yahoo buying? The business acumen of the deal is not something I can judge, but if you are doing due diligence on the technology then Tumblr would probably get a big thumbs up. To see why, please keep on reading...
Yahoo收购的到底是什么？我对这单收购中存在的业务敏锐并没设呢么判断，但是如果你仔细地从技术层面分析Tumblr也许就会翘大拇指了。想知道为什么，请看下文。

With over 15 billion page views a month Tumblr has become an insanely popular blogging platform. Users may like Tumblr for its simplicity, its beauty, its strong focus on user experience, or its friendly and engaged community, but like it they do.
Tumblr凭借着150亿每月的浏览量而成为超级疯狂流行的博客平台。用户喜欢Tumblr也许是出于其简单而优美，它对用户体验的注重，或者友好而忙碌的社区或者其它什么原因。

Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.
每月超过30%的增长无处不存在挑战。在他们之中就存在可靠性问题。如何才能实现像Tumblr那样处理巨大规模的操作：5亿每天的页面访问量，每秒峰值能够达到4万的请求量，每天3TB新数据的存储量，并且运营超过1千太服务器。

One of the common patterns across successful startups is the perilous chasm crossing from startup to wildly successful startup. Finding people, evolving infrastructures, servicing old infrastructures, while handling huge month over month increases in traffic, all with only four engineers, means you have to make difficult choices about what to work on. This was Tumblr’s situation. Now with twenty engineers there’s enough energy to work on issues and develop some very interesting solutions.
所有成功的开端模式都有相似之处，就是如何从危险的峡谷般的险境转变到平原。募集人手，展开基础设施建设工作，维护旧有设施，应对每月增长的流量，全都只能依赖4名工程师，这意味着你必须艰难地选择要进行的工作。这就是Tumblr所处的位置。现在拥有了20名工程师就有足够的能力去解决难题，并尝试开发一些非常有趣的解决方案。

Tumblr started as a fairly typical large LAMP application. The direction they are moving in now is towards a distributed services model built around Scala, HBase, Redis, Kafka, Finagle, and an intriguing cell based architecture for powering their Dashboard. Effort is now going into fixing short term problems in their PHP application, pulling things out, and doing it right using services.
Tumblr的开局使用的是非常典型的大型LAMP应用。现在他们努力进发的方向是使用Scala,HBase,Redis,Kafka,Finagle构建他们的分布式架构，并基于一个迷人的网格架构来强化他们的Dashboard。

The theme at Tumblr is transition at massive scale. Transition from a LAMP stack to a somewhat bleeding edge stack. Transition from a small startup team to a fully armed and ready development team churning out new features and infrastructure. To help us understand how Tumblr is living this theme is startup veteran Blake Matheny, Distributed Systems Engineer at Tumblr. Here’s what Blake has to say about the House of
Tumblr的主要问题是大规模转变。从LAMP技术堆到一个新鲜孵化的技术堆。从一个小的刚上路的团队到一个全副武装去推出新功能与设施的开发团队的转变。为了帮助我们理解Tumblr如何在这个主题下起步并维持生存，请关注超链接作者，他是Tumblr的分布式系统工程师。下面是Blake对Tumblr机制的介绍：

Tumblr:

Site: http://www.tumblr.com/

Stats当前状态
500 million page views a day 5亿页面访问量每天
15B+ page views month 每月150亿以上页面访问量
~20 engineers 20个工程师
Peak rate of ~40k requests per second 每秒大约4万峰值的请求量
1+ TB/day into Hadoop cluster 超过1TB每天的Hadoop集群数据存储量
Many TB/day into MySQL/HBase/Redis/Memcache 每天很多T数据存向MySQL/HBase/Redis/Memcache
Growing at 30% a month 每月30%的增长
~1000 hardware nodes in production 生产环境拥有约1000个硬件节点
Billions of page visits per month per engineer 每个工程师每月承担10亿的页面访问量
Posts are about 50GB a day. Follower list updates are about 2.7TB a day. 作者数据每天增长50GB，跟随用户列表每天增长2.7TB。
Dashboard runs at a million writes a second, 50K reads a second, and it is growing.
Dashboard每秒100万次写操作，5w次读操作，并任然在增长。

Software软件环境
OS X for development, Linux (CentOS, Scientific) in production
开发使用OS X，生产环境使用linux（CentOS与Scientific）
Apache 网页服务器
PHP, Scala, Ruby 业务逻辑实现层
Redis, HBase, MySQL 数据存储层
Varnish, HA-Proxy, nginx, 负载与高可用方面
Memcache, Gearman, Kafka, Kestrel, Finagle 缓存方面
Thrift, HTTP Thrift是屌？不认识哒都要加到技术雷达里。。。
Func - a secure, scriptable remote control framework and API
func，一个安全，可编写脚本的远程控制框架与API
Git, Capistrano, Puppet, Jenkins 软件工程支持方面
Hardware硬件环境
500 web servers 500网站服务器
200 database servers (many of these are part of a spare pool we pulled from for failures)
47 pools
30 shards
200数据库服务器(其中一些是为错误转移做的备用池)
47个备用池，30个片段库
30 memcache servers
30台memcache 服务器，缓存哒
22 redis servers
22redis服务器，key-value缓存
15 varnish servers
ps:Varnish是一款高性能的开源HTTP加速器，估计反向代理吧。

25 haproxy nodes
ps:负载分流+Session粘贴
8 nginx
ps:估计静态资源服务器吧
14 job queue servers (kestrel + gearman)
工作队列服务器
ps:木有用过哎~~~

Architecture架构
Tumblr has a different usage pattern than other social networks.
Tumblr与其它社交网络的使用模式有所不同。
With 50+ million posts a day, an average post goes to many hundreds of people. It’s not just one or two users that have millions of followers. The graph for Tumblr users has hundreds of followers. This is different than any other social network and is what makes Tumblr so challenging to scale.
因为每天5千万的帖子，平均每个帖子要被发送到数百人，而且不仅仅一两人有百万级的追随者。Tumblr用户的图片平均每张都有数百的追随者。相比较其它社交网络来说这样的规模对Tumblr很有挑战。
#2 social network in terms of time spent by users. The content is engaging. It’s images and videos. The posts aren’t byte sized. They aren’t all long form, but they have the ability. People write in-depth content that’s worth reading so people stay for hours.
社交网络按照用户时间排序，内容很吸引人，尤其是图片与视频。这样的帖子并非能用字节衡量大小的。虽然他们并非总是体积很大，但是有变成很大的可能性。那些写很有深度内容的作者能够让读者愿意花好几个小时去看他们的作品。
Users form a connection with other users so they will go hundreds of pages back into the dashboard to read content. Other social networks are just a stream that you sample.
用户与用户之间存在一个个连接，所以他们阅读内容时会在dashboard中不知不觉翻过数百页。而其它社交网络只是简单的流式阅读顺序。

Implication is that given the number of users, the average reach of the users, and the high posting activity of the users, there is a huge amount of updates to handle.
用户数量暗示着：一个用户的平均延展范围，一个用户发帖的活跃度，有许多的更新操作需要被处理。
Tumblr runs in one colocation site. Designs are keeping geographical distribution in mind for the future.
Tumblr运行于一个主机托管网站。这样设计的意图是为未来的地理分布做准备。

Two components to Tumblr as a platform: public Tumblelogs and Dashboard
一个平台上的两个组件，公共的Tumblelogs和Dasboard.
Public Tumblelog is what the public deals with in terms of a blog. Easy to cache as its not that dynamic.
公共Tumblelog是依据一个博客的公开处理形式。很容易缓存，变动不是非常频繁。
Dashboard is similar to the Twitter timeline. Users follow real-time updates from all the users they follow.
Dashboard与Twitter时光轴相似，用户所追随用户的最新进展需要被实时更新。
(ps:知道扣扣空间的时光轴哪里来的了吧，扣扣空间在早期不知道是不是要走SNS.)

Very different scaling characteristics than the blogs. Caching isn’t as useful because every request is different, especially with active followers.
与传统博客有非常显著的区别。缓存不再那么管用了，因为每个请求都不同，特别是对于那些活跃的跟随者们。

Needs to be real-time and consistent. Should not show stale data. And it’s a lot of data to deal with. Posts are only about 50GB a day. Follower list updates are 2.7TB a day. Media is all stored on S3。
需求实时与持久性。不应该展示陈旧数据。并且有很多数据要被处理。每天的帖子才50G，而用户列表的更新则达到了2.7TB，媒体存储都在S3上。

Most users leverage Tumblr as tool for consuming of content. Of the 500+ million page views a day, 70% of that is for the Dashboard.
大多数用户利用Tumblr做为消费内容的工具。媒体超过5亿页面访问量，70%是流向Dashboard

Dashboard availability has been quite good. Tumblelog hasn’t been as good because they have a legacy infrastructure that has been hard to migrate away from. With a small team they had to pick and choose what they addressed for scaling issues.

Dashboard的可用性刚刚的。Tumblelog就没这么壮了，因为他们包含一些很难脱离依赖的遗留基础组件。做为一个小团队的时候他们不得不采用这种形式以解决规模扩展问题。

Old Tumblr
以前的Tumblr
When the company started on Rackspace it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate. This is 2007. They still have custom domains on Rackspace. They route through Rackspace back to their colo space using HAProxy and Varnish. Lots of legacy issues like this.
当公司在Rackspace起步时，它给每个用户空间博客一个记录。而当由于过快增长，有许多用户需要迁移了。这是在2007年。他们在Rackspace上依然有客户空间。他们通过HAProxy与Varnish实现在Rackspace与colo space之间的路由。出现了许多下列遗留问题。
A traditional LAMP progression.传统的LAMP连续性
Historically developed with PHP. Nearly every engineer programs in PHP.
历史上一贯使用PHP开发，几乎所有的工程师都使用PHP。
Started with a web server, database server and a PHP application and started growing from there.
以一个web服务器，数据库服务器与一个PHP应用为开端，并从此开始增长。

To scale they started using memcache, then put in front-end caching, then HAProxy in front of the caches, then MySQL sharding. MySQL sharding has been hugely helpful.
为了扩展，他们开始使用memcache，然后设置前后端缓存，然后HaProxy做为缓存前端，然后是MySQL共享。MySQL共享帮了很大的忙。

Use a squeeze everything out of a single server approach. In the past year they’ve developed a couple of backend services in C: an ID generator and Staircar, using Redis to power Dashboard notifications
用尽一切手段发挥单台服务器能力。在去年，他们用C开发了一系列后端服务：ID产生器，Staircar，并用Redis增强Dashboard的通知。

The Dashboard uses a scatter-gather approach. Events are displayed when a user access their Dashboard. Events for the users you follow are pulled and displayed. This will scale for another 6 months. Since the data is time ordered sharding schemes don’t work particularly well.
Dashboard使用了一种分散-聚集方法。在用户访问他们的Dashboard的时候展示事件。你跟随的用户的时间被拉取并被展示。这样的规模扩展又持续了6个月。直到以时间排序的数据共享模式工作时出现不给力的情况。

New Tumblr
新的Tumblr
Changed to a JVM centric approach for hiring and speed of development reasons.
由于人才招揽与提升开发速度的原因，转移到了JVM为中心的架构。

Goal is to move everything out of the PHP app into services and make the app a thin layer over services that does request authentication, presentation, etc.
目标是将PHP应用中的所有东西都移动到服务中，并让应用层变薄，通过拥有权限控制的请求和呈现服务。
Scala and Finagle Selection ： Scala与Finagle的选择

Internally they had a lot of people with Ruby and PHP experience, so Scala was appealing.
小组内部有很多人有Ruby和PHP经验，所以Scala很受欢迎。
Finagle was a compelling factor in choosing Scala. It is a library from Twitter. It handles most of the distributed issues like distributed tracing, service discovery, and service registration. You don’t have to implement all this stuff. It just comes for free.
Finagle是一个选择Scala的决定因素。他是Twitter退出的一个库。它处理了大多数分布式问题，比如：分布式跟踪，服务发现，服务注册，你不用自己实现这些事情，得到它们是免费的。
Once on the JVM Finagle provided all the primitives they needed (Thrift, ZooKeeper, etc).
一旦用了JVM Finagle提供了所有它们需要的基元
Finagle is being used by Foursquare and Twitter. Scala is also being used by Meetup.

Like the Thrift application interface. It has really good performance.
像Thrift应用接口，它的性能很好。
Liked Netty, but wanted out of Java, so Scala was a good choice.
像Netty一样，但是又不想用Java相关技术，所以Scala是很好选择。

Picked Finagle because it was cool, knew some of the guys, it worked without a lot of networking code and did all the work needed in a distributed system.
选择Finagle因为它长得帅，很多地球人都知道，它不需要和很多的网络代码就能工作，并且能够完成分布式系统中所有需求的工作。

Node.js wasn’t selected because it is easier to scale the team with a JVM base. Node.js isn’t developed enough to have standards and best practices, a large volume of well tested code. With Scala you can use all the Java code. There’s not a lot of knowledge of how to use it in a scalable way and they target 5ms response times, 4 9s HA, 40K requests per second and some at 400K requests per second. There’s a lot in the Java ecosystem they can leverage.
没选Node.js的原因是基于JVM更容易扩展团队。Node.jsp还木有被开发到足够形成标准并完成最佳实践的程度，还木有很多的被测试过的代码。使用Scala你可以使用Java代码，不需要很多知识，不必以scala习惯的方式使用它，就能获得5ms的响应时间。49sHA,4万每秒的请求和40万每秒的请求。在Java生态系统中有许多其他技术能够被使用。

Internal services are being shifted from being C/libevent based to being Scala/Finagle based.
内部服务器被从基于C库转移到基于Scala/Finagle。

Newer, non-relational data stores like HBase and Redis are being used, but the bulk of their data is currently stored in a heavily partitioned MySQL architecture. Not replacing MySQL with HBase.
最近，非关系数据存储比如HBase和Redis被用到了，但是它们数据中的很大部分还是存储在重分区的MySQL架构中。HBase没有替代MySQL.

HBase backs their URL shortner with billions of URLs and all the historical data and analytics. It has been rock solid. HBase is used in situations with high write requirements, like a million writes a second for the Dashboard replacement. HBase wasn’t deployed instead of MySQL because they couldn’t bet the business on HBase with the people that they had, so they started using it with smaller less critical path projects to gain experience.
HBase使用数以10亿的URL与历史数据与分析，备份他们的短URL，它们已经如岩石一般坚固了。HBase被使用在那些需求很多的写操作的环境下，比如在Dashboard替代中每秒要进行一百万次写操作。HBase不能以替代MysQL的方式被部署，因为对于他们已有用户的业务，并不能将业务赌注下在HBase上，所以他们开始时，在很小很少的关键路径项目上尝试使用并积攒经验。

Problem with MySQL and sharding for time series data is one shard is always really hot. Also ran into read replication lag due to insert concurrency on the slaves.
Mysql上以时间序列切分数据也是一个尖锐棘手的问题。经常遇到从服务器上读写复制操作落后于并发的插入操作。
Created a common services framework.建立一个公共服务框架
Spent a lot of time upfront solving operations problem of how to manage a distributed system.
花了大量的时间解决如何管理一个分布式系统这样的问题。
Built a kind of Rails scaffolding, but for services. A template is used to bootstrap services internally.
为服务构建了一种Rails脚手架.一种用于引导内部服务的模板。

All services look identical from an operations perspective. Checking statistics, monitoring, starting and stopping all work the same way for all services.
从运营视图来看，所有的业务都看起来是一致的。所有服务的检查统计，检测，开启与关闭的方式都一样。

Tooling is put around the build process in SBT (a Scala build tool) using plugins and helpers to take care of common activities like tagging things in git, publishing to the repository, etc. Most developers don’t have to get in the guts of the build system.
在构建SBT过程中部署了许多工具，使用插件与帮助工具来关注公告活动，比如在git上标记什么东西，发布到仓库，等等。大多数开发者不需要进入构建系统的核心。

Front-end layer uses HAProxy. Varnish might be hit for public blogs. 40 machines.
前后层使用HAProxy.Varnish也许适用于公共博客。40台机器。
500 web servers running Apache and their PHP application.
200 database servers. Many database servers are used for high availability reasons. Commodity hardware is used and the MTBF is surprisingly low. Much more hardware than expected is lost so there are many spares in case of failure.
200台数据库服务器。许多的数据库服务器被使用是因为高可用性的原因。使用了商用硬件后平均无故障时间比预期的要低许多。许多硬件垮掉的比预期中的要多，所以使用了很多服务器，以便在失败的情况下备用。

6 backend services to support the PHP application. A team is dedicated to develop the backend services. A new service is rolled out every 2-3 weeks. Includes dashboard notifications, dashboard secondary index, URL shortener, and a memcache proxy to handle transparent sharding.
6个后备服务来支持PHP应用，一个团队专注于开发后端服务。每隔2-3周就发布一个新的服务。包括Dashboard通知，Dashboard二级索引，短URL，一个处理透明共享的memcache代理。

Put a lot of time and effort and tooling into MySQL sharding. MongoDB is not used even though it is popular in NY (their location). MySQL can scale just fine..
花费很多时间完成MySQL共享与相关工具。MongoDB的地方我们没用，即使是在纽约这种流行的地方。MySQL可以很好的被扩展。

Gearman, a job queue system, is used for long running fire and forget type work.
Gearman一个工作队列系统，被用于长期任务与备忘类型工作。

Availability is measured in terms of reach. Can a user reach custom domains or the dashboard? Also in terms of error rate.
可到达性是可用性衡量的标准。一个用户是否能够到达自定义空间或者dashboard？出错率也是衡量标准。

Historically the highest priority item is fixed. Now failure modes are analyzed and addressed systematically. Intention is to measure success from a user perspective and an application perspective. If part of a request can’t be fulfilled that is account for
从传统上讲，拥有最高优先级的项目总是固定的。现在失败模型被分析并且被系统的定位处理。意图是从用户角度和程序角度衡量成功标准。如果某请求的一部分不能被满足代表着。。。

Initially an Actor model was used with Finagle, but that was dropped. For fire and forget work a job queue is used. In addition, Twitter’s utility library contains a Futuresimplementation and services are implemented in terms of futures. In the situations when a thread pool is needed futures are passed into a future pool. Everything is submitted to the future pool for asynchronous execution.
最初调用者模型在Finagle中被使用，但后来被放弃了。转而使用了fire and forget工作队列。额外的，Twitter的实用库，包含了一个执行期间实现与服务实现来满足执行期间需求。在这种情况下，当一个线程池需要执行期间，就会被传递到一个执行期间池中。每件事都会被提交到执行期间池以被异步处理。

Scala encourages no shared state. Finagle is assumed correct because it’s tested by Twitter in production. Mutable state is avoided using constructs in Scala or Finagle. No long running state machines are used. State is pulled from the database, used, and writte n back to the database. Advantage is developers don’t need to worry about threads or locks.
Scala鼓励非共享状态。Finagle的假设是正确的因为它在Twitter的生成环境中被测试过。多变的状态被避免使用，通过Scala或Finagle中的设计。长时间运行状态机没有被使用。状态是从数据库中被拉出来，使用然后写回数据库的。好处是开发者不需要担心线程或者锁。

22 Redis servers. Each server has 8 - 32 instances so 100s of Redis instances are used in production.
22个Redis服务器，每个服务器有8-32个实例，数百个Redis实例被用于生产环境。
Used for backend storage for dashboard notifications.
用于dashboard通知的备份存储。
A notification is something like a user liked your post. Notifications show up in a user’s dashboard to indicate actions other users have taken on their content.
一个通知就类似于某个用户喜欢了你的帖子。通知告诉一个用户的dashboard其它用户已经获取了该用户的内容。
High write ratio made MySQL a poor fit.
读写速率高使得MySQL不适用。
Notifications are ephemeral so it wouldn’t be horrible if they were dropped, so Redis was an acceptable choice for this function.
通知是短暂的，所以如果他们被丢弃了也不是恐怖的事情。所以Redis是这个功能的可以接受的选择。
Gave them a chance to learn about Redis and get familiar with how it works.
给他们一个机会学习Redis并且深入了解它如何工作。
Redis has been completely problem free and the community is great.
Redis完全免费，社区很棒。
A Scala futures based interface for Redis was created. This functionality is now moving into their Cell Architecture.
一个Scala期间给予Redis创建的接口。
URL shortener uses Redis as the first level cache and HBase as permanent storage.
短URL使用Redis做为第一层缓存，HBase做为持久化存储。
Dashboard’s secondary index is built around Redis.
Dashboard的二级索引以Redis构建。
Redis is used as Gearman’s persistence layer using a memcache proxy built using Finagle.
Redis在Gearman的持久化层中被使用，memcach代理构建Finagle.
Slowly moving from memcache to Redis. Would like to eventually settle on just one caching service. Performance is on par with memcache.
慢慢的将memcache移动到Redis.最终直到只剩一个缓存服务。Redis和memcache的性能差不多。

Cell Design For Dashboard Inbox
仪表盘中的单元设计
The current scatter-gather model for providing Dashboard functionality has very limited runway. It won’t last much longer.
当前的以分散-集中模型为仪表盘提供的功能非常有限。它不会存在很长时间的。
The solution is to move to an inbox model implemented using a Cell Based Architecture that is similar to Facebook Messages.
解决方案是将收件箱模型实现使用基于单元的架构去实现，比如像Facebook的Messages。

An inbox is the opposite of scatter-gather. A user’s dashboard, which is made up posts from followed users and actions taken by other users, is logically stored together in time order.
一个收件箱是分散-集中模型的对立面。一个用户的仪表盘，由用户所跟随用户的帖子和其它用户发起的动作组成,被有逻辑的存储以时间顺序存储在一起。

Solves the scatter gather problem because it’s an inbox. You just ask what is in the inbox so it’s less expensive then going to each user a user follows. This will scale for a very long time.
因为它是一个收件箱，所以解决了分散集中问题。你仅仅需要询问收件箱中有什么，所以它对于向每个一个用户追随的用户拉取数据的方式来说成本低一些。

Rewriting the Dashboard is difficult. The data has a distributed nature, but it has a transactional quality, it’s not OK for users to get partial updates.
重写仪表盘很困难。数据是自然分布的，但是具有事务质量，用户得到部分更新的数据是不能被接受的。
The amount of data is incredible. Messages must be delivered to hundreds of different users on average which is a very different problem than Facebook faces. Large date + high distribution rate + multiple datacenters.
数据的量是出奇的多。平均每个消息必须被发送到数百个不同的用户，这比facebock面对的问题还要困难些。大数据+高分布+多数据中心。

Spec’ed at a million writes a second and 50K reads a second. The data set size is 2.7TB of data growth with no replication or compression turned on. The million writes a second is from the 24 byte row key that indicates what content is in the inbox.
每秒100万写操作与5万的读操作。数据集增长的大小是2.7TB，不算复制或压缩。百万次每秒的写操作是源于24位的行主键，用来标明收件箱中是什么内容的。
Doing this on an already popular application that has to be kept running.
在一个正在运行的火热的应用上面实施这些，并且要保证应用正常运行。
Cells单元
A cell is a self-contained installation that has all the data for a range of users. All the data necessary to render a user’s Dashboard is in the cell.
一个单元是一个自我包含的装置，它拥有一系列用户的所有数据。单元中的所有的数据都需要被传送到一个用户的仪表盘。

Users are mapped into cells. Many cells exist per data center.
用户被映射到单元中。每个数据中心中存在许多单元。
Each cell has an HBase cluster, service cluster, and Redis caching cluster.
每个单元拥有一个HBase集群，服务集群，与Redis缓存集群。

Users are homed to a cell and all cells consume all posts via firehose updates.
用户都集中在一个单元中，所有的单元通过firehose更新来消费所有的帖子。

Each cell is Finagle based and populates HBase via the firehose and service requests over Thrift.
每个单元都是基于Finagle并且通过fire和Thrift上的服务器请求而常驻于HBase.

A user comes into the Dashboard, users home to a particular cell, a service node reads their dashboard via HBase, and passes the data back.
一个用户进入仪表盘，用户被定位到一个特别的单元主页，一个服务节点通过HBase读取他们的仪表盘，然后把数据送回。
Background tasks consume from the firehose to populate tables and process requests.
后台任务从firehose消费，并保存到流行表，处理请求。
A Redis caching layer is used for posts inside a cell.
一个Redis缓存层被用于单元内部的帖子。

Request flow: a user publishes a post, the post is written to the firehose, all of the cells consume the posts and write that post content to post database, the cells lookup to see if any of the followers of the post creator are in the cell, if so the follower inboxes are updated with the post ID.
请求流程:一个用户发送一片帖子，帖子被写入firehose，所有单元开始消费帖子，吧帖子内容写入帖子数据库，在单元中查找帖子的作者的追随者，然后跟随者的收件箱就根据帖子id而被更新了。
Advantages of cell design:单元设计的好处
Massive scale requires parallelization and parallelization requires components be isolated from each other so there is no interaction. Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
巨大扩展需求的并行与并行需求的组件之间被隔离开关，所以不存在交互。单元提供了一种可以跟随用户增长而任意调整的结构。
Cells isolate failures. One cell failure does not impact other cells.
单元隔离了失败。一个单元失败不会影响其他单元。
Cells enable nice things like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
单元为测试更新的能力，实现滚动更新，测试软件不同版本等很棒的事情创造了条件。

The key idea that is easy to miss is: all posts are replicated to all cells.
非常容易误解的事情是：所有的帖子都被复制到所有的单元内。
Each cell stores a single copy of all posts. Each cell can completely satisfy a Dashboard rendering request. Applications don’t ask for all the post IDs and then ask for the posts for those IDs. It can return the dashboard content for the user. Every cell has all the data needed to fulfill a Dashboard request without doing any cross cell communication.
每个单元存储所有帖子的单个拷贝。每个单元你可以完全的满足一个仪表盘渲染请求。应用不请求所有帖子的id，然后通过id获得帖子。它可以向用户返回仪表盘的内容。每一个单元拥有填充仪表盘请求的所有数据而不用跨单元交流。
Two HBase tables are used: one that stores a copy of each post. That data is small compared to the other table which stores every post ID for every user within that cell. The second table tells what the user’s dashboard looks like which means they don’t have to go fetch all the users a user is following. It also means across clients they’ll know if you read a post and viewing a post on a different device won’t mean you read the same content twice. With the inbox model state can be kept on what you’ve read.
使用了两个HBase表：一个存储每个帖子的拷贝。它的数据量，与其他为每个用户不使用单元，而存储每个帖子ID的表相比，小很多。第二张表告诉用户的仪表盘看起来怎么样，一位置他们不需要抓取一个用户跟随的所有用户的帖子。也意味着交叉的客户端他们将会知道如果你读取一个帖子，使用不同的设备查看一个帖子，不会认为你读了同一个内容两次。收件箱模型状态可以为你读取的内容而保存。
Posts are not put directly in the inbox because the size is too great. So the ID is put in the inbox and the post content is put in the cell just once. This model greatly reduces the storage needed while making it simple to return a time ordered view of an users inbox. The downside is each cell contains a complete copy of call posts. Surprisingly posts are smaller than the inbox mappings. Post growth per day is 50GB per cell, inbox grows at 2.7TB a day. Users consume more than they produce.
帖子不直接放进收件箱里，因为体积是在太大。所以将id放进收件箱，然后帖子内容只在单元内存放一次。这个模型极大的减小了存储需求，并且让返回时间排序的用户收件箱变得简单。负面是每个单元包含了一个完整的被调用的帖子的拷贝。让人惊讶的是帖子比收件箱映射药效。每个单元帖子每天增长50GB，收件箱增长2.7TB。用户消费量比产生量小。
A user’s dashboard doesn’t contain the text of a post, just post IDs, and the majority of the growth is in the IDs.
一个用户的仪表板不包含一个帖子的文本，只有帖子id,并且主要增长的也是id。
As followers change the design is safe because all posts are already in the cell. If only follower posts were stored in a cell then cell would be out of date as the followers changed and some sort of back fill process would be needed.
对于跟随者改变这个设计是安全的，因为所有的帖子已经在单元中了。如果所跟随帖子仅仅存储在一个单元中，那如果更随着改变后单元就会过期，需要一些填充过程。
An alternative design is to use a separate post cluster to store post text. The downside of this design is that if the cluster goes down it impacts the entire site. Using the cell design and post replication to all cells creates a very robust architecture.
一种折中的设计时使用分开的帖子集群去存储帖子文字。缺点是这种设计中，如果集群挂了就会影响整个网站。为所有单元创建都使用单元设计与帖子复制建立了一种非常健壮的架构。
A user having millions of followers who are really active is handled by selectively materializing user feeds by their access model (see Feeding Frenzy).
一个拥有百万跟随者的有吸引力的用户，通过选择性的实现他们的访问模型来处理用户订阅。参见疯狂的订阅。.
Different users have different access models and distribution models that are appropriate. Two different distribution modes: one for popular users and one for everyone else.
不同用户适合使用不同访问模型，分布式模型。两种不同的分布式模型：一种为流行的用户，一种为每个人。
Data is handled differently depending on the user type. Posts from active users wouldn’t actually be published, posts would selectively materialized.
根据不同用户类型选择不同的数据处理方式。活跃用户的帖子不会立即发布，忒子会被选择性的实现。
Users who follow millions of users are treated similarly to users who have millions of followers.
跟随了百万用户的用户和有百万跟随者的用户是相同的待遇.
Cell size is hard to determine. The size of cell is the impact site of a failure. The number of users homed to a cell is the impact. There’s a tradeoff to make in what they are willing to accept for the user experience and how much it will cost.
单元大小很难决定。单元的大小影响着站点的失效情况。单元中存储的用户数量是影响的关键。需要折中的是他们接受如何程度的用户体验和它所花费的代价。
Reading from the firehose is the biggest network issue. Within a cell the network traffic is manageable.
从firehose中读取数据是最大的网络问题。有了单元之后网络传送就可以被管理了。
As more cells are added cells can be placed into a cell group that reads from the firehose and then replicates to all cells within the group. A hierarchical replication scheme. This will also aid in moving to multiple datacenters.
越来越多的单元可以被单元组替代，从fire读取数据，然后复制到单元组中的所有单元。一种横向复制模式。这也有助于移动到多个数据中心。

On Being A Startup In New York
在纽约的开局
NY is a different environment. Lots of finance and advertising. Hiring is challenging because there’s not as much startup experience.
纽约是个不同的环境。存在许多的财团和广告。人才招聘很是挑战，因为没有很多开局经验。
In the last few years NY has focused on helping startups. NYU and Columbia have programs for getting students interesting internships at startups instead of just going to Wall Street. Mayor Bloomberg is establishing a local campus focused on technology.
在纽约的最后几年才着眼于如何开局。NYU和Columbia开展了启发学生兴趣的实习做为开端，而不是仅仅上街寻找人才。市长Bloomberg帮助提升了本地的大学对科技的关注。

Team Structure团队结构
Teams: infrastructure, platform, SRE, product, web ops, services.
团队：基础建设，平台，软件可靠性工程，产品，web服务器，服务。
Infrastructure: Layer 5 and below. IP address and below, DNS, hardware provisioning.
基础建设:5层设备，IP地址，DNS,硬件

Platform: core app development, SQL sharding, services, web operations.
平台：核心应用开发，SQL分区，服务，web运营
SRE: sits between service team and web ops team. Focused on more immediate needs in terms of reliability and scalability.
服务团队与web运营团队之间的网站。着眼于那些可靠性与扩展性上的迫切需求。
Service team: focuses on things that are slightly more strategic, that are a month or two months out.
服务团队：着眼于那些更小更策略性的事情，用了一两个月
Web ops: responsible for problem detection and response, and tuning.
web运营团队：负责问题发现与反馈，还有协调。

Software Deployment
软件部署
Started with a set of rsync scripts that distributed the PHP application everywhere. Once the number of machines reached 200 the system started having problems, deploys took a long time to finish and machines would be in various states of the deploy process.
开始使用一些到处分散在PHP应用程序中备份脚本。当机器数目到达200台时，系统就开始存在问题了，部署需要花费很长时间，并且在部署过程中会出现这样那样的问题。

The next phase built the deploy process (development, staging, production) into their service stack using Capistrano. Worked for services on dozens of machines, but by connecting via SSH it started failing again when deploying to hundreds of machines.
下个阶段在他们使用的Capistrano的服务栈中形成了了部署过程(开发，演示，产品)。一沓沓服机器以服务的方式工作，但是当所部署的数百台机器通过 SSH连接时，又开始发生很多失效情况。

Now a piece of coordination software runs on all machines. Based around Func from RedHat, a lightweight API for issuing commands to hosts. Scaling is built into Func.
现在所有机器上都运行着一些协调软件。基于并围绕来自红帽子的Func，一个轻量级的API被用于向主机发送命令。Scaling被构建到Func中。
Build deployment is over Func by saying do X on a set of hosts, which avoids SSH. Say you want to deploy software on group A. The master reaches out to a set of nodes and runs the deploy command.
构建部署系统是通过Func来告诉一堆主机去做X，避免使用SSH。比如你想要部署软件到组别A上去。系统主节点就会去和所有的节点去交互，并在它们上面运行部署命令。

The deploy command is implemented via Capistrano. It can do a git checkout or pull from the repository. Easy to scale because they are talking HTTP. They like Capistrano because it supports simple directory based versioning that works well with their PHP app. Moving towards versioned updates, where each directory contains a SHA so it’s easy to check if a version is correct.

部署命令是通过Capistrano实现的。能够进行从git中检出或者从仓库中拉取操作。容易扩展的原因是它们使用HTTP交流。他们喜欢Capistrano因为它支持简单的基于版本的目录，并且它在PHP应用上工作的很不错。向着带有版本的更新方向前进，每个目录都带有一个SHA，所以很容易验证某版本是否正确。

The Func API is used to report back status, to say these machines have these software versions.
FuncAPI被用于响应状态，告知那些机器搭载的软件版本。
Safe to restart any of their services because they’ll drain off connections and then restart.
重启任何服务都很安全，因为他们会逐渐将连接关闭然后再重启。

All features run in dark mode before activation.
所有特性在激活前会在 dark 模式下工作。
（ps:什么是dark mode，腹黑模式么。。。我猜是在夜晚业务使用量少的情况下进行测试，因为是晚上，所以叫dark mode，完全猜的。。。）

Development
开发
Started with the philosophy that anyone could use any tool that they wanted, but as the team grew that didn’t work. Onboarding new employees was very difficult, so they’ve standardized on a stack so they can get good with those, grow the team quickly, address production issues more quickly, and build up operations around them.
开始的观点是：任何人去使用任何他们相拥的工具，但随着团队的增长，这种想法不行了。管理新的开发者非常艰难，所以他们就搞了一套标准，以便应对这个问题,团队成长的很快，定位产品问题也更快，并在此基础上建立起操作流程。

Process is roughly Scrum like. Lightweight.
过程与Scrum大致相似，轻量。
Every developer has a preconfigured development machine. It gets updates via Puppet.
每个卡发着都有一个预先配置好的开发及其，它通过Puppet更新。
Dev machines can roll changes, test, then roll out to staging, and then roll out to production.
开发机器可以在变更之间滚动，测试，滚动到某版本以便演示，或者用于生产。
Developers use vim and Textmate.
开发者使用vim与代码编辑器。
Testing is via code reviews for the PHP application.
通过对PHP应用的代码复审以便测试。
On the service side they’ve implemented a testing infrastructure with commit hooks, Jenkins, and continuous integration and build notifications.
在服务端，他们通过提交钩子，Jenkins，持续集成，构建通知，实现了一种测试基础组件。

Lessons Learned
Automation everywhere.
普及自动化。
MySQL (plus sharding) scales, apps don't.
MySQL（加上共享）扩展，应用不这么做。
Redis is amazing.
Redis很棒。
Scala apps perform fantastically.
Scala应用的表现出人意料。
Scrap projects when you aren’t sure if they will work.
当你不确定某个项目是否能够工作时，丢弃它。
Don’t hire people based on their survival through a useless technological gauntlet. Hire them because they fit your team and can do the job.
不要雇佣那些拥有很有挑战性的却又没什么用的技能的人。只有在你的团队需要他们的时候才去雇佣他们。
Select a stack that will help you hire the people you need.
在一堆中选择，这能够帮你雇佣到想要的人。
Build around the skills of your team.
以你的团队所使用的技能开始扩展。

Read papers and blog posts. Key design ideas like the cell architecture and selective materialization were taken from elsewhere.
阅读文章或博客。关键的设计思想，比如网格架构和选择性的实现是从别的地方来的。

Ask your peers. They talked to engineers from Facebook, Twitter, LinkedIn about their experiences and learned from them. You may not have access to this level, but reach out to somebody somewhere.
问问你的同龄人。他们通过Facebook,Twitter,LinkedIn与工程师交流他们的经验，从他们身上学习。你也许还没到这种程度，但是你需要与外界人士接触。
Wade, don’t jump into technologies. They took pains to learn HBase and Redis before putting them into production by using them in pilot projects or in roles where the damage would be limited.
不要跳进技术细节。在将HBase与Redis放入生产环境前，通过测试环境与角色好好研究它们，能够降低它们可能造成的损失。
I’d like to thank Blake very much for the interview. He was very generous with his time and patient with his explanations. Please contact me if you would like to talk about having your architecture profiled.

Hiring Process
招聘过程
Interviews usually avoid math, puzzles, and brain teasers. Try to ask questions focused on work the candidate will actually do. Are they smart? Will they get stuff done? But measuring “gets things done” is difficult to assess. Goal is to find great people rather than keep people out.
面试通常避免数学，难题与大脑测试。试图询问求职者实际所做的工作。看看他们是否聪明？他们能够完成工作么？但是衡量“能够完成”是很难界定的。目的是寻找很棒的人而不是将所有人赶出去。
Focused on coding. They’ll ask for sample code. During phone interviews they will use Collabedit to write shared code.
关注于编码。他们会被要求进行简单的编码。通过电话面试他们会使用Collabedit去共同写代码。
Interviews are not confrontational, they just want to find the best people. Candidates get to use all their tools, like Google, during the interview. The idea is developers are at their best when they have tools so that’s how they run the interviews.
面试不想开会，他们只是想找到最好的人。候选人可以用他们喜欢的工具，比如谷歌。想法是开发者是否能在面试期间发挥最好水平，使用工具并非重点。

Challenge is finding people that have the scaling experience they require given Tumblr’s traffic levels. Few companies in the world are working on the problems they are.
挑战是找到拥有扩展经验并且它们达到Tumblr的交流水平。世界上很少的公司需要解决与相似的问题。
Example, for a new ID generator they needed A JVM process to generate service responses in less the 1ms at a rate at 10K requests per second with a 500 MB RAM limit with High Availability. They found the serial collector gave the lowest latency for this particular work load. Spent a lot of time on JVM tuning.
比如，他们需要一个新的ID 生成器。一个JVM线程产生服务响应必须小于1毫秒，因为要达到一秒至少1万请求，这还得在500MB内存和高可用性的条件下达到。他们发现了一系列的控制器给他们最低的延迟去完成特定工作负载。在JVM协调上花费了很多时间。

On the Tumblr Engineering Blog they’ve posted memorials giving their respects for the passing of Dennis Ritchie & John McCarthy. It’s a geeky culture.
在Tumblr工程师的博客他们文章中记录了对下面这二位的尊敬。极客文化。

Related Articles
Tumblr Engineering Blog - contains a lot of good articles
Building Network Services with Finagle and Ostrich by Nathan Hamblen - Finagle is awesome
Ostrich - A stats collector & reporter for Scala servers
ID Generation at Scale by Blake Matheny
Finagle - a network stack for the JVM that you can use to build asynchronous Remote Procedure Call (RPC) clients and servers in Java, Scala, or any JVM-hosted language. Finagle provides a rich set of protocol-independent tools.
Finagle Redis client from Tumblr
Tumblr. Massively Sharded MySQL by Evan Elias - one of the better presentations on MySQL sharding available
Staircar: Redis-powered notifications by Blake Matheny
Flickr Architecture - talks about Flickr's cell architecture

Allchin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Yahoo用10多亿美金买的Tumblr架构

http://highscalability.com/blog/2013/5/20/the-tumblr-architecture-yahoo-bought-for-a-cool-billion-doll.htmlThe Tumblr Architecture Yahoo Bought For A Cool Billion DollarsMONDAY, MAY 20, 2013 AT 8:
复制链接

扫一扫