How to design Instagram?

Instagram Architecture: 14 Million Users, Terabytes Of Photos, 100s Of Instances, Dozens Of Technologies

Instagram is a free photo sharing and social networking service for your iPhone that has been an instant success. Growing to 14 million users in just over a year, they reached 150 million photos in August while amassing several terabytes of photos, and they did this with just 3 Instaneers, all on the Amazon stack.

The Instagram team has written up what can be considered the canonical description of an early stage startup in this era: What Powers Instagram: Hundreds of Instances, Dozens of Technologies.

Instagram uses a pastiche of different technologies and strategies. The team is small yet has experience rapid growth riding the crest of a rising social and mobile wave, it uses a hybrid of SQL and NoSQL, it uses a ton of open source projects, they chose the cloud over colo, Amazon services are highly leveraged rather than building their own, reliability is through availability zones, async work scheduling links components together, the system is composed as much as possible of services exposing an API and external services they don't have to build, data is stored in-memory and in the cloud, most code is in a dynamic language, custom bits have been coded to link everything together, and they have gone fast and kept small. A very modern construction.

We'll just tl;dr the article here, it's very well written and to the point. Definitely worth reading. Here are the essentials: 

  • Lessons learned: 1) Keep it very simple 2) Don’t re-invent the wheel 3) Go with proven and solid technologies when you can.
  • 3 Engineers.
  • Amazon shop. They use many of Amazon's services. With only 3 engineers so don’t have the time to look at self hosting.
  • 100+ EC2 instances total for various purposes.
  • Ubuntu Linux 11.04 (“Natty Narwhal”). Solid, other Ubuntu versions froze on them.
  • Amazon’s Elastic Load Balancer routes requests and 3 nginx instances sit behind the ELB.
  • SSL terminates at the ELB, which lessens the CPU load on nginx.
  • Amazon’s Route53 for the DNS.
  • 25+ Django application servers on High-CPU Extra-Large machines.
  • Traffic is CPU-bound rather than memory-bound, so High-CPU Extra-Large machines are a good balance of memory and CPU.
  • Gunicorn as their WSGI server. Apache harder to configure and more CPU intensive.
  • Fabric is used to execute commands in parallel on all machines. A deploy takes only seconds.
  • PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances.
  • Twelve PostgreSQL replicas run in a different availability zone.
  • PostgreSQL instances run in a master-replica setup using Streaming Replication. EBS is used for snapshotting, to take frequent backups. 
  • EBS is deployed in a software RAID configuration. Uses mdadm to get decent IO.
  • All of their working set is stored memory. EBS doesn’t support enough disk seeks per second.
  • Vmtouch (portable file system cache diagnostics) is used to manage what data is in memory, especially when failing over from one machine to another, where there is no active memory profile already.
  • XFS as the file system. Used to get consistent snapshots by freezing and unfreezing the RAID arrays when snapshotting.
  • Pgbouncer is used pool connections to PostgreSQL.
  • Several terabytes of photos are stored on Amazon S3.
  • Amazon CloudFront as the CDN.
  • Redis powers their main feed, activity feed, sessions system, and other services.
  • Redis runs on several Quadruple Extra-Large Memory instances. Occasionally shard across instances.
  • Redis runs in a master-replica setup. Replicas constantly save to disk. EBS snapshots backup the DB dumps. Dumping on the DB on the master was too taxing.
  • Apache Solr powers the geo-search API. Like the simple JSON interface.
  • 6 memcached instances for caching. Connect using pylibmc & libmemcached. Amazon Elastic Cache service isn't any cheaper.
  • Gearman is used to: asynchronously share photos to Twitter, Facebook, etc; notifying real-time subscribers of a new photo posted; feed fan-out.
  • 200 Python workers consume tasks off the Gearman task queue.
  • Pyapns (Apple Push Notification Service) handles over a billion push notifications. Rock solid.
  • Munin to graph metrics across the system and alert on problems. Write many custom plugins using Python-Munin to graph, signups per minute, photos posted per second, etc.
  • Pingdom for external monitoring of the service.
  • PagerDuty for handling notifications and incidents.
  • Sentry for Python error reporting.

Sharding & IDs at Instagram

With more than 25 photos and 90 likes every second, we store a lot of data here at Instagram. To make sure all of our important data fits into memory and is available quickly for our users, we’ve begun to shard our data — in other words, place the data in many smaller buckets, each holding a part of the data.

Our application servers run Django with PostgreSQL as our back-end database. Our first question after deciding to shard out our data was whether PostgreSQL should remain our primary data-store, or whether we should switch to something else. We evaluated a few different NoSQL solutions, but ultimately decided that the solution that best suited our needs would be to shard our data across a set of PostgreSQL servers.

Before writing data into this set of servers, however, we had to solve the issue of how to assign unique identifiers to each piece of data in the database (for example, each photo posted in our system). The typical solution that works for a single database — just using a database’s natural auto-incrementing primary key feature — no longer works when data is being inserted into many databases at the same time. The rest of this blog post addresses how we tackled this issue.

Before starting out, we listed out what features were essential in our system:

  1. Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)
  2. IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis)
  3. The system should introduce as few new ‘moving parts’ as possible — a large part of how we’ve been able to scale Instagram with very few engineers is by choosing simple, easy-to-understand solutions that we trust.

Existing solutions

Many existing solutions to the ID generation problem exist; here are a few we considered:

Generate IDs in web application

This approach leaves ID generation entirely up to your application, and not up to the database at all. For example, MongoDB’s ObjectId, which is 12 bytes long and encodes the timestamp as the first component. Another popular approach is to use UUIDs.

Pros:

  1. Each application thread generates IDs independently, minimizing points of failure and contention for ID generation
  2. If you use a timestamp as the first component of the ID, the IDs remain time-sortable

Cons:

  1. Generally requires more storage space (96 bits or higher) to make reasonable uniqueness guarantees
  2. Some UUID types are completely random and have no natural sort
Generate IDs through dedicated service

Ex: Twitter’s Snowflake, a Thrift service that uses Apache ZooKeeper to coordinate nodes and then generates 64-bit unique IDs

Pros:

  1. Snowflake IDs are 64-bits, half the size of a UUID
  2. Can use time as first component and remain sortable
  3. Distributed system that can survive nodes dying

Cons:

  1. Would introduce additional complexity and more ‘moving parts’ (ZooKeeper, Snowflake servers) into our architecture
DB Ticket Servers

Uses the database’s auto-incrementing abilities to enforce uniqueness. Flickr uses this approach, but with two ticket DBs (one on odd numbers, the other on even) to avoid a single point of failure.

Pros:

  1. DBs are well understood and have pretty predictable scaling factors

Cons:

  1. Can eventually become a write bottleneck (though Flickr reports that, even at huge scale, it’s not an issue).
  2. An additional couple of machines (or EC2 instances) to admin
  3. If using a single DB, becomes single point of failure. If using multiple DBs, can no longer guarantee that they are sortable over time.

Of all the approaches above, Twitter’s Snowflake came the closest, but the additional complexity required to run an ID service was a point against it. Instead, we took a conceptually similar approach, but brought it inside PostgreSQL.

Our solution

Our sharded system consists of several thousand ‘logical’ shards that are mapped in code to far fewer physical shards. Using this approach, we can start with just a few database servers, and eventually move to many more, simply by moving a set of logical shards from one database to another, without having to re-bucket any of our data. We used Postgres’ schemas feature to make this easy to script and administrate.

Schemas (not to be confused with the SQL schema of an individual table) are a logical grouping feature in Postgres. Each Postgres DB can h2have several schemas, each of which can contain one or more tables. Table names must only be unique per-schema, not per-DB, and by default Postgres places everything in a schema named ‘public’.

Each ‘logical’ shard is a Postgres schema in our system, and each sharded table (for example, likes on our photos) exists inside each schema.

We’ve delegated ID creation to each table inside each shard, by using PL/PGSQL, Postgres’ internal programming language, and Postgres’ existing auto-increment functionality.

Each of our IDs consists of:

  • 41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
  • 13 bits that represent the logical shard ID
  • 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond

Let’s walk through an example: let’s say it’s September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011. There have been 1387263000 milliseconds since the beginning of our epoch, so to start our ID, we fill the left-most 41 bits with this value with a left-shift:

id = 1387263000 <<(64-41)

Next, we take the shard ID for this particular piece of data we’re trying to insert. Let’s say we’re sharding by user ID, and there are 2000 logical shards; if our user ID is 31341, then the shard ID is 31341 % 2000 -> 1341. We fill the next 13 bits with this value:

id |= 1341 <<(64-41-13)

Finally, we take whatever the next value of our auto-increment sequence (this sequence is unique to each table in each schema) and fill out the remaining bits. Let’s say we’d generated 5,000 IDs for this table already; our next value is 5,001, which we take and mod by 1024 (so it fits in 10 bits) and include it too:

id |= (5001 % 1024)

We now have our ID, which we can return to the application server using the RETURNING keyword as part of the INSERT.

Here’s the PL/PGSQL that accomplishes all this (for an example schema insta5):

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
SELECT nextval('insta5.table_id_seq') %% 1024 INTO seq_id;
    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id <<10);
result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;

And when creating the table, we do:

CREATE TABLE insta5.our_table (
"id" bigint NOT NULL DEFAULT insta5.next_id(),
...rest of table schema...
)

And that’s it! Primary keys that are unique across our application (and as a bonus, contain the shard ID in them for easier mapping). We’ve been rolling this approach into production and are happy with the results so far. Interested in helping us figure out these problems at scale? We’re hiring!

转载于:https://www.cnblogs.com/liangmou/p/8141903.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值