What is tinyurl?
tinyurl is a URL service that users enter a long URL and then the service return a shorter and unique url such as "http://tiny.me/5ie0V2". The highlight part can be any string with 6 letters containing [0-9, a-z, A-Z]. That is, 62^6 ~= 56.8 billions unique strings.
一般流程
- Ask questions; Understand the constraints and use cases.
- abstract design,画图
- bottlenecks, 现在的和以后traffic,data量变大时
- Address these bottlenecks usingscalable system design.
1.和面试官讨论use cases
- shortening: take a url=>return a shorter one
- redirection: take a url=>redirect to the original one
- high availability
Suppose we have a database which contains three columns: id (auto increment), actual url, and shorten url. id的另一种实现方法可以是md5(original_url+random_salt)
Intuitively, we can design a hash function that maps the actual url to shorten url. But string to string mapping is not easy to compute.
Notice that in the database, each record has a unique id associated with it. What if we convert the id to a shorten url?
- Each x must be associated with one and only one y;
- Each y must be associated with one and only one x.
e.g. 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z.Then, the problem becomes Base Conversion problem which is bijection (if not overflowed :).
public String shorturl(int id, int base, HashMap map) {
StringBuilder res = new StringBuilder();
while (id > 0) {
int digit = id % base;
res.append(map.get(digit));
id /= base;
}
while (res.length() < 6) res.append('0');
return res.reverse().toString();
}
For each input long url, the corresponding id is auto generated (in O(1) time). The base conversion algorithm runs in O(
k
) time where
k
is the number of digits (i.e.
k=6
).
4.分析bottlenecks,然后scale
(1)traffic
10% from shortening, 90% from redirection
request per second: 400 (shortening:40, redirection:360)
因为这2个计算都很light,所以traffic不是bottleneck
一共需要6 billion in 5 years
original url: 500 bytes
short url: 算出来是6 bytes
data written per second: 40*(500+6) = 20k
data read per second: 360*506 bytes=180k
所以the data going in and out of the pipe is not much, I/O 不是bottleneck。
(2)data
3TB for all urls, 36 GB for short urls (5 years)
是bottleneck,所以需要scale
5.scalable design
(1)application service layer
add a load balancer + machine cluster over time: when spike traffic, increase machines. delete them when normal. (amazon ELB)
(2)data storage
1)billions of objects
2)each objects is small, <1k
3)no relationships between objects
4)reads are 9x more frequent than writes (360, 40)
5)3TB original urls, 36GB short urls
第一种方法: mysql
1)use one table: short_url: varchar(6), original_url:varchar(512)
2)unique index on the short url (36GB+index overhead), and hold it in memory
3)sharding, 用short_url的第一个char mod partition个数
4)master-slave replication, master-master replication
第二种方法
We can use Distributed Database . But maintenance for such a db would be much more complicated (replicate data across servers, sync among servers to get a unique id, etc.).Alternatively, we can use Distributed Key-Value Datastore .
Some distributed datastore (e.g. Amazon's Dynamo ) uses Consistent Hashing to hash servers and inputs into integers and locate the corresponding server using the hash value of the input. We can apply base conversion algorithm on the hash value of the input.
The basic process can be:
Insert
- Hash an input long url into a single integer;
- Locate a server on the ring and store the key--longUrl on the server;
- Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.
- Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
- Locate the server containing that key and return the longUrl.