URL shortner

 want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef". Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.

Edit: Due to the ongoing interest in this topic, I've uploaded the code that I used to GitHub, with implementations for JavaPHP and JavaScript. Add your solutions if you like :)

My approach:

I have a database table with three columns:

  1. id, integer, auto-increment
  2. long, string, the long URL the user entered
  3. short, string, the shortened URL (or just the six characters)

I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.

My idea:

For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:

short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.

That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?

share improve this question
 
2 
@gudge The point of those functions is that they have an inverse function. This means you can have both encode() and decode() functions. The steps are, therefore: (1) Save URL in database (2) Get unique row ID for that URL from database (3) Convert integer ID to short string with encode(), e.g. 273984 to f5a4 (4) Use the short string (e.g. f4a4) in your sharable URLs (5) When receiving a request for a short string (e.g. 20a8), decode the string to an integer ID with decode() (6) Look up URL in database for given ID. For conversion, use: github.com/delight-im/ShortURL –  Marco W.  Feb 10 '15 at 10:31
 
@Marco, what's the point of storing the hash in the database? –  Maksim Vi.  Jul 11 '15 at 9:04
 
@MaksimVi. If you have an invertible function, there's none. If you had a one-way hash function, there would be one. –  Marco W.  Jul 14 '15 at 14:47
 
would it be wrong if we used simple CRC32 algorithm to shorten a URL? Although very unlikely of a collision (a CRC32 output is usually 8 characters long and that gives us over 30 million possibilities) If a generated CRC32 output was already used previously and was found in the database, we could salt the long URL with a random number until we find a CRC32 output which is unique in my database. How bad or different or ugly would this be for a simple solution? –  syedrakib  Mar 22 at 9:41

19 Answers

up vote 493 down vote accepted

I would continue your "convert number to string" approach. However you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.

Theoretical background

You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:

  • There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
  • and for every y you must be able to find an x so that f(x) = y.

How to convert the ID to a shortened URL

  1. Think of an alphabet we want to use. In your case that's [a-zA-Z0-9]. It contains 62 letters.
  2. Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).

    For this example I will use 12510 (125 with a base of 10).

  3. Now you have to convert 12510 to X62 (base 62).

    12510 = 2×621 + 1×620 = [2,1]

    This requires use of integer division and modulo. A pseudo-code example:

    digits = []
    
    while num > 0
      remainder = modulo(num, 62)
      digits.push(remainder)
      num = divide(num, 62)
    
    digits = digits.reverse
    

    Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:

    0  → a
    1  → b
    ...
    25 → z
    ...
    52 → 0
    61 → 9
    

    With 2 → c and 1 → b you will receive cb62 as the shortened URL.

    http://shor.ty/cb
    

How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

  1. e9a62 will be resolved to "4th, 61st, and 0th letter in alphabet".

    e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810

  2. Now find your database-record with WHERE id = 19158 and do the redirect.

Some implementations (provided by commenters)

share improve this answer
 
9 
Don't forget to sanitize the URLs for malicious javascript code! Remember that javascript can be base64 encoded in a URL so just searching for 'javascript' isn't good enough.j –  Bjorn Tipling  Apr 14 '09 at 8:05
2 
A function must be bijective (injective and surjective) to have an inverse. –  Gumbo  May 4 '10 at 20:28
21 
Food for thought, it might be useful to add a two character checksum to the url. That would prevent direct iteration of all the urls in your system. Something simple like f(checksum(id) % (62^2)) + f(id) = url_id –  koblas  Sep 4 '10 at 13:53
3 
As far as sanitizing the urls go, one of the problems you're going to face is spammers using your service to mask their URLS to avoid spam filters. You need to either limit the service to known good actors, or apply spam filtering to the long urls. Otherwise you WILL be abused by spammers. –  Edward Falk  May 26 '13 at 15:34
26 
Base62 may be a bad choice because it has the potential to generate f* words (for example, 3792586=='F_ck' with u in the place of _). I would exclude some characters like u/U in order to minimize this. –  Paulo Scardine  Jun 28 '13 at 16:02

Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z,a-z,0-9 etc') has 40 characters, convert the id to a base-40 number and use the characters are the digits.

share improve this answer
 
3 
asides from the fact that A-Z, a-z and 0-9 = 62 chars, not 40, you are right on the mark. –  Evan Teran  Apr 12 '09 at 16:39
 
Thanks! Should I use the base-62 alphabet then?  en.wikipedia.org/wiki/Base_62 But how can I convert the ids to a base-62 number? –  Marco W.  Apr 12 '09 at 16:46
 
Using a base conversion algorithm ofcourse - en.wikipedia.org/wiki/Base_conversion#Change_of_radix –  shoosh  Apr 12 '09 at 16:48
 
Thank you! That's really simple. :) Do I have to do this until the dividend is 0? Will the dividend always be 0 at some point? –  Marco W.  Apr 12 '09 at 17:04
1 
with enough resources and time you can "browse" all the URLs of of any URL shortening service. –  shoosh Apr 12 '09 at 21:10
public class UrlShortener {
    private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private static final int    BASE     = ALPHABET.length();

    public static String encode(int num) {
        StringBuilder sb = new StringBuilder();
        while ( num > 0 ) {
            sb.append( ALPHABET.charAt( num % BASE ) );
            num /= BASE;
        }
        return sb.reverse().toString();   
    }

    public static int decode(String str) {
        int num = 0;
        for ( int i = 0; i < str.length(); i++ )
            num = num * BASE + ALPHABET.indexOf(str.charAt(i));
        return num;
    }   
}
share improve this answer
 

Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.

Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.

share improve this answer
 
1 
Thank you, very good idea. I haven't thought about that yet. It's clear that it depends on the kind of use whether that makes sense or not. –  Marco W.  Apr 12 '09 at 18:22
11 
It won't be an issue if people are strictly copy-and-pasting the short urls. –  Edward Falk  May 26 '13 at 15:35

My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.

share improve this answer
 
 
Thanks, you're right. Whether you have 2,176,782,336 possibilities or 56,800,235,584, it's the same: Both will be enough. So I will use base 36 encoding. –  Marco W.  Apr 14 '09 at 18:22
 
It may be obvious but here is some PHP code referenced in wikipedia to do base64 encode in php tonymarston.net/php-mysql/converter.html –  Ryan White  Jul 13 '10 at 15:33 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值