System Design: Designing a URL Shortening Service
From a bird’s eye view, a URL shortening service provides you with a short, and consequently easy to remember URL to use in place of a long original URL. Ever heard of TinyURL? It is a URL shortening service.
The below URL is long, and is impossible to remember.
https://www.yourwebsite.com/documentation-guide/make-money-online/mY7s345G
A URL shortening service could provide you with something like this:
https://www.shortservice.com/ab2sd32e
This short, relatively easy to remember URL is much more convenient compared to the original one.
Requirements
Let’s make a clear note of what is expected of our service. This is always the first step for designing any service.
There are two types of requirements: Functional and Non-Functional
Functional requirements tell what your service is supposed to do. Non Functional requirements tell you in what manner is your service supposed to do it.
Functional Requirements
- Ability to generate a short and unique alias for a given URL
- Ability to redirect to the original URL when the user accesses the short URL.
These are the two fundamental requirements that every URL shortening service must possess. But if you are building a professional service, you might want to add the below requirements as well:
3. Short URLs should expire after an expiration timestamp.
4. Ability to allow users to pick a custom URL for their original URL. This could be a paid service.
Non-Functional Requirements
- Redirection of URLs must be fast. User cannot afford any delays during the redirection of the short URL to the original URL. This service must work with minimum latency.
- This service should be highly available. As long as your service is down, there will be no redirections. Result? Pissed off/Clueless users. This is the most important requirement.
A naive design
This service appears to be simple enough. If we do not consider the Non-Functional requirements, this could be the High Level Design of a URL shortening service:
Problems with this design
- This service will hit the DB for every URL redirection. This will increase latency. How to solve this problem? Introduce Caching.
- The Short URL service is a Single Point of Failure. This is a red flag for any Design. Solution? Introduce multiple instances of this service.
Even when we solve these 2 problems, it still won’t be an efficient system. Before we see why, let’s make some estimates about the traffic and data that our service could encounter.
Estimating Traffic
Our service performs 2 main jobs per user. First, it generates a short URL for our given URL, and then saves it along with the original URL provided by the user. This is a write request, as the service has to save the generated and the original URL in its database. Now, whenever our user uses this short URL, it will be a read request as nothing will be stored in the DB. Evidently, our service will be a read-heavy service. A 1:100 writes to reads ratio is a safe assumption. (Discuss this with the interviewer)
We will be designing our service in a scalable manner. Let’s say at the peak of our service, we get 500 million URL shortening requests. (Again, discuss this number with the interviewer) Now these are the write requests. So as per our assumed writes to reads ratio, we should be prepared for 100 * 500 million or 50 billion URL redirections every month. Let’s calculate the QPS (Queries Per Second) for this service.
Queries Per Second
URL shortening requests per second = 500M * (30 * 24 * 60 * 60) = 200/s
Similarly, URL redirection requests per second = 100 * 200 = 20,000/s
Estimating Storage
To estimate the storage, we first need to decide the default expiration time of a short URL. Our service will store a short URL and corresponding long URL for 10 years. So the total number of long URLs that we will store will be:
500 million * 10 years * 12 months = 60 billion URLs
A long URL won’t have more than 500 characters in it. 1 character takes 1 byte, so an average of 500 bytes per URL is a decently fair assumption.
Total storage needed = 60 billion * 500 bytes = 30 TB
So we will need 30 TB of storage in our Database. We will be using a NoSQL wide column database for this service. Why?
- Read/Writes are faster
- There are no relations between our entities. We will essentially be storing “key-value pairs” of short to long URLs. (Will be explained later)
Cassandra is a great option. Let’s go with that.
Caching
Hitting our DB for every redirection request would be highly inefficient, and would increase latency. So we must cache some URLs. But which URLs?
We would benefit from caching only if we are able to cache the URLs that generate the most traffic to our service. Let’s call these frequently accessed URLs “hot URLs”. Assuming the 80–20 rule applies here, 20% of the URLs would generate 80% of the traffic. We will be caching these 20% hot URLs.
It makes sense to cache the URLs on a daily basis and not monthly basis. So, we will cache 20% of our daily URL redirections.
We get 20,000 * 60 * 60 * 24 = 1.7 billion URL redirections/day.
We will cache 20% * 1.7 billion * 500 bytes = 170 GB
170 GB is not much data, and can be stored on a single server. To avoid a single point of failure, we can have multiple replicas of this cache server.
Cache Eviction policy
If we want to add a new pair of long and short URL to the cache and the cache is full, we need to evict a pair from the cache. For our use case, LRU policy (Least Recently Used) makes most sense. The URL that was used least recently will be discarded.
Cache Instances update
Whenever a cache server is updated — either due to cache eviction, or cache miss — we need to make sure that all the replicas are updated.
We can either implement our own caching system, or use any available distributed memory caching system like Memcached or Redis
Generating a short URL
The example that we discussed in the beginning had this as the generated short URL:
https://www.shortservice.com/ab2sd32e
So, we essentially want to generate a sequence like “ab2sd32e” for a given original URL and append it to our service’s domain name. Let’s see how we can generate this sequence.
First we need to decide the length of our sequence. If we use characters 0-9, a-z and A-Z, we have a total of 62 characters using which we can build our sequence.
If we use 6 characters, we can generate 62⁶ unique sequences. This number is 56.8 billion. If we use 7 characters, then we can generate 62⁷ unique sequences, which is around 3.5 trillion. As we calculated above, we will need 60 billion sequences for our service. We have 2 possible options here:
1. Use 7 characters
2. Use extra characters like ‘+’ and ‘/’ which gives us a total of 64 characters. 64⁶ = over 68 billion. This seems like a better approach for us.
(Ensure that you and interviewer are on the same page about this)
Now that we have decided on the length of the sequence, let’s dive into how we can generate a unique 6 character sequence for our short URL.
There can be many intuitions about how to generate the sequences.
For instance, we could derive a sequence from the input long URL. We could compute a hash of the input URL, and then use base64 encoding. This could result in a unique sequence, but its length might still be longer than 6. We could use the last 6 characters from the generated sequence but that opens the gates for duplicate sequences.
Another solution could be to generate an ever increasing sequence of numbers, and then convert them to base64. This would work great, but if you have multiple instances of your short URL service, it is possible that 2 or more instances generate the same number.
The best solution would be to generate and store all the 6 character sequences beforehand and save them in the DB. This solves 3 problems:
1. Generated sequence is independent of the input URL.
2. No resources/time spent on encoding an input URL.
3. No duplications.
We will have a separate service that generates and stores the sequences before hand. Let’s call it Sequence Service.
Sequence Service
This service will be responsible for generating and storing all the 6-character sequences beforehand and saving them in the DB. When we get a URL to shorten, Sequence Service will pick an already generated sequence and assign it to be used as a short URL for the input long URL.
But we cannot have a single instance of this service either as it will be a single point of failure. But if we have multiple instances of this service, then we face a new problem — two instances of Sequence service can assign same sequence for two different input URLs. This is not acceptable. How do we get around this problem?
The solution is fairly simple. We will have two data structures for our sequences. One will hold all the sequences that have been assigned to an input URL, and are therefore unusable for the next 10 years. The other will hold all the unused sequences. The service can maintain some sequences in memory to dole out to the incoming input URLs quickly. But then this service should mark these sequences as used as soon as they load them in their memory, even if they haven’t actually been assigned to an input URL yet (to avoid duplicate assignments). The catch here is that if the Sequence service instance shuts down temporarily, we would lose the sequences loaded in the memory. Since we have an enormous amount of sequences, losing a few this way won’t affect our service.
Sequence storage
Our sequence is 6 characters long, and we will need 68 billion sequences.
So total storage needed for these sequences = 6 bytes * 68 billion = 408GB
URL Redirection
When a user tries to access a short URL, our service will first check whether this short URL has a long URL associated with it. It will first check in the cache and if not found, then in the used data structure in the DB. If an associated long URL is found, our service will issue an HTTP 302 redirect. Why 302, and not 301? A 301 redirect is a “permanent redirect”. The browser is allowed to cache a 301 redirect whereas a 302 redirect must hit our system every time. 301 could also be a good choice we do not allow URL updates. But if our service provides the functionality of updating the URLs, then 302 is the way to go.
Handling URL Expiration
When a URL expires, the short URL sequence should be moved to the unused sequences Data structure and the Long URL should be deleted. But how exactly do we perform this cleanup? It is not feasible to have a service continuously checking all the stored URLs for expiration as we have 68 billion URLs in the worst case (best case from business stand point) scenario. But it is feasible to perform a lazy cleanup. We will have a dedicated Cleanup Service for this purpose.
- Check for expiration every time user accesses a short URL. If its expiration timestamp has been crossed, simply invoke the cleanup service for this URL, and return an error message to the user. The cleanup service will delete the long URL, and move the short URL’s sequence to the unused data structure.
- We can also invoke the Cleanup service periodically that runs only when the traffic faced by our system is low.
Final Design
We did it! We have successfully built a URL shortening service. At least on paper. Here is the final design of our service:
Conclusion
In this article we learned how to design a URL shortening service. I will be posting similar System Design articles regularly. I will try to cover every type of service that make up the Tech Industry. If you liked this article then do follow me here, tons of interesting stuff coming up.
Peace!