Awesome
<p align="center"> <img src="images/system_design_notebook.png"/> </p>An open source project to help you learn about system design step by step
NOTE: The work on this repo is still in progress. Some information might be lacking or missing at this point.
- Topics
- Topics Explained
- Exercises
- Questions
- Resources
- System Design
- System Design Process
- Interview Tips
- Q&A
- Credits
System Design
Requirements
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="requirements/README.md"><img src="images/requirements/requirements.png" width="160px;" alt="Requirements" /><br /><b>Requirements</b></a></td> <td align="center"><a href="requirements/README.md#functional-requirements"><img src="images/requirements/functional_requirements.png" height="120px;" alt="Functional Requirements" /><br /><b>Functional Requirements</b></a></td> <td align="center"><a href="requirements/README.md#non-functional-requirements"><img src="images/requirements/non_functional_requirements.png" height="120px;" alt="Requirements" /><br /><b>Non-Functional Requirements</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Compnents/Services
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="basic_architecture/README.md"><img src="images/basic_architecture/server.png" width="120px;" height="40px;" alt="Server" /><br /><b>Server</b></a></td> <td align="center"><a href="basic_architecture/README.md"><img src="images/basic_architecture/client.png" width="100px;" height="100px;" alt="Client" /><br /><b>Client</b></a></td> <td align="center"><a href="services/load_balancer/README.md"><img src="images/services/load_balancer.png" width="75px;" height="75px;" alt="Load Balancer" /><br /><b>Load Balancer</b></a></td> <td align="center"><a href="services/api_gateway/README.md"><img src="images/services/api_gateway.png" width="75px;" height="75px;" alt="API Gateway" /><br /><b>API Gateway</b></a></td> <td align="center"><a href="services/dns/README.md"><img src="images/dns/dns.png" width="75px;" height="75px;" alt="DNS" /><br /><b>DNS</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Scalability
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="scalability/README.md"><img src="images/scalability/scalability.png" width="70px;" height="75px;" alt="Scalability" /><br /><b>Scalability</b></a></td> <td align="center"><a href="scalability/README.md#horizontal-scaling"><img src="images/scalability/horizontal_scaling_icon.png" width="300px;" height="75px;" alt="Horizontal Scaling" /><br /><b>Horizontal Scaling</b></a></td> <td align="center"><a href="scalability/README.md"><img src="images/scalability/vertical_scaling_icon.png" width="300px;" height="60px;" alt="Vertical Scaling" /><br /><b>Vertical Scaling</b></a></td> <td align="center"><a href="scalability/README.md#scalability_factor"><img src="images/scalability/scalability_factor.png" width="100px;" height="80px;" alt="Scalability Factor" /><br /><b>Scalability Factor</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Reliability Engineering
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="reliability_engineering/README.md"><img src="images/reliability_engineering/availability.png" width="75px;" height="75px;" alt="Availability" /><br /><b>Availability</b></a></td> <td align="center"><a href="reliability_engineering/README.md"><img src="images/reliability_engineering/failover.png" width="75px;" height="75px;" alt="Availability" /><br /><b>Failover</b></a></td> </tr> <tr> <td align="center"><a href="reliability_engineering/README.md#cold-standby"><img src="images/reliability_engineering/cold.png" width="75px;" height="75px;" alt="Cold Standby" /><br /><b>Cold Standby</b></a></td> <td align="center"><a href="reliability_engineering/README.md#warm-standby"><img src="images/reliability_engineering/warm.png" width="85px;" height="75px;" alt="Warm Standby" /><br /><b>Warm Standby</b></a></td> <td align="center"><a href="reliability_engineering/README.md#hot-standby"><img src="images/reliability_engineering/hot.png" width="75px;" height="75px;" alt="Hot Standby" /><br /><b>Hot Standby</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Networking
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="networking/README.md"><img src="images/networking/networking.png" width="70px;" height="75px;" alt="Networking" /><br /><b>Networking</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Database
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="databases/README.md"><img src="images/databases/database.png" width="70px;" height="75px;" alt="Databases" /><br /><b>Databases</b></a></td> <td align="center"><a href="databases/README.md#cap-theorem"><img src="images/cap_theorem/cap.png" width="100px;" alt="CAP" /><br /><b>CAP Theorem</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Distributed Systems
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> <td align="center"><a href="distributed_systems/README.md"><img src="images/distributed_systems/distributed_systems.png" width="70px;" height="75px;" alt="Distributed Systems" /><br /><b>Distrubuted Systems</b></a></td> <td align="center"><a href="distributed_systems/README.md#clock-diff"><img src="images/distributed_systems/clock_drift.png" width="70px;" height="75px;" alt="Clock Drift" /><br /><b>Clock Diff</b></a></td> <td align="center"><a href="distributed_systems/README.md#cdn"><img src="images/distributed_systems/cdn.png" width="90px;" height="75px;" alt="CDN" /><br /><b>CDN</b></a></td> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END -->Database
<!-- ALL-TOPICS-LIST:START --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <center> <table> <tr> </tr> </table> </center> <!-- markdownlint-enable --> <!-- prettier-ignore-end --> <!-- ALL-TOPICS-LIST:END --> <td align="center"><a href="reliability_engineering/README.md"><br /><b></b></a></td>Exercises
General
"Elementary, my dear Watson"
<details> <summary>The following is an architecture of a server which runs a web server and a database on it. There are a dozen of clients connecting to the server. Answer the following questions <p align="center"> <img src="images/design/general/basic_architecture_db_web.png"/> </p>- Name two issues with this architecture
- What term/concept in system design used to describe one of the issues you specified?
- Suggest an improvement to the architecture WITHOUT adding more components
- If the suggested improvement is "vertical scaling", is it a permanent improvement?
- Suggest two improvements to the architecture, given that additional components can be added
-
Two issues:
- Single point of failure - if the server is down, the users would not be able to connect the application and the server will have to be restored or created from scratch
- If the web process is using all the RAM and/or CPU, then it might affect the DB since it won't have enough resource to handle requests. This is true also the opposite way of DB consuming all resources, not leaving enough for web server processes.
- The term used to describe the issue it "scalability". The server doesn't scale by demand and eventually users will experience issues
-
Apply vertical scaling. By adding more resources to the server (i.e. CPU & RAM) it will allow it to handle more load. This is nota permanent solution because at some point the load might scale to a point where it's not possible to add more RAM and CPU to the instance nor it's really worthwhile.
-
The two improvements would be:
- Separate DB and web server into two components (instead of both running on the same instance)
- Apply horizontal scaling - add more web server servers (it's also possible to add more DB instances)
AWS Cloud
What time is it? (stateless application)
<details> <summary>Design the most basic architecture for a web application (server based) that tells a single user what time is it (no DB, no scaling, ...) with maximum of two components </summary><br><b> <p align="center"> <img src="images/design/aws_cloud/what_time_is_it_1.png"/> </p>In this case what you need is two components:
- EC2 instance - this is where our application will run. A basic micro t2 instance is more than enough
- Elastic IP address - This is the static IP address our user will use every time to reach the application. In case the instance is not operational, we could always move the IP address to one that it is (if we manage more than one instance) </b></details>
Your instance might not be strong enough to handle requests from multiple users and soon enough you might see RAM and CPU utilized fully. One way to deal with it is, to perform what is called "Vertical Scaling" which is the act of adding more resources to your instance. In AWS case, switching to an instance type with more resources like M5 for example.
Note: The problem with vertical scaling (in case you have one node) is downtime (when upgrading the instance type, the instance will be down) so another thing you would want to do is "Horizontal Scaling" which is the act of adding more instances/resources.
<p align="center"> <img src="images/design/aws_cloud/what_time_is_it_2.png"/> </p></b></details>
<details> <summary>You would like to change the architecture offered in the solution, to not use elastic IP addresses for obvious reasons that it's not really scalable (each EC2 instance has a different IP and users are not able to remember them all). Offer an improvement </summary><br><b>Instead of using elastic IP addresses we can add a record in the DNS, using the Route 53 service, to have a record from the type A. This way, all users will be able to access the app using an hostname and the IP address will be provided to them by the DNS
<p align="center"> <img src="images/design/aws_cloud/what_time_is_it_3.png"/> </p>It's important to note that this solution is not optimal if you plan to scale down and up at some point. Due to the TTL value of a record, a user will keep contacting the same IP address, even if the node is already down.
A more proper and complete architecture would be to use an ELB
<p align="center"> <img src="images/design/aws_cloud/what_time_is_it_4.png"/> </p>But even with ELB used and "Auto scaling group" for automatically scaling the nodes, this architecture is not optimal. Can you point what is the problem with current architecture? (from two different aspecs) </b></details>
<details> <summary>State the issues with current architecture and what would you imrpove <p align="center"> <img src="images/design/aws_cloud/what_time_is_it_4.png"/> </p> </summary><br><b>With current architecture, the application is perhaps able to scale up and down, but when the availability zone is down, your entire application is down. So the first improvement would be to make both ELB and the application itself (the EC2 instances) multi-AZ.
Secondly, if you know you always need an instance (or two) for the application, you might want to have reserved nodes. Reserved nodes means you pay less for instances which means you save on costs.
<p align="center"> <img src="images/design/aws_cloud/what_time_is_it_5.png"/> </p></b></details>
"Video Games Shop" (stateful application)
<details> <summary>The following architecture was proposed for an online video games shop with the requirements of:- Support thousands of users at any given point of time
- Users can register
- Shopping cart items shouldn't be lost when the user browsing the store
The problem is that users report that when they browse for additional video games to buy, they lose their shopping cart. What is the problem with the current architecture and how to deal with it?
</summary><br><b>Such application is a stateful application, meaning it's important that we'll keep the information about the client from one session to another. It seems that with current architecture, every time the user initiates new session, it's perform against a different instance without saving client information.
There are a couple of solutions that can be applied in this case:
- Load Balancer Sticky Sessions: users will be redirected to the instance they initiated previously session with, in order to to not lose client's data. There is a disadvantage here of losing
- User cookies: the client/user stores the relevant data (shopping cart in this case) and in this case it doesn't matter with which EC2 instance the user interacts with, the data on the shopping cart will be sent from the client/user to the relevant instance. The disadvantages: HTTP requests are much heavier because data is attached with each request and it holds some security risks as cookies can be potentially modified </b></details>
"Video Games Shop"
<details> <summary>Following the last exercise, is there another way to deal with user's data (short and long term) except user's cookies and sticky sessions? </summary><br><b>There is something called "server session" where we need to add a new component to the architecture - ElastiCache or DynamoDB, to store the data on the shopping cart of each user. In order to identify it, we'll use a session ID which will be sent by the client/user every request
For long term data (user name, address, etc.) we'll use a database (e.g. RDS). There are a couple of variations as to how we can use it. A master instance where we'll write the data and a replication from which we'll read data:
<p align="center"> <img src="images/design/aws_cloud/video_games_shop_2.png"/> </p>A different approach can be to use Cache + DB, where for each session, we'll check if the data is in the cache and if it's not, then we'll access the DB (this is also called "Write Through"):
<p align="center"> <img src="images/design/aws_cloud/video_games_shop_3.png"/> </p></b></details>
Misc
Note: Exercises may repeat themselves in different variations to practice and emphasize different concepts.
"Perfectly balanced, as all things should be"
<details> <summary>You have the following simple architecture of a server handling requests from a client. What are the drawbacks of this design and how to improve it? <p align="center"> <img src="images/design/basic_architecture.png"/> </p> </summary><br><b>-
Limitations:
- Load - at some point it's possible the server will not be able to handle more requests and it will fail or cause delays
- Single point of failure - if the server goes down, nothing will be able to handle the requests
-
How to improve:<br>
<p align="center"> <img src="images/load_balancer/basic_architecture.png"/> </p> -
Further limitations:
- Load was handled as well as the server being a single point of failure, but now the load balancer is a single point of failure. </b></details>
Yes, one could use DNS load balancing.<br> Bonus question: which algorithm a DNS load balancer will use? </b></details>
<details> <summary>What are the drawbacks of round robin algorithm in load balancing?</summary><br><b>- A simple round robin algorithm knows nothing about the load and the spec of each server it forwards the requests to. It is possible, that multiple heavy workloads requests will get to the same server while other servers will got only lightweight requests which will result in one server doing most of the work, maybe even crashing at some point because it unable to handle all the heavy workloads requests by its own.
- Each request from the client creates a whole new session. This might be a problem for certain scenarios where you would like to perform multiple operations where the server has to know about the result of operation so basically, being sort of aware of the history it has with the client. In round robin, first request might hit server X, while second request might hit server Y and ask to continue processing the data that was processed on server X already. </b></details>
"For all my actions both public and private"
<details> <summary>The following is an architecture of a load balancer serving and three web servers. Assuming, we would like to have a secured architecture, that makes sense, where would you set a public IP and where would you set a private IP? <p align="center"> <img src="images/design/public_private_drag_and_drop.png"/> </p> </summary><br><b> <p align="center"> <img src="images/design/public_private_drag_and_drop_solution.png"/> </p>It makes sense to hide the web servers behind the load balancers instead of giving users direct access to them, so each one of them will have a private IP assigned to it. The load balancer should have a public IP since, we except anyone who would like to access a certain web page/resource, to go through the load balancer hence, it should be accessible to users. </b></details>
<details> <summary>What load balancing techniques are there?</summary><br><b>- Round Robin
- Weighted Round Robin
- Least Connection
- Weighted Least Connection
- Resource Based
- Fixed Weighting
- Weighted Response Time
- Source IP Hash
- URL Hash </b></details>
"Keep calm, all I want is your cash"
<details> <summary>The following is a simple architecture of a client making requests to web server which in turn, retrieves the data from a datastore. What are the drawbacks of this design and how to improve it? <p align="center"> <img src="images/design/cache_basics_1.png"/> </p> </summary><br><b>-
Limitations:
- Time - retrieving the data from the datastore every time a request is made from the client, might take a while
- Single point of failure - if the datastore is down (or even slow) it wouldn't be possible to handle the requests
- Load - the datastore getting all the requests can result in high load on the datastore which might result in a downtime
-
How to improve:<br>
<p align="center"> <img src="images/design/cache_basics_2.png"/> </p>
</b></details>
<details> <summary>Are you able to explain what is Cache and in what cases you would use it?</summary><br><b>Why to use cache?
- Save time - Accessing a remote datastore, and in general making network calls, takes time
- Reduce load - Instead of the datastore handling all the requests, we can take some of its load and reduce by accessing the cache
- Avoid repetitive tasks - Instead of querying the data and processing it every time, do it once and store the result in the cache </b></details>
For multiple reasons:
- The hardware on which we store the cache is in some cases much more expensive
- More data in the cache, the bigger it gets and longer the put/get actions will take </b></details>
"In a galaxy far, far away..."
<details> <summary>The following is a system design of a remote database and three applications servers <p align="center"> <img src="images/design/remote_database_1.png"/> </p> </summary><br><b>-
Limitations:
- Latency. Every query made to the remote database will hit latency, even if small.
- In case the remote database crashes, the app will stop working
-
How to improve:<br>
<p align="center"> <img src="images/design/remote_database_2.png"/> </p> * Replicate each database to the local app server. This has several advantages. First, we are not bound to latency anymore. Secondly, a fai -
Further limitations:
- We are bound now to bandwidth
- If the remote database isn't accessible for a long period of time, we'll have an outdated database and each app has the potential to work against a different DB </b></details>
"A bit on the slow side"
<details> <summary>The following is an improvement of the previous system design <p align="center"> <img src="images/design/remote_database_2.png"/> </p> </summary><br><b>-
Limitations:
- Queries to database might be slow, even on the server itself where the app is running
- Once the remote database isn't available, the local databases will not by in sync
-
How to improve:<br> <img src="images/design/remote_database_v2_1.png"/> </b></details>
"Always the same one"
<details> <summary>Every request sent by the same client, is routed every time to a different web server. What problem the user might face with such design? How to fix it? <p align="center"> <img src="images/design/load_balancer/lb_sticky_sessions.png"/> </p> </summary><br><b>- The problem: the user might need to authenticate every single request, because different web servers handle its requests.
- A possible solution: use sticky sessions where the user is routed to the same instance every single time
"Coming back to find we've failed"
<details> <summary>You have a design of load balancer and a couple of web instances behind it. Sometimes the instances crash and the user report the application doesn't works for them. Name one possible way to deal with such situation.</summary><br><b>One possible way to deal with it, is by using health checks. Where an instance that doesn't pass the health check, will be excluded from the list of instances used by the load balancer to forward traffic to. <img src="images/design/load_balancer/load_balancer_health_checks.png"/> </b></details>
"In any major city, minding your own business is a science"
<details> <summary>You have a production application using a database for reads and writes. Your organization would like to add another application to work against the same database but for analytics purposes (read only). What problem might arise from this new situation and what one improvement you can apply to the design? <p align="center"> <img src="images/design/db/analytics_application_database.png"/> </p> </summary><br><b>Adding another application to work against the same database can create additional load on your database which may lead to issues since the additional load might reach the limits of your database capacity constraints.
One improvement to the design could be to add a read replica instance of your database. This way the new application can work against the read replica instead of the original database. The replication will be asynchronous but in most cases, for analytics applications, that's good enough.
<img src="images/design/db/analytics_application_database_improvement.png"/> </b></details>Questions
This is a more "simplified" version of exercises section. No drawings, no evolving exercises, no strange exercises names, just straightforward questions, each in its own category.
<details> <summary>Your website usually serves on average a dozen of users and has good CPU and RAM utilization. It suddenly becomes very popular and many users try to access your web server but they are experiencing issues, and CPU, RAM utilization seems to be on 100%. How would you describe the issue?</summary><br><b>Scalability issue. The web server doesn't scales :'(<br> In order to avoid such issues, the web server has to scale based on the usage. More users -> More CPU, RAM utilization -> Add more resources (= scale up). An When there are less users accessing the website, scale down. </b></details>
Cache
<details> <summary>Tell me everything you know about Cache</summary><br><b> </b></details> <details> <summary>True or False? While caching can reduce load time, it's increasing the load on the server</summary><br><b>False. If your server doesn't have to execute the request since the result is already in the cache, then it's actually the opposite - there is less load on the server in addition to reduced load times. </b></details>
DNS
<details> <summary>What is a DNS?</summary><br><b> </b></details> <details> <summary>Can you use DNS for load balancing?</summary><br><b> </b></details>Storage
<details> <summary>What is RAID?</summary><br><b> </b></details>DNS
<details> <summary>What is CDN?</summary><br><b> </b></details>Misc
<details> <summary>How operating systems able to run tasks simultaneously? for example, open a web browser while starting a game</summary><br><b>The CPU have multiple cores. Each task is executed by a different core.<br> Also, it might only appear to run simultaneously. If every process is getting CPU allocation every nanosecond, the user might think that both processes are running simultaneously. </b></details>
<details> <summary>What is a VPS?</summary><br><b>From wikipedia: "A virtual private server (VPS) is a virtual machine sold as a service by an Internet hosting service." </b></details>
<details> <summary>True or False? VPS is basically a shared server where each user is allocated with a portion of the server OS</summary><br><b>False. You get a private VM that no one else can or should use. </b></details>
Distributed Systems
<details> <summary>What are some of the challenges one has to face when designing and managing distributed systems, but not so much when dealing with systems/services running on a single host?</summary><br><b>See Distributed Systems - Fallacies of distributed systems </b></details>
<details> <summary>What is a clock drift in regards to distributed systems?</summary><br><b>See Distributed System - Clock Drift </b></details>
Resources
There many great resources out there to learn about system design in different ways - videos, blogs, etc. I've gathered some of them here for you
By Topic
<details> <summary>System Design Introduction</summary>Articles
</details> <details> <summary>Scalability</summary>Videos
Repositories
- awesome-scalability - "An updated and organized reading list for illustrating the patterns of scalable, reliable, and performant large-scale systems"
By Resources Type
<details> <summary>Videos</summary>System Design
- Gaurav Sen - Excellent series of videos on system design topics
- System Design Interview - How to get through system design interviews. Covering both architecture and code
Scalability
</details> <details> <summary>Repositories</summary>Scalability
- awesome-scalability - "An updated and organized reading list for illustrating the patterns of scalable, reliable, and performant large-scale systems"
System Design
- system-design-primer - "Learn how to design large-scale systems. Prep for the system design interview."
Introduction
</details>System Designs
<a id="system-design-development"></a>
Development
This section focusing on the development aspect of system design (e.g. Version Control, Development Environment, etc.). Note: it might overlap with other sections such 'Cloud System Design'.
For each scenario try to identify the following:
- Problem Statement & Analysis
- Solution(s)
Big-Scale Monorepo
You are part of a team which owns a Git monorepo. Recently this monorepo grown quite a lot and now includes hundred thousands of files. Recently, developers in your team complain it takes a lot of time to run some Git commands, especially those related to filesystem state (e.g. git status
takes a couple of seconds or even minutes). What would you suggest the team to do?
Big-Scale Monorepo - Problem Statement & Analysis
Let's start with stating the facts:
- The repo has grown to include million of files
- It takes seconds or minutes to run Git operations while usually it takes approximately 1-2 seconds at most
Next, it would be good to understand how exactly these different commands work in order to understand what we can do about the problem.
In case of git status
, it runs diffs on HEAD and the index and also the index and working directory
, to understand what is being tracked and what's not. This results in running lstat() system call which returns a struct on each file (including information like device ID, permissions, inode number, timestamp info, etc.). Running this on hundred thousands of files takes time, which explains why git status
takes seconds, if not minutes, to run.
Big-Scale Monorepo - Solution(s)
The solution should be focusing on how Git can perform less work in regards to its operations. In this case it's very technology specific, but there is always something to learn, even in specific implementations, on the general design approach.
- Use the built-in
fsmonitor
(filesystem monitor) of Git. With fsmonitor (which integrated with Watchman), Git spawn a daemon that will watch for any changes continuously in the working directory of your repository and will cache them . This way, when you rungit status
instead of scanning the working directory, you are using a cached state of your index.
-
Enable
feature.manyFile
withgit config feature.manyFiles true
. This does two things: -
Sets
index.version = 4
which enables path-prefix compression in the index -
Sets
core.untrackedCache=true
which by default is set tokeep
. The untracked cache is quite important concept. What it does is to record the mtime of all the files and directories in the working directory. This way, when time comes to iterate over all the files and directories, it can skip those whom mtime wasn't updated.
Before enabling it, you might want to run git update-index --test-untracked-cache
to test it out and make sure mtime operational on your system.
-
Git also has the built-in
git-maintainence
command which optimizes Git repository so it's faster to run commands likegit add
orgit fatch
and also, the git repository takes less disk space. It's recommended to run this command periodically (e.g. each day). -
Track only what is used/modified by developers - some repositories may include generated files that are required for the project to run properly (or support certain accessibility options), but not actually being modified by any way by the developers. In that case, tracking them is futile. In order to avoid populating those file in the working directory, one can use the
sparse checkout
feature of Git. -
Finally, with certain build systems, you can know which files are being used/relevant exactly based on the component of the project that the developer is focusing on. This, together with the
sparse checkout
can lead to a situation where only a small subset of the files are being populated in the working directory. Making commands likegit add
,git status
, etc. really quick
<a id="system-design-generic-systems"></a>
Generic Systems
This section covers system designs of different types of applications. Nothing too specific, but yet quite common in the real world as a type of application.
For each type of application we are going to mention its:
- Requirements
- API endpoints (public and internal)
Payment and Reservation System for Parking Garages
Design a payment and reservation system for parking garages that supports three type of vehicles:
- regular
- large
- compact
With flat rate based on vehicle type and time
Note: if you've been told to design this type of system without any other requirements, the rate and special parking, is something you should ask about.
Payment and Reservation System for Parking Garages - Clarifications
Ask clarifying questions such as:
- Who are the users and how they are going to use the system?
- What inputs and outputs should the system support?
- How much data do we expect the system to handle?
- How many requests per second?
Payment and Reservation System for Parking Garages - Requirements
- User to be able to reserve a parking spot and receive a ticket
- User can't reserve a parking spot reserved by someone else
- System to support the following types of vehicles: regular, large and compact
- System to support flat rate based on vehicle type and time the vehicle spent in the parking
Payment and Reservation System for Parking Garages - API (Public Endpoints)
-
/reserve
- Parameters: garage_id, start_time, end_time
- Returns: (spot_id, reservation_id)
-
/cancel
- Parameters: reservation_id
-
/payment
- Parameters: reservation_id
- Use existing API like Squre, PayPal, Stripe, etc.
Payment and Reservation System for Parking Garages - API (Internal Endpoints)
-
/calculate_payment
- calculate the payment for reserving a parking spot- Parameters: reservation_id
-
/free_spots
- get free spots where the car can park (note: small car might be able to park in bigger car spot)- Parameters: garage_id, vehicle_type, time
-
/allocate_spot
- do the actual reservation of a parking spot- Parameters: garage_id, vehicle_type, time
-
/create_account
- the ability to create an account so users can use the app and reserve parking spots- Parameters: email, username, first_name (optional), last_name (optional), password (optional)
-
/login
- Parameters: email, username (optional), password
Payment and Reservation System for Parking Garages - Scale
We can assume that the number of users is limited to the number of parking spots in each garage and taking into account the number of garages of course.<br> Given that, users scale is pretty predictable and can't reach unexpected count (assuming no new garages can be added or fixed rate of new garages being added)
Payment and Reservation System for Parking Garages - Data Scheme
SQL based database with the following tables
Reservations
Field | Type |
---|---|
id | primary key, serial |
start | timestamp |
end | timestamp |
paid | boolean |
Garages
Field | Type |
---|---|
id | primary key, serial |
zipcode | varchar |
rate_compact | decimal |
rate_compact | decimal |
rate_regular | decimal |
rate_large | decimal |
Spots
Field | Type |
---|---|
id | primary key, serial |
garage_id | foreign_key, int |
vehicle_type | enum |
status | enum |
Users
Field | Type |
---|---|
id | primary key, serial |
varchar (SHA-256) | |
first_name | varchar |
last_name | varchar |
Vehicles
Field | Type |
---|---|
id | primary key, serial |
license | varchar |
type | enum |
Payment and Reservation System for Parking Garages - High-level architecture
<p align="center"> <img src="images/design/apps/garage_payment_and_reservation/high_level_architecture.png"/> </p><a id="system-design-real-world-systems"></a>
Real-World Systems
This section covers system designs of real world applications.
Each section here will include full details on the system. It's recommended, as part of your learning process, that you will NOT look at these full details and start by trying figuring them out by yourself. More specifically, for every system:
- Create a list of its functional requirements, features
- Create a list of its non functional requirements
- Design API spec
- Perform high-level design
- Perform detailed design
Features / Functional Requirements
- Messaging with individuals and groups (send and receive)
- Sharing documents, images, videos
- User status - online, last seen, etc.
- Message status - delivered, read (and who read it)
- Encryption - encrypt end-to-end communication
Non Functional Requirements
- Scale
- Minimal Latency
- High Availability
- Consistency
- Durability
API Spec
Messaging:
- Direct Chat Session
- Input: API key, user ID, user ID (recipient)
- Send Message
- Input: API key, session ID, message type, message
User account:
- Register account
- Input: API key, user data
- Validate account
- Input: API key, user ID, validation code
System Design Overview
TODO
System Design Components
TODO
Amazon Video Prime
Features / Functional Requirements
- Accounts
- Users have their account (they able to login or logout from it)
- Uploading videos
- Support uploading videos (and video thumbnail) by content creators/producers
- Only certain accounts can upload videos not everyone
- Set limit for uploads
- Video length (e.g. 5 hours)
- Size per hour of video (if 5 hours and 1GB per hour = 5GB)
- Thumbnail size (e.g. 50 MB)
- Streaming videos
- Users able to stream videos to different devices
- UI differ based on devices
- Buffering (over quality) (Note: A trade off)
- Store different quality versions of the same video
- Users able to stream videos to different devices
- Browsing videos
- Be able to tag videos (genre, date, ...)
- Allow users to search for videos (based on name, tag, ...)
Non Functional Requirements
- Reliability (Reliable video storage for storing uploaded videos and streaming them when needed)
- Scalability (be able to onboard users without hitting performance issues)
- High Availability (if one of the components fail, the system would still be usable)
System Design Process
How to perform system design? TODO(abregman): this section is not yet ready
- Define clearly the requirements - if requirements are not clear, make sure to clarify them
- Answer the question what is the projected traffic?
- Define which quality attributes are important for your system - scalability, efficiency, reliability, etc.
- TODO
System Design Interview Tips
If you are here because you have a system design interview, here are a couple of suggestions:
- Choose the simplest architecture that captures the requirements (functional requirements but also expected traffic for example) of the system. No need to complicate things :)
More specific suggestions based on the phase of the interview:
What to ask yourself when you see a system design
Note: You might want to ask yourself these questions also after you've done performing a system design
- Does it scale if I add more users?
- Is there a single point of failure in the design?
- Don't be shy about asking for clarifications on a given system design. Some system design are vague on purpose
What to ask the interviewer
- What are the requirements?
- How the system is used?
- How much users are expected to access the system?
- How often the users access the system?
- Where the users access the system from?
- Are there any constraints?
- Ask for clarifications if needed. Sometimes instructions or requirements are vague on purpose.
What to say at the beginning of the discussion on a system design
- List the required features of the system
- State problem you expect to encounter
- State the traffic you expect the system to handle
What you might want to state during the design or the discussion
- At each decision made about the system design, state what are the cons and pros of such decision
Credits
The project banner, the icons used in "General" exercises section and the system design index, designed by Arie Bregman
Contributions
If you would like to contribute to the project, please read the contribution guidelines