Skip to main content

How to Choose Right Database in System Design Interviews - Part - 1

Choosing the right database in the system design interviews is one of the most important decisions. The database selection impacts the design correctness and scalability of the system. There are many factors which might impact the decisions, but the most important ones are:

  • Structure of the Data:
    • Different databases support data models such as relational, document, key-value, graph, etc.
    • The choice of database depends on which data model best fits the structure of your data. For example, a traditional relational database like MySQL or PostgreSQL might be appropriate if the data is highly structured and relational.
    • A NoSQL database like Cassandra or MongoDB might be more suitable if the data is semi-structured or unstructured.
  • Query Pattern:
    • Understanding the types of queries and operations performed on the data helps select a database that can efficiently handle those operations.
    • Are they read-heavy or write-heavy? Do they involve complex joins, aggregations, or full-text searches? 
    • Some databases are optimized for complex analytical queries (OLAP), while others excel at fast transactional queries (OLTP).
    • Some databases are optimized for specific use cases, such as time-series data, geospatial data, or graph data.
  • Data Integrity and Consistency:
    • The database's support for enforcing data integrity constraints, such as unique keys, foreign key constraints, and transactions, is critical to maintaining data consistency.
    • Relational databases typically excel in this area, whereas some NoSQL databases may sacrifice strict consistency for performance or scalability
  • Data Volume and Scalability:
    • The volume of data that needs to be stored and managed can also influence the database selection. However, unlike popular opinion, the volume of data is not a big factor in selecting a relational or no-SQL database.
    • One common misconception is that a traditional RDBMS might only be sufficient for small to moderate-sized datasets. For large-scale data sets that require horizontal scalability, distributed databases like Apache Cassandra or Amazon DynamoDB might be necessary. This is not always true.
    • Scaling is not an issue with modern relational distributed Databases like Google Spanner, Amazon RDS, and Cockroach DB. They scale seamlessly. However, they are relational databases so built-in ACID support is there, which might impact the latency. One needs to consider other factors mentioned above to make a final decision.
Let's discuss some of the most common use cases for databases:

  • Time-Series Database:

    • Time series databases (TSDBs) are optimized for storing, querying, and analyzing time-stamped data points. They are commonly used in various industries and applications where data is collected over time and needs to be efficiently managed and analyzed. 
    • Some common TSDBs are InfluxDB and Amazon Timestream. Bigtable and MongoDB also work well for time-series data. 
    • Some common use cases are:
      • Monitoring and Observability:
        • TSDBs are widely used for monitoring and observability applications, such as monitoring infrastructure performance (e.g., CPU usage, memory utilization, network traffic), network devices, servers, and application metrics (e.g., response times, error rates).
        • They store metrics and telemetry data collected from sensors, devices, and software systems in real-time.
        • Prominent tools like Prometheus, Grafana, and InfluxDB widely use TSDBs.
      • Financial Services:
        • Financial services use Time series databases to track and analyze stock market data, cryptocurrency prices, trading volumes, and other financial metrics.
        • These databases help in historical analysis, real-time monitoring of market trends, algorithmic trading, risk management, and compliance reporting.
      • Log Management and Analytics:
        • TSDBs are used for storing and analyzing log data generated by applications, servers, network devices, and security systems.
        • They enable centralized log management, log aggregation, real-time log analysis, anomaly detection, and forensic investigations, helping organizations gain insights into system behavior and security incidents.

  • Data warehouses:

    • Data warehouses are used for analytics on the whole company from different sources. E.g Data warehouses are responsible for doing analytics like the number of orders, geography-wide orders, most popular items for each geography, etc.
    • Data warehouses are specialized databases designed for storing, analyzing, and reporting large volumes of structured and unstructured data from multiple sources.
    • Here a lot of data from all services are dumped and can be used to generate reports on top of that data. These data from different sources are collected in the Data Warehouse using ETL(Extract, Transform, Load) jobs.
    • Some of the popular Data warehouses are BigQuery, Amazon Redshift, etc.

  • Text Searching Capabilities:

    • Systems like Amazon provide search capabilities on various products based on title, product description, seller, etc
    • Systems like Netflix where one needs to search on movie name, title, genres, etc
    • Social media platforms like Facebook, Twitter, and LinkedIn employ text search engines to enable users to search for posts, tweets, profiles, and hashtags. Users can search for specific topics, people, or conversations using keywords or hashtags.
    • CMS(Content Management System) platforms, such as WordPress, Drupal, and Joomla, use text search engines to enable users to search for articles, blog posts, pages, and other content within the CMS. Users can search by title, content, tags, categories, and metadata.
    • Text search engines like Elastic Search can be used for such use cases.
    • They are useful for fuzzy search as well. It allows users to perform fuzzy search to find approximate matches for a given query.
    • One Important point to note is they are not DB but the search engine. Search engines do not guarantee the durability of data. Search Engines need to be used along with the DB. DB is always the source of truth and data can be loaded into the search engine on a need basis.
We will explore the SQL and No-SQL databases in the next part. What are the factors that help us to choose among them? 

Comments

Popular posts from this blog

CAP Theorem - Debunking Myths

The CAP theorem is a widely recognized idea in the field of distributed systems. It represents three key concepts: Consistency, Availability, and Partition Tolerance. While most of us are familiar with its definition, the devil lies in the details. In this discussion, we'll clarify common myths and misunderstandings. We'll start by explaining the CAP theorem in detail, and then explore various scenarios that may challenge the common interpretation of it. CAP theorem also known as Brewer's theorem states that any distributed data store can provide only two of the following three guarantees: Consistency:  For every read request, the system should provide the most recent write or an error. Note that this consistency is different from the consistency of the  ACID theorem Availability:   For every request, the system should provide a response, even if it’s not the latest data.  In other words, all non-failing (healthy) nodes in the distributed system return a valid ...

Understanding Merkle Tree

A Merkle Tree is a cryptographic tree structure used in computer science and distributed systems to efficiently verify the integrity of large sets of data (accuracy and consistency of data over its lifecycle).  Merkle Tree, also known as Hash Tree is a tree of hash values.  It has a tree structure in which each leaf node is a hash of a small portion of the data, and each non-leaf node is a hash of its children. It is used in applications such as  NoSQL databases, Git, Cryptocurrencies,  File Systems, etc. Some key characteristics of Merkle Tree are: Binary Tree Structure:  The Merkle Tree is a binary tree, where each leaf node represents a hash of data. Leaf Nodes: The data set is divided into fixed-size blocks or "leaves". Each leaf node contains the hash of a specific data block or piece of information. Non-Leaf Nodes: Non-leaf nodes in the tree represent the hash of the concatenation of their child node's hashes.  If the number of leaves is odd...

Event Driven Architecture - SAGA Pattern (Part-1 : Choreography Model)

The Saga pattern is a distributed transactional pattern used in microservices architecture to maintain data consistency across multiple services. It helps manage long-running transactions involving multiple services by breaking them down into smaller, more manageable work units. There is a famous Database per Service  pattern in the Microservice world. Under this paradigm, each service maintains its own dedicated database. Some business transactions, however, span multiple services so we need a mechanism to implement transactions that span through services. Take, for instance, the scenario of placing an online order, which involves actions like inventory verification and item reservation till payment completion. Since services such as Orders, Inventory, and Payment operate on separate databases, the application cannot simply use a local ACID transaction. 2 Phase Commit Protocol  is one of the options being used for ensuring transactions across services. However, it has se...