Skip to main content

Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure used to determine whether an element is a member of a set. The filter responds to queries with either a definite No or a probable Yes. The probabilistic nature of the Bloom filter is apparent in its 'probably Yes' responses. This implies the possibility of false positives. When the Bloom filter indicates Yes, the element might or might not be present in the set. However, if the Bloom filter returns No, then the element is definitively not present in the set.

Software engineering is all about tradeoffs. When using the Bloom filter, we sacrifice a bit of accuracy and flexibility to gain a substantial boost in performance. The key consideration is identifying use cases where it's acceptable to detect an element in the set, even if it's not actually present. Let us understand the core algorithm first then we will see its real life applications:

Algorithm:

A key element in the construction of a Bloom Filter is the choice of an effective hashing function. The function should be fast, evenly, and randomly distributed. While occasional collisions are acceptable, they should not occur frequently. Some common hashing choices include Murmur Hash, Jenkins Hash, or simple modular arithmetic. 

Another key component is a bit array of say size 'm'.  During insertion, an element is hashed to produce a hash value within the range [0, m-1]. The bit at the corresponding index in the array is then set to 1 .  To check if an element is present in the bloom filter, it's hashed position is examined. If the bit is 1, the element is "probably" in the set; if it's 0, it's definitely not present. This is the core idea of Bloom filter.

Bloom Filter uses multiple hash functions to improve reliabilty. Using more hash functions leads to a more even distribution, reducing the likelihood of collisions and false positives. More formal algorithm is:
  1. Initialization
    • Create a bit array of length 'm' (say 10) and initialize all bits to 0.
    • Choose the number of hash functions 'k' to be used.

  2. Hash Functions:
    • Say we choose k = 3 hash functions. Common choices are MurmurHash, Jenkins Hash, etc. For simplicity, let's use modular arithmetic as hash functions.

  3. Insertion:
    • Insert an element, say "A":
      • Apply each hash function to the element, obtaining k hash values.
      • Set the corresponding bits at these hash positions in the bit array to 1.

  4. Membership Check:
    • Check if an element, say "A", is in the set:
      • Apply each hash function to the element, obtaining k hash values.
      • Check if all corresponding bits at these hash positions in the bit array are set to 1.

    • Similarly, check for any other element using k hash functions. If one of the k bits is not set, the value must not be present in the set.

Note:

  • The bit array size and the number of hashing functions in a Bloom filter play a crucial role in its efficiency and effectiveness.
  • If the bit array size is small and the dataset is large, all bits will likely be set within a short duration. A larger bit array offers greater precision but demands more memory. Therefore, it is essential to choose the optimal value.
  • The number of hashing functions affects how hash values are distributed across the bit array. Using more hash functions leads to a more even distribution, reducing the likelihood of collisions and false positives. 
  • Increasing the number of hash functions generally decreases the false positive rate. This is because the probability of any specific bit being set to 1 by a random hash function decreases with more functions.
  • There are mathematical derivations for coming up with optimal number hash functions and bit array length, which is out of the scope of this post.
  • One important thing to keep in mind is items can only be added to the bloom filter, they cannot be removed.

Real-Life Applications:

  1. Databases based on LSM Tree like Cassandra, DynamoDB
    • In an LSM tree-based database, data is often stored in multiple levels (memtable, SSTables, etc.). When querying for a particular key, the database needs to determine which level might contain the key.
    • A Bloom filter is associated with each level of the LSM tree. The Bloom filter quickly indicates whether the key might exist at that level or not.
    • Bloom filters can generate false positives, meaning they may indicate that a key exists when it does not. However, they do not produce false negatives – if the Bloom filter says a key is not present, it is definitely not in the level. The database design takes into consideration the acceptable level of false positives to balance query speed with accuracy.
    • By consulting Bloom filters before accessing specific levels, the database can avoid unnecessary disk I/O operations for keys that are likely not present.
  2. Content Delivery Networks (CDN): 
    • CDNs leverage Bloom filters to exclude caching for One Hit Wonders—web pages requested only once. As per Akamai, 75% of webpages fall into this category. These pages are cached only upon the second request.
    • If the incoming webpage is not found in the Bloom filter, avoid caching it on the Content Delivery Network (CDN).
    • It significantly reduces the Caching workload and boosts the Cache-hit ratio. In case of false positives, it is acceptable if a small percentage of the One Hit Wonders are added to the CDN.
  3. System like Password or UserName or Email ID Validators:
    • Some of the password validators use the Bloom filter to filter out the weak password. The system may maintain a Bloom filter-driven stock of insecure credentials.
    • When a new user is added, the password is evaluated against the Bloom filter, and whenever a potential match is found, the user is notified of weak pasword.
    • Sometimes, the strong password might be a victim of a false positive. This should again be fine, as the system would max ask the user to come up with another password.
    • One important thing to note here is we are storing these passwords in hashed form. So, even if this data structure is leaked, exact passwords will not be compromised.
    • Similar logic can be applied to check for User Name, or email ID availability as well.
  4. Malware Detection:
    • Malware detection often involves creating cryptographic hash values (e.g., MD5, SHA-256) of file attributes, such as the file's content, size, and metadata.
    • Known malware hashes, generated from previously identified malicious files, are inserted into the Bloom filter.
    • When a new file is encountered, its attributes are hashed using the same hash functions used to construct the Bloom filter. If the Bloom filter indicates a match, it suggests that the file might be associated with known malware.
    • False positives are acceptable to some extent in malware detection, as further analysis can be performed on flagged files to confirm their status. It provides a quick initial check for known malware without the need to perform more resource-intensive analyses on every file.
    • Bloom filters can be dynamically updated by adding new hash values for recently identified malware.

Comments

Popular posts from this blog

CAP Theorem - Debunking Myths

The CAP theorem is a widely recognized idea in the field of distributed systems. It represents three key concepts: Consistency, Availability, and Partition Tolerance. While most of us are familiar with its definition, the devil lies in the details. In this discussion, we'll clarify common myths and misunderstandings. We'll start by explaining the CAP theorem in detail, and then explore various scenarios that may challenge the common interpretation of it. CAP theorem also known as Brewer's theorem states that any distributed data store can provide only two of the following three guarantees: Consistency:  For every read request, the system should provide the most recent write or an error. Note that this consistency is different from the consistency of the  ACID theorem Availability:   For every request, the system should provide a response, even if it’s not the latest data.  In other words, all non-failing (healthy) nodes in the distributed system return a valid ...

Understanding Merkle Tree

A Merkle Tree is a cryptographic tree structure used in computer science and distributed systems to efficiently verify the integrity of large sets of data (accuracy and consistency of data over its lifecycle).  Merkle Tree, also known as Hash Tree is a tree of hash values.  It has a tree structure in which each leaf node is a hash of a small portion of the data, and each non-leaf node is a hash of its children. It is used in applications such as  NoSQL databases, Git, Cryptocurrencies,  File Systems, etc. Some key characteristics of Merkle Tree are: Binary Tree Structure:  The Merkle Tree is a binary tree, where each leaf node represents a hash of data. Leaf Nodes: The data set is divided into fixed-size blocks or "leaves". Each leaf node contains the hash of a specific data block or piece of information. Non-Leaf Nodes: Non-leaf nodes in the tree represent the hash of the concatenation of their child node's hashes.  If the number of leaves is odd...

Event Driven Architecture - SAGA Pattern (Part-1 : Choreography Model)

The Saga pattern is a distributed transactional pattern used in microservices architecture to maintain data consistency across multiple services. It helps manage long-running transactions involving multiple services by breaking them down into smaller, more manageable work units. There is a famous Database per Service  pattern in the Microservice world. Under this paradigm, each service maintains its own dedicated database. Some business transactions, however, span multiple services so we need a mechanism to implement transactions that span through services. Take, for instance, the scenario of placing an online order, which involves actions like inventory verification and item reservation till payment completion. Since services such as Orders, Inventory, and Payment operate on separate databases, the application cannot simply use a local ACID transaction. 2 Phase Commit Protocol  is one of the options being used for ensuring transactions across services. However, it has se...