A Bloom filter is a space-efficient probabilistic data structure used to determine whether an element is a member of a set. The filter responds to queries with either a definite No or a probable Yes. The probabilistic nature of the Bloom filter is apparent in its 'probably Yes' responses. This implies the possibility of false positives. When the Bloom filter indicates Yes, the element might or might not be present in the set. However, if the Bloom filter returns No, then the element is definitively not present in the set.
Software engineering is all about tradeoffs. When using the Bloom filter, we sacrifice a bit of accuracy and flexibility to gain a substantial boost in performance. The key consideration is identifying use cases where it's acceptable to detect an element in the set, even if it's not actually present. Let us understand the core algorithm first then we will see its real life applications:
Algorithm:
- Initialization
- Create a bit array of length 'm' (say 10) and initialize all bits to 0.
- Choose the number of hash functions 'k' to be used.
- Hash Functions:
- Say we choose k = 3 hash functions. Common choices are MurmurHash, Jenkins Hash, etc. For simplicity, let's use modular arithmetic as hash functions.
- Insertion:
- Insert an element, say "A":
- Apply each hash function to the element, obtaining k hash values.
- Set the corresponding bits at these hash positions in the bit array to 1.
- Membership Check:
- Check if an element, say "A", is in the set:
- Apply each hash function to the element, obtaining k hash values.
- Check if all corresponding bits at these hash positions in the bit array are set to 1.
- Similarly, check for any other element using k hash functions. If one of the k bits is not set, the value must not be present in the set.
Note:
- The bit array size and the number of hashing functions in a Bloom filter play a crucial role in its efficiency and effectiveness.
- If the bit array size is small and the dataset is large, all bits will likely be set within a short duration. A larger bit array offers greater precision but demands more memory. Therefore, it is essential to choose the optimal value.
- The number of hashing functions affects how hash values are distributed across the bit array. Using more hash functions leads to a more even distribution, reducing the likelihood of collisions and false positives.
- Increasing the number of hash functions generally decreases the false positive rate. This is because the probability of any specific bit being set to 1 by a random hash function decreases with more functions.
- There are mathematical derivations for coming up with optimal number hash functions and bit array length, which is out of the scope of this post.
- One important thing to keep in mind is items can only be added to the bloom filter, they cannot be removed.
Real-Life Applications:
- Databases based on LSM Tree like Cassandra, DynamoDB
- In an LSM tree-based database, data is often stored in multiple levels (memtable, SSTables, etc.). When querying for a particular key, the database needs to determine which level might contain the key.
- A Bloom filter is associated with each level of the LSM tree. The Bloom filter quickly indicates whether the key might exist at that level or not.
- Bloom filters can generate false positives, meaning they may indicate that a key exists when it does not. However, they do not produce false negatives – if the Bloom filter says a key is not present, it is definitely not in the level. The database design takes into consideration the acceptable level of false positives to balance query speed with accuracy.
- By consulting Bloom filters before accessing specific levels, the database can avoid unnecessary disk I/O operations for keys that are likely not present.
- Content Delivery Networks (CDN):
- CDNs leverage Bloom filters to exclude caching for One Hit Wonders—web pages requested only once. As per Akamai, 75% of webpages fall into this category. These pages are cached only upon the second request.
- If the incoming webpage is not found in the Bloom filter, avoid caching it on the Content Delivery Network (CDN).
- It significantly reduces the Caching workload and boosts the Cache-hit ratio. In case of false positives, it is acceptable if a small percentage of the One Hit Wonders are added to the CDN.
- System like Password or UserName or Email ID Validators:
- Some of the password validators use the Bloom filter to filter out the weak password. The system may maintain a Bloom filter-driven stock of insecure credentials.
- When a new user is added, the password is evaluated against the Bloom filter, and whenever a potential match is found, the user is notified of weak pasword.
- Sometimes, the strong password might be a victim of a false positive. This should again be fine, as the system would max ask the user to come up with another password.
- One important thing to note here is we are storing these passwords in hashed form. So, even if this data structure is leaked, exact passwords will not be compromised.
- Similar logic can be applied to check for User Name, or email ID availability as well.
- Malware Detection:
- Malware detection often involves creating cryptographic hash values (e.g., MD5, SHA-256) of file attributes, such as the file's content, size, and metadata.
- Known malware hashes, generated from previously identified malicious files, are inserted into the Bloom filter.
- When a new file is encountered, its attributes are hashed using the same hash functions used to construct the Bloom filter. If the Bloom filter indicates a match, it suggests that the file might be associated with known malware.
- False positives are acceptable to some extent in malware detection, as further analysis can be performed on flagged files to confirm their status. It provides a quick initial check for known malware without the need to perform more resource-intensive analyses on every file.
- Bloom filters can be dynamically updated by adding new hash values for recently identified malware.
- In an LSM tree-based database, data is often stored in multiple levels (memtable, SSTables, etc.). When querying for a particular key, the database needs to determine which level might contain the key.
- A Bloom filter is associated with each level of the LSM tree. The Bloom filter quickly indicates whether the key might exist at that level or not.
- Bloom filters can generate false positives, meaning they may indicate that a key exists when it does not. However, they do not produce false negatives – if the Bloom filter says a key is not present, it is definitely not in the level. The database design takes into consideration the acceptable level of false positives to balance query speed with accuracy.
- By consulting Bloom filters before accessing specific levels, the database can avoid unnecessary disk I/O operations for keys that are likely not present.
- CDNs leverage Bloom filters to exclude caching for One Hit Wonders—web pages requested only once. As per Akamai, 75% of webpages fall into this category. These pages are cached only upon the second request.
- If the incoming webpage is not found in the Bloom filter, avoid caching it on the Content Delivery Network (CDN).
- It significantly reduces the Caching workload and boosts the Cache-hit ratio. In case of false positives, it is acceptable if a small percentage of the One Hit Wonders are added to the CDN.
- Some of the password validators use the Bloom filter to filter out the weak password. The system may maintain a Bloom filter-driven stock of insecure credentials.
- When a new user is added, the password is evaluated against the Bloom filter, and whenever a potential match is found, the user is notified of weak pasword.
- Sometimes, the strong password might be a victim of a false positive. This should again be fine, as the system would max ask the user to come up with another password.
- One important thing to note here is we are storing these passwords in hashed form. So, even if this data structure is leaked, exact passwords will not be compromised.
- Similar logic can be applied to check for User Name, or email ID availability as well.
- Malware detection often involves creating cryptographic hash values (e.g., MD5, SHA-256) of file attributes, such as the file's content, size, and metadata.
- Known malware hashes, generated from previously identified malicious files, are inserted into the Bloom filter.
- When a new file is encountered, its attributes are hashed using the same hash functions used to construct the Bloom filter. If the Bloom filter indicates a match, it suggests that the file might be associated with known malware.
- False positives are acceptable to some extent in malware detection, as further analysis can be performed on flagged files to confirm their status. It provides a quick initial check for known malware without the need to perform more resource-intensive analyses on every file.
- Bloom filters can be dynamically updated by adding new hash values for recently identified malware.
Comments
Post a Comment