Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure used to determine whether an element is a member of a set. The filter responds to queries with either a definite No or a probable Yes. The probabilistic nature of the Bloom filter is apparent in its 'probably Yes' responses. This implies the possibility of false positives. When the Bloom filter indicates Yes, the element might or might not be present in the set. However, if the Bloom filter returns No, then the element is definitively not present in the set.

Software engineering is all about tradeoffs. When using the Bloom filter, we sacrifice a bit of accuracy and flexibility to gain a substantial boost in performance. The key consideration is identifying use cases where it's acceptable to detect an element in the set, even if it's not actually present. Let us understand the core algorithm first then we will see its real life applications:

Algorithm:

A key element in the construction of a Bloom Filter is the choice of an effective hashing function. The function should be fast, evenly, and randomly distributed. While occasional collisions are acceptable, they should not occur frequently. Some common hashing choices include Murmur Hash, Jenkins Hash, or simple modular arithmetic.

Another key component is a bit array of say size 'm'. During insertion, an element is hashed to produce a hash value within the range [0, m-1]. The bit at the corresponding index in the array is then set to 1 . To check if an element is present in the bloom filter, it's hashed position is examined. If the bit is 1, the element is "probably" in the set; if it's 0, it's definitely not present. This is the core idea of Bloom filter.

Bloom Filter uses multiple hash functions to improve reliabilty. Using more hash functions leads to a more even distribution, reducing the likelihood of collisions and false positives. More formal algorithm is:

Initialization

Create a bit array of length 'm' (say 10) and initialize all bits to 0.
Choose the number of hash functions 'k' to be used.

Hash Functions:

Say we choose k = 3 hash functions. Common choices are MurmurHash, Jenkins Hash, etc. For simplicity, let's use modular arithmetic as hash functions.

Insertion:

Insert an element, say "A":

Apply each hash function to the element, obtaining k hash values.
Set the corresponding bits at these hash positions in the bit array to 1.

Membership Check:

Check if an element, say "A", is in the set:

Apply each hash function to the element, obtaining k hash values.
Check if all corresponding bits at these hash positions in the bit array are set to 1.

Similarly, check for any other element using k hash functions. If one of the k bits is not set, the value must not be present in the set.

Note:

The bit array size and the number of hashing functions in a Bloom filter play a crucial role in its efficiency and effectiveness.
If the bit array size is small and the dataset is large, all bits will likely be set within a short duration. A larger bit array offers greater precision but demands more memory. Therefore, it is essential to choose the optimal value.
The number of hashing functions affects how hash values are distributed across the bit array. Using more hash functions leads to a more even distribution, reducing the likelihood of collisions and false positives.
Increasing the number of hash functions generally decreases the false positive rate. This is because the probability of any specific bit being set to 1 by a random hash function decreases with more functions.
There are mathematical derivations for coming up with optimal number hash functions and bit array length, which is out of the scope of this post.
One important thing to keep in mind is items can only be added to the bloom filter, they cannot be removed.

Real-Life Applications:

Databases based on LSM Tree like Cassandra, DynamoDB
In an LSM tree-based database, data is often stored in multiple levels (memtable, SSTables, etc.). When querying for a particular key, the database needs to determine which level might contain the key.
A Bloom filter is associated with each level of the LSM tree. The Bloom filter quickly indicates whether the key might exist at that level or not.
Bloom filters can generate false positives, meaning they may indicate that a key exists when it does not. However, they do not produce false negatives – if the Bloom filter says a key is not present, it is definitely not in the level. The database design takes into consideration the acceptable level of false positives to balance query speed with accuracy.
By consulting Bloom filters before accessing specific levels, the database can avoid unnecessary disk I/O operations for keys that are likely not present.
Content Delivery Networks (CDN):
CDNs leverage Bloom filters to exclude caching for One Hit Wonders—web pages requested only once. As per Akamai, 75% of webpages fall into this category. These pages are cached only upon the second request.
If the incoming webpage is not found in the Bloom filter, avoid caching it on the Content Delivery Network (CDN).
It significantly reduces the Caching workload and boosts the Cache-hit ratio. In case of false positives, it is acceptable if a small percentage of the One Hit Wonders are added to the CDN.
System like Password or UserName or Email ID Validators:
Some of the password validators use the Bloom filter to filter out the weak password. The system may maintain a Bloom filter-driven stock of insecure credentials.
When a new user is added, the password is evaluated against the Bloom filter, and whenever a potential match is found, the user is notified of weak pasword.
Sometimes, the strong password might be a victim of a false positive. This should again be fine, as the system would max ask the user to come up with another password.
One important thing to note here is we are storing these passwords in hashed form. So, even if this data structure is leaked, exact passwords will not be compromised.
Similar logic can be applied to check for User Name, or email ID availability as well.
Malware Detection:
Malware detection often involves creating cryptographic hash values (e.g., MD5, SHA-256) of file attributes, such as the file's content, size, and metadata.
Known malware hashes, generated from previously identified malicious files, are inserted into the Bloom filter.
When a new file is encountered, its attributes are hashed using the same hash functions used to construct the Bloom filter. If the Bloom filter indicates a match, it suggests that the file might be associated with known malware.
False positives are acceptable to some extent in malware detection, as further analysis can be performed on flagged files to confirm their status. It provides a quick initial check for known malware without the need to perform more resource-intensive analyses on every file.
Bloom filters can be dynamically updated by adding new hash values for recently identified malware.

HTTP Series - Part-1: (TCP, HTTP and HTTP/1.1)

Hypertext Transfer Protocol (HTTP) enables the transfer of data over the Internet. HTTP is an application-layer protocol that facilitates the transmission of hypermedia documents, such as HTML. It was designed for communication between web browsers and servers but can also be used for other purposes. I have discussed in detail about different modes of communication between a client and a server. They all use HTTP under the hood. HTTP is a “stateless” protocol, which means each request is executed independently, without any knowledge of the requests that were executed before it. It uses the underlying transport protocol TCP (Transmission Control Protocol) to establish and manage connections between a client and a server. Transmission Control Protocol (TCP): TCP is a connection-oriented transport layer protocol. It provides a fully duplex and reliable exchange of messages between different devices over a network. Some of the main features of TCP are: Reliability: TCP ens...

Tech With Rahul

Search This Blog

Bloom Filter

Algorithm:

Note:

Real-Life Applications:

Comments

Post a Comment

Popular posts from this blog

CAP Theorem - Debunking Myths

Understanding Merkle Tree

HTTP Series - Part-1: (TCP, HTTP and HTTP/1.1)