Bloom Filters
- Offers very fast lookup
- space efficient, probabilistic checks
- tradeoff between space and accuracy
- Returns
- Maybe Yes (Possible False Positive)
- Firm No (No False Negative)
- Used along with Cache systems, and act as first line of check
- False Positives can be reduced by increasing Bit Vector size
- Examples
- Check Username
- Web Crawler to check if URL is already visited
- Avoid One Hit Wonders
- CDN like Akamai prevent caching one-hit wonders
- web pages that are only requested once
- 75% pages are one-hit wonders
- Facebook to avoid rarely search data in bloom filers to avoid one-hit wonders
- Chrome in past used to check URL is malicious
- Databases
- Cassandra uses it to avoid unnecessary database lookup in SSTable
- PostgresQL, Google BigTable for same reason
- LSM Tree Based NoSQL Databases
- Spell Checkers to check against dictionary
- Password validators use it to avoid weak password
Working
- Create a bit array representing 0s and 1s
- Initialize with all 0s
- Have set of hash functions h1(x), h2(x), h3(x), …
- Given an element, apply all hash functions to the element
- For each result from each hash function, flip 0 to 1
- Any new element if it enters system, has to apply all the hash functions
- If there is 1 present already in the bit array for each function, then it is a Maybe-Yes
- Else, It is a Firm-No
Example
- Assume we have following strings in our set:
| Hash Functions | ”Hello" | "World” |
|---|
| h1(x) | 3 | 4 |
| h2(x) | 5 | 7 |
- Corresponding Bit Vector will be:
- Checking new Words if they are present in the set:
| Hash Functions | ”Coca" | "Cola” |
|---|
| h1(x) | 3 | 0 |
| h2(x) | 7 | 2 |
| Present in Set | Yes (Wrong) | No (Right) |
| Interpretation | False Positive (Maybe-Yes) | False Negative (Absolute-No) |