How To Test A Bloom Filter
Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and got a message, "Username is already taken". Yous added your nativity appointment along username, still no luck. At present yous take added your university gyre number also, still got "Username is already taken". Information technology's really frustrating, isn't information technology?
Simply take you e'er idea almost how quickly Geekbook checks availability of username by searching millions of username registered with it. In that location are many ways to do this job –
- Linear search : Bad thought!
- Binary Search : Store all username alphabetically and compare entered username with centre one in list, If information technology matched, then username is taken otherwise figure out, whether entered username will come earlier or later middle one and if it volition come after, fail all the usernames before middle one(inclusive). Now search after middle one and repeat this procedure until you got a match or search end with no match. This technique is improve and promising simply still it requires multiple steps.
But, there must be something better!!
Blossom Filter is a data structure that can do this task.
For understanding flower filters, you must know what is hashing. A hash function takes input and outputs a unique identifier of fixed length which is used for identification of input.
What is Bloom Filter?
A Bloom filter is a infinite-efficient probabilistic data structure that is used to test whether an element is a member of a fix. For example, checking availability of username is gear up membership problem, where the set is the listing of all registered username. The price we pay for efficiency is that information technology is probabilistic in nature that ways, in that location might be some False Positive results. Simulated positive means, it might tell that given username is already taken but actually information technology'due south not.
Interesting Properties of Flower Filters
- Dissimilar a standard hash table, a Bloom filter of a fixed size tin stand for a set with an arbitrarily large number of elements.
- Calculation an element never fails. Nonetheless, the fake positive charge per unit increases steadily every bit elements are added until all bits in the filter are set up to one, at which signal all queries yield a positive result.
- Bloom filters never generatefalse negative consequence, i.east., telling y'all that a username doesn't be when it actually exists.
- Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete "geeks" (in given example below) by clearing scrap at 1, four and 7, nosotros might end up deleting "nerd" as well Considering flake at alphabetize 4 becomes 0 and bloom filter claims that "nerd" is not present.
Working of Bloom Filter
A empty bloom filter is a bit array of m bits, all set to zero, like this –
We need k number of hash functions to calculate the hashes for a given input. When nosotros want to add an item in the filter, the $.25 at thousand indices h1(x), h2(10), … hk(x) are fix, where indices are calculated using hash functions.
Example – Suppose we want to enter "geeks" in the filter, we are using 3 hash functions and a bit array of length 10, all set to 0 initially. Kickoff we'll calculate the hashes as follows:
h1("geeks") % ten = one h2("geeks") % 10 = 4 h3("geeks") % 10 = 7 Notation: These outputs are random for explanation only.
Now we will fix the bits at indices i, 4 and 7 to 1
Again we want to enter "nerd", similarly, we'll calculate hashes
h1("nerd") % 10 = three h2("nerd") % x = five h3("nerd") % 10 = 4 Gear up the $.25 at indices three, 5 and 4 to 1
Now if we want to check "geeks" is nowadays in filter or not. We'll do the same process but this time in opposite social club. We calculate respective hashes using h1, h2 and h3 and check if all these indices are set to 1 in the fleck assortment. If all the bits are ready and then nosotros can say that "geeks" is probably present. If whatsoever of the chip at these indices are 0 and so "geeks" is definitely non present.
Faux Positive in Bloom Filters
The question is why we said "probably nowadays", why this uncertainty. Let'south empathise this with an example. Suppose we desire to check whether "cat" is present or not. We'll calculate hashes using h1, h2 and h3
h1("true cat") % 10 = 1 h2("cat") % 10 = 3 h3("cat") % 10 = 7 If nosotros cheque the bit array, $.25 at these indices are gear up to one simply we know that "true cat" was never added to the filter. Bit at index 1 and seven was fix when we added "geeks" and fleck iii was ready we added "nerd".
Then, because bits at calculated indices are already set by some other particular, blossom filter erroneously claims that "cat" is present and generating a false positive outcome. Depending on the application, information technology could be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling the size of the Flower filter. More space ways fewer simulated positives. If we want to decrease probability of imitation positive result, we have to use more number of hash functions and larger flake assortment. This would add latency in addition to the item and checking membership.
Operations that a Flower Filter supports
- insert(10) : To insert an element in the Flower Filter.
- lookup(x) : to check whether an element is already present in Flower Filter with a positive false probability.
Annotation : Nosotros cannot delete an element in Bloom Filter.
Probability of Faux positivity: Let k be the size of bit array, thousand exist the number of hash functions and n be the number of expected elements to be inserted in the filter, and then the probability of false positive p can exist calculated as:
Size of Bit Array: If expected number of elements n is known and desired imitation positive probability is p then the size of fleck assortment m can be calculated as :
Optimum number of hash functions: The number of hash functions k must be a positive integer. If yard is size of bit assortment and n is number of elements to be inserted, then 1000 can be calculated as :
Infinite Efficiency
If nosotros desire to store large list of items in a set for purpose of set membership, we can store information technology in hashmap, tries or uncomplicated array or linked listing. All these methods require storing detail itself, which is not very memory efficient. For case, if nosotros want to store "geeks" in hashmap nosotros have to store actual string " geeks" equally a key value pair {some_key : "geeks"}.
Flower filters do not store the data item at all. Equally we accept seen they use bit array which allow hash standoff. Without hash standoff, it would not exist compact.
Choice of Hash Role
The hash function used in bloom filters should be contained and uniformly distributed. They should be fast as possible. Fast unproblematic not cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes.
Generating hash is major operation in bloom filters. Cryptographic hash functions provide stability and guarantee only are expensive in calculation. With increment in number of hash functions k, bloom filter become tiresome. All though not-cryptographic hash functions practise not provide guarantee only provide major functioning improvement.
Bones implementation of Bloom Filter grade in Python3. Salve it every bit bloomfilter.py
Python
import math
import mmh3
from bitarray import bitarray
course BloomFilter( object ):
def __init__( self , items_count, fp_prob):
self .fp_prob = fp_prob
self .size = self .get_size(items_count, fp_prob)
self .hash_count = self .get_hash_count( self .size, items_count)
self .bit_array = bitarray( self .size)
cocky .bit_array.setall( 0 )
def add together( self , particular):
digests = []
for i in range ( cocky .hash_count):
assimilate = mmh3. hash (item, i) % cocky .size
digests.append(digest)
self .bit_array[digest] = True
def check( self , particular):
for i in range ( self .hash_count):
digest = mmh3. hash (item, i) % self .size
if self .bit_array[digest] = = Faux :
return False
return True
@classmethod
def get_size( self , due north, p):
m = - (n * math.log(p)) / (math.log( two ) * * 2 )
return int (m)
@classmethod
def get_hash_count( self , m, n):
k = (grand / due north) * math.log( 2 )
return int (k)
Lets test the blossom filter. Save this file every bit bloom_test.py
Python
from bloomfilter import BloomFilter
from random import shuffle
northward = 20
p = 0.05
bloomf = BloomFilter(north,p)
impress ( "Size of chip array:{}" . format (bloomf.size))
print ( "Faux positive Probability:{}" . format (bloomf.fp_prob))
print ( "Number of hash functions:{}" . format (bloomf.hash_count))
word_present = [ 'abound' , 'abounds' , 'abundance' , 'arable' , 'accessible' ,
'bloom' , 'blossom' , 'eternalize' , 'bonny' , 'bonus' , 'bonuses' ,
'coherent' , 'cohesive' , 'colorful' , 'comely' , 'comfort' ,
'gems' , 'generosity' , 'generous' , 'generously' , 'genial' ]
word_absent = [ 'bluff' , 'cheater' , 'hate' , 'war' , 'humanity' ,
'racism' , 'hurt' , 'nuke' , 'gloomy' , 'facebook' ,
'geeksforgeeks' , 'twitter' ]
for item in word_present:
bloomf.add(item)
shuffle(word_present)
shuffle(word_absent)
test_words = word_present[: 10 ] + word_absent
shuffle(test_words)
for word in test_words:
if bloomf.check(word):
if word in word_absent:
impress ( "'{}' is a false positive!" . format (word))
else :
print ( "'{}' is probably present!" . format (discussion))
else :
impress ( "'{}' is definitely not present!" . format (word))
Output
Size of bit assortment:124 Imitation positive Probability:0.05 Number of hash functions:4 'war' is definitely not present! 'gloomy' is definitely non present! 'humanity' is definitely not present! 'abundant' is probably present! 'bloom' is probably present! 'coherent' is probably present! 'cohesive' is probably present! 'bluff' is definitely not present! 'bolster' is probably nowadays! 'hate' is definitely not present! 'racism' is definitely non present! 'bonus' is probably present! 'abounds' is probably present! 'genial' is probably present! 'geeksforgeeks' is definitely not present! 'nuke' is definitely not present! 'hurt' is definitely not present! 'twitter' is a false positive! 'cheater' is definitely not present! 'generosity' is probably present! 'facebook' is definitely not nowadays! 'affluence' is probably present!
C++ Implementation
Here is the implementation of a sample Bloom Filters with iv sample hash functions ( k = four) and the size of chip assortment is 100.
C++
#include <bits/stdc++.h>
#define ll long long
using namespace std;
int h1(string s, int arrSize)
{
ll int hash = 0;
for ( int i = 0; i < s.size(); i++)
{
hash = (hash + (( int )south[i]));
hash = hash % arrSize;
}
return hash;
}
int h2(string southward, int arrSize)
{
ll int hash = ane;
for ( int i = 0; i < s.size(); i++)
{
hash = hash + pw (xix, i) * south[i];
hash = hash % arrSize;
}
return hash % arrSize;
}
int h3(string southward, int arrSize)
{
ll int hash = seven;
for ( int i = 0; i < s.size(); i++)
{
hash = (hash * 31 + southward[i]) % arrSize;
}
return hash % arrSize;
}
int h4(cord due south, int arrSize)
{
ll int hash = 3;
int p = 7;
for ( int i = 0; i < s.size(); i++) {
hash += hash * 7 + s[0] * pw (p, i);
hash = hash % arrSize;
}
return hash;
}
bool lookup( bool * bitarray, int arrSize, string due south)
{
int a = h1(s, arrSize);
int b = h2(s, arrSize);
int c = h3(s, arrSize);
int d = h4(s, arrSize);
if (bitarray[a] && bitarray[b] && bitarray
&& bitarray[d])
return true ;
else
render simulated ;
}
void insert( bool * bitarray, int arrSize, string s)
{
if (lookup(bitarray, arrSize, s))
cout << s << " is Probably already nowadays" << endl;
else
{
int a = h1(s, arrSize);
int b = h2(s, arrSize);
int c = h3(s, arrSize);
int d = h4(s, arrSize);
bitarray[a] = truthful ;
bitarray[b] = true ;
bitarray = truthful ;
bitarray[d] = truthful ;
cout << s << " inserted" << endl;
}
}
int main()
{
bool bitarray[100] = { simulated };
int arrSize = 100;
string sarray[33]
= { "grow" , "abounds" , "affluence" ,
"arable" , "attainable" , "bloom" ,
"blossom" , "eternalize" , "bonny" ,
"bonus" , "bonuses" , "coherent" ,
"cohesive" , "colorful" , "comely" ,
"comfort" , "gems" , "generosity" ,
"generous" , "generously" , "genial" ,
"bluff" , "cheater" , "hate" ,
"state of war" , "humanity" , "racism" ,
"hurt" , "nuke" , "gloomy" ,
"facebook" , "geeksforgeeks" , "twitter" };
for ( int i = 0; i < 33; i++) {
insert(bitarray, arrSize, sarray[i]);
}
return 0;
}
Output
abound inserted abounds inserted abundance inserted abundant inserted accessable inserted bloom inserted flower inserted bolster inserted bonny inserted bonus inserted bonuses inserted coherent inserted cohesive inserted colorful inserted comely inserted condolement inserted gems inserted generosity inserted generous inserted generously inserted genial inserted barefaced is Probably already present cheater inserted hate inserted war is Probably already present humanity inserted racism inserted hurt inserted nuke is Probably already nowadays gloomy is Probably already present facebook inserted geeksforgeeks inserted twitter inserted
Applications of Bloom filters
- Medium uses flower filters for recommending post to users by filtering post which have been seen by user.
- Quora implemented a shared blossom filter in the feed backend to filter out stories that people take seen before.
- The Google Chrome web browser used to apply a Flower filter to place malicious URLs
- Google BigTable, Apache HBase and Apache Cassandra, and Postgresql utilize Flower filters to reduce the deejay lookups for non-existent rows or columns
References
- https://en.wikipedia.org/wiki/Bloom_filter
- https://blog.medium.com/what-are-blossom-filters-1ec2a50c68ff
- https://www.quora.com/What-are-the-best-applications-of-Flower-filters
This article is contributed by Atul Kumar and improved by Manoj Kumar. If you like GeeksforGeeks and would like to contribute, y'all can also write an article using write.geeksforgeeks.org or mail your commodity to review-team@geeksforgeeks.org. See your commodity appearing on the GeeksforGeeks master folio and help other Geeks.
Please write comments if you find anything incorrect, or you lot want to share more information most the topic discussed to a higher place.
How To Test A Bloom Filter,
Source: https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
Posted by: robinsonadardly84.blogspot.com

0 Response to "How To Test A Bloom Filter"
Post a Comment