How To Test A Bloom Filter

Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and got a message, "Username is already taken". Yous added your nativity appointment along username, still no luck. At present yous take added your university gyre number also, still got "Username is already taken". Information technology's really frustrating, isn't information technology?
Simply take you e'er idea almost how quickly Geekbook checks availability of username by searching millions of username registered with it. In that location are many ways to do this job –

Linear search : Bad thought!
Binary Search : Store all username alphabetically and compare entered username with centre one in list, If information technology matched, then username is taken otherwise figure out, whether entered username will come earlier or later middle one and if it volition come after, fail all the usernames before middle one(inclusive). Now search after middle one and repeat this procedure until you got a match or search end with no match. This technique is improve and promising simply still it requires multiple steps.
But, there must be something better!!

Blossom Filter is a data structure that can do this task.
For understanding flower filters, you must know what is hashing. A hash function takes input and outputs a unique identifier of fixed length which is used for identification of input.

What is Bloom Filter?

A Bloom filter is a infinite-efficient probabilistic data structure that is used to test whether an element is a member of a fix. For example, checking availability of username is gear up membership problem, where the set is the listing of all registered username. The price we pay for efficiency is that information technology is probabilistic in nature that ways, in that location might be some False Positive results. Simulated positive means, it might tell that given username is already taken but actually information technology'due south not.
Interesting Properties of Flower Filters

Dissimilar a standard hash table, a Bloom filter of a fixed size tin stand for a set with an arbitrarily large number of elements.
Calculation an element never fails. Nonetheless, the fake positive charge per unit increases steadily every bit elements are added until all bits in the filter are set up to one, at which signal all queries yield a positive result.
Bloom filters never generatefalse negative consequence, i.east., telling y'all that a username doesn't be when it actually exists.
Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete "geeks" (in given example below) by clearing scrap at 1, four and 7, nosotros might end up deleting "nerd" as well Considering flake at alphabetize 4 becomes 0 and bloom filter claims that "nerd" is not present.

Working of Bloom Filter

A empty bloom filter is a bit array of m bits, all set to zero, like this –

empty_bit_array

We need k number of hash functions to calculate the hashes for a given input. When nosotros want to add an item in the filter, the $.25 at thousand indices h1(x), h2(10), … hk(x) are fix, where indices are calculated using hash functions.
Example – Suppose we want to enter "geeks" in the filter, we are using 3 hash functions and a bit array of length 10, all set to 0 initially. Kickoff we'll calculate the hashes as follows:

h1("geeks") % ten = one h2("geeks") % 10 = 4 h3("geeks") % 10 = 7

Notation: These outputs are random for explanation only.
Now we will fix the bits at indices i, 4 and 7 to 1

geeks

Again we want to enter "nerd", similarly, we'll calculate hashes

h1("nerd") % 10 = three h2("nerd") % x = five h3("nerd") % 10 = 4

Gear up the $.25 at indices three, 5 and 4 to 1

nerd

Now if we want to check "geeks" is nowadays in filter or not. We'll do the same process but this time in opposite social club. We calculate respective hashes using h1, h2 and h3 and check if all these indices are set to 1 in the fleck assortment. If all the bits are ready and then nosotros can say that "geeks" is probably present. If whatsoever of the chip at these indices are 0 and so "geeks" is definitely non present.

Faux Positive in Bloom Filters

The question is why we said "probably nowadays", why this uncertainty. Let'south empathise this with an example. Suppose we desire to check whether "cat" is present or not. We'll calculate hashes using h1, h2 and h3

h1("true cat") % 10 = 1 h2("cat") % 10 = 3 h3("cat") % 10 = 7

If nosotros cheque the bit array, $.25 at these indices are gear up to one simply we know that "true cat" was never added to the filter. Bit at index 1 and seven was fix when we added "geeks" and fleck iii was ready we added "nerd".

cat

Then, because bits at calculated indices are already set by some other particular, blossom filter erroneously claims that "cat" is present and generating a false positive outcome. Depending on the application, information technology could be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling the size of the Flower filter. More space ways fewer simulated positives. If we want to decrease probability of imitation positive result, we have to use more number of hash functions and larger flake assortment. This would add latency in addition to the item and checking membership.
Operations that a Flower Filter supports

insert(10) : To insert an element in the Flower Filter.
lookup(x) : to check whether an element is already present in Flower Filter with a positive false probability.

Annotation : Nosotros cannot delete an element in Bloom Filter.
Probability of Faux positivity: Let k be the size of bit array, thousand exist the number of hash functions and n be the number of expected elements to be inserted in the filter, and then the probability of false positive p can exist calculated as:

$P=\left ( 1-\left [ 1- \frac {1}{m} \right ]^{kn} \right )^k$

Size of Bit Array: If expected number of elements n is known and desired imitation positive probability is p then the size of fleck assortment m can be calculated as :

$m= -\frac {n\ln P}{(ln 2)^2}$

Optimum number of hash functions: The number of hash functions k must be a positive integer. If yard is size of bit assortment and n is number of elements to be inserted, then 1000 can be calculated as :

$k= \frac {m}{n} ln 2$

Infinite Efficiency

If nosotros desire to store large list of items in a set for purpose of set membership, we can store information technology in hashmap, tries or uncomplicated array or linked listing. All these methods require storing detail itself, which is not very memory efficient. For case, if nosotros want to store "geeks" in hashmap nosotros have to store actual string " geeks" equally a key value pair {some_key : "geeks"}.
Flower filters do not store the data item at all. Equally we accept seen they use bit array which allow hash standoff. Without hash standoff, it would not exist compact.

Choice of Hash Role

The hash function used in bloom filters should be contained and uniformly distributed. They should be fast as possible. Fast unproblematic not cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes.
Generating hash is major operation in bloom filters. Cryptographic hash functions provide stability and guarantee only are expensive in calculation. With increment in number of hash functions k, bloom filter become tiresome. All though not-cryptographic hash functions practise not provide guarantee only provide major functioning improvement.

Bones implementation of Bloom Filter grade in Python3. Salve it every bit bloomfilter.py

Python

              import              math            
              import              mmh3            
              from              bitarray                            import              bitarray            
              course              BloomFilter(              object              ):            
                            def              __init__(              self              , items_count, fp_prob):            
                            self              .fp_prob                            =              fp_prob            
                            self              .size                            =              self              .get_size(items_count, fp_prob)            
                            self              .hash_count                            =              self              .get_hash_count(              self              .size, items_count)            
                            self              .bit_array                            =              bitarray(              self              .size)            
                            cocky              .bit_array.setall(              0              )            
                            def              add together(              self              , particular):            
                            digests                            =              []            
                            for              i                            in              range              (              cocky              .hash_count):            
                            assimilate                            =              mmh3.              hash              (item, i)                            %              cocky              .size            
                            digests.append(digest)            
                            self              .bit_array[digest]                            =              True            
                            def              check(              self              , particular):            
                            for              i                            in              range              (              self              .hash_count):            
                            digest                            =              mmh3.              hash              (item, i)                            %              self              .size            
                            if              self              .bit_array[digest]                            =              =              Faux              :            
                            return              False            
                            return              True            
                            @classmethod            
                            def              get_size(              self              , due north, p):            
                            m                            =              -              (n                            *              math.log(p))              /              (math.log(              two              )              *              *              2              )            
                            return              int              (m)            
                            @classmethod            
                            def              get_hash_count(              self              , m, n):            
                            k                            =              (grand              /              due north)                            *              math.log(              2              )            
                            return              int              (k)            

Lets test the blossom filter. Save this file every bit bloom_test.py

Python

              from              bloomfilter                            import              BloomFilter            
              from              random                            import              shuffle            
              northward                            =              20            
              p                            =              0.05            
              bloomf                            =              BloomFilter(north,p)            
              impress              (              "Size of chip array:{}"              .              format              (bloomf.size))            
              print              (              "Faux positive Probability:{}"              .              format              (bloomf.fp_prob))            
              print              (              "Number of hash functions:{}"              .              format              (bloomf.hash_count))            
              word_present                            =              [              'abound'              ,              'abounds'              ,              'abundance'              ,              'arable'              ,              'accessible'              ,            
                            'bloom'              ,              'blossom'              ,              'eternalize'              ,              'bonny'              ,              'bonus'              ,              'bonuses'              ,            
                            'coherent'              ,              'cohesive'              ,              'colorful'              ,              'comely'              ,              'comfort'              ,            
                            'gems'              ,              'generosity'              ,              'generous'              ,              'generously'              ,              'genial'              ]            
              word_absent                            =              [              'bluff'              ,              'cheater'              ,              'hate'              ,              'war'              ,              'humanity'              ,            
                            'racism'              ,              'hurt'              ,              'nuke'              ,              'gloomy'              ,              'facebook'              ,            
                            'geeksforgeeks'              ,              'twitter'              ]            
              for              item                            in              word_present:            
                            bloomf.add(item)            
              shuffle(word_present)            
              shuffle(word_absent)            
              test_words                            =              word_present[:              10              ]                            +              word_absent            
              shuffle(test_words)            
              for              word                            in              test_words:            
                            if              bloomf.check(word):            
                            if              word                            in              word_absent:            
                            impress              (              "'{}' is a false positive!"              .              format              (word))            
                            else              :            
                            print              (              "'{}' is probably present!"              .              format              (discussion))            
                            else              :            
                            impress              (              "'{}' is definitely not present!"              .              format              (word))            

Output

Size of bit assortment:124 Imitation positive Probability:0.05 Number of hash functions:4 'war' is definitely not present! 'gloomy' is definitely non present! 'humanity' is definitely not present! 'abundant' is probably present! 'bloom' is probably present! 'coherent' is probably present! 'cohesive' is probably present! 'bluff' is definitely not present! 'bolster' is probably nowadays! 'hate' is definitely not present! 'racism' is definitely non present! 'bonus' is probably present! 'abounds' is probably present! 'genial' is probably present! 'geeksforgeeks' is definitely not present! 'nuke' is definitely not present! 'hurt' is definitely not present! 'twitter' is a false positive! 'cheater' is definitely not present! 'generosity' is probably present! 'facebook' is definitely not nowadays! 'affluence' is probably present!

C++ Implementation

Here is the implementation of a sample Bloom Filters with iv sample hash functions ( k = four) and the size of chip assortment is 100.

C++

              #include <bits/stdc++.h>            
              #define ll long long            
              using              namespace              std;            
              int              h1(string s,                            int              arrSize)            
              {            
                            ll                            int              hash = 0;            
                            for              (              int              i = 0; i < s.size(); i++)            
                            {            
                            hash = (hash + ((              int              )south[i]));            
                            hash = hash % arrSize;            
                            }            
                            return              hash;            
              }            
              int              h2(string southward,                            int              arrSize)            
              {            
                            ll                            int              hash = ane;            
                            for              (              int              i = 0; i < s.size(); i++)            
                            {            
                            hash = hash +                            pw              (xix, i) * south[i];            
                            hash = hash % arrSize;            
                            }            
                            return              hash % arrSize;            
              }            
              int              h3(string southward,                            int              arrSize)            
              {            
                            ll                            int              hash = seven;            
                            for              (              int              i = 0; i < s.size(); i++)            
                            {            
                            hash = (hash * 31 + southward[i]) % arrSize;            
                            }            
                            return              hash % arrSize;            
              }            
              int              h4(cord due south,                            int              arrSize)            
              {            
                            ll                            int              hash = 3;            
                            int              p = 7;            
                            for              (              int              i = 0; i < s.size(); i++) {            
                            hash += hash * 7 + s[0] *                            pw              (p, i);            
                            hash = hash % arrSize;            
                            }            
                            return              hash;            
              }            
              bool              lookup(              bool              * bitarray,                            int              arrSize, string due south)            
              {            
                            int              a = h1(s, arrSize);            
                            int              b = h2(s, arrSize);            
                            int              c = h3(s, arrSize);            
                            int              d = h4(s, arrSize);            
                            if              (bitarray[a] && bitarray[b] && bitarray            
                            && bitarray[d])            
                            return              true              ;            
                            else            
                            render              simulated              ;            
              }            
              void              insert(              bool              * bitarray,                            int              arrSize, string s)            
              {            
                            if              (lookup(bitarray, arrSize, s))            
                            cout << s <<                            " is Probably already nowadays"              << endl;            
                            else            
                            {            
                            int              a = h1(s, arrSize);            
                            int              b = h2(s, arrSize);            
                            int              c = h3(s, arrSize);            
                            int              d = h4(s, arrSize);            
                            bitarray[a] =                            truthful              ;            
                            bitarray[b] =                            true              ;            
                            bitarray =                            truthful              ;            
                            bitarray[d] =                            truthful              ;            
                            cout << s <<                            " inserted"              << endl;            
                            }            
              }            
              int              main()            
              {            
                            bool              bitarray[100] = {                            simulated              };            
                            int              arrSize = 100;            
                            string sarray[33]            
                            = {                            "grow"              ,                            "abounds"              ,                            "affluence"              ,            
                            "arable"              ,                            "attainable"              ,                            "bloom"              ,            
                            "blossom"              ,                            "eternalize"              ,                            "bonny"              ,            
                            "bonus"              ,                            "bonuses"              ,                            "coherent"              ,            
                            "cohesive"              ,                            "colorful"              ,                            "comely"              ,            
                            "comfort"              ,                            "gems"              ,                            "generosity"              ,            
                            "generous"              ,                            "generously"              ,                            "genial"              ,            
                            "bluff"              ,                            "cheater"              ,                            "hate"              ,            
                            "state of war"              ,                            "humanity"              ,                            "racism"              ,            
                            "hurt"              ,                            "nuke"              ,                            "gloomy"              ,            
                            "facebook"              ,                            "geeksforgeeks"              ,                            "twitter"              };            
                            for              (              int              i = 0; i < 33; i++) {            
                            insert(bitarray, arrSize, sarray[i]);            
                            }            
                            return              0;            
              }            

Output

abound inserted abounds inserted abundance inserted abundant inserted accessable inserted bloom inserted flower inserted bolster inserted bonny inserted bonus inserted bonuses inserted coherent inserted cohesive inserted colorful inserted comely inserted condolement inserted gems inserted generosity inserted generous inserted generously inserted genial inserted barefaced is Probably already present cheater inserted hate inserted war is Probably already present humanity inserted racism inserted hurt inserted nuke is Probably already nowadays gloomy is Probably already present facebook inserted geeksforgeeks inserted twitter inserted

Applications of Bloom filters

Medium uses flower filters for recommending post to users by filtering post which have been seen by user.
Quora implemented a shared blossom filter in the feed backend to filter out stories that people take seen before.
The Google Chrome web browser used to apply a Flower filter to place malicious URLs
Google BigTable, Apache HBase and Apache Cassandra, and Postgresql utilize Flower filters to reduce the deejay lookups for non-existent rows or columns

References

https://en.wikipedia.org/wiki/Bloom_filter
https://blog.medium.com/what-are-blossom-filters-1ec2a50c68ff
https://www.quora.com/What-are-the-best-applications-of-Flower-filters

This article is contributed by Atul Kumar and improved by Manoj Kumar. If you like GeeksforGeeks and would like to contribute, y'all can also write an article using write.geeksforgeeks.org or mail your commodity to review-team@geeksforgeeks.org. See your commodity appearing on the GeeksforGeeks master folio and help other Geeks.
Please write comments if you find anything incorrect, or you lot want to share more information most the topic discussed to a higher place.