Probabilistic data Structures – Bloom filter and HyperLogLog for Big Data

When working with large volume of data memory and space requirement could be very high. This in turn have effect on scalability, when suddenly your job or process either taking too long or requires more resources. Probabilistic data structure allow you to trade some accuracy for immense decrease in memory usage. For example, with a single HyperLogLog structure using 2.56 KB you can count the number of unique items up to approximately 7,900,000,000 items with 1.625% error. This can be very efficient for analytic applications, when for example, you want to calculate how many unique users have visited the URL.

1,625% of error mean that if we’re trying to count the number of unique license plate numbers for cars, if our HyperLogLog counter said there were 654,192,028, we would be confident that the actual number is between 664,822,648 and 643,561,407. Furthermore, if this accuracy isn’t sufficient, you can simply add more memory to the structure and it will perform better. Giving it 40.96KB of resources will decrease the error from 1.625% to 0.4%. However, storing this data in a set would take 3.925GB, even assuming no overhead!

Another restriction is that both Bloom Filters and HyperLogLog allow to answer only specific set of questions. Bloom filter is a data structure that offers a membership query only, where the value of lookup is one of two values: definite no, meaning item does not exists in bloom filter and maybe, meaning that there are probability than item exists. The amount of false positive can be tuned.
Bloom filters are used in BigTable and HBase to eliminate the need to read blocks from discks to determine if they contain keys.
So bloom filter is an efficient way to ask only one question: does my data structure has a key.

Let’s create a simple application to demostrate use of bloom filter. I will be using google guava implementation of bloom filter. Details can be found here. Default false probability on guava implementation is 3%.
First, let’s get dependencies


Bloom filter test application:

public class BloomFilterApp {
    static Random random = new Random();

    public static void main(String[] args) {
        // convert object into funnel - specific to google implementation
        int totals=100000;
        Funnel<String> funnel = new Funnel<String>() {
            public void funnel(String s, PrimitiveSink primitiveSink) {
                primitiveSink.putString(s, Charsets.UTF_8);

        BloomFilter bloomFilter = BloomFilter.create(funnel, totals);
        for (int i = 0; i < totals; i++) {
            Integer value = random.nextInt(10000);
            // add only even number to bloom filter
            if ((value % 2) == 0) {
                String key = "key" + value;
                //insert only even values into bloom filter


       // check if key exist in bloom filter
        String key = "key100";
        assert (bloomFilter.mightContain(key)) == true;

        String key2 = "key5";
        assert (bloomFilter.mightContain(key2) == false);

Hadoop also provide implementation of bloom filter. It is quite easy to adapt bloom filter to be used in distributed environment like Hadoop or Storm.

HyperLogLog is data structure, which allow you to ask questions about cardinality. For example, how many unique users have visited the URL. HyperLogLog was first introduced in 2007 paper to “estimate the number of distinct element of a very large data ensembles”. HyperLogLog trades space for precision.
For example, traditional way to calculate this type of aggregation will require use of hashtable, witch could explade your data storage need. HyperLogLog has much smaller memory footprint, but the trade-off is accuracy, which you can tune. For example, For example, in 1280 bytes HyperLogLog can estimate the count of tens of billions of distinct values with only a few percent error.

Interesting implementation of HyperLogLog is java-hll. Let’s have a simple code that show us how java-hll works:
First, you need to download libraries from Maven central. HLL requires to have a good hashing algoritm to hash values, so we will be using Google guava cache. Our maven dependencies:


Then we can create a simple test program to illustrate HyperLogLog:

import net.agkn.hll.HLL;

public class HyperLogLog {
    public static void main(String[] args) {
        final int seed = 123456;
        HashFunction hash = Hashing.murmur3_128(seed);
        // data on which to calculate distinct count
        final Integer[] data = new Integer[]{1, 1, 2, 3, 4, 5, 6, 6, 
                6, 7, 7, 7, 7, 8, 10};
        final HLL hll = new HLL(13, 5); //number of bucket and bits per bucket
        for (int item : data) {
            final long value = hash.newHasher().putInt(item).hash().asLong();
        System.out.println("Distinct count="+ hll.cardinality());


The result will be:
Distinct count=9

It’s easy to adapt this code into Hadoop MapReduce job. Hll implementation also provide toBytes method that we can use to serialize the data structure into Hadoop serialization framework like Avro and fromBytes method for deserialization. Some databases, for example, Redis, have HyperLogLog as a supported data structure.

If you are dealing with huge amount of data, Big Data and your can tolerate some margin of errors, then probabilistic data structure can be of a great help.

Bloom Filter Membership query

Cardinality query

Bloom Filter

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>