Counting Distinct Elements in a Stream
- Distinct elements problem
- Streams and sketches
- MinHash
- The math of MinHash
- Distinct elements with MinHash
- Median trick optimization
- Space analysis
- In practice and further reading
- Conclusion
- Resources
Distinct elements problem
Most problems in computer science become challenging when they are brought to scale. The distinct elements problem is not a difficult one to understand. Given a group of elements that may be identical or unique, determine the number of distinct elements. For example, below we have a list of integers:
If all of these elements are added into a set, then the number of distinct elements is 4.
It's clear that we are just talking about the cardinality of the set.
This is the go-to solution for simple variants of this problem. This is not interesting enough though, so let's put a small twist on the original problem. In mathematical theory, a set can contain as many elements as it wants, but in practice, a set is stored in memory. What if we have so many elements that we cannot store our set in memory (e.g. we are Google and we want to determine how many unique searches there are per day)?
Suddenly, our solution doesn't work in practice.
Streams and sketches
Many problems that can be solved with our traditional data structures fail to scale to massive datasets. In data science, we often have a stream of data coming in and we do not nearly have enough space to store all the information in a single database. We still want to holistically analyze this data, but realistically we only see each data point once when it is coming in through a stream. This is because it becomes less feasible to analyze and query all data points from a distributed database to a single location after each data point is stored.
These types of problems are commonly solved with sketch algorithms. Sketch algorithms are a family of algorithms that can compress a stream of data such that the compressed representation can be utilized in answering queries. By allowing for a small amount of error, we can significantly reduce the amount of storage necessary to compute the result of our queries. The catch is that because the dataset is compressed, it may not guarantee the exact solution to the query. However, it is possible to bound the error of the solution in such a way that the solution is still meaningful to us.
MinHash
The core of the our algorithm for the Count-Distinct problem in streams is an elegant application of randomized hashing known as MinHash (Flajolet-Martin 1985). Suppose we have a pairwise-independent randomized hashing function that has a real valued output. Then, MinHash is defined as:
# X is a dataset stream and x_i is
# an individual element of X
def min_hash(X):
s = 1
for x_i in X:
s = min(s, h(x_i))
return 1/s - 1
With only four lines of pseudocode, the MinHash algorithm is shockingly short: just hash each element and then return the inverse of the minimum hash subtracted by 1. There isn't much more—the return value of the min_hash
is the estimated total distinct elements in the dataset .
Remember, the randomized hash function returns a float in the range of . The implication here is that this inverse of the minimum hash value of the dataset is at least 1 and at most infinity. As a result, the return value of min_hash
is in the range .
The math of MinHash
As with many elegant algorithms, the theory behind the algorithm is usually more complex. Fortunately, the theory for MinHash only requires an introductory university level statistics background (expected value, variance, probability densities, concentration bounds, etc) for the most part.
First, let us get some intuition for MinHash. We will recycle our example from the beginning.
Our hash function should hash duplicate numbers to the same hash value (i.e. all 2s must map to the same float), so there only exists four unique hash values from this list of numbers. An initial observation is that the number of unique hash values is equivalent to the number of distinct elements in the list. Yet, if we try to store the number of unique hash values, we run into the same issue as before: storing unique hash values will utilize an in-memory data structure.
Instead, we keep only the minimum hash value, which must somehow tell us information about how many distinct elements exist in the stream. In our pseudocode, we denote as the minimum hash value of the elements in . Can we retrieve the number of distinct elements just from the minimum hash ?
Claim 1. For a dataset with all distinct elements, the CDF of is , where is the number of elements in the dataset.
Proof. By definition, the CDF of is . Let be our dataset, be the -th element of , and be our randomized hash function.
We start off by determining the CDF of . This is useful because the CDF gets us access to calculating the expected value of . Notice that our proof has only shown this to be true assuming all elements in are unique. Because identical elements are guaranteed to have the same hash from our randomized hash function, we can disregard any extra copies of of the same element. Let be the true number of distinct elements in . Then, the CDF becomes . From here, we can solve for the expected value of .
Claim 2. The expected value of is a function of . In particular,
Proof. We can first take the derivative of the CDF to calculate the PDF, which we can then use to derive .
It then follows from that .
Intuitively, this result actually makes a lot of sense. If we have distinct elements in our dataset, then our expectation is that these elements approximately split the range into equal parts. Imagine we have three distinct elements. If these three elements were to be perfectly distributed uniformly within the range, then we would have an element approximately at , , and , creating four subranges. The minimum hash would be 0.25, yielding a distinct element count approximation of 4. This is exactly .
Of course, isn't guaranteed to be around 0.2 because everything is random, but we know that as long as is close to , then our estimated distinct elements is close to . If we can get to be close to , then this is good enough to approximate . That being said, it is important to recognize that .
Note: there is a slightly more involved proof to bound which I might append later on. The general intuition is that the bounds on correspond to the error on , so we can treat any bounds on the error of as approximate bounds on .
Distinct elements with MinHash
If MinHash returns the number of distinct elements as long as is close to its expectation, then aren't we done here?
Not quite.
Randomized hashing suffers from its element of randomness. It's not impossible that an unlucky randomized hash function may cause all elements to hash close together such that is not close to . Let's use our previous example.
The minimum hash is , so our estimated distinct elements is , about 1.
As with anything random, we can minimize the variance of by taking more trials of MinHash. Let be a pre-specified value denoting how many trials we want to take and be the -th randomized hash function.
def distinct_elements(X, k):
s = []
# Initialize k elements of s to 1
for j in range(k):
s.append(1)
for x_i in X:
# Perform MinHash k times per element in X
for j in range(k):
s[j] = min(s[j], h_j(x_i))
avg_s = 1/k * sum(s)
return 1/avg_s - 1
The more hash functions (trials) we use, the better we approximate the number of distinct elements in . Let's take a look at the variance of a single hash function.
Claim 4. .
Proof. We use the definition of variance while trading a precise expression for a more concise expression representing an upper bound.
With hash functions, , so our variance decreases with each hash function. We want it to be extremely likely that whatever we get in the end is within percent of , where is our tolerated percentage of error from its expectation.
Fortunately for us, a Chebyshev's Inequality can exactly bound the likelihood is within percent error.
The probability that exceeds by percent is at most . Since and are factors we control, we can set them such that they satisfy a predetermined acceptable level of error in (and by extension ).
For example, if we want our sketch to be successful 95% of the time in returning an that is within percent, our error rate . Then, if we want to ensure that is within 1% of its expectation, . The question now is how many hash functions do we need?
From our derivations, if we utilize 200000 hash functions, 95% of the time that we run our algorithm, we can achieve a result with within 1% of .
More generally, .
And with that, we have written an algorithm that can estimate the amount of distinct elements up to an acceptable error with a certain likelihood! Admittedly, our results seem less than ideal. About one out of 20 times, our algorithm fails to provide a distinct element count within our desired error percent. If this sketch is run many times, 5% is still a somewhat nonnegligible error rate. In the next section, I will talk about a neat optimization that decreases our space usage which allows us to use lower error rates. This will increase our success rate for the same amount of space.
Median trick optimization
One part of understanding this algorithm that I handwaved earlier was how the on corresponds to the error on . Since we don't actually care about the error of , only the error on in practice, we assumed that bounding the error on would also have corresponding bounds on the error of . This much is true. However, the formal proof of the bounds on actually limits to a certain range, which prevents us from setting an arbitrary as our error rate.
The "median trick" is a way to allow us to use any that we want, at the cost of an constant factor in space usage.
def distinct_elements(X, k, t):
s = []
d = []
# For each trial, store k hashes
for t_i in range(t):
s.append([])
for j in range(k):
s[t_i].append(1)
for x_i in X:
# Perform t trials
for t_i in range(t):
# Perform MinHash k times per element in X
for j in range(k):
s[t_i][j] = min(s[t_i][j], h_j(x_i))
for t_i in range(t):
avg_s = 1/k * sum(s[t_i])
d.append(1/avg_s - 1)
return median(d)
The median is often used as an alternative to the mean because it is more robust when it comes to outliers. As the median (our new ) is likely to be within , we can actually relax our error rate in MinHash to a much larger value (for example, we can use 0.2 instead of 0.05) which stays constant and does not become a dependency in our space analysis anymore.
Note: the value of here is a bit confusing. In our original analysis, was the rate of failure for the MinHash algorithm. When we apply the median trick, becomes the rate of failure of the entire median trick algorithm. When we say we are relaxing the error rate in MinHash, we are saying we allow the MinHash algorithm to fail more often. However, the error rate is the failure rate of the algorithm as a whole (probability of not returning a within some percent of ).
I will not go into the proof here because the intuition and statistics required is a more complex than introductory statistics (Chernoff bound). But, the point to take away is that by attempting approximately trials of our original algorithm, we can achieve an arbitrary error rate that we can control.
Since the space requirement is loosened, we can achieve even greater accuracy while demanding the same amount of space as before.
Space analysis
The total amount of space is directly dependent on the amount of hash functions. For each hash function, we must store a float. In addition, for each trial, we have to use hash functions. Originally, the number of hash functions we used was
where was our error rate of our sketch and was our margin of error for .
After applying the median trick, the number of hash functions becomes
The main difference between the space usage is the versus , where is just a constant induced by the big-O notation.
Finally, take a look at whether or not our space analysis includes the value of or anywhere! The space that our algorithm requires does not depend on total number of elements at all, making it extremely scalable.
In practice and further reading
There is an issue with the hash function that we described. The function is not possible in practice because we cannot create a hash function with a continuous range. To solve this, we can just discretize the range between for hash values.
Alternatively, the HyperLogLog algorithm is a more practical extension of the Flajolet-Martin (MinHash and LogLog). Instead of hashing each element to a continuous range and taking the minimum hash, each element is hashed to a binary number and we take the maximum over the number of leading zeros. The amount of space this algorithm uses is tiny—each hash only uses bits, which is why the algorithm is named this way. The HyperLogLog algorithm is often known for its usage in Google's large scale systems, which I will provide in the resources below.
Conclusion
And that's that! The MinHash algorithm honestly blew my mind when I first learned about it. It was so surprising to me that we could sacrifice the exactness of our solution for a drastic decrease in space. Since then, I have been amazed by all the approximation algorithms out there being used for big data. I truly hope that you found this algorithm as interesting as I did! Here are some other fascinating sketch algorithms:
- Bloom filters
- Count-min sketch
- Locality-sensitive hashing
All credit for what I have written goes to Prof. Cameron Musco's 2021 offering of CS514 at UMass (probably my favorite class I have taken so far), which I will link in the resources below.